0% found this document useful (0 votes)
34 views11 pages

The Role of Field-Programmable Gate Arrays in The Acceleration of Modern High - Performance Computing Workloads

Uploaded by

Kelner Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views11 pages

The Role of Field-Programmable Gate Arrays in The Acceleration of Modern High - Performance Computing Workloads

Uploaded by

Kelner Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

COVER FEATURE ATTRIBUTES OF QUALITY

The Role of Field-


Programmable Gate
Arrays in the Acceleration
of Modern High-
Performance Computing
Workloads
Manuel de Castro , University of Valladolid
David L. Vilariño , University of Santiago de Compostela
Yuri Torres and Diego R. Llanos , University of Valladolid

Reconfigurable hardware circuits, such as field-


programmable gate arrays, have gained popularity in the
high-performance computing (HPC) community in recent
years. Nevertheless, their real contribution to accelerating
HPC workloads is unclear in both potential and extent.

I
n the early 2000s, the increment in single-core CPU frameworks, languages, and runtime environments to ease
performance slowed down significantly with respect the programming of GPUs for purposes beyond graphic
to previous decades. This caused new techniques and processing. Consequently, general-purpose computing on
design paradigms, such as parallel (multicore) or vec- GPUs was born. This entailed a paradigm shift for the
torial processing, to emerge as alternatives to further high-performance computing (HPC) community, as het-
increase CPU performance. Scientists also started inves- erogeneous systems including regular CPUs and spe-
tigating the potential use of GPUs as high-­performance cialized hardware accelerators became the standard for
computational units for floating-point intensive computa- supercomputers, and data parallelism took the spotlight.
tions. That encouraged the main GPU vendors to develop As a consequence of this shift toward heterogeneous
systems, different kinds of hardware accelerators, from
Digital Object Identifier 10.1109/MC.2024.3378380
GPUs to field-programmable gate arrays (FPGAs) to appli-
Date of current version: 26 June 2024 cation-specific integrated circuits (ASICs), have appeared
This work is licensed under a Creative Commons
Attribution 4.0 License. For more information, see
66 CO M PUTE R P U B LISHED BY THE IEEE COMP UTER SOCIE T Y https://creativecommons.org/licenses/by/4.0/
during the last two decades. Among main consequences. First, GPU ven- ASICs: SPECIFIC-PURPOSE
them, FPGAs have recently gained dors started assembling what we may ACCELERATORS
interest in the literature as a prom- call “general-purpose versions” of their Some computational algorithms only
ising HPC platform. However, there cards, adding error-correcting code mem- moderately benefited from conven-
exists a sharp contrast between this ory and other features to better suit tional GPU architectures, while others
increasing research interest in FPGAs’ HPC needs. More recently, mainly due needed to be accelerated even further.
theoretical capabilities and their low to the artificial intelligence (AI) mar- In these cases, ASICs came to the res-
general adoption. This situation begs ket (and its convergence with HPC), cue. ASICs are designed and built solely
some questions: Are current data cen- GPU vendors also started to develop to solve the particular task of interest,
ter FPGAs well suited for accelerating
modern HPC workloads? When and
how is it advisable to leverage FPGA
devices to accelerate scientific compu- THE DECLINE OF FPGAs WAS NOT JUST
tations? Let us discuss these topics in
more detail by first putting heteroge-
AN ISSUE OF COMPUTING PERFORMANCE
neous accelerators in perspective and, OR EFFICIENCY BUT ALSO A PROBLEM OF
later, analyzing the characteristics, PRODUCTIVITY.
advantages, and drawbacks of FPGAs,
including their programmability and
the portability of their code, to offer an
answer to these questions. GPUs with scientific/AI computations with both increased performance and
in mind. Second, programming lan- better energy efficiency as compared
GPUs: THE STANDARD HPC guages, frameworks, and models for to those achievable by a CPU or GPU.
ACCELERATOR heterogeneous computing mainly tar- This is another form of heterogeneous
As efforts to increase processing per- geting GPUs have been created. Thus, computing, although the adoption of
formance since the early 2000s have their design philosophy has been GPU ASICs is often limited to certain market
focused on parallel computing and its centric, or at least data parallelism cen- niches, due to their specific nature. In
many forms, GPUs have revolution- tric. For example, OpenCL, SYCL, and research contexts, one of the most cur-
ized the field, due to their massively Data Parallel C++ include program- rently used ASICs is the tensor process-
parallel architectures. GPUs include ming constructs that map particularly ing unit (TPU), developed by Google for
thousands of processing cores, simpler well to GPU architectures, even though neural network machine learning accel-
than the ones used for CPUs, which all of them are designed to work with a eration.1 Another example is the use of
are designed so that all of them per- wide range of computing devices, not ASICs in the context of bitcoin mining.2
form the same computations (that is, only GPUs. CPUs can easily translate
instructions) on different and inde- these constructs to their own archi- FPGAs: RECONFIGURABLE
pendent datasets. Even though each tectural resources and efficiently work HARDWARE ACCELERATORS
individual GPU core is considerably with them, but this is not the case for all FPGAs are reconfigurable hardware
less computationally powerful than computing devices supported by these devices. They can be used to synthesize
a CPU core, the sheer number of them models (for example, FPGAs). different hardware designs or architec-
that a single device can contain makes To maintain GPU dominance, ven- tures over and over again. FPGAs were
GPUs superior to CPUs when it comes dors have recently started includ- introduced in mid-1980 by Xilinx (now
to data-parallel processing, both in raw ing more specific hardware in their AMD) as a natural evolution of pro-
performance and energy efficiency. devices, which further accelerate tasks grammable logic devices. They were
The high interest in GPUs mani- of high current interest. For example, initially intended to serve as glue logic
fested by the HPC community from since 2018, Nvidia GPUs include dedi- and a prototype of small digital circuits.
the beginning has greatly influenced cated tensor cores for the acceleration Since the beginning of the premul-
the industry. We highlight here two of deep learning workloads. ticore CPUs era, FPGAs appeared as an

J U LY 2 0 2 4  67
ATTRIBUTES OF QUALITY

excellent proof-of-concept device to of productivity. The programming of this regard, such as, for example, the
shorten the software development cycle FPGAs required working at the register Horizon 2020 Future and Emerging
for ASICs, as this development was transfer level (RTL) with intricate hard- Technologies–HPC EuroEXA project
allowed to start before any test chip ware description languages (HDLs), (https://euroexa.eu/) and the Hetero-
had been manufactured. The increase such as VHDL or Verilog, which are geneous Accelerated Compute Clusters
in the available logic cells, together with rather less user-friendly than high-level project (https://www.amd-haccs.io/),
large random-access memory blocks, programming languages and models. which remains ongoing. Regarding the
digital signal processor (DSP) arithme- As in the case of GPUs, which were first integration of FPGAs in data centers,
tic units, and even embedded micro- programmed using clever tricks to take an overview of different developments
processors, moved FPGA usage beyond advantage of their capabilities, with is provided in Alonso and Bailis.6
proof-of-concept prototyping to final their vendors later to develop friend- Concerning vendors, there are cur-
production on their own. Thus, in the lier programming environments, the rently two main FPGA vendors devel-
oping device models that target HPC
contexts: Xilinx (property of AMD)
and Intel (formerly Altera). Exam-
AS FPGAs ALLOW THE PROGRAMMER ples of HPC or data center accelerator
FPGAs are Xilinx’s Alveo and Versal
TO IMPLEMENT CUSTOM HARDWARE FPGA families and Intel’s Stratix 10 and
ARCHITECTURES, THEY, AT FIRST Agilex FPGA family. However, even
GLANCE, SEEMED TO BE WELL SUITED though data center FPGAs have been
FOR HPC COMPUTATIONS. available for a few years now, interest
among researchers has increased sig-
nificantly, and although the prospect of
near-future FPGA-powered supercom-
2000s, high-performance FPGA-based main FPGA vendors have made efforts puters has existed since at least 2013,7,8
architectures were developed. At that to provide high-level synthesis (HLS) there has not been significant adoption
time, FPGAs already exhibited high effi- tools, such as AMD’s Vitis, which allow of FPGA devices as general-purpose
ciency as accelerators of applications in FPGA applications to be developed from accelerators in the industry. For exam-
a wide variety of areas, such as cryptog- a software perspective, viewing pro- ple, many of the TOP500 list’s newest
raphy, signal processing, genomics, or grammable logic as a computational entries are multi-CPU-and-GPU super-
pattern recognition, to name just a few. resource instead of a hardware system.5 computers. One of the few FPGA-pow-
As a consequence, they were adopted as However, this improvement comes ered supercomputers found in the
accelerator devices in some supercom- at the cost of increased compilation latest TOP500 list is Noctua 2, inaugu-
puting clusters.3 time. The translation of HLS code to rated in 2022, in Paderborn, Germany.
In the mid-2000s, GPUs came into the RTL, and from there to the desired This situation leads us to the fol-
game as a serious rival of FPGAs. Even if FPGA configuration, involves multiple lowing question: Are FPGAs really
FPGAs were initially competitive against optimization steps to map the design useful to accelerate HPC workloads,
GPUs, the fast development of the latter, onto the target FPGA architecture, and where absolute performance is the
and, more importantly, the support of it usually takes a significant amount of ultimate goal? To try to answer this
Nvidia delivering the CUDA platform time, on the order of hours. question, we should first understand
in 2007, restricted FPGAs to embedded Progress in the computing capabil- why FPGA architecture and program-
application domains where energy effi- ities of FPGA technology has also been mability are so special.
ciency was critical, and GPUs took their made so that these kinds of devices
place as accelerators in HPC clusters.4 might be leveraged in research facili- SPECIAL CHARACTERISTICS
In fact, the decline of FPGAs was ties, data centers, computing centers, OF FPGAs
not just an issue of computing perfor- and other similar environments. Sev- Reconfigurability is the main prop-
mance or efficiency but also a problem eral projects were also conducted in erty of FPGAs. They contain an array

68 COMPUTER  W W W.CO M P U T E R .O R G /CO M P U T E R


of programmable logic blocks as well consumption, especially for tasks that Nevertheless, as devices for hetero-
as reconfigurable interconnections to can be highly parallelized or require geneous computing, it would be more
link these blocks together, which allows specialized processing. However, FPGAs appropriate to compare them with
them to implement complex logic func- offer lower clock speeds than CPUs. To other accelerators used for hetero-
tions. FPGAs can implement any logi- overcome their limitations, engineers geneous computing. After all, mod-
cal function that an ASIC can perform. exploit the main strengths of FPGAs: ern general-purpose supercomputers
Most FPGAs also include memory ele- fine- and coarse-grain parallelism as rarely include only CPUs but, rather,
ments, such as flip-flops, and modern well as the previously mentioned low a combination of CPUs and GPUs. This
FPGAs even include logic blocks for overhead in computations. comparison should not only be carried
the fast execution of common low-level
computations, such as DSPs for float-
ing-point operations. Although FPGAs
are designed to be able to implement THE COMPILATION OF FPGA CODES IS A
(synthesize) arbitrary logic functions,
they are limited by their quantity of
TIME-CONSUMING PROCESS, ESPECIALLY
resources and their clock speed. Thus, WHEN HIGH-LEVEL LANGUAGES ARE
high-complexity functions might not USED TO DESCRIBE SOPHISTICATED
be synthesizable into a given FPGA. ALGORITHMS.
Nevertheless, the resource amount
present in FPGA models has greatly
increased over time.
As FPGAs allow the programmer to Overall, recent improvements in FPGA out in terms of absolute performance
implement custom hardware architec- technology (both in device design and and energy efficiency but also taking
tures, they, at first glance, seemed to the software stack) have made the use into account programmability and
be well suited for HPC computations. of these devices seemingly viable as portability issues.
While common CPU execution must accelerators for HPC workloads. They
dedicate a significant amount of time are known for being able to success- FPGA PROGRAMMABILITY
and energy fetching and decoding every fully accelerate workloads composed AND PORTABILITY ISSUES
instruction to execute,9 these steps and of irregular data types and algorithms As we stated above, FPGAs are often
their cost can be avoided in custom when compared to CPU executions as programmed using HDLs, such as
hardware designs, where the compu- well as for achieving a considerably Verilog or VHDL, which provide deep
tations to perform are known before- higher energy efficiency.10,11 Addition- low-level control over the electronic
hand. Moreover, CPU instruction sets ally, FPGAs present a certain innate components or behavior of the devices.
are composed mainly of simple opera- characteristic that cannot be replicated Although the use of these languages
tions that are combined to make more by any ASIC: reconfigurability. This is maximizes FPGA performance and
complex computations; however, FPGAs a crucial advantage in environments minimizes resource utilization, from
can potentially implement those com- where multiple distinct applications an HPC perspective, these languages
plex computations directly, saving clock need to be accelerated over different are cumbersome and error prone and
cycles in their execution. This includes periods of time. Moreover, FPGAs can incur high development times. The
the implementation of data- or task-par- leverage dynamic partial reconfigu- reason is that they are too low level,
allel computations in hardware. By ration to modify their behavior on the and HPC engineers are not usually very
allowing specific computational tasks to fly.12 This possibility increases the familiar with the constructs on which
be executed directly in hardware, FPGAs accelerator’s f lexibility further and they are based. Trying to program HPC
are highly power-efficient devices, enables it to widen the number of tasks kernels entirely with an HDL leads to
reducing the need for general-purpose it can serve without requiring a com- very high development costs, even
processor overheads. This direct execu- plete reconfiguration (which incurs more so if the user has to program
tion path can significantly lower power higher overheads). the entire logic to communicate the

J U LY 2 0 2 4  69
ATTRIBUTES OF QUALITY

FPGA (device) with the CPU (host) for using a popular software programming Other languages and frameworks
data movement and task dispatch- language but it also allows any code used for high-level synthesis use C prag-
ing, which is architecture dependent. written targeting any other acceleratormas to target particular devices. For
Thus, using HDL languages is deemed (namely, GPUs) to execute on an FPGA. example, Vitis HLS uses pragmas to tar-
unfeasible in HPC contexts. Nevertheless, theory and reality areget AMD Xilinx FPGAs. The use of prag-
To alleviate these issues, HLS lan- often known to differ. While it is truemas allows the code to be annotated
guages and frameworks have been that OpenCL provides code portability with different pragmas to target several
developed, which leverage high-level across supported devices, it does not architectures at the same time. On the
software programming languages guarantee performance portability. other hand, the use of OpenCL forces the
(mainly C based) for hardware design. Moreover, its high verbosity and the lack
rewriting of the code to take advantage
of architectures whose vendors do not
support OpenCL. Consequently, the use
of C pragmas leads to an FPGA-centric
MOST BANDWIDTH LIMITS ON FPGAs design philosophy, which might result
in fewer efforts and complexities to
COME FROM THE USE OF DOUBLE DATA optimize naive or device-agnostic codes
RATE 4 TECHNOLOGY, WHILE GPUs for FPGA execution. However, this opti-
HAVE BEEN USING FASTER MEMORY mization step is still unavoidable.
TECHNOLOGY FOR SOME YEARS NOW. Overall, concerning the programma-
bility of FPGAs for HPC applications, as
of today, it seems unfeasible to rely only
on compiler optimizations to efficiently
HLS has succeeded in several areas, of support from important vendors (for execute device-agnostic code on FPGAs.
including deep learning, video transcod- example, Nvidia) have made it less com- Therefore, HPC researchers and engi-
ing, graph processing, and genome monly used lately. In the particular case neers are expected to have some knowl-
sequencing.13 Examples of these lan- of FPGA accelerators, although they edge of the underlying architecture
guages are Vitis HLS (for Xilinx FPGAs are able to properly execute device-­ when trying to maximize performance
only) and OpenCL (commonly used for agnostic or GPU-optimized OpenCL for FPGA devices.
Intel FPGAs). OpenCL is one of the most code, the performance they achieve Moreover, the compilation of FPGA
popular ones for Intel FPGAs and previ- with such codes is, in general, con- codes is a time-consuming process, espe-
ously for Xilinx ones too. OpenCL was siderably low.14,15 Some optimization cially when high-level languages are used
designed from the beginning to tar- techniques are known to alleviate this to describe sophisticated algorithms that
get heterogeneous systems and allow situation (see “FPGA-Specific Optimiza- lead to complex hardware descriptions.
all their resources to be efficiently tion Techniques”). Although we centered HPC kernels for FPGAs are known to take
exploited, and it has been extensively our discussion on OpenCL capabilities, several hours to compile, which further
used for programming CPU + GPU it is worth noting that these conclusions adds to the development costs associated
applications. Its design philosophy is may be extended to any programming with these devices. Overlay architectures
to enable code portability across many model or framework targeting different for FPGAs show potential in reducing the
different computing devices, that is, to kinds of accelerators (namely, GPUs and long compilation and reconfiguration
be able to write a single device-agnos- FPGAs), such as SYCL and all its derived times traditionally associated with FPGA
tic code and execute it on any Open- implementations, although their actual deployment. By providing a higher-level
CL-supported device (including CPUs, performance depends on the particular abstraction, overlays can simplify FPGA
GPUs, and FPGAs). This, in theory, is application considered and the inter- programming, making it more accessible
perfect for heterogeneous computing, nal compiler optimizations available. and quicker to adapt to different appli-
especially for FPGAs. Not only does the Other pragma-based languages, such as cations.16 This approach allows for rapid
language make the complex low-level OpenACC and OpenMP, are also used for prototyping and iteration, which is crucial
details of hardware design abstract by this purpose. in research and development settings.

70 COMPUTER  W W W.CO M P U T E R .O R G /CO M P U T E R


CURRENT HARDWARE Lower clock frequency GPUs. However, FPGAs still present a
LIMITATIONS OF FPGA FPGA devices present considerably lower lower effective parallel factor, which
DEVICES working clock frequencies than other makes GPUs the winner in terms of the
In addition to the programmability kinds of accelerators. For example, Cong absolute performance achievable.
and portability issues described previ- et al.17 studied the main performance
ously, current FPGA technology pres- differences between FPGAs and GPUs. Lower memory
ents some significant limitations that The authors claim that the lower clock bandwidth and size
hinder achieving high performance, frequencies are partially alleviated by Cong et al. also state that the lower
no matter how thoughtf ul of the the fact that FPGAs are able to achieve parallel factor presented by FPGAs,
underlying architecture the program- a higher number of operations per described previously, is largely caused
ming might be. cycle in each computing pipeline than by the FPGAs’ far lower of f-chip

FPGA-SPECIFIC OPTIMIZATION TECHNIQUES


To achieve high performance on FPGA devices, specific »» Memory hierarchy usage: The memory hierarchy of FP-
code optimizations are needed. These optimizations GAs differs significantly from traditional general-pur-
often differ considerably from CPU or GPU optimizations pose accelerators, and the user should take this into
and require the programmer to be aware of the under- account when designing optimized kernels. Among
lying architecture to a certain degree. The importance of others, the usage of the restrict C keyword, which is
optimizing the code for FPGAs is such that it can make used in pointer declarations to indicate to the com-
the difference between underperforming and outper- piler that no other pointer will be used to access the
forming CPU executions of the same applications. There object to which it points, usually provides a noticeable
exist several particular optimization techniques that are performance improvement. Some other well-known
known to considerably increase performance for FPGA constructs, such as the use of shift registers and sliding
executions of HPC workloads: window strategies, are able to efficiently exploit the
FPGA resources and achieve high performance.
»» Pipeline single-threaded versus ND-range kernels: »» Other manual optimizations: In general, automatic
Single-threaded loop-pipelined kernels usually achieve compiler optimizations do not achieve perfor-
higher performance and allow for more FPGA-spe- mance increments comparable to those of manual
cific optimizations than multithreaded (also known optimizations. In particular, manual loop unrolling
as ND-range) kernels, which are commonly used for and manual vectorization often result in increased
GPU and CPU execution. This is usually true even when performance.
the single-threaded kernels achieve a lower work-
ing frequency than the multithreaded ones since the More advanced optimizing transformations, including
further optimizations available to the single-threaded pipelining, data reuse, and resolving interface conten-
kernels enable a much higher number of computations tion, are discussed in de Fine Licht et al.S1
per cycle to be achieved. FPGAs especially benefit from
deep pipeline kernels, as executing every independent REFERENCE
stage of the pipeline simultaneously on every single S1. J. de Fine Licht, M. Besta, S. Meierhans, and T. Hoefler,
clock cycle achieves a computations-per-cycle rate “Transformations of high-level synthesis codes for
proportional to the number of pipeline stages. Thus, high-performance computing,” IEEE Trans. Parallel
algorithmic refactoring of a kernel might be needed to Distrib. Syst., vol. 32, no. 5, pp. 1014–1029, May 2021,
achieve the highest performance on FPGAs. doi: 10.1109/TPDS.2020.3039409.

J U LY 2 0 2 4  71
ATTRIBUTES OF QUALITY

memory bandwidth. Low memory band- of not being optimized for high mem- disadvantage against other accelerators
width is the other most important lim- ory bandwidth in the same way GPUs for many applications.
itation of current FPGA devices, and it are since GPUs are designed with par- The cited works conducted their
probably constitutes the main limiting allelism and high-bandwidth memory research using older FPGA and GPU
factor for FPGAs to achieve high per- interfaces from the outset. models, so their conclusion might not
formance in numerous applications. Zohouri et al.18 present a compre- seem representative of the current
Most bandwidth limits on FPGAs come hensive analysis of the memory control- state of the art. To provide some insight
from the use of Double Data Rate 4 tech- ler and memory bandwidth efficiency of into how the state of the art might have
nology, while GPUs have been using Intel FPGAs, concluding that to achieve changed since those works were pub-
faster memory technology for some high memory performance, FPGA ker- lished, Table 1 provides a comparison of
years now. This limitation is even more nels must meet multiple and strict clock frequencies and memory band-
relevant when considering that avail- requirements related to access pat- widths among different Intel FPGA and
able FPGA boards do not support the terns, alignment, and memory hierar- Nvidia GPU models, including recent
memory sizes available in GPUs, and chy usage. These requirements are hard ones. Comparing FPGAs and GPUs just in
getting data in and out of these cards to meet in real-world applications, and terms of maximum clock frequency is an
is expensive and can easily destroy thus, for many applications, it might not oversimplification based on theoretical
any potential benefit in the computa- be possible to achieve more than 70% of hard limits and should be taken lightly. It
tion. FPGAs are designed for flexibil- the peak memory bandwidth. Overall, is worth noting that FPGA working clock
ity and programmability, with their the low off-chip memory bandwidth frequencies depend on the specific hard-
architecture consisting of an array of compared to CPUs and GPUs, as well as ware design synthesized and rarely come
programmable logic blocks and rout- the difficulties to efficiently exploit that close to the reported theoretical maxima
ing. This flexibility comes at the cost bandwidth, put FPGA accelerators at a (shown in the table), especially when

TABLE 1. A comparison of the clock frequency and peak memory bandwidth of several
Xilinx and Intel FPGAs and Nvidia GPU models, sorted by release date.
Processing clock Peak memory
Release date frequency bandwidth

Virtex UltraScale+ First quarter 2016 Up to 819 MHz 76.8 GB/s

Nvidia Tesla V100 GPU First quarter 2017 1,245 MHz (base), 900 GB/s
1,380 MHz (boost)

Intel PAC with Intel Arria 10 GX FPGA Fourth quarter 2017 (FPGA model from 2013) Up to 800 MHz 34.8 GB/s

Intel FPGA PAC D5005 (with Intel Stratix 10 GX) Fourth quarter 2019 (FPGA model from 2013) Up to 1,000 MHz 76.8 GB/s

Intel Stratix 10 MX* FPGA FPGA model from 2017 Up to 1,000 MHz 512 GB/s

Nvidia A100 GPU First quarter 2020 765 MHz (base), 1,555 GB/s
1,410 MHz (boost)

Xilinx Alveo U55C Fourth quarter 2021 Up to 1,028 MHz 460 GB/s

Intel Agilex 7 FPGA* M-Series 039 FPGA model from first quarter 2022 Up to 800 MHz 1,000 GB/s

Nvidia H100 GPU First quarter 2022 1,095 MHz (base), 2,039 GB/s
1,755 MHz (boost)

*The entry is an FPGA integrated circuit (chiplet) model to be integrated into a hardware package or module with other components, not a commercially available ready-to-use
accelerator itself.

72 COMPUTER  W W W.CO M P U T E R .O R G /CO M P U T E R


using a high-level synthesis language FPGAs were not keeping pace with other same energy efficiency as GPUs for most
and compiler (such as OpenCL) instead platforms in terms of performance, applications, and even exceed it in some
of HDLs. For compute-intensive or HPC which caused HPC applications to be cases. The authors also noted that FPGAs
kernels, the cited works report working migrated to other more powerful plat- are likely to continue being competitive
frequencies that usually range between forms, such as software-based manycore in areas for which GPU and CPU com-
200 and 300 MHz on Arria 10 GX devices, systems (CPUs and GPUs). The authors puting models do not match the nature
which is 20% to 37.5%, respectively, of noted that FPGAs become competi- of the problem. Their work led them to
their reported peak frequency. In the tive when working with applications interesting conclusions. First, they point
case of Stratix 10 GX devices, it is usual to with specific constructs or requirements to the low memory bandwidths of the
achieve working frequencies that range for which general-purpose computing FPGAs as the main limiting factor for
between 300 and 400 MHz (30% to 40%
of their peak frequency).

Lower floating-point performance FPGAs BECOME COMPETITIVE WHEN


One of the major limitations of FPGAs
WORKING WITH APPLICATIONS
in the context of HPC seems to be the
low single- and double-precision float-
WITH SPECIFIC CONSTRUCTS OR
ing-point performance that could be REQUIREMENTS FOR WHICH GENERAL-
achieved with such devices.19 Some PURPOSE COMPUTING DEVICES ARE
recent research works, such as two by NOT SUITED.
Calore and Schifano20,21 attempted to
measure it in the context of the roof-
line model. These works offer a perfor-
mance estimation of function kernels devices are not suited (for example, appli- achieving high performance on FPGAs.
developed with high-level synthesis cations with operands with custom/ Second, they also note that exploiting
tools, revealing some of the FPGA lim- user-defined data widths as well as com- such bandwidths is rather difficult, while
itations in the context of HPC. binational logic problems, finite-state only a very small fraction of the theoret-
machines, and parallel MapReduce prob- ical memory bandwidth is achieved by
ON THE POTENTIAL OF lems). The authors also noted a trend for unoptimized codes. Third, they note
FPGAs IN MODERN HPC the HPC community toward adopting that FPGAs are not power-proportional
WORKLOADS hybrid (heterogeneous) platforms with devices, in the sense that a significant
Regarding the question of whether cur- a mix of different kinds of devices work- increase in performance might require
rent FPGAs are suitable to accelerate ing together. As that work is almost a only a moderate increase in power con-
modern HPC workloads, we now con- decade old, it is worth reviewing more sumption. This contrasts with CPUs
textualize the points discussed above recent works so as to analyze whether and GPUs, where the power consump-
for the case of real-world applications. these trends have continued. For exam- tion increase is more pronounced. Their
To review the potential of FPGAs in ple, in 2020, Nguyen et al.11 explored conclusions were that vendors should
HPC contexts, not only the absolute per- the potential of FPGAs in HPC environ- prioritize maximizing development pro-
formance of FPGAs when executing ments, testing both an Intel FPGA (Arria ductivity for FPGAs rather than increas-
common HPC tasks must be considered 10 GX) and a Xilinx FPGA (Alveo U280) ing their amount and type of resources,
but also the performance relative to and comparing them against other as their FPGA implementations required
other accelerators since the potential of accelerators (namely, an Intel Xeon CPU greater orders of magnitude of software
FPGAs is conditioned by the other avail- and Nvidia V100 GPU). They found the development time than the equivalent
able alternatives for HPC accelerators. single-precision FPGA performance and (and often superior) CPU and GPU imple-
Back in 2014, Véstias et al.19 reviewed bandwidth still fall far below GPUs for mentations. Other works22 note that
the trends of CPU, GPU, and FPGA devices compute and memory-intensive tasks; FPGAs might present certain advantages
for HPC. Their conclusions were that however, FPGAs can deliver nearly the compared to GPUs for applications that

J U LY 2 0 2 4  73
ATTRIBUTES OF QUALITY

can exploit temporal blocking or other computations, due to the pipelined nature of HPC workloads, whose strengths and
forms of high pipelined parallelism. of deep learning models and the poten- limitation were described previously,
FPGAs can be beneficial in scientific tial to optimize them by means of custom there is also certain interest in studying
computing applications where latency irregular data types as well as irregular the potential use of FPGAs cooperatively
and predictability of execution times algorithms. Nurvitadhi et al.23 conclude with other devices, both FPGAs and
are crucial, such as in urgent HPC sce- that recent trends in deep neural net- other accelerators, so as to exploit all the
narios, including interactive prototyp- work algorithms might favor FPGAs available resources of a given heteroge-
ing, urgent streaming data analysis, over GPUs and that FPGAs have poten- neous potentially distributed system.
application steering, and in situ visu- tial to become the platform of choice Many works explore the possibilities
alization. There are several reasons for for accelerating deep neural networks, of using FPGAs cooperatively in hetero-
this. First, FPGAs excel in providing offering superior performance. Never- geneous environments. Some explore
low-latency processing. Unlike CPUs theless, the deep learning market has the possibility of using FPGA-powered
and GPUs, which have fixed hardware been of significant importance in recent network interface cards to carry out CPU-
structures and instruction sets, FPGAs times, and many vendors of electronic less processing of incoming and outcom-
can be configured to perform specific components (including CPUs, GPUs, and ing network data, thus reducing latency.
computations directly in hardware, ASICs) have tried to get into and expand This can be applied to inter-FPGA com-
reducing their overhead. This is partic- inside that market. Since the publi- munications to efficiently connect mul-
ularly relevant in the case of irregular cation of that work, two major break- tiple distributed FPGAs together. Other
applications, where the single-instruc- throughs have been made concerning works discuss direct memory access
tion, multiple data paradigm cannot hardware acceleration of deep learning (DMA) mechanisms to connect GPUs
be applied. Second, FPGAs offer more applications. First, many GPUs have and FPGAs together in order to effi-
predictable performance compared started to include dedicated hardware ciently communicate different kinds of
to CPUs and GPUs. Since FPGAs can for AI acceleration, such as Nvidia’s ten- accelerators from different perspectives:
be configured with specific hardware sor cores. Second, Google launched the either the GPU is the peripheral compo-
paths for given tasks, they can execute TPU, an ASIC for AI acceleration. These nent interconnect express (PCIe) mas-
these tasks consistently without the new kinds of hardware attempt to accel- ter or this task is assigned to the FPGA.
unpredictability introduced by shared erate AI tasks that GPUs are not well Both approaches show that performance
resources (like caches or memory buses) suited for, including algorithms dealing penalties are incurred for DMA transfers
in general-purpose processors. This with custom irregular data types. Thus, in which the PCIe master is the destina-
predictability is critical in applications both of them pose new challenges to tion device. These two techniques can be
where timing and consistency of com- FPGAs to become the platform of choice combined to enable efficient cooperative
putation are vital. Third, FPGAs can be to accelerate deep neural networks. work between GPUs and FPGAs over dif-
tailored for specific algorithms or data Over recent years, we have seen a signif- ferent nodes.24
processing tasks. This customization icant increment in TPU and tensor core Although these are promising tech-
allows for highly efficient execution of GPU utilization for accelerating real- niques for heterogeneous environments,
particular tasks in scientific comput- world AI tasks; however, FPGAs do not there do not seem to be many real case
ing, such as data analysis or simulation, seem to have made significant progress applications that clearly benefit from
which can be critical in urgent com- in this field, and nowadays, they are not cooperative FPGA approaches. While
puting scenarios where quick accurate the platform of choice for accelerating GPU technology is making significant
results are required. Due to these rea- large-scale deep neural networks. progress in the distributed multi-GPU
sons, FPGA benefits in scenarios like field for real-world applications, the
the ones described previously can be WHAT ABOUT THE multi-FPGA and hybrid GPU–FPGA
substantial, especially when immedi- USE OF FPGAs AS fields for real-world applications seem
ate data processing and decision mak- COOPERATIVE DEVICES IN to be considerably less explored. One
ing are crucial. HETEROGENEOUS SYSTEMS? major cause for this seems to be the
For some years, FPGAs have been Besides the potential use of FPGAs as lower scaling capabilities of FPGA
considered well suited for deep learning stand-alone devices for the acceleration devices, which hinder the development

74 COMPUTER  W W W.CO M P U T E R .O R G /CO M P U T E R


of multi-FPGA solutions, as well as the Castilla y León FEDER Grant VA226P20 7. M. Baity-Jesi et al., “An FPGA-based
higher hardware costs, which hinder (PROPHET-2 Project). Diego R. Llanos supercomputer for statistical physics:
the integration of FPGAs into heteroge- has been supported in part by Grant The weird case of Janus,” in High-
neous clusters already including GPUs. TED2021-130367B-I00, funded by MCIN/ Performance Computing Using FPGAs,
Besides this, the existence of applica- AEI/10.13039/501100011033, and by Next W. Vanderbauwhede and K. Benkrid,
tions with computational patterns that Generation EU Plan de Recuperación, Eds., New York, NY, USA: Spring-
would benefit from simultaneous GPU Transformación, y Resiliencia. The work er-Verlag, 2013, pp. 481–506.
and FPGA acceleration is still unclear. of David L. Vilariño has been supported 8. “Top500 NEWS: Good times for FPGA
by Grants PID2022-141623NB-I00 and enthusiasts,” Top500, Sinsheim,

O
PID2019-104834GB-I00 (funded by MCIN/ Germany, 2016. Accessed: Sep. 2023.
verall, modern FPGA technol- AEI/10.13039/501100011033/FEDER, UE) [Online]. Available: https://www.
ogy focused on HPC environ- and by the Conselleria de Cultura, Educa- top500.org/news/good-times-for
ments still presents important cion, e Ordenacion ­Universitaria, Xunta de -fpga-enthusiasts
limitations that put FPGA devices at a Galicia (Accreditation ED431C 2022/16). 9. S. R. Hines, “Improving processor effi-
disadvantage compared to GPUs, namely, Thanks to the anonymous reviewers for ciency through enhanced instruction
low memory bandwidth and size, lower many useful suggestions. fetch,” Ph.D. thesis, Florida State
raw computational power, the need for Univ., Tallahassee, FL, USA, 2008.
sophisticated manual tuning due to REFERENCES 10. S. Kestur, J. D. Davis, and O. Wil-
poor automatic compiler optimizations, 1. N. P. Jouppi et al., “In-datacenter per- liams, “BLAS comparison on FPGA,
development complexity, and very long formance analysis of a tensor process- CPU and GPU,” in Proc. IEEE Comput.
compilation times. FPGAs can still ing unit,” SIGARCH Comput. Archit. Soc. Annu. Symp. VLSI, Piscataway,
prove useful in the acceleration of irreg- News, vol. 45, no. 2, pp. 1–12, Jun. 2017, NJ, USA: IEEE, 2010, pp. 288–293, doi:
ular tasks for which general-purpose doi: 10.1145/3140659.3080246. 10.1109/ISVLSI.2010.84.
architectures (CPU and GPU) are poorly 2. M. Bedford Taylor, “The evolution of 11. T. Nguyen, S. Williams, M. Siracusa,
optimized, such as tasks with irregular bitcoin hardware,” Computer, vol. 50, C. MacLean, D. Doerfler, and
data types or algorithms, as long as no. 9, pp. 58–66, Sep. 2017, doi: N. J. Wright, “The performance and
it is not profitable to build and deploy 10.1109/MC.2017.3571056. energy efficiency potential of FPGAs
ASICs for those applications. FPGAs also 3. C. Stephen et al., “Examining the in scientific computing,” in Proc.
show potential for accelerating tasks viability of FPGA supercomputing,” IEEE/ACM Perform. Model., Bench-
in environments where flexibility and/ EURASIP J. Embedded Syst., marking Simul. High Perform. Comput.
or energy efficiency are crucial. Nev- vol. 2007, Jan. 2007, Art. no. 93652, Syst. (PMBS), Piscataway, NJ, USA:
ertheless, FPGA technology still has to doi: 10.1155/2007/93652. IEEE, 2020, pp. 8–19, doi: 10.1109/
make some progress, both in hardware 4. D. H. Jones et al., “GPU versus FPGA PMBS51919.2020.00007.
capabilities and ease of development, for high productivity computing,” in 12. K. Vipin and S. A. Fahmy, “FPGA
to become competitive at accelerating Proc. Int. Conf. Field Programmable dynamic and partial reconfiguration:
most modern HPC workloads. Logic Appl., 2010, pp. 119–124, doi: A survey of architectures, methods,
10.1109/FPL.2010.32. and applications,” ACM Comput. Surv.,
ACKNOWLEDGMENT 5. N. Brown, “Weighing up the new kid vol. 51, no. 4, Jul. 2018, Art. no. 72,
The work of Manuel de Castro, Yuri on the block: Impressions of using Vitis doi: 10.1145/3193827.
Torres, and Diego R. Llanos has been for HPC software development,” in 13. J. Cong et al., “FPGA HLS today: Suc-
supported in part by Grant PID2022- Proc. 30th Int. Conf. Field-Programmable cesses, challenges, and opportunities,”
142292NB-I00 (NATASHA Project), funded Logic Appl. (FPL), 2020, pp. 335–340, ACM Trans. Reconfigurable Technol.
by MCIN/AEI/10.13059/501100011033, doi: 10.1109/FPL50879.2020.00062. Syst., vol. 15, no. 4, pp. 1–42, Aug. 2022,
and by the European Regional Develop- 6. G. Alonso and P. Bailis, “Research for doi: 10.1145/3530775.
ment Fund’s A Way of Making Europe practice: FPGAs in datacenters,” 14. K. Krommydas et al., “Bridging the
project. Yuri Torres and Diego R. Llanos Commun. ACM, vol. 61, no. 9, performance-programmability gap
have been supported in part by Junta de pp. 48–49, 2018, doi: 10.1145/3209275. for FPGAs via OpenCL: A case study

J U LY 2 0 2 4  75
ATTRIBUTES OF QUALITY

ABOUT THE AUTHORS


MANUEL de CASTRO is a Ph.D. candidate in the Department YURI TORRES is an associate professor in the Department
of Computer Science, University of Valladolid, 47011 Valladolid, of Computer Science, University of Valladolid, 47011 Vall-
Spain. His research interests include parallel and distributed com- adolid, Spain. His research interests include parallel and
puting and GPU and field-programmable gate array program- distributed computing, parallel programming models, and
ming. de Castro received a M.S. in computer science from the embedded computing. Torres received a Ph.D. in computer
Universidade de Coruña. Contact him at [email protected]. science from the University of Valladolid. Contact him at yuri.
[email protected].
DAVID L. VILARIÑO is an associate professor in the Department
of Electronic and Computation, University of Santiago de Com- DIEGO R. LLANOS is a full professor in the Department of
postela, 15782 Santiago de Compostela, Spain. His research Computer Science, University of Valladolid, 47011 Valladolid,
interests include the design of algorithms and special-purpose Spain. His research interests include parallel and distributed
hardware modules for reconfigurable architectures (field-pro- computing, the Internet of Things, and embedded systems.
grammable gate array and coarse-grain reconfigurable archi- Llanos received a Ph.D. in computer science from the Uni-
tecture), with a focus on fast and efficient computation. Vilariño versity of Valladolid. He is a Senior Member of IEEE and the
received a Ph.D. in computer science from the Universidade de Association for Computing Machinery. Contact him at diego.
Santiago de Compostela. Contact him at [email protected]. [email protected].

with OpenDwarfs,” in Proc. IEEE 24th Intel FPGA SDK for OpenCL memory 22. H. R. Zohouri et al., “Combined spatial
Annu. Int. Symp. Field-Programmable interface,” in Proc. IEEE/ACM Int. and temporal blocking for high-­
Custom Comput. Mach. (FCCM), 2016, pp. Workshop Heterogeneous High- performance stencil computation on
198–198, doi: 10.1109/FCCM.2016.56. Perform. Reconfigurable Comput. FPGAs using OpenCL,” in Proc. ACM/
15. H. R. Zohouri et al., “Evaluating and (H2RC), Nov. 2019, pp. 11–18, doi: SIGDA Int. Symp. Field-Programmable
optimizing OpenCL kernels for high 10.1109/H2RC49586.2019.00007. Gate Arrays (FPGA), New York, NY,
performance computing with FPGAs,” 19. M. Véstias et al., “Trends of CPU, GPU USA: Association for Computing
in Proc. Int. Conf. High Perform. Comput., and FPGA for high-performance com- Machinery, 2018, pp. 153–162, doi:
Netw., Storage Anal. (SC), Nov. 2016, pp. puting,” in Proc. 24th Int. Conf. Field 10.1145/3174243.3174248.
409–420, doi: 10.1109/SC.2016.34. Programmable Logic Appl. (FPL), 2014, 23. E. Nurvitadhi et al., “Can FPGAs beat
16. H. K.-H. So and C. Liu, “FPGA pp. 1–6, doi: 10.1109/FPL.2014.6927483. GPUs in accelerating next-generation
overlays,” in FPGAs for Software 20. E. Calore and S. F. Schifano, “Perfor- deep neural networks?” in Proc. ACM/
Programmers, D. Koch, F. Hannig, and mance assessment of FPGAs as HPC SIGDA Int. Symp. Field-Programmable Gate
D. Ziener, Eds., Cham, Switzerland: accelerators using the FPGA empirical Arrays (FPGA), New York, NY, USA: Asso-
Springer-Verlag, 2016, pp. 285–305. roofline,” in Proc. 31st Int. Conf. Field-­ ciation for Computing Machinery, 2017,
17. J. Cong et al., “Understanding Programmable Logic Appl. (FPL), Pisca- pp. 5–14, doi: 10.1145/3020078.3021740.
performance differences of FPGAs taway, NJ, USA: IEEE, 2021, pp. 83–90, 24. R. Kobayashi et al., “OpenCL-enabled
and GPUs,” in Proc. ACM/SIGDA Int. doi: 10.1109/FPL53798.2021.00022. high performance direct memory
Symp. Field-Programmable Gate Arrays 21. E. Calore and S. F. Schifano, “FER: access for GPU-FPGA cooperative
(FPGA), New York, NY, USA: Associa- A benchmark for the roofline computation,” in Proc. HPC Asia Work-
tion for Computing Machinery, 2018, analysis of FPGA based HPC accel- shops (HPCAsia Workshops), New York,
p. 288, doi: 10.1145/3174243.3174970. erators,” IEEE Access, vol. 10, pp. NY, USA: Association for Computing
18. H. R. Zohouri et al., “The memory 94,220–94,234, 2022, doi: 10.1109/ Machinery, 2019, pp. 6–9, doi: 10.1145/
controller wall: Benchmarking the ACCESS.2022.3203566. 3317576.3317581.

76 COMPUTER  W W W.CO M P U T E R .O R G /CO M P U T E R

You might also like