The Role of Field-Programmable Gate Arrays in The Acceleration of Modern High - Performance Computing Workloads
The Role of Field-Programmable Gate Arrays in The Acceleration of Modern High - Performance Computing Workloads
I
n the early 2000s, the increment in single-core CPU frameworks, languages, and runtime environments to ease
performance slowed down significantly with respect the programming of GPUs for purposes beyond graphic
to previous decades. This caused new techniques and processing. Consequently, general-purpose computing on
design paradigms, such as parallel (multicore) or vec- GPUs was born. This entailed a paradigm shift for the
torial processing, to emerge as alternatives to further high-performance computing (HPC) community, as het-
increase CPU performance. Scientists also started inves- erogeneous systems including regular CPUs and spe-
tigating the potential use of GPUs as high-performance cialized hardware accelerators became the standard for
computational units for floating-point intensive computa- supercomputers, and data parallelism took the spotlight.
tions. That encouraged the main GPU vendors to develop As a consequence of this shift toward heterogeneous
systems, different kinds of hardware accelerators, from
Digital Object Identifier 10.1109/MC.2024.3378380
GPUs to field-programmable gate arrays (FPGAs) to appli-
Date of current version: 26 June 2024 cation-specific integrated circuits (ASICs), have appeared
This work is licensed under a Creative Commons
Attribution 4.0 License. For more information, see
66 CO M PUTE R P U B LISHED BY THE IEEE COMP UTER SOCIE T Y https://creativecommons.org/licenses/by/4.0/
during the last two decades. Among main consequences. First, GPU ven- ASICs: SPECIFIC-PURPOSE
them, FPGAs have recently gained dors started assembling what we may ACCELERATORS
interest in the literature as a prom- call “general-purpose versions” of their Some computational algorithms only
ising HPC platform. However, there cards, adding error-correcting code mem- moderately benefited from conven-
exists a sharp contrast between this ory and other features to better suit tional GPU architectures, while others
increasing research interest in FPGAs’ HPC needs. More recently, mainly due needed to be accelerated even further.
theoretical capabilities and their low to the artificial intelligence (AI) mar- In these cases, ASICs came to the res-
general adoption. This situation begs ket (and its convergence with HPC), cue. ASICs are designed and built solely
some questions: Are current data cen- GPU vendors also started to develop to solve the particular task of interest,
ter FPGAs well suited for accelerating
modern HPC workloads? When and
how is it advisable to leverage FPGA
devices to accelerate scientific compu- THE DECLINE OF FPGAs WAS NOT JUST
tations? Let us discuss these topics in
more detail by first putting heteroge-
AN ISSUE OF COMPUTING PERFORMANCE
neous accelerators in perspective and, OR EFFICIENCY BUT ALSO A PROBLEM OF
later, analyzing the characteristics, PRODUCTIVITY.
advantages, and drawbacks of FPGAs,
including their programmability and
the portability of their code, to offer an
answer to these questions. GPUs with scientific/AI computations with both increased performance and
in mind. Second, programming lan- better energy efficiency as compared
GPUs: THE STANDARD HPC guages, frameworks, and models for to those achievable by a CPU or GPU.
ACCELERATOR heterogeneous computing mainly tar- This is another form of heterogeneous
As efforts to increase processing per- geting GPUs have been created. Thus, computing, although the adoption of
formance since the early 2000s have their design philosophy has been GPU ASICs is often limited to certain market
focused on parallel computing and its centric, or at least data parallelism cen- niches, due to their specific nature. In
many forms, GPUs have revolution- tric. For example, OpenCL, SYCL, and research contexts, one of the most cur-
ized the field, due to their massively Data Parallel C++ include program- rently used ASICs is the tensor process-
parallel architectures. GPUs include ming constructs that map particularly ing unit (TPU), developed by Google for
thousands of processing cores, simpler well to GPU architectures, even though neural network machine learning accel-
than the ones used for CPUs, which all of them are designed to work with a eration.1 Another example is the use of
are designed so that all of them per- wide range of computing devices, not ASICs in the context of bitcoin mining.2
form the same computations (that is, only GPUs. CPUs can easily translate
instructions) on different and inde- these constructs to their own archi- FPGAs: RECONFIGURABLE
pendent datasets. Even though each tectural resources and efficiently work HARDWARE ACCELERATORS
individual GPU core is considerably with them, but this is not the case for all FPGAs are reconfigurable hardware
less computationally powerful than computing devices supported by these devices. They can be used to synthesize
a CPU core, the sheer number of them models (for example, FPGAs). different hardware designs or architec-
that a single device can contain makes To maintain GPU dominance, ven- tures over and over again. FPGAs were
GPUs superior to CPUs when it comes dors have recently started includ- introduced in mid-1980 by Xilinx (now
to data-parallel processing, both in raw ing more specific hardware in their AMD) as a natural evolution of pro-
performance and energy efficiency. devices, which further accelerate tasks grammable logic devices. They were
The high interest in GPUs mani- of high current interest. For example, initially intended to serve as glue logic
fested by the HPC community from since 2018, Nvidia GPUs include dedi- and a prototype of small digital circuits.
the beginning has greatly influenced cated tensor cores for the acceleration Since the beginning of the premul-
the industry. We highlight here two of deep learning workloads. ticore CPUs era, FPGAs appeared as an
J U LY 2 0 2 4 67
ATTRIBUTES OF QUALITY
excellent proof-of-concept device to of productivity. The programming of this regard, such as, for example, the
shorten the software development cycle FPGAs required working at the register Horizon 2020 Future and Emerging
for ASICs, as this development was transfer level (RTL) with intricate hard- Technologies–HPC EuroEXA project
allowed to start before any test chip ware description languages (HDLs), (https://euroexa.eu/) and the Hetero-
had been manufactured. The increase such as VHDL or Verilog, which are geneous Accelerated Compute Clusters
in the available logic cells, together with rather less user-friendly than high-level project (https://www.amd-haccs.io/),
large random-access memory blocks, programming languages and models. which remains ongoing. Regarding the
digital signal processor (DSP) arithme- As in the case of GPUs, which were first integration of FPGAs in data centers,
tic units, and even embedded micro- programmed using clever tricks to take an overview of different developments
processors, moved FPGA usage beyond advantage of their capabilities, with is provided in Alonso and Bailis.6
proof-of-concept prototyping to final their vendors later to develop friend- Concerning vendors, there are cur-
production on their own. Thus, in the lier programming environments, the rently two main FPGA vendors devel-
oping device models that target HPC
contexts: Xilinx (property of AMD)
and Intel (formerly Altera). Exam-
AS FPGAs ALLOW THE PROGRAMMER ples of HPC or data center accelerator
FPGAs are Xilinx’s Alveo and Versal
TO IMPLEMENT CUSTOM HARDWARE FPGA families and Intel’s Stratix 10 and
ARCHITECTURES, THEY, AT FIRST Agilex FPGA family. However, even
GLANCE, SEEMED TO BE WELL SUITED though data center FPGAs have been
FOR HPC COMPUTATIONS. available for a few years now, interest
among researchers has increased sig-
nificantly, and although the prospect of
near-future FPGA-powered supercom-
2000s, high-performance FPGA-based main FPGA vendors have made efforts puters has existed since at least 2013,7,8
architectures were developed. At that to provide high-level synthesis (HLS) there has not been significant adoption
time, FPGAs already exhibited high effi- tools, such as AMD’s Vitis, which allow of FPGA devices as general-purpose
ciency as accelerators of applications in FPGA applications to be developed from accelerators in the industry. For exam-
a wide variety of areas, such as cryptog- a software perspective, viewing pro- ple, many of the TOP500 list’s newest
raphy, signal processing, genomics, or grammable logic as a computational entries are multi-CPU-and-GPU super-
pattern recognition, to name just a few. resource instead of a hardware system.5 computers. One of the few FPGA-pow-
As a consequence, they were adopted as However, this improvement comes ered supercomputers found in the
accelerator devices in some supercom- at the cost of increased compilation latest TOP500 list is Noctua 2, inaugu-
puting clusters.3 time. The translation of HLS code to rated in 2022, in Paderborn, Germany.
In the mid-2000s, GPUs came into the RTL, and from there to the desired This situation leads us to the fol-
game as a serious rival of FPGAs. Even if FPGA configuration, involves multiple lowing question: Are FPGAs really
FPGAs were initially competitive against optimization steps to map the design useful to accelerate HPC workloads,
GPUs, the fast development of the latter, onto the target FPGA architecture, and where absolute performance is the
and, more importantly, the support of it usually takes a significant amount of ultimate goal? To try to answer this
Nvidia delivering the CUDA platform time, on the order of hours. question, we should first understand
in 2007, restricted FPGAs to embedded Progress in the computing capabil- why FPGA architecture and program-
application domains where energy effi- ities of FPGA technology has also been mability are so special.
ciency was critical, and GPUs took their made so that these kinds of devices
place as accelerators in HPC clusters.4 might be leveraged in research facili- SPECIAL CHARACTERISTICS
In fact, the decline of FPGAs was ties, data centers, computing centers, OF FPGAs
not just an issue of computing perfor- and other similar environments. Sev- Reconfigurability is the main prop-
mance or efficiency but also a problem eral projects were also conducted in erty of FPGAs. They contain an array
J U LY 2 0 2 4 69
ATTRIBUTES OF QUALITY
FPGA (device) with the CPU (host) for using a popular software programming Other languages and frameworks
data movement and task dispatch- language but it also allows any code used for high-level synthesis use C prag-
ing, which is architecture dependent. written targeting any other acceleratormas to target particular devices. For
Thus, using HDL languages is deemed (namely, GPUs) to execute on an FPGA. example, Vitis HLS uses pragmas to tar-
unfeasible in HPC contexts. Nevertheless, theory and reality areget AMD Xilinx FPGAs. The use of prag-
To alleviate these issues, HLS lan- often known to differ. While it is truemas allows the code to be annotated
guages and frameworks have been that OpenCL provides code portability with different pragmas to target several
developed, which leverage high-level across supported devices, it does not architectures at the same time. On the
software programming languages guarantee performance portability. other hand, the use of OpenCL forces the
(mainly C based) for hardware design. Moreover, its high verbosity and the lack
rewriting of the code to take advantage
of architectures whose vendors do not
support OpenCL. Consequently, the use
of C pragmas leads to an FPGA-centric
MOST BANDWIDTH LIMITS ON FPGAs design philosophy, which might result
in fewer efforts and complexities to
COME FROM THE USE OF DOUBLE DATA optimize naive or device-agnostic codes
RATE 4 TECHNOLOGY, WHILE GPUs for FPGA execution. However, this opti-
HAVE BEEN USING FASTER MEMORY mization step is still unavoidable.
TECHNOLOGY FOR SOME YEARS NOW. Overall, concerning the programma-
bility of FPGAs for HPC applications, as
of today, it seems unfeasible to rely only
on compiler optimizations to efficiently
HLS has succeeded in several areas, of support from important vendors (for execute device-agnostic code on FPGAs.
including deep learning, video transcod- example, Nvidia) have made it less com- Therefore, HPC researchers and engi-
ing, graph processing, and genome monly used lately. In the particular case neers are expected to have some knowl-
sequencing.13 Examples of these lan- of FPGA accelerators, although they edge of the underlying architecture
guages are Vitis HLS (for Xilinx FPGAs are able to properly execute device- when trying to maximize performance
only) and OpenCL (commonly used for agnostic or GPU-optimized OpenCL for FPGA devices.
Intel FPGAs). OpenCL is one of the most code, the performance they achieve Moreover, the compilation of FPGA
popular ones for Intel FPGAs and previ- with such codes is, in general, con- codes is a time-consuming process, espe-
ously for Xilinx ones too. OpenCL was siderably low.14,15 Some optimization cially when high-level languages are used
designed from the beginning to tar- techniques are known to alleviate this to describe sophisticated algorithms that
get heterogeneous systems and allow situation (see “FPGA-Specific Optimiza- lead to complex hardware descriptions.
all their resources to be efficiently tion Techniques”). Although we centered HPC kernels for FPGAs are known to take
exploited, and it has been extensively our discussion on OpenCL capabilities, several hours to compile, which further
used for programming CPU + GPU it is worth noting that these conclusions adds to the development costs associated
applications. Its design philosophy is may be extended to any programming with these devices. Overlay architectures
to enable code portability across many model or framework targeting different for FPGAs show potential in reducing the
different computing devices, that is, to kinds of accelerators (namely, GPUs and long compilation and reconfiguration
be able to write a single device-agnos- FPGAs), such as SYCL and all its derived times traditionally associated with FPGA
tic code and execute it on any Open- implementations, although their actual deployment. By providing a higher-level
CL-supported device (including CPUs, performance depends on the particular abstraction, overlays can simplify FPGA
GPUs, and FPGAs). This, in theory, is application considered and the inter- programming, making it more accessible
perfect for heterogeneous computing, nal compiler optimizations available. and quicker to adapt to different appli-
especially for FPGAs. Not only does the Other pragma-based languages, such as cations.16 This approach allows for rapid
language make the complex low-level OpenACC and OpenMP, are also used for prototyping and iteration, which is crucial
details of hardware design abstract by this purpose. in research and development settings.
J U LY 2 0 2 4 71
ATTRIBUTES OF QUALITY
memory bandwidth. Low memory band- of not being optimized for high mem- disadvantage against other accelerators
width is the other most important lim- ory bandwidth in the same way GPUs for many applications.
itation of current FPGA devices, and it are since GPUs are designed with par- The cited works conducted their
probably constitutes the main limiting allelism and high-bandwidth memory research using older FPGA and GPU
factor for FPGAs to achieve high per- interfaces from the outset. models, so their conclusion might not
formance in numerous applications. Zohouri et al.18 present a compre- seem representative of the current
Most bandwidth limits on FPGAs come hensive analysis of the memory control- state of the art. To provide some insight
from the use of Double Data Rate 4 tech- ler and memory bandwidth efficiency of into how the state of the art might have
nology, while GPUs have been using Intel FPGAs, concluding that to achieve changed since those works were pub-
faster memory technology for some high memory performance, FPGA ker- lished, Table 1 provides a comparison of
years now. This limitation is even more nels must meet multiple and strict clock frequencies and memory band-
relevant when considering that avail- requirements related to access pat- widths among different Intel FPGA and
able FPGA boards do not support the terns, alignment, and memory hierar- Nvidia GPU models, including recent
memory sizes available in GPUs, and chy usage. These requirements are hard ones. Comparing FPGAs and GPUs just in
getting data in and out of these cards to meet in real-world applications, and terms of maximum clock frequency is an
is expensive and can easily destroy thus, for many applications, it might not oversimplification based on theoretical
any potential benefit in the computa- be possible to achieve more than 70% of hard limits and should be taken lightly. It
tion. FPGAs are designed for flexibil- the peak memory bandwidth. Overall, is worth noting that FPGA working clock
ity and programmability, with their the low off-chip memory bandwidth frequencies depend on the specific hard-
architecture consisting of an array of compared to CPUs and GPUs, as well as ware design synthesized and rarely come
programmable logic blocks and rout- the difficulties to efficiently exploit that close to the reported theoretical maxima
ing. This flexibility comes at the cost bandwidth, put FPGA accelerators at a (shown in the table), especially when
TABLE 1. A comparison of the clock frequency and peak memory bandwidth of several
Xilinx and Intel FPGAs and Nvidia GPU models, sorted by release date.
Processing clock Peak memory
Release date frequency bandwidth
Nvidia Tesla V100 GPU First quarter 2017 1,245 MHz (base), 900 GB/s
1,380 MHz (boost)
Intel PAC with Intel Arria 10 GX FPGA Fourth quarter 2017 (FPGA model from 2013) Up to 800 MHz 34.8 GB/s
Intel FPGA PAC D5005 (with Intel Stratix 10 GX) Fourth quarter 2019 (FPGA model from 2013) Up to 1,000 MHz 76.8 GB/s
Intel Stratix 10 MX* FPGA FPGA model from 2017 Up to 1,000 MHz 512 GB/s
Nvidia A100 GPU First quarter 2020 765 MHz (base), 1,555 GB/s
1,410 MHz (boost)
Xilinx Alveo U55C Fourth quarter 2021 Up to 1,028 MHz 460 GB/s
Intel Agilex 7 FPGA* M-Series 039 FPGA model from first quarter 2022 Up to 800 MHz 1,000 GB/s
Nvidia H100 GPU First quarter 2022 1,095 MHz (base), 2,039 GB/s
1,755 MHz (boost)
*The entry is an FPGA integrated circuit (chiplet) model to be integrated into a hardware package or module with other components, not a commercially available ready-to-use
accelerator itself.
J U LY 2 0 2 4 73
ATTRIBUTES OF QUALITY
can exploit temporal blocking or other computations, due to the pipelined nature of HPC workloads, whose strengths and
forms of high pipelined parallelism. of deep learning models and the poten- limitation were described previously,
FPGAs can be beneficial in scientific tial to optimize them by means of custom there is also certain interest in studying
computing applications where latency irregular data types as well as irregular the potential use of FPGAs cooperatively
and predictability of execution times algorithms. Nurvitadhi et al.23 conclude with other devices, both FPGAs and
are crucial, such as in urgent HPC sce- that recent trends in deep neural net- other accelerators, so as to exploit all the
narios, including interactive prototyp- work algorithms might favor FPGAs available resources of a given heteroge-
ing, urgent streaming data analysis, over GPUs and that FPGAs have poten- neous potentially distributed system.
application steering, and in situ visu- tial to become the platform of choice Many works explore the possibilities
alization. There are several reasons for for accelerating deep neural networks, of using FPGAs cooperatively in hetero-
this. First, FPGAs excel in providing offering superior performance. Never- geneous environments. Some explore
low-latency processing. Unlike CPUs theless, the deep learning market has the possibility of using FPGA-powered
and GPUs, which have fixed hardware been of significant importance in recent network interface cards to carry out CPU-
structures and instruction sets, FPGAs times, and many vendors of electronic less processing of incoming and outcom-
can be configured to perform specific components (including CPUs, GPUs, and ing network data, thus reducing latency.
computations directly in hardware, ASICs) have tried to get into and expand This can be applied to inter-FPGA com-
reducing their overhead. This is partic- inside that market. Since the publi- munications to efficiently connect mul-
ularly relevant in the case of irregular cation of that work, two major break- tiple distributed FPGAs together. Other
applications, where the single-instruc- throughs have been made concerning works discuss direct memory access
tion, multiple data paradigm cannot hardware acceleration of deep learning (DMA) mechanisms to connect GPUs
be applied. Second, FPGAs offer more applications. First, many GPUs have and FPGAs together in order to effi-
predictable performance compared started to include dedicated hardware ciently communicate different kinds of
to CPUs and GPUs. Since FPGAs can for AI acceleration, such as Nvidia’s ten- accelerators from different perspectives:
be configured with specific hardware sor cores. Second, Google launched the either the GPU is the peripheral compo-
paths for given tasks, they can execute TPU, an ASIC for AI acceleration. These nent interconnect express (PCIe) mas-
these tasks consistently without the new kinds of hardware attempt to accel- ter or this task is assigned to the FPGA.
unpredictability introduced by shared erate AI tasks that GPUs are not well Both approaches show that performance
resources (like caches or memory buses) suited for, including algorithms dealing penalties are incurred for DMA transfers
in general-purpose processors. This with custom irregular data types. Thus, in which the PCIe master is the destina-
predictability is critical in applications both of them pose new challenges to tion device. These two techniques can be
where timing and consistency of com- FPGAs to become the platform of choice combined to enable efficient cooperative
putation are vital. Third, FPGAs can be to accelerate deep neural networks. work between GPUs and FPGAs over dif-
tailored for specific algorithms or data Over recent years, we have seen a signif- ferent nodes.24
processing tasks. This customization icant increment in TPU and tensor core Although these are promising tech-
allows for highly efficient execution of GPU utilization for accelerating real- niques for heterogeneous environments,
particular tasks in scientific comput- world AI tasks; however, FPGAs do not there do not seem to be many real case
ing, such as data analysis or simulation, seem to have made significant progress applications that clearly benefit from
which can be critical in urgent com- in this field, and nowadays, they are not cooperative FPGA approaches. While
puting scenarios where quick accurate the platform of choice for accelerating GPU technology is making significant
results are required. Due to these rea- large-scale deep neural networks. progress in the distributed multi-GPU
sons, FPGA benefits in scenarios like field for real-world applications, the
the ones described previously can be WHAT ABOUT THE multi-FPGA and hybrid GPU–FPGA
substantial, especially when immedi- USE OF FPGAs AS fields for real-world applications seem
ate data processing and decision mak- COOPERATIVE DEVICES IN to be considerably less explored. One
ing are crucial. HETEROGENEOUS SYSTEMS? major cause for this seems to be the
For some years, FPGAs have been Besides the potential use of FPGAs as lower scaling capabilities of FPGA
considered well suited for deep learning stand-alone devices for the acceleration devices, which hinder the development
O
PID2019-104834GB-I00 (funded by MCIN/ Germany, 2016. Accessed: Sep. 2023.
verall, modern FPGA technol- AEI/10.13039/501100011033/FEDER, UE) [Online]. Available: https://www.
ogy focused on HPC environ- and by the Conselleria de Cultura, Educa- top500.org/news/good-times-for
ments still presents important cion, e Ordenacion Universitaria, Xunta de -fpga-enthusiasts
limitations that put FPGA devices at a Galicia (Accreditation ED431C 2022/16). 9. S. R. Hines, “Improving processor effi-
disadvantage compared to GPUs, namely, Thanks to the anonymous reviewers for ciency through enhanced instruction
low memory bandwidth and size, lower many useful suggestions. fetch,” Ph.D. thesis, Florida State
raw computational power, the need for Univ., Tallahassee, FL, USA, 2008.
sophisticated manual tuning due to REFERENCES 10. S. Kestur, J. D. Davis, and O. Wil-
poor automatic compiler optimizations, 1. N. P. Jouppi et al., “In-datacenter per- liams, “BLAS comparison on FPGA,
development complexity, and very long formance analysis of a tensor process- CPU and GPU,” in Proc. IEEE Comput.
compilation times. FPGAs can still ing unit,” SIGARCH Comput. Archit. Soc. Annu. Symp. VLSI, Piscataway,
prove useful in the acceleration of irreg- News, vol. 45, no. 2, pp. 1–12, Jun. 2017, NJ, USA: IEEE, 2010, pp. 288–293, doi:
ular tasks for which general-purpose doi: 10.1145/3140659.3080246. 10.1109/ISVLSI.2010.84.
architectures (CPU and GPU) are poorly 2. M. Bedford Taylor, “The evolution of 11. T. Nguyen, S. Williams, M. Siracusa,
optimized, such as tasks with irregular bitcoin hardware,” Computer, vol. 50, C. MacLean, D. Doerfler, and
data types or algorithms, as long as no. 9, pp. 58–66, Sep. 2017, doi: N. J. Wright, “The performance and
it is not profitable to build and deploy 10.1109/MC.2017.3571056. energy efficiency potential of FPGAs
ASICs for those applications. FPGAs also 3. C. Stephen et al., “Examining the in scientific computing,” in Proc.
show potential for accelerating tasks viability of FPGA supercomputing,” IEEE/ACM Perform. Model., Bench-
in environments where flexibility and/ EURASIP J. Embedded Syst., marking Simul. High Perform. Comput.
or energy efficiency are crucial. Nev- vol. 2007, Jan. 2007, Art. no. 93652, Syst. (PMBS), Piscataway, NJ, USA:
ertheless, FPGA technology still has to doi: 10.1155/2007/93652. IEEE, 2020, pp. 8–19, doi: 10.1109/
make some progress, both in hardware 4. D. H. Jones et al., “GPU versus FPGA PMBS51919.2020.00007.
capabilities and ease of development, for high productivity computing,” in 12. K. Vipin and S. A. Fahmy, “FPGA
to become competitive at accelerating Proc. Int. Conf. Field Programmable dynamic and partial reconfiguration:
most modern HPC workloads. Logic Appl., 2010, pp. 119–124, doi: A survey of architectures, methods,
10.1109/FPL.2010.32. and applications,” ACM Comput. Surv.,
ACKNOWLEDGMENT 5. N. Brown, “Weighing up the new kid vol. 51, no. 4, Jul. 2018, Art. no. 72,
The work of Manuel de Castro, Yuri on the block: Impressions of using Vitis doi: 10.1145/3193827.
Torres, and Diego R. Llanos has been for HPC software development,” in 13. J. Cong et al., “FPGA HLS today: Suc-
supported in part by Grant PID2022- Proc. 30th Int. Conf. Field-Programmable cesses, challenges, and opportunities,”
142292NB-I00 (NATASHA Project), funded Logic Appl. (FPL), 2020, pp. 335–340, ACM Trans. Reconfigurable Technol.
by MCIN/AEI/10.13059/501100011033, doi: 10.1109/FPL50879.2020.00062. Syst., vol. 15, no. 4, pp. 1–42, Aug. 2022,
and by the European Regional Develop- 6. G. Alonso and P. Bailis, “Research for doi: 10.1145/3530775.
ment Fund’s A Way of Making Europe practice: FPGAs in datacenters,” 14. K. Krommydas et al., “Bridging the
project. Yuri Torres and Diego R. Llanos Commun. ACM, vol. 61, no. 9, performance-programmability gap
have been supported in part by Junta de pp. 48–49, 2018, doi: 10.1145/3209275. for FPGAs via OpenCL: A case study
J U LY 2 0 2 4 75
ATTRIBUTES OF QUALITY
with OpenDwarfs,” in Proc. IEEE 24th Intel FPGA SDK for OpenCL memory 22. H. R. Zohouri et al., “Combined spatial
Annu. Int. Symp. Field-Programmable interface,” in Proc. IEEE/ACM Int. and temporal blocking for high-
Custom Comput. Mach. (FCCM), 2016, pp. Workshop Heterogeneous High- performance stencil computation on
198–198, doi: 10.1109/FCCM.2016.56. Perform. Reconfigurable Comput. FPGAs using OpenCL,” in Proc. ACM/
15. H. R. Zohouri et al., “Evaluating and (H2RC), Nov. 2019, pp. 11–18, doi: SIGDA Int. Symp. Field-Programmable
optimizing OpenCL kernels for high 10.1109/H2RC49586.2019.00007. Gate Arrays (FPGA), New York, NY,
performance computing with FPGAs,” 19. M. Véstias et al., “Trends of CPU, GPU USA: Association for Computing
in Proc. Int. Conf. High Perform. Comput., and FPGA for high-performance com- Machinery, 2018, pp. 153–162, doi:
Netw., Storage Anal. (SC), Nov. 2016, pp. puting,” in Proc. 24th Int. Conf. Field 10.1145/3174243.3174248.
409–420, doi: 10.1109/SC.2016.34. Programmable Logic Appl. (FPL), 2014, 23. E. Nurvitadhi et al., “Can FPGAs beat
16. H. K.-H. So and C. Liu, “FPGA pp. 1–6, doi: 10.1109/FPL.2014.6927483. GPUs in accelerating next-generation
overlays,” in FPGAs for Software 20. E. Calore and S. F. Schifano, “Perfor- deep neural networks?” in Proc. ACM/
Programmers, D. Koch, F. Hannig, and mance assessment of FPGAs as HPC SIGDA Int. Symp. Field-Programmable Gate
D. Ziener, Eds., Cham, Switzerland: accelerators using the FPGA empirical Arrays (FPGA), New York, NY, USA: Asso-
Springer-Verlag, 2016, pp. 285–305. roofline,” in Proc. 31st Int. Conf. Field- ciation for Computing Machinery, 2017,
17. J. Cong et al., “Understanding Programmable Logic Appl. (FPL), Pisca- pp. 5–14, doi: 10.1145/3020078.3021740.
performance differences of FPGAs taway, NJ, USA: IEEE, 2021, pp. 83–90, 24. R. Kobayashi et al., “OpenCL-enabled
and GPUs,” in Proc. ACM/SIGDA Int. doi: 10.1109/FPL53798.2021.00022. high performance direct memory
Symp. Field-Programmable Gate Arrays 21. E. Calore and S. F. Schifano, “FER: access for GPU-FPGA cooperative
(FPGA), New York, NY, USA: Associa- A benchmark for the roofline computation,” in Proc. HPC Asia Work-
tion for Computing Machinery, 2018, analysis of FPGA based HPC accel- shops (HPCAsia Workshops), New York,
p. 288, doi: 10.1145/3174243.3174970. erators,” IEEE Access, vol. 10, pp. NY, USA: Association for Computing
18. H. R. Zohouri et al., “The memory 94,220–94,234, 2022, doi: 10.1109/ Machinery, 2019, pp. 6–9, doi: 10.1145/
controller wall: Benchmarking the ACCESS.2022.3203566. 3317576.3317581.