Ispc - A SPMD Compiler for High-Performance CPU Programming (Ispc_inpar_2012)
Ispc - A SPMD Compiler for High-Performance CPU Programming (Ispc_inpar_2012)
CPU Programming
supporting this illusion falls on the compiler. Control flow antees on execution order, requiring explicit barrier synchro-
constructs are compiled following the approach described nization among program instances with __syncthreads() or
by Allen et al. [2] and recently generalized by Karrenberg et barrier(), respectively, when there is communication be-
al. [17], where control flow is transformed to data flow. tween program instances via memory. Implementing these
A simple example of this transformation is shown in Fig- barriers efficiently for OpenCL on CPUs is challenging [10].
ure 1, where assignments to a variable are controlled by an Maximally converged execution provides several advan-
if statement. The SIMD code generated by this example tages compared to the looser model on GPUs; it is partic-
maintains a mask that indicates which program instances ularly helpful for efficient communication of values between
are currently active during program execution. Operations program instances without needing to explicitly synchro-
with side-effects are masked so that they don’t have any ef- nize among them. However, this property also can intro-
fect for program instances with an “off” mask value. This duce a dependency on SIMD width; by definition, ordering
approach is also applied to loops (including break and con- changes if the gang size changes. The programmer generally
tinue statements) and to multiple return statements within only needs to consider this issue when doing cross-program-
one function. instance communication.
Implementation of this transformation is complex on SSE The concept of lockstep execution must be precisely de-
hardware due to limited support for SIMD write-masks, in fined at the language level in order to write well-formed pro-
contrast to AVX, MIC and most GPUs. Instead, the com- grams where program instances depend on values that are
piler must use separate blend/select instructions designed written to memory by other program instances within their
for this purpose. Fortunately, masking isn’t required for gang. With ispc, any side effect from one program instance
all operations; it is unnecessary for most temporaries com- is visible to other program instances in the gang after the
puted when evaluating an expression, for example. Be- next sequence point in the program, where sequence points
cause the SPMD programming model is used pervasively are defined as in C. Generally, sequence points include the
on GPUs, most GPUs have some hardware/ISA support end of a full expression, before a function is entered in a
for SPMD control flow, thereby reducing the burden on the function call, at function return, and at the end of initial-
compiler [25, 27]. izer expressions. The fact that there is no sequence point
between the increment of i and the assignment to i in i=i++
3.4 SPMD and Synchronization is why that expression yields undefined behavior in C, for ex-
ispc provides stricter guarantees of execution convergence ample. Similarly, if multiple program instances write to the
than GPUs running SPMD programs do; these guarantees in same location without an intervening sequence point, unde-
turn provide ease-of-use benefits to the programmer. ispc fined behavior results. (The ispc User’s Guide has further
specifically provides an important guarantee about the be- details about these convergence guarantees and resulting im-
havior of the program counter and execution mask: the ex- plications for language semantics [14].)
ecution of program instances within a gang is maximally
converged. Maximal convergence means that if two program 3.5 Mapping SPMD To Hardware: Memory
instances follow the same control path, they are guaranteed The loads and stores generated by SPMD execution can
to execute each program statement concurrently. If two pro- present a performance challenge. Consider a simple array
gram instances follow diverging control paths, it is guaran- indexing operation like a[index]: when executed in SPMD,
teed that they will re-converge at the earliest point in the each of the program instances will in general have a different
program where they could re-converge.5 value for index and thus access a different memory location.
In contrast, CUDA and OpenCL have much looser guar- Loads and stores of this type, corresponding to loads and
stores with SIMD vectors of pointers, are typically called
5 “gathers” and “scatters” respectively. It is frequently the
This guarantee is not provided across gangs in different
threads; in that case, explicit synchronization must be used. case at runtime that these accesses are to the same location
or to sequential locations in memory; we refer to this as while, and do loops, as well as break, continue, and return
a coherent gather or scatter. For coherent gather/scatter, statements in all of the places where they are allowed in
modern DRAM typically delivers better performance from C.7 Different program instances can follow different control
a single memory transaction than a series of discrete memory paths; in the example above, the while loop may execute a
transactions. different number of times for different elements of the array.
Modern GPUs have memory controllers that coalesce co- ispc provides a standard library of useful functions, in-
herent gathers and scatters into more efficient vector loads cluding hardware atomic operations, transcendentals, func-
and stores [25]. The range of cases that this hardware han- tions for communication between program instances, and
dles has generally expanded over successive hardware gen- various data-parallel primitives such as reductions and scans
erations. Current CPU hardware lacks “gather” and “scat- across the program instances.
ter” instructions; SSE and AVX only provide vector load
and store instructions for contiguous data. Therefore, when 4.1 Mapping Computation to Data
gathers and scatters are required, they must be implemented Given a number of instances of the program running in
via a less-efficient series of scalar instructions.6 ispc’s tech- SPMD (i.e. one gang), it’s necessary for the instances to
niques for minimizing unnecessary gathers and scatters are iterate over the input data (which is typically larger than
described in Section 6.4. a gang). The example above does this using a for loop
and the built-in variables programIndex and programCount.
4. LANGUAGE OVERVIEW programCount gives the total number of instances running
To give a flavor of the syntax and how the language is used, (i.e. the gang size) and programIndex gives each program
here is a simple example of using ispc. For more extensive instance an index from zero to programCount-1. Thus, in
examples and language documentation, see the ispc online the above, for each for loop iteration a programCount-sized
documentation [13]. number of contiguous elements of the input arrays are pro-
First, we have some setup code in C++ that dynamically cessed concurrently by the program instances.
allocates and initializes two arrays. It then calls an update() ispc’s built-in programIndex variable is analogous to the
function. threadIdx variable in CUDA and to the get_global_id()
function in OpenCL, though a key difference is that in ispc,
float *values = new float[1024]; looping over more than a gang’s worth of items to process
int *iterations = new int[1024]; is implemented by the programmer as an in-language for or
// ... initialize values[], iterations[] ... foreach loop, while in those languages, the corresponding
update(values, iterations, 1024); iteration is effectively done by the hardware and runtime
thread scheduler outside of the user’s kernel code. Perform-
The call to update() is a regular function call; in this ing this mapping in user code gives the programmer more
case it happens that update() is implemented in ispc. The control over the structure of the parallel computation.
function squares each element in the values array the num-
ber of times indicated by the corresponding entry in the 4.2 Implementation
iterations array. The ispc compiler uses flex and bison for tokenization and
export void update(uniform float values[], parsing. The compiler front-end performs type-checking and
uniform int iterations[], standard early optimizations such as constant folding before
uniform int num_values) { transforming the program to the vector intermediate repre-
for (int i = programIndex; i < num_values; sentation of the LLVM toolkit [21]. LLVM then performs an
i += programCount) { additional set of traditional optimizations. Next our custom
int iters = iterations[i]; optimizations are applied, as discussed in Section 6. LLVM
while (iters-- > 0) then generates final assembly code.
values[i] *= values[i]; It is reasonably easy to add support for new target instruc-
} tion sets: most of the compiler is implemented in a fashion
} that is target agnostic (e.g. “gathers” are issued generically
and only late in the compilation process are they trans-
The syntax and basic capabilities of ispc are based on formed to a target-specific operation).
C (C89 [3], specifically), though it adopts a number of con-
structs from C99 and C++. (Examples include the abil-
ity to declare variables anywhere in a function, a built-in
5. DESIGN FEATURES
bool type, references, and function overloading.) Matching We’ll now describe some key features of ispc and how
C’s syntax as closely as possible is an important aid to the they support the goals introduced in Section 2.
adoptability of the language.
The update() function has an export qualifier, which 5.1 “Uniform” Datatypes
indicates that it should be made callable from C++; the In a SPMD language like ispc, a declaration of a variable
uniform variable qualifier specifies scalar storage and com- like float x represents a variable with a separate storage
putation and will be described in Section 5.1. location (and thus, potentially different value) for each of
ispc supports arbitrary structured control flow within the program instances. However, some variables and their
functions, including if statements, switch statements, for, 7
Unstructured control flow (i.e. goto statements) is more
6
This limitation will be removed in future hardware (The difficult to support efficiently, though ispc does support
Haswell New Instructions provide gather [15] and MIC pro- goto in cases where it can be statically determined that all
vides both gather and scatter [35]). program instances will execute the goto.
associated computations do not need to be replicated across
program instances. For example, address computations and AOS x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 ...
loop iteration variables can often be shared.
Since CPU hardware provides separate scalar compu-
tation units, it is important to be able to express non- float v = a[index].x
replicated storage and computation in the language. ispc
provides a uniform storage class for this purpose, which cor- hybrid SOA x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 ...
responds to a single value in memory and thus, a value that
is the same across all elements. In addition to the obvious
direct benefits, the use of uniform variables facilitates ad-
float v = a[index / 4].x[index & 3]
ditional optimizations as discussed in Section 6.1. It is a
compile-time error to assign a non-uniform (i.e., “varying”)
value to a uniform variable.
In the absence of the uniform storage class, an optimiz- short SOA x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3
These coherent control flow variants do not affect program • CPUs can co-issue scalar and vector instructions, so
correctness or the final results computed, but can potentially that scalar and vector computations can happen con-
lead to higher performance. currently.
For similar reasons, ispc provides convenience foreach
• In the usual case of using 64-bit pointers, pointer arith-
constructs that loop over arrays of one or more dimensions
metic (e.g. for addressing calculations) is more efficient
and automatically set the execution mask at boundaries.
for scalar pointers than for vector pointers.
These constructs allow the ispc compiler to easily produce
optimized code for the subset of iterations that completely • Dereferencing a uniform pointer (or using a uniform
fill a gang of program instances (see Section 6.6 for a de- value to index into an array) corresponds to a single
scription of these optimizations). scalar or vector memory access, rather than a general
gather or scatter.
5.7 Native Object Files and Function Calls
The ispc compiler generates native object files that can be • Code for control flow based on uniform quantities can
linked into the application binary in the same way that other be more efficient than code for control flow based on
object files are. ispc code can be split into multiple object non-uniform quantities (Section 6.2).
files if desired, with function calls between them resolved
For the workloads we use for evaluation in Section 7, if all
at link time. Standard debugging information is optionally
uses of the uniform qualifier were removed thus eliminat-
included. These capabilities allow standard debuggers and
ing all of the above benefits, the workloads ran at geomet-
disassemblers to be used with ispc programs and make it
ric mean (geomean) 0.45x the speed of when uniform was
easy to add ispc code to existing build systems.
present. The ray tracer was hardest hit, running at 0.05x of
ispc’s calling conventions are based on the platform’s
its previous performance, “aobench” ran at 0.36x its original
standard ABI, though functions not marked export are aug-
performance without uniform and “stencil” at 0.21x.
mented with an additional parameter to provide the current
There were multiple causes of these substantial perfor-
execution mask. Functions that are marked export can be
mance reductions without uniform; the most significant were
called with a regular function call from C or C++; calling a
the higher overhead of non-uniform control flow and the
ispc function is thus a lightweight operation—it’s the same
much greater expense of varying pointer operations com-
as the overhead of calling to an externally-defined C or C++
pared to uniform pointer operations. Increased pressure on
function. In particular, no data copying or reformatting
the vector registers which in turn led to more register spills
is performed, other than than possibly pushing parameters
to memory also impacted performance without uniform.
onto the stack if required by the platform ABI. While there
are some circumstances where such reformatting could lead 6.2 Uniform Control Flow
to improved performance, introducing such a layer is against
When a control flow test is based on a uniform quantity,
our goals of performance transparency.
all program instances will follow the same path at that point
Lightweight function calls are a significant difference from
in a function. Therefore, the compiler is able to generate reg-
OpenCL on the CPU, where an API call to a driver must be
ular jump instructions for control flow in this case, avoiding
made in order to launch a kernel and where additional API
the costs of mask updates and overhead for handling control
calls are required to set each kernel parameter value.
flow divergence.
Treating all uniform control flow statements as varying
6. EFFICIENT SPMD-ON-SIMD caused the example workloads to run with performance ge-
There are a number of specialized optimizations that ispc omean 0.91x as fast as when this optimization was enabled.
applies to generate efficient code for SPMD on CPUs. We This optimization had roughly similar effectiveness on all of
the workloads, though the ray tracer was particularly hard- Sec. Optimization Perf. when
hit without it, running 0.65x as fast as it did without this disabled
optimization.
6.1, 6.2 Uniform data & control flow 0.45x
6.3 Benefits of SOA 6.2 Uniform control flow 0.91x
6.4 Gather/scatter improvements 0.79x
We measured the benefits of SOA versus AOS layout with
6.5 Coherent control flow 0.85x
a workload based on a collision detection algorithm that
6.6 “All on” mask improvements 0.73x
computed collision points, when present, between two groups
of spheres. We implemented this workload with AOS lay-
out and then modified the implementation to also use SOA. Table 1: Effect of individually disabling various op-
By avoiding the gathers required with AOS layout, the SOA timizations (geometric mean over all of the example
version was 1.25x faster than the AOS one. workloads)
Table 2: Speedup of various workloads on a single Table 3: Speedup versus serial C++ implementa-
core and on four cores of a system with 8-wide SIMD tions of various workloads on a 40-core system with
units, compared to a serial C++ implementation. 4-wide SIMD units.
The one core speedup shows the benefit from using
the SIMD lanes of a single core efficiently, while the
four core speedup shows the benefit from filling the vector width, there are a number of microarchitectural de-
entire processor with useful computation. tails in the first generation of AVX systems that inhibit ideal
speedups; they include the fact that the integer vector units
are still only four-wide, as well as the fact that cache write
would be hard to interpret the results in that the effects of bandwidth was not doubled to keep up with the widening of
different compiler optimizations and code generators would the vector registers.
be confounded with the effects of the impact of the lan-
guage designs. Instead, we have focused on evaluating the 7.4 Scalability on Larger Systems
performance benefit of various ispc features by disabling Table 3 shows the result of running the example work-
them individually, thus isolating the effect of the factor un- loads with 80 threads on a 2-way hyper-threaded 40-core
der evaluation. Intel R Xeon E7-8870 system at 2.40GHz, using the SSE4
Table 1 recaps the effects of the various compiler opti- instruction set and running Microsoft Windows Server 2008
mizations that were reported in Section 6. Enterprise. For these tests, the serial C/C++ baseline code
was compiled with MSVC 2010. No changes were made to
7.1 Speedup Compared to Serial Code the implementation of workloads after their initial paral-
Table 2 shows speedups due to ispc’s effective use of lelization, though the “aobench” and options pricing work-
SIMD hardware and due to ispc’s use of task parallelism loads were run with larger data sets than the four-core runs
on a 4-core system. The table compares three cases for (2048x2048 image resolution versus 512x512, and 2M op-
each workload: a serial non-SIMD C++ implementation; tions versus 128k options, respectively).
an ispc implementation running in a single hardware thread The results fall into three categories: some (aobench, ray
on a single core of the system; and an ispc implementation tracer, and volume rendering), saw substantial speedups ver-
running eight threads on the four two-way hyper-threaded sus the serial baseline, thanks to effective use of all of the sys-
cores of the system. The four core performance shows the tem’s computational resources, achieving speedups of more
result of filling the entire processor with computation via than the theoretically-ideal 160x (the product of number of
both task-parallelism and SPMD. For the four-core results, cores and SIMD width on each core); again, the super-linear
the workloads were parallelized over cores using the tasking component of the speedups is mostly due to hyper-threading.
functionality described in Section 5.4. Other workloads (both of the options pricing workloads and
the Mandelbrot set workload), saw speedups around 2x the
7.2 Speedup Versus Intrinsics system’s core count; for these, the MSVC compiler seems
The complexity of the example workloads (which are as to have been somewhat effective at automatically vectoriz-
much as 700 lines of ispc code) makes it impractical to also ing them, thus improving the serial baseline performance.
implement intrinsics-based versions of them for performance Note, however, that these are the simplest of the workloads;
comparisons. However, a number of users of the system have for the more complex workloads the auto-vectorizer is much
implemented computations in ispc after previously imple- less effective.
menting the same computation with intrinsics and seen good The stencil computation saw a poor speedup versus the
results—the examples we’ve seen are an image downsam- serial baseline (and indeed, a worse speedup than on a four-
pling kernel (ispc performance 0.99x of intrinsics), a colli- core system.) The main issue is that the computation is iter-
sion detection computation (ispc 1.05x faster), a particle ative, requiring that each set of asynchronous tasks complete
system rasterizer (ispc 1.01x faster). before the set of tasks for the next iteration can be launched;
the repeated ramp up and ramp down of parallel computa-
7.3 Speedup with wider vectors tion hurts scalability. Improvements to the implementation
We compared the performance of compiling the example of the underlying task system could presumably reduce the
workloads to use four-wide SSE vector instructions versus impact of this issue.
eight-wide AVX on a system that supported both instruction
sets. No changes were made to the workloads’ ispc source 7.5 Users
code. The geometric mean of the speedup for the workloads There have been over 1,500 downloads of the ispc binaries
when going from SSE to AVX was 1.42x. Though this is not since the system was first released; we don’t know how many
as good as the potential 2x speedup from the doubling of additional users are building the system from source. Users
have reported roughly fifty bugs and made a number of sug- counter; and provided keywords similar to ispc’s uniform
gestions for improvements to code generation and language and varying to distinguish between scalar and SIMD vari-
syntax and capabilities. ables. MPL provided vector control constructs with syntax
Overall feedback from users has been positive, both from similar to ispc, OpenCL, and CUDA; C* provided a more
users with a background in SPMD programming from GPUs limited capability just for if statements. MPL provided
but also from users with extensive background in intrin- generalized SIMD pointers similar to the ones in ispc, but
sics programming. Their experience has generally been that each SIMD ALU and the scalar ALU had its own memory
ispc’s interoperability features and close relationship to C so these pointers could not be used to communicate data
has made it easy to adopt the system; users can port existing between units as they can in ispc. Both C* and MPL had
code to ispc by starting with existing C/C++ code, updat- sophisticated communication primitives for explicitly mov-
ing it to remove any constructs that ispc doesn’t support ing data between SIMD ALUs.
(like classes), and then modifying it to use ispc’s parallel Clearspeed’s Cn is a more recent example of this family
constructs. It hasn’t been unusual for a user with a bit of of languages; the paper describing it has a good discussion
ispc experience to port an existing 500–1000 line program of design trade-offs [26].
from C++ to ispc in a morning’s work. From the other
direction, many ispc programs can be compiled as C with
the introduction of a few preprocessor definitions; being able 8.2 Contemporary systems
to go back to serial C with the same source code has been
CUDA is a SPMD language for NVIDIA GPUs [31] and
useful for a number of users as well.
OpenCL is a similar language developed as an open stan-
Applications that users have reported using ispc for in-
dard, with some enhancements such as API-level task paral-
clude implementing a 2D Jacobi Poisson solver (achieving
lelism designed to make it usable for CPUs as well as GPUs
a 3.60x speedup compared to the previous implementation,
[10, 19, 34]. At a high level, the most important differences
both on a single core); implementing a variety of image pro-
between these languages and ispc are that ispc’s design was
cessing operations for a production imaging system (achiev-
not restricted by GPU constraints such as a separate mem-
ing a 3.2x speedup, again both on single core); and im-
ory system, and that ispc includes numerous features de-
plementing physical simulation of airflow for aircraft design
signed specifically to provide efficient performance on CPUs.
(speedups not reported to us). Most of these users had not
All three languages are C-like but do not support all features
previously bothered to try to vectorize their workloads with
of C. ispc and CUDA have some C++ features as well.
intrinsics, but have been able to see substantial speedups
The difference in hardware focus between CUDA/OpenCL
using ispc; they have generally been quite happy with both
and ispc drives many specific differences. OpenCL has
performance transparency and absolute performance.
several different address spaces, including a per-SIMD-lane
memory address space (called “private”), and a per-work-
8. RELATED WORK group address space (called “local”) whereas ispc has a sin-
The challenge of providing language and compiler sup- gle global coherent address space for all storage. OpenCL
port for parallel programming has received considerable at- and CUDA also have complex APIs for moving data to and
tention over many years. To keep the discussion of related from a discrete graphics card that are unnecessary in ispc.
work tractable, we focus on languages whose goal is high ispc has language-level support for task parallelism, unlike
performance (or more precisely, high efficiency) program- OpenCL and CUDA. CUDA and OpenCL lack ispc’s sup-
ming of SIMD hardware. We further focus on general pur- port for “uniform” variables and convenient declaration of
pose languages (in contrast to domain-specific languages) structure of arrays data types. Although these features are
with a particular emphasis on languages that are C-like. We less important for performance on GPUs than on CPUs, we
do not discuss languages and libraries that are focused just believe they would provide some benefit even on GPUs.
on multi-core or distributed parallelism, such as OpenMP, There are several implementations of CUDA and OpenCL
TBB, Cilk, MPI, etc. even though some of these languages for CPUs. Some do not attempt to vectorize across SIMD
use an SPMD programming model. lanes in the presence of control flow [10, 36]. Intel’s OpenCL
compiler does perform SIMD vectorization [34], using an
8.1 Historical systems approach related to Karrenberg et al.’s [17] (who also applied
In the late 1980s and early 1990s, there was a wave of in- their technique to OpenCL kernels.)
terest in SIMD architectures and accompanying languages. Parker et al.’s RTSL system provided SPMD-on-SIMD on
In all of the cases we discuss, SIMD computations were sup- current CPUs in a domain-specific language for implement-
ported with a true superset of C; that is, serial C code could ing ray tracers [32].
always be compiled, but the SIMD hardware was accessible Microsoft’s C++ AMP [30] provides a set of extensions
via language extensions. The Pixar FLAP computer had a to C++ to support GPU programming. As with CUDA
scalar integer ALU and 4-wide SIMD floating-point ALU, and OpenCL, its design was constrained by the goal of run-
with an accompanying extended-C language [24]. FLAP is ning on today’s GPUs. It is syntactically very different from
also notable for providing hardware support for SIMD mask CUDA, OpenCL, and ispc because of its choice of mecha-
operations, like the MIC ISA and some modern GPUs. The nisms for extending C++.
Thinking Machines CM-1 and CM-2 and the MasPar MP-1 The UPC language extends C to provide an SPMD pro-
and MP-2 supercomputers used very wide SIMD (1000s of gramming model for multiple cores [5]. UPC includes mech-
ALUs), programmed in the extended-C languages C* [33] anisms for scaling to very large systems that lack hardware
and MPL [29] respectively. memory coherence, but the language was not designed to
All of these systems used a single language for both serial target SIMD parallelism within a core and as far as we know
and parallel computations; had a single hardware program it has never been used for this purpose.
8.3 Concurrently-developed systems on 32-bit integer values (for vector pointer addressing cal-
The IVL and VecImp languages described in a recent pa- culations) may be helpful.
per are similar to ispc in a number of ways [23]; they The decision to include both scalar and SIMD computa-
were developed concurrently with ispc with some cross- tion as first-class operations in the language may be applica-
pollination of ideas. These three languages are the only ble to other architectures. For example, AMD’s forthcoming
C-like general-purpose languages that we are aware of that GPU has a scalar unit alongside its vector unit [27] as does a
provide a mechanism for creating a structure-of-arrays vari- research architecture from NVIDIA [18]. Such architectures
ant of a previously-declared struct data type. could have a variety of efficiency advantages versus a tra-
There are substantial differences in emphasis between the ditional “brute force” SIMD-only GPU implementation [6].
VecImp/IVL paper and this work. The VecImp/IVL paper More broadly, many of the available approaches for achiev-
focuses on describing the language and formally proving the ing high SIMD efficiency can be implemented in different
soundness of the type system, whereas we focus on justify- ways: by the programmer/language, by the compiler, or by
ing and quantitatively evaluating language features such as the hardware. In the power-constrained environment that
uniform variables and structure-of-arrays support. IVL and limits all hardware architectures today, we expect continued
its evaluation focus on the MIC architecture, whereas ispc exploration of the complex trade offs between these different
focuses on the SSE and AVX architectures which have less approaches.
dedicated ISA support for SPMD-style computation. This
paper also introduces and analyzes the compiler optimiza- Acknowledgments
tions required to reap the full benefit of language features The parser from the (non-SPMD) C-based “nit” language
such as uniform variables.
written by Geoff Berry and Tim Foley at Neoptica pro-
There are a variety of other detailed differences between
vided the starting-point for the ispc implementation; on-
ispc, IVL, and VecImp. For example, IVL supports function going feedback from Geoff and Tim about design and imple-
polymorphism, which is not currently supported in ispc, mentation issues in ispc has been extremely helpful. Tim
and ispc’s pointer model is more powerful than IVL’s. ispc
suggested the “SPMD on SIMD” terminology and has exten-
uses LLVM for code generation, but the IVL compiler gen-
sively argued for the advantages of the SPMD model.
erates C++ code with intrinsics. ispc is the only one of the We’d like to specifically thank the LLVM development
three languages with an implementation available for public team; without LLVM, this work wouldn’t have been possible.
use. Bruno Cardoso Lopes’s work on support for AVX in LLVM
The Intel C/C++ compiler provides an “elemental func-
was particularly helpful for the results reported here.
tions” extension of C++ that is intended to provide SPMD
We have had many fruitful discussions with Ingo Wald
as an extension of a full C++ compiler [16]. Its language that have influenced the system’s design; ispc’s approach
functionality for SPMD is more limited than ispc’s; for ex- to SPMD and Ingo’s approach with IVL languages have
ample its equivalent of uniform can only be applied to func-
had bidirectional and mutually-beneficial influence. More
tion parameters and there is no general facility for creating
recently, discussions with Roland Leißa and Sebastian Hack
SOA types from AOS types. It has been demonstrated that about VecImp have been quite helpful.
its capabilities can be used to achieve good utilization of We appreciate the support of Geoff Lowney and Jim Hur-
SIMD units [20]. ley for this work as well as Elliot Garbus’s early enthusiasm
and support for it. Thanks to Kavyvon Fatahalian, Solomon
Boulos, and Jonathan Ragan-Kelley for discussions about
9. CONCLUSION SPMD parallel languages and SIMD hardware architectures.
We have presented ispc, a SPMD language for program- Discussions with Nadav Rotem about SIMD code generation
ming CPU vector units that is easy to adopt and productive and LLVM as well as discussions with Matt Walsh have also
to use. We have shown that a few key language features– directly improved the system. Ali Adl-Tabatabai’s feedback
uniform data types, native support for SOA structure layout, and detailed questions about the precise semantics of ispc
and in-language task launch–coupled with a series of custom have been extremely helpful as well.
optimization passes make it possible to efficiently execute Thanks to Tim Foley, Mark Lacey, Jacob Munkberg, Doug
SPMD programs on the SIMD hardware of modern CPUs. McNabb, Andrew Lauritzen, Misha Smelyanskiy, Stefanus
These programs can effectively target the full capabilities Du Toit, Geoff Berry, Roland Leißa, Aaron Lefohn, Dillon
of CPUs, executing code with performance essentially the Sharlet, and Jean-Luc Duprat for comments on this paper,
same as hand-written intrinsics. Support for uniform types and thanks to the early users of ispc inside Intel—Doug
is particularly important; our experiments showed that this McNabb, Mike MacPherson, Ingo Wald, Nico Galoppo, Bret
capability provides over a 2x performance improvement. Stastny, Andrew Lauritzen, Jefferson Montgomery, Jacob
In the future, we plan to further refine the ispc language, Munkberg, Masamichi Sugihara, and Wooyoung Kim—in
eliminating remaining differences with C and adding conve- particular for helpful suggestions and bug reports as well as
nience features like polymorphic functions. We are already for their patience with early versions of the system.
adding support for the MIC architecture, which is an attrac-
tive target due to its 16-wide SIMD and good ISA support
for SPMD execution.
Experience with ispc suggests a number of avenues for
improving future hardware architectures. For conventional
CPUs, improved support for masking and scatter would be
desirable, and extending vector units to operate on 64-bit
integer values at the same performance as when operating
10. REFERENCES [21] C. Lattner and V. Adve. LLVM: A Compilation
Framework for Lifelong Program Analysis &
[1] T. Aila and S. Laine. Understanding the efficiency of Transformation. In Proc. of CGO ’04, Mar 2004.
ray traversal on GPUs. In Proc. High-Performance
[22] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim,
Graphics 2009, pages 145–149, 2009.
A. D. Nguyen, N. Satish, M. Smelyanskiy,
[2] J. R. Allen, K. Kennedy, C. Porterfield, and S. Chennupaty, P. Hammarlund, R. Singhal, and
J. Warren. Conversion of control dependence to data P. Dubey. Debunking the 100x GPU vs. CPU myth:
dependence. In Proc. POPL ’83. an evaluation of throughput computing on CPU and
[3] American National Standards Institute. American GPU. In Proc. ISCA 2010.
National Standard Programming Language C, ANSI [23] R. Leißa, S. Hack, and I. Wald. Extending a C-like
X3.159-1989, 1989. language for portable SIMD programming. In PPoPP,
[4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Feb 2012.
Leiserson, K. H. Randall, and Y. Zhou. Cilk: An [24] A. Levinthal, P. Hanrahan, M. Paquette, and
efficient multithreaded runtime system. In SIGPLAN J. Lawson. Parallel computers for graphics
Symp. on Principles and Practice of Parallel applications. SIGPLAN Not., October 1987.
Programming (PPoPP), July 1995.
[25] E. Lindholm, J. Nickolls, S. Oberman, and
[5] W.-Y. Chen, D. Bonachea, J. Duell, P. Husbands, J. Montrym. NVIDIA Tesla: A unified graphics and
C. Iancu, and K. Yelick. A performance analysis of the computing architecture. IEEE Micro, Mar–April 2008.
Berkeley UPC compiler. In Proc. of 17th Annual Intl. [26] A. Lokhmotov, B. Gaster, A. Mycroft, N. Hickey, and
Conf. on Supercomputing, pages 63–73, 2003. D. Stuttard. Revisiting SIMD programming. In
[6] S. Collange, D. Defour, and Y. Zhang. Dynamic Languages and Compilers for Parallel Computing,
detection of uniform and affine vectors in GPGPU pages 32–46. 2008.
computations. In Proc. of the 2009 Intl. Conf. on
[27] M. Mantor and M. Houston. AMD graphics core next:
Parallel Processing, Euro-Par’09.
Low power high performance graphics and parallel
[7] F. Darema, D. George, V. Norton, and G. Pfister. A compute. Hot3D, High Performance Graphics Conf.,
single-program-multiple-data computational model for 2011.
EPEX/FORTRAN. Parallel Computing, 7(1), 1988. [28] W. R. Mark, R. S. Glanville, K. Akeley, and M. J.
[8] M. J. Flynn. Some computer organizations and their Kilgard. Cg: a system for programming graphics
effectiveness. IEEE Transactions on Computers, hardware in a C-like language. ACM Trans. Graph.,
C-21(9):948–960, Sept. 1972. July 2003.
[9] S. Fujita. AOBench. [29] MasPar Computer Corporation. MasPar Programming
http://code.google.com/p/aobench. Language (ANSI C compatible MPL) Reference
[10] J. Gummaraju, L. Morichetti, M. Houston, B. Sander, Manual, Software Version 3.0, July 1992.
B. R. Gaster, and B. Zheng. Twin peaks: a software [30] Microsoft Corporation. MSDN Library: Overview of
platform for heterogeneous computing on C++ Acceleration Massive Parallelism (C++ AMP),
general-purpose and graphics processors. PACT ’10. 2011. Online preview documention, visited Dec 11.
[11] P. Hanrahan and J. Lawson. A language for shading [31] J. Nickolls, I. Buck, M. Garland, and K. Skadron.
and lighting calculations. SIGGRAPH Comput. Scalable parallel programming with CUDA. ACM
Graph., 24:289–298, September 1990. Queue, 6:40–53, March 2008.
[12] R. Ierusalimschy, L. H. de Figueiredo, and W. Celes. [32] S. G. Parker, S. Boulos, J. Bigler, and A. Robison.
Passing a language through the eye of a needle. ACM RTSL: a ray tracing shading language. In Proc. of the
Queue, 9(5). 2007 IEEE Symp. on Interactive Ray Tracing, 2007.
[13] Intel. Intel SPMD Program Compiler documentation. [33] J. Rose and J. G. Steele. C*: An extended C language
http://ispc.github.com/documentation.html. for data parallel programming. In Proc. of the Second
[14] Intel. Intel SPMD Program Compiler User’s Guide. Intl. Conf. on Supercomputing, May 1987.
http://ispc.github.com/ispc.html. [34] N. Rotem. Intel OpenCL SDK vectorizer. In LLVM
[15] Intel. Intel advanced vector extensions programming Developer Conf. Presentation, Nov. 2011.
reference. June 2011. [35] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth,
[16] Intel. Intel Cilk Plus Language Extension Specification M. Abrash, P. Dubey, S. Junkins, A. Lake,
Version 1.1, 2011. Online document. J. Sugerman, R. Cavin, R. Espasa, E. Grochowski,
[17] R. Karrenberg and S. Hack. Whole Function T. Juan, and P. Hanrahan. Larrabee: a many-core x86
Vectorization. In CGO 2011. architecture for visual computing. ACM Trans.
[18] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, Graph., August 2008.
and D. Glasco. GPUs and the future of parallel [36] J. A. Stratton, S. S. Stone, and W.-M. W. Hwu.
computing. IEEE Micro, 31:7–17, Sept–Oct 2011. MCUDA: An efficient implementation of CUDA
[19] Khronos OpenCL Working Group. The OpenCL kernels for multi-core CPUs. In Proc. 21st Int’l
Specification, Sept. 2010. Workshop on Languages and Compilers for Parallel
[20] C. Kim, N. Satish, J. Chhugani, H. Saito, Computing, 2008.
R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and [37] M. Wolfe, C. Shanklin, and L. Ortega.
P. Dubey. Closing the ninja performance gap through High-Performance Compilers for Parallel Computing.
traditional programming and compiler technology. Addison Wesley, 1995.
Technical report, Intel Corporation, Dec 2011.