0% found this document useful (0 votes)
16 views

Ispc - A SPMD Compiler for High-Performance CPU Programming (Ispc_inpar_2012)

Uploaded by

Suchit Neman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Ispc - A SPMD Compiler for High-Performance CPU Programming (Ispc_inpar_2012)

Uploaded by

Suchit Neman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ispc: A SPMD Compiler for High-Performance

CPU Programming

Matt Pharr William R. Mark


Intel Corporation Intel Corporation
[email protected] [email protected]

ABSTRACT are well-known and reasonably easily adopted, parallelizing


SIMD parallelism has become an increasingly important across SIMD vector lanes remains difficult, often requiring
mechanism for delivering performance in modern CPUs, due laboriously writing intrinsics code to generate desired in-
its power efficiency and relatively low cost in die area com- struction sequences by hand. The most common parallel
pared to other forms of parallelism. Unfortunately, lan- programming languages and libraries designed for CPUs—
guages and compilers for CPUs have not kept up with the including OpenMP, MPI, Thread Building Blocks, UPC,
hardware’s capabilities. Existing CPU parallel program- and Cilk—focus on multi-core parallelism and do not pro-
ming models focus primarily on multi-core parallelism, ne- vide any assistance for targeting SIMD parallelism within
glecting the substantial computational capabilities that are a core. There has been some hope that CPU implementa-
available in CPU SIMD vector units. GPU-oriented lan- tions of GPU-oriented languages that support SIMD hard-
guages like OpenCL support SIMD but lack capabilities ware (such as OpenCL) might address this gap [10], but
needed to achieve maximum efficiency on CPUs and suf- OpenCL lacks capabilities needed to achieve maximum effi-
fer from GPU-driven constraints that impair ease of use on ciency on CPUs and imposes productivity penalties caused
CPUs. by needing to accommodate GPU limitations such as a sep-
We have developed a compiler, the Intel R SPMD Pro- arate memory system. This situation led us to ask what
gram Compiler (ispc), that delivers very high performance would be possible if one were to design a language specif-
on CPUs thanks to effective use of both multiple processor ically for achieving high performance and productivity for
cores and SIMD vector units. ispc draws from GPU pro- using SIMD vector units on modern CPUs.
gramming languages, which have shown that for many ap- We have implemented a language and compiler, the Intel R
plications the easiest way to program SIMD units is to use SPMD Program Compiler (ispc), that extends a C-based
a single-program, multiple-data (SPMD) model, with each language with “single program, multiple data” (SPMD) con-
instance of the program mapped to one SIMD lane. We dis- structs for high-performance SIMD programming.1 ispc’s
cuss language features that make ispc easy to adopt and “SPMD-on-SIMD” execution model provides the key feature
use productively with existing software systems and show of being able to execute programs that have divergent con-
that ispc delivers up to 35x speedups on a 4-core system trol flow across the SIMD lanes. ispc’s main focus is effec-
and up to 240x speedups on a 40-core system for complex tive use of CPU SIMD units, though it supports multi-core
workloads (compared to serial C++ code). parallelism as well.
The language and underlying programming model are de-
signed to fully expose the capabilities of modern CPU hard-
Categories and Subject Descriptors ware, while providing ease of use and high programmer pro-
D.3.4 [Programming languages]: Processors—Compil- ductivity. Programs written in ispc generally see their per-
ers; D.1.3 [Concurrent programming]: Parallel program- formance scale with the product of both the number of pro-
ming cessing cores and their SIMD width; this is a standard char-
acteristic of GPU programming models but one that is much
Keywords less common on the CPU.
The most important features of ispc for performance are:
SPMD, parallel programming, SIMD, CPUs
• Explicit language support for both scalar and SIMD
1. INTRODUCTION operations.
Recent work has shown that CPUs are capable of deliv-
ering high performance on a variety of highly parallel work- • Support for structure-of-arrays data structures, includ-
loads by using both SIMD and multi-core parallelism [22]. ing for converting previously-declared data types into
Coupled with their ability to also efficiently execute code structure of arrays layout.
with moderate to small amounts of parallelism, this makes
• Access to the full flexibility of the underlying CPU
CPUs an attractive target for a range of computationally-
hardware, including the ability to launch asynchronous
intensive applications, particularly those that exhibit vary-
tasks and to perform fast cross-lane SIMD operations.
ing amounts of parallelism over the course of their execution.
However, achieving this performance is difficult in prac- 1
ispc is available for download in both source and binary
tice; although techniques for parallelizing across CPU cores form from http://ispc.github.com.
The most important features of ispc for usability are: is easy to learn. Ideally the language should be so similar
to C that porting code or sharing data structure definitions
• Support for tight coupling between C++ and ispc, between it and C/C++ is easy. To support incremental use
including the ability to directly call ispc routines from of the system, it should be easy to call back and forth be-
C++ and to also call C++ routines from ispc.2 tween C/C++ and ispc code, and it should be easy to share
• Coherent shared memory between C++ and ispc. complex pointer-based data structures. Code generated by
the compiler should interoperate with existing memory al-
• Familiar syntax and language features due to its basis locators and task schedulers, rather than imposing its own.
in C. Finally, the system should easily work with existing build
systems, debuggers and tools like memory verifiers (e.g. val-
Most ispc language features are inspired by earlier SIMD grind ). (Lua’s “embeddability” goals are similar [12].)
languages such as C* [33], Pixar’s FLAP C [24], the Render-
Man Shading Language [11] and the MasPar programming 2.2 Non-goals
language [29], and in some cases by more modern GPU-
It is useful to specifically list several non-goals of ispc.
oriented languages such as CUDA [31] and OpenCL [19].
No support for GPUs: CPU and GPU architectures
The primary contribution of this paper is to design and im-
are sufficiently different that a single performance-focused
plement a language and compiler targeted at modern CPU
programming model for both is unlikely to be an ideal fit
architectures, and to evaluate the performance impact of key
for either. Thus, we focus exclusively on CPU architec-
language features on these architectures.
tures. For example, we assume a single cache-coherent ad-
dress space, which would not be possible if the language had
2. DESIGN GOALS to support today’s discrete GPUs.
In order to motivate some of the design differences be- Don’t try to provide “safe” parallelism: we do not
tween ispc and other parallel languages, we will discuss the attempt to protect the programmer by making races or dead-
specific goals of the system and the key characteristics of the lock difficult or impossible. Doing so would place too many
hardware that it targets. levels of abstraction between the programmer and the un-
derlying hardware, and we choose to focus on programmers
2.1 Goals who are willing give up some safety in return for achieving
Performance on today’s CPU hardware: The tar- peak machine performance, just as they might do when they
get users for ispc are performance-focused programmers. choose C or C++ over Java.
Therefore, a key design goal is that the system should pro-
vide performance transparency: just as with C, it should be 2.3 Target Hardware
straightforward for the user to understand how code writ- Since one of the primary goals of the ispc language is
ten in the language will be compiled to the hardware and to provide high efficiency on modern CPU hardware, it is
roughly how the code will perform. The target hardware is helpful to review some of the characteristics of this hardware
modern CPU hardware, with SIMD units from four to six- that impact the language and compiler design.
teen elements wide, and in particular x86 CPUs with SSE or Multi-core and SIMD parallelism: A modern CPU
AVX instructions. A skilled user should be able to achieve consists of several cores, each of which has a scalar unit
performance similar (85% or better) to that achievable by and a SIMD unit. The instructions for accessing the SIMD
programming in C with SSE and AVX intrinsic functions. unit have different names on different architectures: SSE for
Modern x86 workstations have up to forty cores and the In- 128-bit wide SIMD on x86 processors, AVX for 256-bit wide
tel MIC architecture will have over fifty cores on a single SIMD on Intel processors, AltiVec on PowerPC processors,
chip, so ispc programs should be able to scale to these core and Neon on ARM processors. The ispc compiler currently
counts and beyond. supports SSE and AVX.
Programmer productivity: Programmer productivity Simultaneous execution of scalar and SIMD in-
should be substantially higher than that achievable by pro- structions: Modern CPU architectures can issue multiple
gramming with intrinsic functions, and should be compa- instructions per cycle when appropriate execution units are
rable to that of writing high-performance serial C code or available for those instructions. There is often a performance
OpenCL kernels. Productivity is measured not just in terms advantage from replacing a SIMD instruction with a scalar
of writing code, but also by the ease of reading and modify- instruction due to better occupancy of execution units. The
ing code. Of course, it is expected that it will take longer and architects of the Pixar FLAP observed long ago that even
require more skill to write highly-tuned code than it does to SIMD-heavy code has a large number of addressing and con-
write less efficient code, just as when programming in other trol computations that can be executed on a scalar unit [24].
languages. It should be possible (though not mandatory) to One program counter per core: The scalar unit and
write code that is portable across architectures with differing all lanes of the associated SIMD unit share a single hardware
SIMD widths. program counter.
Ease of adoption and interoperability: It should be Single coherent memory: All cores share a single
easy for programmers to adopt the language, both for new cache-coherent address space and memory system for both
code and for incremental enhancements to existing systems. scalar and SIMD operations. This capability greatly simpli-
The language should be as familiar as possible so that it fies the sharing of data structures between serial and parallel
2
Here and throughout this paper, we use “C++ code” or code. This capability is lacking on today’s GPUs.
“application code” to indicate the rest of the software sys- Cross-lane SIMD operations: SSE and AVX effi-
tem that ispc is being used with. This could include, for ciently support various cross-lane SIMD operations such as
example, Fortran or Python code that called ispc code. swizzles via a single instruction. GPUs generally provide
weaker support for these operations, although they can be grammers working with SPMD programs. Furthermore, the
mimicked at lower performance via memory. SPMD approach aids with performance transparency: vec-
Tightly defined execution order and memory torization of a SPMD program is guaranteed by the under-
model: Modern CPUs have relatively strict rules on the lying model, so a programmer can write SPMD code with
order with which instructions are completed and the rules a clear mental model of how it will be compiled. Over the
for when memory stores become visible to memory loads. past ten years the SPMD model has become widely used
GPUs have more relaxed rules, which provides greater free- on GPUs, first for programmable shading [28] and then for
dom for hardware scheduling but makes it more difficult to more general-purpose computation via CUDA and OpenCL.
provide ordering guarantees at the language level. ispc implements SPMD execution on the SIMD vector
units of CPUs; we refer to this model as “SPMD-on-SIMD”.
Each instance of the program corresponds to a different
3. PARALLELISM MODEL: SIMD lane; conditionals and control flow that are different
SPMD ON SIMD between the program instances are allowed. As long as each
Any language for parallel programming requires a concep- program instance operates only on its own data, it produces
tual model for expressing parallelism in the language and for the same results that would be obtained if it was running
mapping this language-level parallelism to the underlying on a dedicated MIMD processor. Figure 1 illustrates how
hardware. For the following discussion of ispc’s approach, SPMD execution is implemented on CPU SIMD hardware.
we rely on Flynn’s taxonomy of programming models into
SIMD, MIMD, etc. [8], with Darema’s enhancement to in- 3.2 Basic Execution Model
clude SPMD (Single Program Multiple Data) [7]. Upon entry to a ispc function called from C/C++ code,
the execution model switches from the application’s serial
3.1 Why SPMD? model to ispc’s SPMD model. Conceptually, a number of
program instances start running concurrently. The group
Recall that our goal is to design a language and compiler
of running program instances is a called a gang (harkening
for today’s SIMD CPU hardware. One option would be to
to “gang scheduling”, since ispc provides certain guarantees
use a purely sequential language, such as unmodified C, and
about when program instances running in a gang run con-
rely on the compiler to find parallelism and map it to the
currently with other program instances in the gang, detailed
SIMD hardware. This approach is commonly referred to
below.)3 The gang of program instances starts executing in
as auto-vectorization [37]. Although auto-vectorization can
the same hardware thread and context as the application
work well for regular code that lacks conditional operations,
code that called the ispc function; no thread creation or
a number of issues limit the applicability of the technique in
implicit context switching is done by ispc.
practice. All optimizations performed by an auto-vectorizer
The number of program instances in a gang is relatively
must honor the original sequential semantics of the program;
small; in practice, it’s no more than twice the SIMD width of
the auto-vectorizer thus must have visibility into the entire
the hardware that it is executing on.4 Thus, there are four
loop body, which precludes vectorizing loops that call out
or eight program instances in a gang on a CPU using the
to externally-defined functions, for example. Complex con-
4-wide SSE instruction set, and eight or sixteen on a CPU
trol flow and deeply nested function calls also often inhibit
using 8-wide AVX. The gang size is set at compile time.
auto-vectorization in practice, in part due to heuristics that
SPMD parallelization across the SIMD lanes of a single
auto-vectorizers must apply to decide when to try to vec-
core is complementary to multi-core parallelism. For ex-
torize. As a result, auto-vectorization fails to provide good
ample, if an application has already been parallelized across
performance transparency—it is difficult to know whether
cores, then threads in the application can independently call
a particular fragment of code will be successfully vectorized
functions written in ispc to use the SIMD unit on the core
by a given compiler and how it will perform.
where they are running. Alternatively, ispc has capabilities
To achieve ispc’s goals of efficiency and performance
for launching asynchronous tasks for multi-core parallelism;
transparency it is clear that the language must have par-
they will be introduced in Section 5.4.
allel semantics. This leads to the question: how should
parallelism be expressed? The most obvious option is to 3.3 Mapping SPMD To Hardware: Control
explicitly express SIMD operations as explicit vector com-
putations. This approach works acceptably in many cases One of the challenges in SPMD execution is handling di-
when the SIMD width is four or less, since explicit operations vergent control flow. Consider a while loop with a termi-
on 3-vectors and 4-vectors are common in many algorithms. nation test n > 0; when different program instances have
For SIMD widths greater than four, this option is still ef- different values for n, they will need to execute the loop
fective for algorithms without data-dependent control flow, body different numbers of times.
and can be implemented in C++ using operator overload- ispc’s SPMD-on-SIMD model provides the illusion of sep-
ing layered over intrinsics. However, this option becomes arate control flow for each SIMD lane, but the burden of
less viable once complex control flow is required. 3
Program instances thus correspond to threads in CUDA
Given complex control flow, what the programmer ideally and work items in OpenCL. A gang roughly corresponds to
wants is a programming model that is as close as possible to a CUDA warp.
4
MIMD, but that can be efficiently compiled to the available Running gangs wider than the SIMD width can give perfor-
SIMD hardware. SPMD provides just such a model: with mance benefits from amortizing shared computation (such as
SPMD, there are multiple instances of a single program exe- scalar control flow overhead) over more program instances,
better cache reuse across the program instances, and from
cuting concurrently and operating on different data. SPMD more instruction-level parallelism being available. The costs
programs largely look like scalar programs (unlike explicit are greater register pressure and potentially more control
SIMD), which leads to a productivity advantage for pro- flow divergence across the program instances.
Figure 1: Execution of a 4-wide SPMD program on 4-wide SIMD vector hardware. On the left we have a
short program with simple control flow; the right illustrates how this program is compiled to run on SIMD
vector hardware. Here, the if statement has been converted into partially predicated instructions, so the
instructions for both the “true” and “false” cases are always executed. A mask is used to prevent side effects
for program instances that should not themselves be executing instructions in a particular control flow path.

supporting this illusion falls on the compiler. Control flow antees on execution order, requiring explicit barrier synchro-
constructs are compiled following the approach described nization among program instances with __syncthreads() or
by Allen et al. [2] and recently generalized by Karrenberg et barrier(), respectively, when there is communication be-
al. [17], where control flow is transformed to data flow. tween program instances via memory. Implementing these
A simple example of this transformation is shown in Fig- barriers efficiently for OpenCL on CPUs is challenging [10].
ure 1, where assignments to a variable are controlled by an Maximally converged execution provides several advan-
if statement. The SIMD code generated by this example tages compared to the looser model on GPUs; it is partic-
maintains a mask that indicates which program instances ularly helpful for efficient communication of values between
are currently active during program execution. Operations program instances without needing to explicitly synchro-
with side-effects are masked so that they don’t have any ef- nize among them. However, this property also can intro-
fect for program instances with an “off” mask value. This duce a dependency on SIMD width; by definition, ordering
approach is also applied to loops (including break and con- changes if the gang size changes. The programmer generally
tinue statements) and to multiple return statements within only needs to consider this issue when doing cross-program-
one function. instance communication.
Implementation of this transformation is complex on SSE The concept of lockstep execution must be precisely de-
hardware due to limited support for SIMD write-masks, in fined at the language level in order to write well-formed pro-
contrast to AVX, MIC and most GPUs. Instead, the com- grams where program instances depend on values that are
piler must use separate blend/select instructions designed written to memory by other program instances within their
for this purpose. Fortunately, masking isn’t required for gang. With ispc, any side effect from one program instance
all operations; it is unnecessary for most temporaries com- is visible to other program instances in the gang after the
puted when evaluating an expression, for example. Be- next sequence point in the program, where sequence points
cause the SPMD programming model is used pervasively are defined as in C. Generally, sequence points include the
on GPUs, most GPUs have some hardware/ISA support end of a full expression, before a function is entered in a
for SPMD control flow, thereby reducing the burden on the function call, at function return, and at the end of initial-
compiler [25, 27]. izer expressions. The fact that there is no sequence point
between the increment of i and the assignment to i in i=i++
3.4 SPMD and Synchronization is why that expression yields undefined behavior in C, for ex-
ispc provides stricter guarantees of execution convergence ample. Similarly, if multiple program instances write to the
than GPUs running SPMD programs do; these guarantees in same location without an intervening sequence point, unde-
turn provide ease-of-use benefits to the programmer. ispc fined behavior results. (The ispc User’s Guide has further
specifically provides an important guarantee about the be- details about these convergence guarantees and resulting im-
havior of the program counter and execution mask: the ex- plications for language semantics [14].)
ecution of program instances within a gang is maximally
converged. Maximal convergence means that if two program 3.5 Mapping SPMD To Hardware: Memory
instances follow the same control path, they are guaranteed The loads and stores generated by SPMD execution can
to execute each program statement concurrently. If two pro- present a performance challenge. Consider a simple array
gram instances follow diverging control paths, it is guaran- indexing operation like a[index]: when executed in SPMD,
teed that they will re-converge at the earliest point in the each of the program instances will in general have a different
program where they could re-converge.5 value for index and thus access a different memory location.
In contrast, CUDA and OpenCL have much looser guar- Loads and stores of this type, corresponding to loads and
stores with SIMD vectors of pointers, are typically called
5 “gathers” and “scatters” respectively. It is frequently the
This guarantee is not provided across gangs in different
threads; in that case, explicit synchronization must be used. case at runtime that these accesses are to the same location
or to sequential locations in memory; we refer to this as while, and do loops, as well as break, continue, and return
a coherent gather or scatter. For coherent gather/scatter, statements in all of the places where they are allowed in
modern DRAM typically delivers better performance from C.7 Different program instances can follow different control
a single memory transaction than a series of discrete memory paths; in the example above, the while loop may execute a
transactions. different number of times for different elements of the array.
Modern GPUs have memory controllers that coalesce co- ispc provides a standard library of useful functions, in-
herent gathers and scatters into more efficient vector loads cluding hardware atomic operations, transcendentals, func-
and stores [25]. The range of cases that this hardware han- tions for communication between program instances, and
dles has generally expanded over successive hardware gen- various data-parallel primitives such as reductions and scans
erations. Current CPU hardware lacks “gather” and “scat- across the program instances.
ter” instructions; SSE and AVX only provide vector load
and store instructions for contiguous data. Therefore, when 4.1 Mapping Computation to Data
gathers and scatters are required, they must be implemented Given a number of instances of the program running in
via a less-efficient series of scalar instructions.6 ispc’s tech- SPMD (i.e. one gang), it’s necessary for the instances to
niques for minimizing unnecessary gathers and scatters are iterate over the input data (which is typically larger than
described in Section 6.4. a gang). The example above does this using a for loop
and the built-in variables programIndex and programCount.
4. LANGUAGE OVERVIEW programCount gives the total number of instances running
To give a flavor of the syntax and how the language is used, (i.e. the gang size) and programIndex gives each program
here is a simple example of using ispc. For more extensive instance an index from zero to programCount-1. Thus, in
examples and language documentation, see the ispc online the above, for each for loop iteration a programCount-sized
documentation [13]. number of contiguous elements of the input arrays are pro-
First, we have some setup code in C++ that dynamically cessed concurrently by the program instances.
allocates and initializes two arrays. It then calls an update() ispc’s built-in programIndex variable is analogous to the
function. threadIdx variable in CUDA and to the get_global_id()
function in OpenCL, though a key difference is that in ispc,
float *values = new float[1024]; looping over more than a gang’s worth of items to process
int *iterations = new int[1024]; is implemented by the programmer as an in-language for or
// ... initialize values[], iterations[] ... foreach loop, while in those languages, the corresponding
update(values, iterations, 1024); iteration is effectively done by the hardware and runtime
thread scheduler outside of the user’s kernel code. Perform-
The call to update() is a regular function call; in this ing this mapping in user code gives the programmer more
case it happens that update() is implemented in ispc. The control over the structure of the parallel computation.
function squares each element in the values array the num-
ber of times indicated by the corresponding entry in the 4.2 Implementation
iterations array. The ispc compiler uses flex and bison for tokenization and
export void update(uniform float values[], parsing. The compiler front-end performs type-checking and
uniform int iterations[], standard early optimizations such as constant folding before
uniform int num_values) { transforming the program to the vector intermediate repre-
for (int i = programIndex; i < num_values; sentation of the LLVM toolkit [21]. LLVM then performs an
i += programCount) { additional set of traditional optimizations. Next our custom
int iters = iterations[i]; optimizations are applied, as discussed in Section 6. LLVM
while (iters-- > 0) then generates final assembly code.
values[i] *= values[i]; It is reasonably easy to add support for new target instruc-
} tion sets: most of the compiler is implemented in a fashion
} that is target agnostic (e.g. “gathers” are issued generically
and only late in the compilation process are they trans-
The syntax and basic capabilities of ispc are based on formed to a target-specific operation).
C (C89 [3], specifically), though it adopts a number of con-
structs from C99 and C++. (Examples include the abil-
ity to declare variables anywhere in a function, a built-in
5. DESIGN FEATURES
bool type, references, and function overloading.) Matching We’ll now describe some key features of ispc and how
C’s syntax as closely as possible is an important aid to the they support the goals introduced in Section 2.
adoptability of the language.
The update() function has an export qualifier, which 5.1 “Uniform” Datatypes
indicates that it should be made callable from C++; the In a SPMD language like ispc, a declaration of a variable
uniform variable qualifier specifies scalar storage and com- like float x represents a variable with a separate storage
putation and will be described in Section 5.1. location (and thus, potentially different value) for each of
ispc supports arbitrary structured control flow within the program instances. However, some variables and their
functions, including if statements, switch statements, for, 7
Unstructured control flow (i.e. goto statements) is more
6
This limitation will be removed in future hardware (The difficult to support efficiently, though ispc does support
Haswell New Instructions provide gather [15] and MIC pro- goto in cases where it can be statically determined that all
vides both gather and scatter [35]). program instances will execute the goto.
associated computations do not need to be replicated across
program instances. For example, address computations and AOS x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 ...
loop iteration variables can often be shared.
Since CPU hardware provides separate scalar compu-
tation units, it is important to be able to express non- float v = a[index].x
replicated storage and computation in the language. ispc
provides a uniform storage class for this purpose, which cor- hybrid SOA x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 ...
responds to a single value in memory and thus, a value that
is the same across all elements. In addition to the obvious
direct benefits, the use of uniform variables facilitates ad-
float v = a[index / 4].x[index & 3]
ditional optimizations as discussed in Section 6.1. It is a
compile-time error to assign a non-uniform (i.e., “varying”)
value to a uniform variable.
In the absence of the uniform storage class, an optimiz- short SOA x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3

ing compiler could convert varying variables into uniform


variables when appropriate. (For example, in OpenCL or float v = a.x[index]
CUDA, all kernel parameters are effectively uniform and
only variables that have values that are derived from the
thread index are varying.) However, there are a number of Figure 2: “Array of structures” layout (top), “hy-
reasons why having uniform as an explicit property of types brid structure of arrays” layout (middle), and “short
in the language is important: structure of arrays” layout (bottom) of the example
structure from Section 5.2. Reading data in an AOS
• Interoperability with C/C++ data structures: layout generally leads to expensive gather instruc-
uniform is necessary to explicitly declare in-memory tions, while the SOA layouts lead to efficient vector
variables of just a single element, as is common in load instructions.
C/C++.
• Performance transparency: Treating uniform as struct Foo { float x, y, z; };
an optimization rather than an explicit type property uniform Foo a[...] = { ... };
would make it difficult for the programmer to rea- int index = ...;
son about performance. A small change to a program float x = a[index].x;
could inadvertently inhibit the optimization elsewhere
resulting in significant and difficult-to-understand per- Even if program instances access the elements of contigu-
formance regressions. ous structures (i.e. the values of index are sequential over
the program instances), the locations accessed are strided in
• Support for separate compilation: Optimizations memory and performance suffers from gathers (Section 3.5).
cannot cross separate-compilation boundaries, so at a A better performing in-memory layout is “hybrid struc-
minimum it must be possible to define a formal func- ture of arrays” (hybrid SOA layout), where the structure
tion parameter as uniform. But to guarantee that a members are widened to be short arrays. On a system with
call to such a function with a variable as an actual pa- a 4-wide vector unit, one might instead use the following
rameter is legal, uniform must be an explicit part of struct declaration and access code:
the type system. Otherwise, the legality of the func-
tion call would depend on the optimizer’s behavior for struct Foo4 { float x[4], y[4], z[4]; };
the variable. uniform Foo4 a[...] = { ... };
int index = ...;
There is a downside to distinguishing between uniform and float x = a[index / 4].x[index & 3]
varying types in the type system: with separately compiled
libraries of functions, to provide optimum performance it The corresponding memory layout is shown in the middle
may be necessary to have multiple variants of functions that third of Figure 2. In many cases, accessing structure ele-
take different combinations of uniform and varying parame- ments in hybrid SOA layout can be done with efficient vector
ters. load and store instructions.
The uniform and varying keywords were first used in The above syntax for declaring hybrid SOA layout and
the RenderMan shading language [11], but a similar dis- accessing hybrid SOA data is awkward and unnecessarily
tinction was made even earlier in general-purpose SIMD lan- verbose; each element of Foo4 has the same array width re-
guages. To designate a SIMD variable, C* uses poly; Pixar’s peated in its declaration. If we want both SOA and AOS
FLAP-C uses parallel; and MPL uses plural. CUDA and versions of the struct, we would have to declare two structs
OpenCL do not provide this distinction; all variables are with different types, which is undesirable. Furthermore, ac-
semantically varying in those languages. cessing elements of the structure is much more unwieldy to
express than in the AOS case.
5.2 Support For SOA Layout ispc addresses these problems and encourages more effi-
It is well known that the standard C/C++ layout in mem- cient hybrid SOA data layout by introducing a keyword soa,
ory for an “array of structures” (AOS) leads to sub-optimal which modifies existing types to be laid out in SOA format.
performance for SIMD code. The top third of Figure 2 illus- The soa qualifier converts primitive types (e.g. float or int)
trates the issue using a simple structure, which corresponds to fixed-sized arrays of that type, while for nested data struc-
to the ispc code below: tures or arrays, soa propagates downward through the data
structure until it reaches a primitive type. Traditional ar- in C/C++, while the performance-critical portions of the
ray indexing syntax is used for indexing into hybrid SOA application that read or update the data structure can be
data, while the code generated by the compiler actually im- rewritten in ispc.
plements the two-stage indexing calculation. Thus, use of The distinction between uniform and varying data exists
the more efficient hybrid SOA layout can be expressed as for both the pointer itself and for the data that is pointed
follows in ispc: to. (MasPar’s C extensions make a similar distinction [29].)
Thus, there are four kinds of pointers:
struct Foo { float x, y, z; };
soa<4> struct Foo a[...] = { ... }; uniform float * uniform x;
int index = ...; varying float * uniform x;
float x = a[index].x; uniform float * varying x;
varying float * varying x;
Other than the soa<4> keyword, the code looks just like
what one would write for an AOS layout, yet it delivers all The first two declarations above are uniform pointers; the
of the performance benefits of hybrid SOA. As far as we first is to uniform data and the second is to varying data.
know, these SOA capabilities have not been provided before Both are thus represented as single scalar pointers. The
in a general-purpose C-like language. second two declarations are varying pointers, representing
SOA layout also improves the performance of accesses to a separate pointer for each program instance. Because all
variables used by each program instance in a gang. We refer variables and dynamically allocated storage reside in a single
to this layout as a “short SOA layout” and illustrate it in coherent address space, any pointer can point to any data
the bottom of Figure 2. In the SPMD programming model, of the appropriate underlying type in memory.
such variables should “look” scalar when they are used in ex- In OpenCL and CUDA, all locally-declared pointers are in
pressions, so the indexing of such variables by programIndex effect varying pointers to data, with additional limitations
should be implicit. Note that CUDA and OpenCL achieve imposed by the fragmented memory architecture. CUDA
similar behavior by storing such variables in a separate per- supports function pointers and pointers to pointers, whereas
lane memory space. The keyword varying produces the de- OpenCL does not support function pointers and only sup-
sired behavior in ispc: it causes a structure to be widened to ports certain cases of pointers to pointers.
the gang size and to be implicitly indexed by programIndex.
In the code below, after the expensive AOS structure loads 5.4 Task Launch
have been performed by the indexing operation, the elements In order to make it easy to fill multiple CPU cores with
of fv are laid out contiguously in memory and so can be ac- computation, ispc provides an asynchronous task launch
cessed efficiently. mechanism, closely modeled on the “spawn” facility pro-
vided by Cilk [4]. ispc functions called in this manner are
uniform struct Foo a[...] = {...};
semantically asynchronous function calls that may run con-
int index = ...;
currently in different hardware threads than the function
varying Foo fv = a[index];
that launched them. This capability makes multi-core par-
// now e.g. fv.x is contiguous in memory
allelization of ispc programs straightforward when indepen-
fv.x = fv.y + fv.z; // looks scalar
dent computation is available; generally just a few lines of
varying structures of this form are also available in the Vec- additional code are needed to use this construct.
Imp and IVL languages designed concurrently to ispc [23]. Any complex multi-core C++ application typically has
The ability to conveniently but explicitly declare and ac- its own task system or thread pool, which may be custom
cess hybrid SOA and short SOA data structures is one of designed or may be an existing one such as Microsoft’s Con-
the major advantages of ispc over OpenCL when target- currency Runtime, or Intel Thread Building Blocks. To in-
ing CPU hardware. Note that languages that do not define teroperate with the application’s task system, ispc allows
memory layout as strictly as C/C++ (and hence, typically the user to provide a callback to a task enqueue function,
forbid pointers or restrict pointer arithmetic) may choose and then uses this callback to enqueue asynchronous tasks.
to optimize layout to SOA form even when the declaration As in Cilk, all tasks launched from an ispc function
appears to be AOS. For languages with strict layout rules, must have returned before the function is allowed to return.
the compiler may still optimize layout to SOA form if it This characteristic ensures parallel composability by free-
can guarantee that pointers are never used to access the ing callers of functions from having to be aware of whether
data. However, these approaches provide less performance tasks are still executing (or yet to be executed) from func-
transparency than ispc’s approach and cannot be used for tions they called. ispc also provides an explicit built-in
zero-copy data structures that are shared with the C/C++ sync construct that waits for tasks launched earlier in the
application. function to complete.
Current GPU programming languages have no support
5.3 Full C Pointer Model for task launch from the GPU, although it is possible to
ispc generalizes the full set of C pointer operations to implement a task system in “user space” in CUDA [1].
SPMD, including both uniform and varying pointers, point-
ers to pointers, and function pointers. This feature is im- 5.5 Cross-lane operations
portant for the expressability of algorithms that use complex One strength of SIMD capabilities on CPUs is the rich
pointer-based data structures in ispc and is also critical for set of fast cross-lane operations. For example, there are
allowing ispc programs to interoperate with existing appli- instructions for broadcasting a value from one lane to all
cation data structures. Often the code that builds these other lanes, and instructions for permuting values between
data structures is not performance-critical and can be left lanes. ispc exposes these capabilities through built-in func-
tions that allow the program instances in a gang to exchange will show how the features introduced in Section 5 make
data. These operations are particularly lightweight thanks a number of these optimizations possible. We focus on the
to the gang convergence guarantees described in Section 3.4. optimizations that are unique to SPMD-on-SIMD; ispc also
applies a standard set of traditional optimizations (constant
5.6 Coherent Control Flow Hints folding, inlining, etc).
As described in Section 3.3, divergent control flow requires
extra instructions on CPU hardware compared to regular 6.1 Benefits of “Uniform”
control flow. In many uses of control flow, the common case Having scalar uniform data types, as introduced in Sec-
is that all program instances follow the same control path. tion 5.1, provides a number of benefits compared to always
If the compiler had a way to know this, it could perform having a separate per-program-instance storage location for
a number of optimizations, which are introduced in Sec- each variable in the program:
tion 6.5. ispc provides language constructs to express the
programmer’s expectation that control flow will typically be • It reduces the total amount of in-memory storage used
converged at a given point in the program. For each control for data, which in turn can lead to better cache per-
flow construct, there is a corresponding “coherent” variant formance.
with the character “c” prepended to it. The following code
• Less bandwidth is consumed when reading and writing
shows cif in use:
scalar values to memory.
float x = ...;
cif (x < 0) { • CPUs have separate register sets for scalar and vector
// handle negative x values; storing values in scalar registers when possible
} reduces pressure on vector registers.

These coherent control flow variants do not affect program • CPUs can co-issue scalar and vector instructions, so
correctness or the final results computed, but can potentially that scalar and vector computations can happen con-
lead to higher performance. currently.
For similar reasons, ispc provides convenience foreach
• In the usual case of using 64-bit pointers, pointer arith-
constructs that loop over arrays of one or more dimensions
metic (e.g. for addressing calculations) is more efficient
and automatically set the execution mask at boundaries.
for scalar pointers than for vector pointers.
These constructs allow the ispc compiler to easily produce
optimized code for the subset of iterations that completely • Dereferencing a uniform pointer (or using a uniform
fill a gang of program instances (see Section 6.6 for a de- value to index into an array) corresponds to a single
scription of these optimizations). scalar or vector memory access, rather than a general
gather or scatter.
5.7 Native Object Files and Function Calls
The ispc compiler generates native object files that can be • Code for control flow based on uniform quantities can
linked into the application binary in the same way that other be more efficient than code for control flow based on
object files are. ispc code can be split into multiple object non-uniform quantities (Section 6.2).
files if desired, with function calls between them resolved
For the workloads we use for evaluation in Section 7, if all
at link time. Standard debugging information is optionally
uses of the uniform qualifier were removed thus eliminat-
included. These capabilities allow standard debuggers and
ing all of the above benefits, the workloads ran at geomet-
disassemblers to be used with ispc programs and make it
ric mean (geomean) 0.45x the speed of when uniform was
easy to add ispc code to existing build systems.
present. The ray tracer was hardest hit, running at 0.05x of
ispc’s calling conventions are based on the platform’s
its previous performance, “aobench” ran at 0.36x its original
standard ABI, though functions not marked export are aug-
performance without uniform and “stencil” at 0.21x.
mented with an additional parameter to provide the current
There were multiple causes of these substantial perfor-
execution mask. Functions that are marked export can be
mance reductions without uniform; the most significant were
called with a regular function call from C or C++; calling a
the higher overhead of non-uniform control flow and the
ispc function is thus a lightweight operation—it’s the same
much greater expense of varying pointer operations com-
as the overhead of calling to an externally-defined C or C++
pared to uniform pointer operations. Increased pressure on
function. In particular, no data copying or reformatting
the vector registers which in turn led to more register spills
is performed, other than than possibly pushing parameters
to memory also impacted performance without uniform.
onto the stack if required by the platform ABI. While there
are some circumstances where such reformatting could lead 6.2 Uniform Control Flow
to improved performance, introducing such a layer is against
When a control flow test is based on a uniform quantity,
our goals of performance transparency.
all program instances will follow the same path at that point
Lightweight function calls are a significant difference from
in a function. Therefore, the compiler is able to generate reg-
OpenCL on the CPU, where an API call to a driver must be
ular jump instructions for control flow in this case, avoiding
made in order to launch a kernel and where additional API
the costs of mask updates and overhead for handling control
calls are required to set each kernel parameter value.
flow divergence.
Treating all uniform control flow statements as varying
6. EFFICIENT SPMD-ON-SIMD caused the example workloads to run with performance ge-
There are a number of specialized optimizations that ispc omean 0.91x as fast as when this optimization was enabled.
applies to generate efficient code for SPMD on CPUs. We This optimization had roughly similar effectiveness on all of
the workloads, though the ray tracer was particularly hard- Sec. Optimization Perf. when
hit without it, running 0.65x as fast as it did without this disabled
optimization.
6.1, 6.2 Uniform data & control flow 0.45x
6.3 Benefits of SOA 6.2 Uniform control flow 0.91x
6.4 Gather/scatter improvements 0.79x
We measured the benefits of SOA versus AOS layout with
6.5 Coherent control flow 0.85x
a workload based on a collision detection algorithm that
6.6 “All on” mask improvements 0.73x
computed collision points, when present, between two groups
of spheres. We implemented this workload with AOS lay-
out and then modified the implementation to also use SOA. Table 1: Effect of individually disabling various op-
By avoiding the gathers required with AOS layout, the SOA timizations (geometric mean over all of the example
version was 1.25x faster than the AOS one. workloads)

6.4 Coherent Memory Access


After conventional compiler optimizations have been ap- at a point in the program, there are additional optimization
plied, it’s often possible to detect additional cases where opportunities. For example, scatters need to be “scalarized”
the program instances are actually accessing memory coher- on current CPUs; they are turned into a scalar store for
ently [17]. The ispc compiler performs an additional opti- each currently-executing program instance. In the general
mization pass late in compilation that detects cases where case, this scalarization requires a conditional test for each
all the instances, even if using “varying” indexing or point- program instance before the corresponding store instruction.
ers, are actually accessing the same location or consecutive If all program instances are known to be executing, however,
locations. the per-lane mask check can be omitted.
When this optimization was disabled for the example There is furthermore some benefit to turning masked loads
workloads, performance was geomean 0.79x slower than and stores to regular loads and stores even on systems
when it is enabled. This optimization had a significant effect that support masked memory instructions natively when the
on the “stencil” workload, which ran 0.23x as fast when it mask is known to be all on. Doing so can in turn allow
was disabled. those memory operations to be emitted as direct memory
operands to instructions without needing to be first loaded
6.5 Dynamic Control Flow Coherence into registers.
Recall from Section 3.3 that control flow is generally trans- Disabling all of the optimizations that take advantage of
formed by the compiler to data flow with masking, so that statically determining that the execution mask is all on led
for example both the “if” and “else” clauses of an if state- to geomean 0.73x the performance of when it was enabled.
ment are executed. In many such cases, the executing pro-
gram instances will actually follow a converged control-flow 7. RESULTS
path at runtime; for example, only the “else” clause might We have measured performance of a variety of workloads
be actually needed. The code generated by the compiler written in ispc, comparing to serial, non-SIMD C++ im-
can check for this case at points in the program where con- plementations of the same algorithms. Both the C++ and
trol flow could diverge. When it actually does not diverge, ispc implementations received equivalent amounts of per-
a more efficient code path can be followed. Performing this formance tuning. (In general, the ispc and C++ imple-
check can be especially helpful to performance for code paths mentations are syntactically very similar.)
that are rarely executed (corner case handling, etc.) These workloads are all included in the open-source ispc
The ispc compiler uses the “coherent” control flow state- distribution. They include two options pricing workloads,
ments described in Section 5.6 to indicate when these addi- a third-order stencil computation, a ray tracer, a volume
tional tests should be performed. Performing this check for renderer, Mandelbrot set computation, and “aobench”, a
dynamic convergence at runtime gives two main advantages. Monte Carlo rendering benchmark [9]. Most of these are not
• It makes it possible to avoid executing instructions suitable for conventional auto-vectorization, due to complex
when the mask is “all off” and to jump over them. data-dependent control flow and program complexity.
For the results reported here, we did a number of runs
• It gives an opportunity for dynamically reestablishing of each workload, reporting the minimum time. The results
that the mask is “all on” and then taking a specialized were within a few percent over each run. Other than the re-
code path for that case; the advantages of doing so are sults on a 40-core system, results were measured on a 4-core
discussed in the following subsection. Apple iMac with a 4-core 3.4GHz Intel R Core-i7 processor
Disabling the the coherent control flow statements caused using the AVX instruction set. The basis for comparison is
the example workloads to run at geomean 0.85x their perfor- a reference C++ implementation compiled with a version of
mance of when it is enabled. This optimization was particu- the clang compiler built using the same version of the LLVM
larly important for “aobench”, which ran at 0.33x of regular libraries that are used by ispc.8 Thus, the results should
performance without it. For the workloads that only have generally indicate the performance due to more effective use
“uniform” control flow (e.g. Black-Scholes), disabling this of the vector unit rather than differences in implementation
optimization had no effect. of traditional compiler optimizations or code generation.
We have not performed direct comparisons between ispc
6.6 All On Mask and CPU OpenCL implementations in these evaluations; it
When it can be determined (statically or dynamically) 8
We have also tested with various versions of gcc with es-
that all of the program instances in a gang are executing sentially the same results.
Workload 1 core / 4 cores / Workload 40 cores /
1 thread 8 threads 80 threads
aobench 5.58x 26.26x aobench 182.36x
Binomial Options 4.39x 18.63x Binomial Options 63.85x
Black-Scholes 7.43x 26.69x Black-Scholes 83.97x
Mandelbrot Set 5.85x 24.67x Mandelbrot Set 76.48x
Ray Tracer 6.85x 34.82x Ray Tracer 195.67x
Stencil 3.37x 12.03x Stencil 9.40x
Volume Rendering 3.24x 15.92x Volume Rendering 243.18x

Table 2: Speedup of various workloads on a single Table 3: Speedup versus serial C++ implementa-
core and on four cores of a system with 8-wide SIMD tions of various workloads on a 40-core system with
units, compared to a serial C++ implementation. 4-wide SIMD units.
The one core speedup shows the benefit from using
the SIMD lanes of a single core efficiently, while the
four core speedup shows the benefit from filling the vector width, there are a number of microarchitectural de-
entire processor with useful computation. tails in the first generation of AVX systems that inhibit ideal
speedups; they include the fact that the integer vector units
are still only four-wide, as well as the fact that cache write
would be hard to interpret the results in that the effects of bandwidth was not doubled to keep up with the widening of
different compiler optimizations and code generators would the vector registers.
be confounded with the effects of the impact of the lan-
guage designs. Instead, we have focused on evaluating the 7.4 Scalability on Larger Systems
performance benefit of various ispc features by disabling Table 3 shows the result of running the example work-
them individually, thus isolating the effect of the factor un- loads with 80 threads on a 2-way hyper-threaded 40-core
der evaluation. Intel R Xeon E7-8870 system at 2.40GHz, using the SSE4
Table 1 recaps the effects of the various compiler opti- instruction set and running Microsoft Windows Server 2008
mizations that were reported in Section 6. Enterprise. For these tests, the serial C/C++ baseline code
was compiled with MSVC 2010. No changes were made to
7.1 Speedup Compared to Serial Code the implementation of workloads after their initial paral-
Table 2 shows speedups due to ispc’s effective use of lelization, though the “aobench” and options pricing work-
SIMD hardware and due to ispc’s use of task parallelism loads were run with larger data sets than the four-core runs
on a 4-core system. The table compares three cases for (2048x2048 image resolution versus 512x512, and 2M op-
each workload: a serial non-SIMD C++ implementation; tions versus 128k options, respectively).
an ispc implementation running in a single hardware thread The results fall into three categories: some (aobench, ray
on a single core of the system; and an ispc implementation tracer, and volume rendering), saw substantial speedups ver-
running eight threads on the four two-way hyper-threaded sus the serial baseline, thanks to effective use of all of the sys-
cores of the system. The four core performance shows the tem’s computational resources, achieving speedups of more
result of filling the entire processor with computation via than the theoretically-ideal 160x (the product of number of
both task-parallelism and SPMD. For the four-core results, cores and SIMD width on each core); again, the super-linear
the workloads were parallelized over cores using the tasking component of the speedups is mostly due to hyper-threading.
functionality described in Section 5.4. Other workloads (both of the options pricing workloads and
the Mandelbrot set workload), saw speedups around 2x the
7.2 Speedup Versus Intrinsics system’s core count; for these, the MSVC compiler seems
The complexity of the example workloads (which are as to have been somewhat effective at automatically vectoriz-
much as 700 lines of ispc code) makes it impractical to also ing them, thus improving the serial baseline performance.
implement intrinsics-based versions of them for performance Note, however, that these are the simplest of the workloads;
comparisons. However, a number of users of the system have for the more complex workloads the auto-vectorizer is much
implemented computations in ispc after previously imple- less effective.
menting the same computation with intrinsics and seen good The stencil computation saw a poor speedup versus the
results—the examples we’ve seen are an image downsam- serial baseline (and indeed, a worse speedup than on a four-
pling kernel (ispc performance 0.99x of intrinsics), a colli- core system.) The main issue is that the computation is iter-
sion detection computation (ispc 1.05x faster), a particle ative, requiring that each set of asynchronous tasks complete
system rasterizer (ispc 1.01x faster). before the set of tasks for the next iteration can be launched;
the repeated ramp up and ramp down of parallel computa-
7.3 Speedup with wider vectors tion hurts scalability. Improvements to the implementation
We compared the performance of compiling the example of the underlying task system could presumably reduce the
workloads to use four-wide SSE vector instructions versus impact of this issue.
eight-wide AVX on a system that supported both instruction
sets. No changes were made to the workloads’ ispc source 7.5 Users
code. The geometric mean of the speedup for the workloads There have been over 1,500 downloads of the ispc binaries
when going from SSE to AVX was 1.42x. Though this is not since the system was first released; we don’t know how many
as good as the potential 2x speedup from the doubling of additional users are building the system from source. Users
have reported roughly fifty bugs and made a number of sug- counter; and provided keywords similar to ispc’s uniform
gestions for improvements to code generation and language and varying to distinguish between scalar and SIMD vari-
syntax and capabilities. ables. MPL provided vector control constructs with syntax
Overall feedback from users has been positive, both from similar to ispc, OpenCL, and CUDA; C* provided a more
users with a background in SPMD programming from GPUs limited capability just for if statements. MPL provided
but also from users with extensive background in intrin- generalized SIMD pointers similar to the ones in ispc, but
sics programming. Their experience has generally been that each SIMD ALU and the scalar ALU had its own memory
ispc’s interoperability features and close relationship to C so these pointers could not be used to communicate data
has made it easy to adopt the system; users can port existing between units as they can in ispc. Both C* and MPL had
code to ispc by starting with existing C/C++ code, updat- sophisticated communication primitives for explicitly mov-
ing it to remove any constructs that ispc doesn’t support ing data between SIMD ALUs.
(like classes), and then modifying it to use ispc’s parallel Clearspeed’s Cn is a more recent example of this family
constructs. It hasn’t been unusual for a user with a bit of of languages; the paper describing it has a good discussion
ispc experience to port an existing 500–1000 line program of design trade-offs [26].
from C++ to ispc in a morning’s work. From the other
direction, many ispc programs can be compiled as C with
the introduction of a few preprocessor definitions; being able 8.2 Contemporary systems
to go back to serial C with the same source code has been
CUDA is a SPMD language for NVIDIA GPUs [31] and
useful for a number of users as well.
OpenCL is a similar language developed as an open stan-
Applications that users have reported using ispc for in-
dard, with some enhancements such as API-level task paral-
clude implementing a 2D Jacobi Poisson solver (achieving
lelism designed to make it usable for CPUs as well as GPUs
a 3.60x speedup compared to the previous implementation,
[10, 19, 34]. At a high level, the most important differences
both on a single core); implementing a variety of image pro-
between these languages and ispc are that ispc’s design was
cessing operations for a production imaging system (achiev-
not restricted by GPU constraints such as a separate mem-
ing a 3.2x speedup, again both on single core); and im-
ory system, and that ispc includes numerous features de-
plementing physical simulation of airflow for aircraft design
signed specifically to provide efficient performance on CPUs.
(speedups not reported to us). Most of these users had not
All three languages are C-like but do not support all features
previously bothered to try to vectorize their workloads with
of C. ispc and CUDA have some C++ features as well.
intrinsics, but have been able to see substantial speedups
The difference in hardware focus between CUDA/OpenCL
using ispc; they have generally been quite happy with both
and ispc drives many specific differences. OpenCL has
performance transparency and absolute performance.
several different address spaces, including a per-SIMD-lane
memory address space (called “private”), and a per-work-
8. RELATED WORK group address space (called “local”) whereas ispc has a sin-
The challenge of providing language and compiler sup- gle global coherent address space for all storage. OpenCL
port for parallel programming has received considerable at- and CUDA also have complex APIs for moving data to and
tention over many years. To keep the discussion of related from a discrete graphics card that are unnecessary in ispc.
work tractable, we focus on languages whose goal is high ispc has language-level support for task parallelism, unlike
performance (or more precisely, high efficiency) program- OpenCL and CUDA. CUDA and OpenCL lack ispc’s sup-
ming of SIMD hardware. We further focus on general pur- port for “uniform” variables and convenient declaration of
pose languages (in contrast to domain-specific languages) structure of arrays data types. Although these features are
with a particular emphasis on languages that are C-like. We less important for performance on GPUs than on CPUs, we
do not discuss languages and libraries that are focused just believe they would provide some benefit even on GPUs.
on multi-core or distributed parallelism, such as OpenMP, There are several implementations of CUDA and OpenCL
TBB, Cilk, MPI, etc. even though some of these languages for CPUs. Some do not attempt to vectorize across SIMD
use an SPMD programming model. lanes in the presence of control flow [10, 36]. Intel’s OpenCL
compiler does perform SIMD vectorization [34], using an
8.1 Historical systems approach related to Karrenberg et al.’s [17] (who also applied
In the late 1980s and early 1990s, there was a wave of in- their technique to OpenCL kernels.)
terest in SIMD architectures and accompanying languages. Parker et al.’s RTSL system provided SPMD-on-SIMD on
In all of the cases we discuss, SIMD computations were sup- current CPUs in a domain-specific language for implement-
ported with a true superset of C; that is, serial C code could ing ray tracers [32].
always be compiled, but the SIMD hardware was accessible Microsoft’s C++ AMP [30] provides a set of extensions
via language extensions. The Pixar FLAP computer had a to C++ to support GPU programming. As with CUDA
scalar integer ALU and 4-wide SIMD floating-point ALU, and OpenCL, its design was constrained by the goal of run-
with an accompanying extended-C language [24]. FLAP is ning on today’s GPUs. It is syntactically very different from
also notable for providing hardware support for SIMD mask CUDA, OpenCL, and ispc because of its choice of mecha-
operations, like the MIC ISA and some modern GPUs. The nisms for extending C++.
Thinking Machines CM-1 and CM-2 and the MasPar MP-1 The UPC language extends C to provide an SPMD pro-
and MP-2 supercomputers used very wide SIMD (1000s of gramming model for multiple cores [5]. UPC includes mech-
ALUs), programmed in the extended-C languages C* [33] anisms for scaling to very large systems that lack hardware
and MPL [29] respectively. memory coherence, but the language was not designed to
All of these systems used a single language for both serial target SIMD parallelism within a core and as far as we know
and parallel computations; had a single hardware program it has never been used for this purpose.
8.3 Concurrently-developed systems on 32-bit integer values (for vector pointer addressing cal-
The IVL and VecImp languages described in a recent pa- culations) may be helpful.
per are similar to ispc in a number of ways [23]; they The decision to include both scalar and SIMD computa-
were developed concurrently with ispc with some cross- tion as first-class operations in the language may be applica-
pollination of ideas. These three languages are the only ble to other architectures. For example, AMD’s forthcoming
C-like general-purpose languages that we are aware of that GPU has a scalar unit alongside its vector unit [27] as does a
provide a mechanism for creating a structure-of-arrays vari- research architecture from NVIDIA [18]. Such architectures
ant of a previously-declared struct data type. could have a variety of efficiency advantages versus a tra-
There are substantial differences in emphasis between the ditional “brute force” SIMD-only GPU implementation [6].
VecImp/IVL paper and this work. The VecImp/IVL paper More broadly, many of the available approaches for achiev-
focuses on describing the language and formally proving the ing high SIMD efficiency can be implemented in different
soundness of the type system, whereas we focus on justify- ways: by the programmer/language, by the compiler, or by
ing and quantitatively evaluating language features such as the hardware. In the power-constrained environment that
uniform variables and structure-of-arrays support. IVL and limits all hardware architectures today, we expect continued
its evaluation focus on the MIC architecture, whereas ispc exploration of the complex trade offs between these different
focuses on the SSE and AVX architectures which have less approaches.
dedicated ISA support for SPMD-style computation. This
paper also introduces and analyzes the compiler optimiza- Acknowledgments
tions required to reap the full benefit of language features The parser from the (non-SPMD) C-based “nit” language
such as uniform variables.
written by Geoff Berry and Tim Foley at Neoptica pro-
There are a variety of other detailed differences between
vided the starting-point for the ispc implementation; on-
ispc, IVL, and VecImp. For example, IVL supports function going feedback from Geoff and Tim about design and imple-
polymorphism, which is not currently supported in ispc, mentation issues in ispc has been extremely helpful. Tim
and ispc’s pointer model is more powerful than IVL’s. ispc
suggested the “SPMD on SIMD” terminology and has exten-
uses LLVM for code generation, but the IVL compiler gen-
sively argued for the advantages of the SPMD model.
erates C++ code with intrinsics. ispc is the only one of the We’d like to specifically thank the LLVM development
three languages with an implementation available for public team; without LLVM, this work wouldn’t have been possible.
use. Bruno Cardoso Lopes’s work on support for AVX in LLVM
The Intel C/C++ compiler provides an “elemental func-
was particularly helpful for the results reported here.
tions” extension of C++ that is intended to provide SPMD
We have had many fruitful discussions with Ingo Wald
as an extension of a full C++ compiler [16]. Its language that have influenced the system’s design; ispc’s approach
functionality for SPMD is more limited than ispc’s; for ex- to SPMD and Ingo’s approach with IVL languages have
ample its equivalent of uniform can only be applied to func-
had bidirectional and mutually-beneficial influence. More
tion parameters and there is no general facility for creating
recently, discussions with Roland Leißa and Sebastian Hack
SOA types from AOS types. It has been demonstrated that about VecImp have been quite helpful.
its capabilities can be used to achieve good utilization of We appreciate the support of Geoff Lowney and Jim Hur-
SIMD units [20]. ley for this work as well as Elliot Garbus’s early enthusiasm
and support for it. Thanks to Kavyvon Fatahalian, Solomon
Boulos, and Jonathan Ragan-Kelley for discussions about
9. CONCLUSION SPMD parallel languages and SIMD hardware architectures.
We have presented ispc, a SPMD language for program- Discussions with Nadav Rotem about SIMD code generation
ming CPU vector units that is easy to adopt and productive and LLVM as well as discussions with Matt Walsh have also
to use. We have shown that a few key language features– directly improved the system. Ali Adl-Tabatabai’s feedback
uniform data types, native support for SOA structure layout, and detailed questions about the precise semantics of ispc
and in-language task launch–coupled with a series of custom have been extremely helpful as well.
optimization passes make it possible to efficiently execute Thanks to Tim Foley, Mark Lacey, Jacob Munkberg, Doug
SPMD programs on the SIMD hardware of modern CPUs. McNabb, Andrew Lauritzen, Misha Smelyanskiy, Stefanus
These programs can effectively target the full capabilities Du Toit, Geoff Berry, Roland Leißa, Aaron Lefohn, Dillon
of CPUs, executing code with performance essentially the Sharlet, and Jean-Luc Duprat for comments on this paper,
same as hand-written intrinsics. Support for uniform types and thanks to the early users of ispc inside Intel—Doug
is particularly important; our experiments showed that this McNabb, Mike MacPherson, Ingo Wald, Nico Galoppo, Bret
capability provides over a 2x performance improvement. Stastny, Andrew Lauritzen, Jefferson Montgomery, Jacob
In the future, we plan to further refine the ispc language, Munkberg, Masamichi Sugihara, and Wooyoung Kim—in
eliminating remaining differences with C and adding conve- particular for helpful suggestions and bug reports as well as
nience features like polymorphic functions. We are already for their patience with early versions of the system.
adding support for the MIC architecture, which is an attrac-
tive target due to its 16-wide SIMD and good ISA support
for SPMD execution.
Experience with ispc suggests a number of avenues for
improving future hardware architectures. For conventional
CPUs, improved support for masking and scatter would be
desirable, and extending vector units to operate on 64-bit
integer values at the same performance as when operating
10. REFERENCES [21] C. Lattner and V. Adve. LLVM: A Compilation
Framework for Lifelong Program Analysis &
[1] T. Aila and S. Laine. Understanding the efficiency of Transformation. In Proc. of CGO ’04, Mar 2004.
ray traversal on GPUs. In Proc. High-Performance
[22] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim,
Graphics 2009, pages 145–149, 2009.
A. D. Nguyen, N. Satish, M. Smelyanskiy,
[2] J. R. Allen, K. Kennedy, C. Porterfield, and S. Chennupaty, P. Hammarlund, R. Singhal, and
J. Warren. Conversion of control dependence to data P. Dubey. Debunking the 100x GPU vs. CPU myth:
dependence. In Proc. POPL ’83. an evaluation of throughput computing on CPU and
[3] American National Standards Institute. American GPU. In Proc. ISCA 2010.
National Standard Programming Language C, ANSI [23] R. Leißa, S. Hack, and I. Wald. Extending a C-like
X3.159-1989, 1989. language for portable SIMD programming. In PPoPP,
[4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Feb 2012.
Leiserson, K. H. Randall, and Y. Zhou. Cilk: An [24] A. Levinthal, P. Hanrahan, M. Paquette, and
efficient multithreaded runtime system. In SIGPLAN J. Lawson. Parallel computers for graphics
Symp. on Principles and Practice of Parallel applications. SIGPLAN Not., October 1987.
Programming (PPoPP), July 1995.
[25] E. Lindholm, J. Nickolls, S. Oberman, and
[5] W.-Y. Chen, D. Bonachea, J. Duell, P. Husbands, J. Montrym. NVIDIA Tesla: A unified graphics and
C. Iancu, and K. Yelick. A performance analysis of the computing architecture. IEEE Micro, Mar–April 2008.
Berkeley UPC compiler. In Proc. of 17th Annual Intl. [26] A. Lokhmotov, B. Gaster, A. Mycroft, N. Hickey, and
Conf. on Supercomputing, pages 63–73, 2003. D. Stuttard. Revisiting SIMD programming. In
[6] S. Collange, D. Defour, and Y. Zhang. Dynamic Languages and Compilers for Parallel Computing,
detection of uniform and affine vectors in GPGPU pages 32–46. 2008.
computations. In Proc. of the 2009 Intl. Conf. on
[27] M. Mantor and M. Houston. AMD graphics core next:
Parallel Processing, Euro-Par’09.
Low power high performance graphics and parallel
[7] F. Darema, D. George, V. Norton, and G. Pfister. A compute. Hot3D, High Performance Graphics Conf.,
single-program-multiple-data computational model for 2011.
EPEX/FORTRAN. Parallel Computing, 7(1), 1988. [28] W. R. Mark, R. S. Glanville, K. Akeley, and M. J.
[8] M. J. Flynn. Some computer organizations and their Kilgard. Cg: a system for programming graphics
effectiveness. IEEE Transactions on Computers, hardware in a C-like language. ACM Trans. Graph.,
C-21(9):948–960, Sept. 1972. July 2003.
[9] S. Fujita. AOBench. [29] MasPar Computer Corporation. MasPar Programming
http://code.google.com/p/aobench. Language (ANSI C compatible MPL) Reference
[10] J. Gummaraju, L. Morichetti, M. Houston, B. Sander, Manual, Software Version 3.0, July 1992.
B. R. Gaster, and B. Zheng. Twin peaks: a software [30] Microsoft Corporation. MSDN Library: Overview of
platform for heterogeneous computing on C++ Acceleration Massive Parallelism (C++ AMP),
general-purpose and graphics processors. PACT ’10. 2011. Online preview documention, visited Dec 11.
[11] P. Hanrahan and J. Lawson. A language for shading [31] J. Nickolls, I. Buck, M. Garland, and K. Skadron.
and lighting calculations. SIGGRAPH Comput. Scalable parallel programming with CUDA. ACM
Graph., 24:289–298, September 1990. Queue, 6:40–53, March 2008.
[12] R. Ierusalimschy, L. H. de Figueiredo, and W. Celes. [32] S. G. Parker, S. Boulos, J. Bigler, and A. Robison.
Passing a language through the eye of a needle. ACM RTSL: a ray tracing shading language. In Proc. of the
Queue, 9(5). 2007 IEEE Symp. on Interactive Ray Tracing, 2007.
[13] Intel. Intel SPMD Program Compiler documentation. [33] J. Rose and J. G. Steele. C*: An extended C language
http://ispc.github.com/documentation.html. for data parallel programming. In Proc. of the Second
[14] Intel. Intel SPMD Program Compiler User’s Guide. Intl. Conf. on Supercomputing, May 1987.
http://ispc.github.com/ispc.html. [34] N. Rotem. Intel OpenCL SDK vectorizer. In LLVM
[15] Intel. Intel advanced vector extensions programming Developer Conf. Presentation, Nov. 2011.
reference. June 2011. [35] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth,
[16] Intel. Intel Cilk Plus Language Extension Specification M. Abrash, P. Dubey, S. Junkins, A. Lake,
Version 1.1, 2011. Online document. J. Sugerman, R. Cavin, R. Espasa, E. Grochowski,
[17] R. Karrenberg and S. Hack. Whole Function T. Juan, and P. Hanrahan. Larrabee: a many-core x86
Vectorization. In CGO 2011. architecture for visual computing. ACM Trans.
[18] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, Graph., August 2008.
and D. Glasco. GPUs and the future of parallel [36] J. A. Stratton, S. S. Stone, and W.-M. W. Hwu.
computing. IEEE Micro, 31:7–17, Sept–Oct 2011. MCUDA: An efficient implementation of CUDA
[19] Khronos OpenCL Working Group. The OpenCL kernels for multi-core CPUs. In Proc. 21st Int’l
Specification, Sept. 2010. Workshop on Languages and Compilers for Parallel
[20] C. Kim, N. Satish, J. Chhugani, H. Saito, Computing, 2008.
R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and [37] M. Wolfe, C. Shanklin, and L. Ortega.
P. Dubey. Closing the ninja performance gap through High-Performance Compilers for Parallel Computing.
traditional programming and compiler technology. Addison Wesley, 1995.
Technical report, Intel Corporation, Dec 2011.

You might also like