Roofline: An Insightful Visual Performance Model For Floating-Point Programs and Multicore Architectures
Roofline: An Insightful Visual Performance Model For Floating-Point Programs and Multicore Architectures
*To appear in Communications of the ACM April 2008. Revised August and October 2008.
2
modeled has a peak double precision floating-point performance cores can issue two every other clock. As the clock rate is slightly
of 17.6 GFlops/sec and a peak memory bandwidth of 15 faster—2.2 GHz for X2 versus 2.3 GHz for X4—the X4 has
GBytes/sec from our benchmark. This latter measure is the steady slightly more than four times the peak floating-point performance
state bandwidth potential of the memory in a computer, not the of the X2 with the same memory bandwidth.
pin bandwidth of the DRAM chips.
Figure 1b compares the Roofline models for both systems. As
We can plot a horizontal line showing peak floating-point expected, the ridge point shifts right from 1.0 in the Opteron X2 to
performance of the computer. Obviously, the actual floating-point 4.4 in the Opteron X4. Hence, to see a performance gain in the
performance of a floating-point kernel can be no higher than the X4, kernels need an operational intensity higher than 1.
horizontal line, since that is a hardware limit.
How could we plot the peak memory performance? Since X-axis
is GFlops per byte and the Y-axis is GFlops per second, bytes per
second—which equals (GFlops/second)/(GFlops/byte)—is just a
line at a 45-degree angle in this figure. Hence, we can plot a
second line that gives the maximum floating-point performance
that the memory system of that computer can support for a given
operational intensity. This formula drives the two performance
limits in the graph in Figure 1a:
Attainable GFlops/sec = Min(Peak Floating Point Performance,
Peak Memory Bandwidth x Operational Intensity)
These two lines intersect at the point of peak computational
performance and peak memory bandwidth. Note that these limits
are created once per multicore computer, not once per kernel.
Figure 1. Roofline Model for (a) AMD Opteron X2 on left
For a given kernel, we can find a point on the X-axis based on its and (b) Opteron X2 vs. Opteron X4 on right.
operational intensity. If we draw a (pink dashed) vertical line
through that point, the performance of the kernel on that computer
must lie somewhere along that line. 4. ADDING CEILINGS TO THE MODEL
The Roofline model gives an upper bound to performance.
The horizontal and diagonal lines give this bound model its name. Suppose your program is performing far below its Roofline. What
The Roofline sets an upper bound on performance of a kernel optimizations should you perform, and in what order? Another
depending on its operational intensity. If we think of operational advantage of bound and bottleneck analysis is [20]
intensity as a column that hits the roof, either it hits the flat part of
the roof, which means performance is compute bound, or it hits “a number of alternatives can be treated together, with a single
the slanted part of the roof, which means performance is bounding analysis providing useful information about them all.”
ultimately memory bound. In Figure 1a, a kernel with operational We leverage this insight to add multiple ceilings to the Roofline
intensity 2 is compute bound and a kernel with operational model to guide which optimizations to perform, which are similar
intensity 1 is memory bound. Given a Roofline, you can use it to the guidelines that loop balance gives the compiler. We can
repeatedly on different kernels, since the Roofline doesn’t vary. think of each of these optimizations as a “performance ceiling”
Note that the ridge point, where the diagonal and horizontal roofs below the appropriate Roofline, meaning that you cannot break
meet, offers an insight into the overall performance of the through a ceiling without performing the associated optimization.
computer. The x-coordinate of the ridge point is the minimum For example, to reduce computational bottlenecks on the Opteron
operational intensity required to achieve maximum performance. X2, two optimizations can help almost any kernel:
If the ridge point is far to the right, then only kernels with very
high operational intensity can achieve the maximum performance 1. Improve instruction level parallelism (ILP) and apply SIMD.
of that computer. If it is far to the left, then almost any kernel can For superscalar architectures, the highest performance comes
potentially hit the maximum performance. As we shall see when fetching, executing, and committing the maximum
(Section 6.3.5), the ridge point suggests the level of difficulty for number of instructions per clock cycle. The goal here is to
programmers and compiler writers to achieve peak performance. improve the code from the compiler to increase ILP. The
highest performance comes from completely covering the
To illustrate, let’s compare the Opteron X2 with two cores in functional unit latency. One way is by unrolling loops. For
Figure 1a to its successor, the Opteron X4 with four cores. To the x86-based architectures, another way is using floating-
simplify board design, they share the same socket. Hence, they point SIMD instructions whenever possible, since an SIMD
have the same DRAM channels and can thus have the same peak instruction operates on pairs of adjacent operands.
memory bandwidth, although the prefetching is better in the X4.
In addition to doubling the number of cores, the X4 also has twice 2. Balance floating-point operation mix. The best performance
the peak floating-point performance per core: X4 cores can issue requires that a significant fraction of the instruction mix be
two floating-point SSE2 instructions per clock cycle while X2 floating-point operations (see Section 7). Peak floating-point
3
performance typically also requires an equal number of 2.2 GFlops/sec if the optimizations to increase ILP or SIMD are
simultaneous floating-point additions and multiplications, also missing. Figure 2b shows the memory bandwidth ceilings of
since many computers have multiply-add instructions or 11 GBytes/sec without software prefetching, 4.8 GBytes/sec
because they have an equal number of adders and multipliers. without memory affinity optimizations as well, and 2.7
GBytes/sec with only unit stride optimizations.
To reduce memory bottlenecks, three optimizations can help:
Figure 2c combines the other two figures into a single graph. The
3. Restructure loops for unit stride accesses. Optimizing for
operational intensity of a kernel determines the optimization
unit stride memory accesses engages hardware prefetching,
region, and thus which optimizations to try. The middle of Figure
which significantly increases memory bandwidth.
2c shows that the computational optimizations and the memory
4. Ensure memory affinity. Most microprocessors today include bandwidth optimizations overlap. The colors were picked to
a memory controller on the same chip with the processors. If highlight that overlap. For example, Kernel 2 falls in the blue
the system has two multicore chips, then some addresses go trapezoid on the right, which suggests working only on the
to the DRAM local to one multicore chip and the rest must computational optimizations. If a kernel fell in the yellow triangle
go over a chip interconnect to access the DRAM that is local on the lower left, the model would suggest trying just memory
to another chip. This latter case lowers performance. This optimizations. Kernel 1 falls in the green (= yellow + blue)
optimization allocates data and the threads tasked to that data parallelogram in the middle, which suggests trying both types of
to the same memory-processor pair, so that the processors optimizations. Note that the Kernel 1 vertical lines falls below the
rarely have to access the memory attached to other chips. floating-point imbalance optimization, so optimization 2 may be
skipped.
5. Use software prefetching. Usually the highest performance
requires keeping many memory operations in flight, which is The ceilings of the Roofline model suggest which optimizations to
easier to do via prefetching rather than waiting until the data perform. The height of the gap between a ceiling and the next
is actually requested by the program. On some computers, higher one is the potential reward for trying that optimization.
software prefetching delivers more bandwidth than hardware Thus, Figure 2 suggests that optimization 1, which improves
prefetching alone. ILP/SIMD, has a large potential benefit for improving
computation on that computer, and optimization 4, which
Like the computational Roofline, the computational ceilings can improves memory affinity, has a large potential benefit for
come from an optimization manual [2], although it’s easy to improving memory bandwidth on that computer.
imagine collecting the necessary parameters from simple
microbenchmarks. The memory ceilings require running The order of the ceilings suggest the optimization order, so we
experiments on each computer to determine the gap between them rank the ceilings from bottom to top: those most likely to be
(see Appendix A.1). The good news is that like the Roofline, the realized by a compiler or with little effort by a programmer are at
ceilings only need be measured once per multicore computer. the bottom and those that are difficult to be implemented by a
programmer or inherently lacking in a kernel are at the top. The
Figure 2 adds ceilings to the Roofline model in Figure 1a: Figure one quirk is floating-point balance, since the actual mix is
2a shows the computational ceilings and Figure 2b the memory dependent on the kernel. For most kernels, achieving parity
bandwidth ceilings. Although the higher ceilings are not labeled between multiplies and additions is very difficult, but for a few,
with lower optimizations, they are implied: to break through a parity is natural. One example is sparse matrix-vector
ceiling, you need to have already broken through all the ones multiplication. For that domain, we would place floating-point
below. Figure 2a shows the computational “ceilings” of 8.8 mix as the lowest ceiling, since it is inherent. Like the 3Cs model,
GFlops/sec if the floating-point operation mix is imbalanced and as long as the Roofline model delivers on insights, it need not be
perfect.
memory bandwidth of the four, for each chip has two dual-
5. Tying the 3Cs to Operational Intensity
channel memory controllers that can drive four sets of
Operational intensity tells us which ceilings to look at. Thus far,
DDR2/FBDIMMs.
we have been assuming that the operational intensity is fixed, but
that is not really the case. For example, there are kernels where The clock rate of IBM Cell QS20 is highest of the four multicores
the operational intensity increases with problem size, such as for at 3.2 GHz. It is also most unusual. It is a heterogeneous design,
Dense Matrix and FFT problems. with a relatively simple PowerPC core and with eight SPEs
(Synergistic Processing Elements) that have their own unique
Clearly, caches affect the number of accesses that go to memory,
SIMD-style instruction set. Each SPE also has its own local
so optimizations that improve cache performance increase
memory instead of a cache. An SPE must transfer data from main
operational intensity. Hence, we can connect the 3Cs model to the
memory into the local memory to operate on it and then back to
Roofline model. Compulsory misses set the minimum memory
main memory when it is completed. It uses DMA, which has
traffic and hence the highest possible operational intensity.
some similarity to software prefetching. The lack of caches means
Memory traffic from conflict and capacity misses can
porting programs to Cell is more challenging.
considerably lower the operational intensity of a kernel, so we
should try to eliminate such misses. Table 1. Characteristics of four recent multicores.
AMD Opteron X4
(Barcelona, 2356)
Sun UltraSPARC
T2+ (Niagara 2,
(Clovertown,
MPU Type
Intel Xeon
padding arrays to change cache line addressing. A second
e5345)
5120)
example is that some computers have a no-allocate store
instruction, so stores go directly to memory and do not affect the
caches. This optimization prevents loading a cache block with
data to be overwritten, thereby reducing memory traffic. It also
prevents displacing useful items in the cache with data that will ISA x86/64 x86/64 SPARC Cell SPEs
not be read thereby saving conflict misses. Total Threads 8 8 128 16
This shift right of operational intensity could put a kernel in a Total Cores 8 8 16 16
different optimization region. The advice is generally to improve Total Sockets 2 2 2 2
operational intensity of the kernel before other optimizations. GHz 2.33 2.30 1.17 3.20
Peak GFlop/s 75 74 19 29
6. DEMONSTRATION OF THE MODEL Peak 21.3r, 2 x 10.6 2 x 21.3r, 2 x 25.6
To demonstrate the utility of the model, we develop Roofline DRAM GB/s 10.6w 2 x 10.6w
models for 4 recent multicore computers and then optimize 4 Stream GB/s 5.9 16.6 26.0 47.0
floating-point kernels. We then show that the ceilings and DRAM Type FBDIMM DDR2 FBDIMM XDR
rooflines bound the achieved results for all computers and kernels.
6.2 Four Diverse Floating-Point Kernels
6.1 Four Diverse Multicore Computers Rather than pick programs from some standard parallel
benchmark suite such as Parsec [5] or Splash-2 [30], we were
Given the lack of conventional wisdom for multicore architecture,
inspired by the work of Phil Colella [10]. This expert in scientific
it’s not surprising that there are as many different designs as there
computing has identified seven numerical methods that he
are chips. Table 1 lists the key characteristics of the four multicore
believes will be important for science and engineering for at least
computers of this section, which are all dual-socket systems.
the next decade. Because he picked seven, they have become
The Intel Xeon uses relatively sophisticated processors, capable of known as the Seven Dwarfs. The dwarfs are specified at a high
executing two SIMD instructions per clock cycle that can each level of abstraction to allow reasoning about their behavior across
perform two double-precision floating-point operations. It is the a broad range of implementations. The widely read “Berkeley
only one of the four machines with a front side bus connecting to View” report [4] found that if the data types were changed from
a common north bridge chip and memory controller. The other floating point to integer, those same dwarfs could also be found in
three have the memory controller on chip. many other programs. Note that the claim is not that the dwarfs
are easy to parallelize. The claim is that they will be important to
The Opteron X4 also uses sophisticated cores with high peak
computing in most current and future applications, so designers
floating–point performance, but it is the only computer of the four
are advised to make sure they run well on systems that they
with on-chip L3 caches. These two sockets communicate over
create, whether or not their creations are parallel.
separate, dedicated Hypertransport links, which makes it possible
to build a “glueless” multi-chip system. One advantage of using these higher-level descriptions of
programs is that we are not tied to code that may have been
The Sun UltraSPARC T2+ uses relatively simple processors at a
written originally to optimize an old computer to evaluate future
modest clock rate compared to the others, which allows it to have
systems. Another advantage of the restricted number is that we
twice as many cores per chip. It is also highly multithreaded, with
can create autotuners for each kernel that would search the space
eight hardware-supported threads per core. It has the highest
5
of alternatives to produce the best code for that multicore 6.3.1 Sparse Matrix-Vector Multiplication
computer, including extensive cache optimizations [13]. The first example kernel of the sparse matrix computational dwarf
With that background, Table 2 lists the four kernels from the is Sparse Matrix-Vector multiply (SpMV). The computation is y =
dwarfs that we use to demonstrate the Roofline Model on the four A*x where A is a sparse matrix and x and y are dense vectors.
multicore computers of Table 1. The auto-tuning for this section is SpMV is popular in scientific computing, economic modeling,
from [12], [25] and [26]. and information retrieval. Alas, conventional implementations
often run at less than 10% of peak floating-point performance in
For these kernels, there is sufficient parallelism to utilize all the uniprocessors. One reason is the irregular accesses to memory,
cores and threads and to keep them load balanced. (Appendix A.2 which you might expect from sparse matrices. The operational
describes how to handle cases when load is not balanced.) intensity varies from 0.17 before a register blocking optimization
to 0.25 Flops per byte afterwards [29]. (See Appendix A.1.)
Table 2. Characteristics of four FP Kernels.
Oper. Given that the operational intensity of SpMV was below the ridge
Name Description
Inten. point of all four multicores in Figure 3, most of the optimizations
Sparse Matrix-Vector multiply: y = A*x involved the memory system. Table 3 summarizes the
SpMV 0.17 to
where A is a sparse matrix and x, y are optimizations used by SpMV and the rest of the kernels. Many are
[26] 0.25
dense vectors; multiplies and adds equal. associated with the ceilings in Figure 3, and the height of the
Lattice-Boltzmann Magnetohydro- ceilings suggests the potential benefit of these optimizations.
LBMHD 0.70 to
dynamics is a structured grid code with a
[25] 1.07 6.3.2 Lattice-Boltzmann Magnetohydrodynamics
series of time steps.
Stencil 0.33 to A multigrid kernel that updates 7 nearby Like SpMV, LBMHD tends to get a small fraction of peak
[12] 0.50 points in a 3-D stencil for a 2563 problem performance on uniprocessors because of the complexity of the
3-D 1.09 to Three-Dimensional Fast Fourier data structures and the irregularity of memory access patterns. The
FFT 1.64 Transform (2 sizes: 1283 and 5123). Flops to byte ratio is 0.70 versus 0.25 or less in SpMV. By using
the no-allocate store optimization, the LBMHD intensity rises to
6.3 Roofline Models and Results 1.07. Both x86 multicores offer this cache optimization, and Cell
Figure 3 shows the Roofline models for Xeon, X4, and Cell. The does not have this problem since it uses DMA. Hence, T2+ is the
pink vertical dashed lines show the operational intensity and the only one with the lower intensity of 0.70.
red X marks performance achieved for that kernel. As mentioned
above, adds and multiplies are naturally equal in SpMV, so Figures 3 and 4 show that the operational intensity of LBMHD is
balance is easy for this kernel but hard for the others. Hence, there high enough that both computational and memory bandwidth
are two graphs per computer in Figure 3: the left graphs have optimizations make sense on all multicores but the T2+, whose
multiply-add balance as the top ceiling for LBMHD, Stencil, and Roofline ridge point is below that of LBMHD. The T2+ reaches
3-D FFT, and those on the right have multiply-add as the bottom its performance ceiling using only the computational
ceiling for SpMV. Since the T2+ does not have a fused multiply- optimizations.
add instruction nor can it simultaneously issue multiplies and
adds, Figure 4 shows a single roofline for the four kernels for T2+ 6.3.3 Stencil
without the multiply-add balance ceiling. In general, a stencil on a structure grid is defined as a function that
updates a point based on the values of its neighbors. The stencil
The Intel Xeon has the highest peak double precision performance structure remains constant as it moves from one point in space to
of the four multicores. However, the Roofline model in Figure 3a the next. For this work, we use the stencil derived from the
shows that this can be achieved only with operational intensities explicit heat equation PDE on a uniform 2563 3-D grid [12]. The
of at least 6.7; started alternatively, balance requires 55 floating- neighbors for this stencil are the nearest 6 points along each axis
point operations for every double precision operand (8 bytes) as well as the center point itself. This stencil will do 8 floating-
going to DRAM. This high ratio is due in part to the limitation of point operations for every 24 bytes of compulsory memory traffic
the front side bus, which also carries coherency traffic that can on write-allocate architectures, yielding an operational intensity of
consume half the bus bandwidth. Intel includes a snoop filter to 0.33.
prevent unnecessary coherency traffic on the bus. If the working
set is small enough for the hardware to filter, the snoop filter 6.3.4 3-D FFT
nearly doubles the delivered memory bandwidth. This fast Fourier transform is the classic divide and conquer
algorithm that recursively breaks down a discrete Fourier
The Opteron X4 has a memory controller on chip, its own path to
transform into many smaller ones. The FFT is ubiquitous in many
667 MHz DDR2 DRAM, and separate paths for coherency. Figure
domains, such as image processing and data compression. An
3 shows that the ridge point in the Roofline model is to the left of
efficient approach for 3-D FFT is to perform 1-D transforms along
the Xeon, at an operational intensity of 4.4 Flops per byte. The
each dimension to maintain unit-stride accesses. We computed the
Sun T2+ has the highest memory bandwidth so the ridge point is
1-D FFTs on Xeon, X4, and T2+ using an autotuned library
an exceptionally low operational intensity of just 0.33 Flops per
(FFTW) [15]. For Cell, we implemented a radix-2 FFT.
byte. It keeps multiple memory transfers in flight by using many
threads. The IBM Cell ridge point of operational intensity is 0.65.
6
Figure 3. Roofline Model for Intel Xeon, AMD Opteron X4, and IBM Cell (see Table 1).
7
good performing code from the compiler and then use as many
threads as possible. The downside was that the L2 cache was only
16-way set associative, which can lead to conflict misses when 64
threads access the cache, as it did for Stencil.
In contrast, the computer with the highest ridge point had the
lowest unoptimized performance. The Intel Xeon was difficult
because it was hard to understand the memory behavior of the
dual front side buses, hard to understand how hardware
prefetching worked, and because of the difficulty of getting good
SIMD code from the compiler. The C code for it and for the
Opteron X4 are liberally sprinkled with intrinsic statements
involving SIMD instructions to get good performance. With a
ridge point close to the Xeon, the Opteron X4 was about as much
effort, since the Opteron X4 benefited from the most types of
optimizations. However, the memory behavior of the Opteron X4
was easier to understand than that of the Xeon.
The IBM Cell, with a ridge point almost as low as the Sun T2+,
Figure 4. Roofline Model for Sun UltraSPARC T2+.
provided two types of challenges. First, it was awkward to
compile for the SIMD instructions of Cell’s SPE, so at times we
needed to help the compiler by inserting intrinsic statements with
Table 3. Kernel Optimizations [12], [26] [25].
assembly language instructions into the C code. This comment
Memory Affinity. Reduce accesses to DRAM memory attached to reflects the immaturity of the IBM compiler as well as the
the other socket. difficulty of compiling for these SIMD instructions. Second, the
Long unit-stride accesses. Change loop structures to generate memory system was more challenging. Since each SPE has local
long unit-stride accesses to engage the prefetchers. Also reduces memory in a separate address space, we could not simply port the
TLB misses. code and start running on the SPE. We needed to change the
Software Prefetching. To get the most out of the memory program to issue DMA commands to transfer data back and forth
systems, both software and hardware prefetching were used. between local store and memory. The good news is that DMA
Reduce conflict misses. Pad arrays to improve cache-hit rates. played the role of software prefetch in caches. DMA for a local
Unroll and Reorder Loops. To expose sufficient parallelism and store is easier to program, to achieve good memory performance,
improve cache utilization, unroll and reorder loops to group and to overlap with computation than prefetching to caches.
statements with similar addresses; improves code quality,
reduces register pressure, and facilitates SIMD. 6.3.6 Summary of Roofline Model Demonstration
“SIMD-ize” the code. The x86 compilers didn't generate good To demonstrate the utility of the Roofline Model, Table 4 shows
SSE code, so made a code generator to produce SSE intrinsics. upper and lower ceilings and the GFlops/s and GByte/s per
kernel-computer pair; recall that operational intensity is the ratio
Compress Data Structures (SpMV only). Since bandwidth limits
performance, use smaller data structures: 16-bit vs. 32-bit index between the two rates. The ceilings listed are the ceilings that
and smaller representations of non-zero subblocks [24]. sandwich the actual performance. All 16 cases validate this bound
and bottleneck model since the upper and lower ceilings of
FFT differs from the three kernels above in that its operational Roofline bound performance and the kernels were optimized as
intensity is a function of problem size. For the 1283- and 5123- the lower ceilings suggest. The metric that limits performance is
point transforms we examine, the operational intensities are 1.09 in bold: 15 of 16 ceilings are memory bound for Xeon and X4
and 1.41, respectively. (Cell’s 1 GB main memory is too small to while it’s almost evenly split for T2+ and Cell. For FFT,
hold 5123 points, so we estimate this result.) On Xeon and X4, an interestingly, the surrounding ceilings are memory bound for
entire 128x128 plane fits in cache, increasing temporal locality Xeon and X4 but compute bound for T2+ and Cell.
and improving the intensity to 1.64 for the 1283-point transform.
Table 4. Achieved Performance and Nearest Roofline Ceilings, with Metric Limiting Performance in Bold (3-D FFT is 1283).
Upper Ceiling Achieved Performance Lower Ceiling
Kernel
Type Name Value Compute Memory O.I. Type Name Value
SpMV Memory Stream BW 11.2 GByte/s 2.8 GFlop/s 11.1 GB/s 0.25 Memory Snoop filter 5.9 GByte/s
Intel LBMHD Memory Snoop filter 5.9 GByte/s 5.6 GFlop/s 5.3 GB/s 1.07 Memory (none) 0.0 GByte/s
Xeon Stencil Memory Snoop filter 5.9 GByte/s 2.5 GFlop/s 5.1 GB/s 0.50 Memory (none) 0.0 GByte/s
3-D FFT Memory Snoop filter 5.9 GByte/s 9.7 GFlop/s 5.9 GB/s 1.64 Compute TLP only 6.2 GFlop/s
SpMV Memory Stream BW 17.6 GByte/s 4.2 GFlop/s 16.8 GB/s 0.25 Memory Copy BW 13.9 GByte/s
AMD LBMHD Memory Copy BW 13.9 GByte/s 11.4 GFlop/s 10.7 GB/s 1.07 Memory No Affinity 7.0 GByte/s
X4 Stencil Memory Stream BW 17.6 GByte/s 8.0 GFlop/s 16.0 GB/s 0.50 Memory Copy BW 13.9 GByte/s
3-D FFT Memory Copy BW 13.9 GByte/s 14.0 GFlop/s 8.6 GB/s 1.64 Memory No Affinity 7.0 GByte/s
SpMV Memory Stream BW 36.7 GByte/s 7.3 GFlop/s 29.1 GB/s 0.25 Memory No Affinity 19.8 GByte/s
Sun LBMHD Memory No Affinity 19.8 GByte/s 10.5 GFlop/s 15.0 GB/s 0.70 Compute 25% issued FP 9.3 GFlop/s
T2+ Stencil Compute 25% issued FP 9.3 GFlop/s 6.8 GFlop/s 20.3 GB/s 0.33 Memory No Affinity 19.8 GByte/s
3-D FFT Compute Peak DP 19.8 GFlop/s 9.2 GFlop/s 10.0 GB/s 1.09 Compute 25% issued FP 9.3 GFlop/s
SpMV Memory Stream BW 47.6 GByte/s 11.8 GFlop/s 47.1 GB/s 0.25 Memory FMA 7.3 GFlop/s
IBM LBMHD Memory No Affinity 23.8 GByte/s 16.7 GFlop/s 15.6 GB/s 1.07 Memory Without FMA 14.6 GFlop/s
Cell Stencil Compute Without FMA 14.6 GFlop/s 14.2 GFlop/s 30.2 GB/s 0.47 Memory No Affinity 23.8 GByte/s
3-D FFT Compute Peak DP 29.3 GFlop/s 15.7 GFlop/s 14.4 GB/s 1.09 Compute SIMD 14.6 GFlop/s
Section 2 shows that the memory bandwidth measures of the Fallacy: You need to recalculate the model for every kernel.
computer do include prefetching and any other optimization that
The Roofline need to be calculated for given performance metrics
can improve memory performance such as blocking. Similarly,
and computer just once, and then guide the design for any
some of the optimizations in Table 3 explicitly involve memory.
program for which that metric is the critical performance metric.
Moreover, Section 5 demonstrates their effect on increasing
The examples in this paper used floating-point operations and
operational intensity by reducing capacity and conflict misses.
memory traffic. The ceilings are measured once, but they can be
Fallacy: Doubling cache size will increase operational intensity. reordered depending whether the multiplies and adds are naturally
balanced or not in the kernel (see Section 4).
Autotuning three of the four kernels gets very close to the
compulsory memory traffic; in fact, the resultant working set is Note that the heights of the ceilings in this paper document the
sometimes only a small fraction of the cache. Increasing cache maximum potential gain of a code performing this optimization.
size helps only with capacity misses and possibly conflict misses, An interesting future direction is to use performance counters to
so a larger cache can have no effect on the operational intensity adjust the height of the ceilings and the order of the ceilings for a
for those three kernels. For 1283 3-D FFT, however, a large cache particular kernel to show the actual benefits of each optimization
can capture a whole plane of a 3-D cube, which improves and the recommended order to try them (see Appendix A.3).
operational intensity by reducing capacity and conflict misses.
Fallacy: The model is limited to easily optimized kernels that
Fallacy: The model doesn’t account for the long memory latency. never hit in the cache.
The ceilings for no software prefetching in Figures 3 and 4 are at First, these kernels do hit in the cache. For example, the cache-hit
lower memory bandwidth precisely because they cannot hide the rates of our three multicores with on-chip caches are at least 94%
long memory latency. for stencil and 98% for FFT. Second, if the dwarfs were easy to
optimize, that would bode well for the future of multicores. Our
Fallacy: The model ignores integer units in floating-point
experience, however, is that it was not easy to create the fastest
programs, which can limit performance.
version of these numerical methods on the divergent multicore
For the examples in this paper, the amount of integer code and the architectures presented here. Indeed, three of the results were
integer performance can affect performance. For example, the Sun considered significant enough to be accepted for publication at
UltraSPARC T2+ fetches two instructions per core per clock major conferences [12][25][26].
cycle, and it doesn’t have the SIMD instructions of the x86 that
Fallacy: The model is limited to floating-point programs.
can operate on two double-precision floating-point operands at a
time. Relative to others, T2+ executes more integer instructions Our focus in this paper has also been on floating-point programs,
and executes them at a lower rate, which hurts overall so the two axes of the model are floating-point operations per
performance. second and the floating-point operational intensity of accesses to
main memory. However, we believe the Roofline model can work
Fallacy: The model has nothing to do with multicore.
for other kernels where the performance was a function of
Little's Law [21][20][17] dictates that to really push the limits of different performance metrics.
the memory system, considerable concurrency is necessary. That
A concrete example is the transpose phase of 3-D FFT, which
concurrency is more easily satisfied in a multicore than in a
does no floating-point operations at all. Figure 5 shows a Roofline
uniprocessor. While the bandwidth orientation of the Roofline
model for just this phase on Cell, with exchanges replacing Flops
model certainly works for uniprocessors, it is even more helpful
in the model. One exchange involves reading and writing 16
for multicores.
bytes, so its operational intensity is 1/32. Despite the
9
computational metric being memory exchanges, note that there is We applied the model to four kernels from the seven dwarfs
still a computational horizontal Roofline since local stores and [10][4] to four recent multicore designs: the AMD Opteron X4,
caches could affect the number of exchanges that go to DRAM. Intel Xeon, IBM Cell, and Sun T2+. The ridge point—the
minimum operational intensity to achieve maximum
performance—proved to be a better predictor of performance than
clock rate or peak performance. Cell offered the highest
performance on these kernels, but T2+ was the easiest computer
on which to achieve its highest performance. One reason is
because ridge point of the Roofline model for T2+ was the lowest.
Just the graphical Roofline offers insights into the difficulty of
achieving the peak performance of a computer, as it makes
obvious when a computer is imbalanced. The operational ridge
points for the two x86 computers were 4.4 and 6.7—meaning 35
to 55 Flops per 8-byte operand that accesses DRAM—yet the
operational intensities for the 16 combinations of kernels and
computers in Table 4 ranged from 0.25 to just 1.64, with a median
of 0.60. Architects should keep the ridge point in mind if they
want programs to reach peak performance on their new designs.
Figure 5. Roofline for transpose phase of 3-D FFT for the Cell We measured the roofline and ceilings using microbenchmarks,
but we could have used performance counters (see Appendix A.1
Fallacy: The Roofline model must use DRAM bandwidth. and A.3). In fact, we believe there may be a synergistic
If the working set fits in the L2 cache, the diagonal Roofline could relationship between performance counters and the Roofline
be L2 cache bandwidth instead of DRAM bandwidth, and the model. The requirements for automatic creation of a Roofline
operational intensity on the X-axis would be based on Flops per model could guide the designer as to which metrics should be
L2 cache byte accessed. The diagonal memory performance line collected when faced with literally hundreds of candidates but a
would move up, and the ridge point would surely move to the left. limited hardware budget. [6]
For example, Jike Chong ported two financial PDE solvers to four We believe Roofline models can offer insights to other types of
other multicore computers: the Intel Penryn and Larrabee and multicore systems such as vector processors and GPUs (Graphical
NVIDIA G80 and GTX280.[9] He used the Roofline model to Processing Units); other kernels such as sort and ray tracing; other
keep track the platforms' peak arithmetic throughput and L1, L2, computational metrics such as pair-wise sorts per second and
and DRAM bandwidths. By analyzing an algorithm's working set frames per second; and other traffic metrics such as L3 cache
and operational intensity, he was able to use the Roofline model to bandwidth and I/O bandwidth. Alas, there are many more
quickly estimate the needs for algorithmic improvements. opportunities than we can pursue. Thus, we invite others to join us
Specifically, for the option-pricing problem with an implicit PDE in the exploration of the effectiveness of Roofline models.
solver, the working set is small enough to fit into L1 and the L1
bandwidth is sufficient to support peak arithmetic throughput, so 9. ACKNOWLEDGMENTS
the Roofline model indicates that no optimization is necessary. This research was sponsored in part by the Universal Parallel
For option pricing with an explicit PDE formulation, the working Computing Research Center, funded by Intel and Microsoft, and
set is too large to fit into cache, and the Roofline model helps to in part by the ASCR Office in the DOE Office of Science under
indicate the extent to which cache blocking is necessary to extract contract number DE-AC02-05CH11231. We’d like to thank FZ-
peak arithmetic performance. Jülich and Georgia Tech for access to Cell blades. Our thanks go
to Joseph Gebis, Leonid Oliker, John Shalf, Katherine Yelick, and
8. CONCLUSIONS the rest of the Par Lab for feedback on the Roofline model, and to
The sea change from sequential computing to parallel computing Jike Chong, Kaushik Datta, Mark Hoemmen, Matt Johnson, Jae
is increasing the diversity of computers that programmers must Lee, Rajesh Nishtala, Heidi Pan, David Wessel, Mark Hill and the
confront in making correct, efficient, scalable, and portable anonymous reviewers for feedback on early drafts of this paper.
software [4]. This paper describes a simple and visual model to
help see which systems would be a good match to important 10. REFERENCES
kernels, or conversely, to see how to change kernel code or [1] V. Adve, Analyzing the Behavior and Performance of Parallel
hardware to run desired kernels well. For floating-point kernels Programs, PhD thesis, Univ. of Wisconsin, 1993.
that do not fit completely in caches, we showed how operational
[2] AMD, Software Optimization Guide for AMD Family 10h
intensity—the number of floating point operations per byte Processors, Publication 40546, April 2008.
transferred from DRAM—is an important parameter for both the
kernels and the multicore computers. [3] G. Amdahl, “Validity of the Single Processor Approach to
Achieving Large-Scale Computing Capabilities,” AFIPS Conference
Proceedings 30(1967), 483-485.
10
[4] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, K. Keutzer, D. [24] S. Williams, Autotuning Performance on Multicore Computers, PhD
Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick. “The thesis, U.C. Berkeley, 2008.
landscape of parallel computing research: A view from Berkeley.”
[25] S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Lattice
Tech. Rep. UCB/EECS-2006-183, EECS, U.C. Berkeley, Dec 2006.
Boltzmann Simulation Optimization on Leading Multicore
[5] C. Bienia,. S. Kumar, J. Singh, and K. Li. “The PARSEC Bench- Platforms," Int’l Parallel & Distributed Processing Symposium
mark Suite: Characterization and Architectural Implications,” (IPDPS), 2008.
Princeton University Technical Report TR-81 1-008, January 2008.
[26] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel,
[6] S. Bird et al, “A Case for Sensible Performance Counters,” "Optimization of Sparse Matrix-Vector Multiplication on Emerging
submitted to the First USENIX Workshop on Hot Topics in Multicore Platforms," Supercomputing (SC07), 2007.
Parallelism (HotPar ’09), Berkeley CA, March 2009.
[27] M. Tikir, L. Carrington, E. Strohmaier, A. Snavely, "A Genetic
[7] E. Boyd, W. Azeem, H. Lee, T. Shih, S. Hung, and E. Davidson, “A Algorithms Approach to Modeling the Performance of Memory-
Hierarchical Approach to Modeling and Improving the Performance bound Computations," Supercomputing (SC07), 2007.
of Scientific Applications on the KSR1,” Proc. 1994 Int’l Conf. on
[28] A. Thomasian and P. Bay, “Analytic Queueing Network Models for
Parallel Processing, vol. 3, pp. 188-192, 1994.
Parallel Processing of Task Systems,” IEEE Trans. on Computers C-
[8] D. Callahan, J. Cocke, and K. Kennedy. “Estimating interlock and 35, 12 (December 1986), 1045-1054.
improving balance for pipelined machines,” J. Parallel Distrb.
[29] R. Vuduc, J. Demmel, K. Yelick, S. Kamil, R. Nishtala, and B. Lee.
Comput. 5, 334-358. 1988.
“Performance Optimizations and Bounds for Sparse Matrix-Vector
[9] J. Chong, Private Communication, 2008. Multiply,” Supercomputing (SC02), Nov. 2002.
[10] P. Colella, “Defining Software Requirements for Scientific [30] S. Woo, M. Ohara, E. Torrie, J-P Singh, and A. Gupta. “The
Computing,” presentation, 2004. SPLASH-2 programs: characterization and methodological
considerations,” Proc. 22nd annual Int’l Symp. on Computer
[11] S. Carr and K. Kennedy, “Improving the Ratio of Memory
Architecture (ISCA '95), May 1995, 24 - 36.
Operations to Floating-Point Operations in Loops,” ACM TOPLAS
16(4) (Nov. 1994). Categories and Subject Descriptors
[12] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, B.8.2 [Performance And Reliability]: Performance Analysis and
D. Patterson, J Shalf, K. Yelick, “Stencil Computation Optimization Design Aids, D.1.3 [Programming Techniques]: Concurrent
and Autotuning on State-of-the-Art Multicore Architectures,”
Programming
Supercomputing (SC08), 2008.
[13] J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. General Terms
Vuduc, R. Whaley, and K. Yelick, “Self Adapting Linear Algebra
Measurement, Performance, Experimentation
Algorithms and Software,” Proc. IEEE: Special Issue on Program
Generation, Optimization, and Adaptation, 93 (2) 2005.
Keywords
[14] M. Dubois and F. A. Briggs, “Performance of Synchronized Iterative
Performance model, Parallel Computer, Multicore Computer,
Processes in Multiprocessor Systems,” IEEE Trans. on Software
Engineering SE-8, 4 (July 1982), 419-431. Multiprocessor, Kernel, Sparse Matrix, Structured Grid, FFT,
Stencil, AMD Opteron X4, AMD Barcelona, Intel Xeon, Intel
[15] M. Frigo and S. Johnson, "The Design and Implementation of Clovertown, IBM Cell, Sun UltraSPARC T2+, Sun Niagara 2
FFTW3," Proc. IEEE: Special Issue on Program Generation,
Optimization, and Platform Adaptation. 93 (2) 2005. Appendix A
[16] M. Harris, "Mapping Computational Concepts to GPUs," ACM Appendix A is found online at the CACM website:
SIGGRAPH Tutorials, Chapter 31, 2005.
cacm.acm.org.
[17] J. Hennessy and D. Patterson, Computer Architecture: A
Quantitative Approach, 4th ed., Boston, MA: Morgan Kaufmann
Publishers, 2007.
[18] M. Hill and M. Marty, “Amdahl's Law in the Multicore Era,” IEEE
Computer, July 2008.
[19] M. Hill and A. Smith, "Evaluating Associativity in CPU Caches,"
IEEE Trans. on Computers, 38(12), pp. 1612-1630, Dec. 1989.
[20] E. Lazowska, J. Zahorjan, S. Graham, and K. Sevcik, Quantitative
System Performance: Computer System Analysis Using Queueing
Network Models, Prentice Hail, Upper Saddle River, NJ, 1984.
[21] J. D. C. Little, "A Proof of the Queueing Formula L = λ W"
Operations Research, 9, 383-387 (1961).
[22] J. McCalpin, “STREAM: Sustainable Memory Bandwidth in High
Performance Computers,” www.cs.virginia.edu/stream, 1995.
[23] D. Patterson, “Latency Lags Bandwidth,” 47:10, CACM, Oct. 2004.