0% found this document useful (0 votes)
79 views

An Analytical Model For A GPU Architecture With Memory-Level and Thread-Level Parallelism Awareness

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understandin

Uploaded by

anon1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

An Analytical Model For A GPU Architecture With Memory-Level and Thread-Level Parallelism Awareness

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understandin

Uploaded by

anon1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

An Analytical Model for a GPU Architecture with

Memory-level and Thread-level Parallelism Awareness

Sunpyo Hong Hyesoon Kim


Electrical and Computer Engineering School of Computer Science
Georgia Institute of Technology Georgia Institute of Technology
[email protected] [email protected]

ABSTRACT 1. INTRODUCTION
GPU architectures are increasingly important in the multi-core era The increasing computing power of GPUs gives them consid-
due to their high number of parallel processors. Programming thou- erably higher peak computing power than CPUs. For example,
sands of massively parallel threads is a big challenge for software NVIDIA’s GTX280 GPUs [3] provide 933 Gflop/s with 240 cores,
engineers, but understanding the performance bottlenecks of those while Intel’s Core2Quad processors [2] deliver only 100 Gflop/s.
parallel programs on GPU architectures to improve application per- Intel’s next generation of graphics processors will support more
formance is even more difficult. Current approaches rely on pro- than 900 Gflop/s [26]. AMD/ATI’s latest GPU (HD4870) provides
grammers to tune their applications by exploiting the design space 1.2 Tflop/s [1]. However, even though hardware is providing high
exhaustively without fully understanding the performance charac- performance computing, writing parallel programs to take full ad-
teristics of their applications. vantage of this high performance computing power is still a big
To provide insights into the performance bottlenecks of parallel challenge.
applications on GPU architectures, we propose a simple analytical Recently, there have been new programming languages that aim
model that estimates the execution time of massively parallel pro- to reduce programmers’ burden in writing parallel applications for
grams. The key component of our model is estimating the number the GPUs such as Brook+ [5], CUDA [22], and OpenCL [16].
of parallel memory requests (we call this the memory warp paral- However, even with these newly developed programming languages,
lelism) by considering the number of running threads and memory programmers still need to spend enormous time and effort to op-
bandwidth. Based on the degree of memory warp parallelism, the timize their applications to achieve better performance [24]. Al-
model estimates the cost of memory requests, thereby estimating though the GPGPU community [11] provides general guidelines
the overall execution time of a program. Comparisons between for optimizing applications using CUDA, clearly understanding var-
the outcome of the model and the actual execution time in several ious features of the underlying architecture and the associated per-
GPUs show that the geometric mean of absolute error of our model formance bottlenecks in their applications is still remaining home-
on micro-benchmarks is 5.4% and on GPU computing applications work for programmers. Therefore, programmers might need to
is 13.3%. All the applications are written in the CUDA program- vary all the combinations to find the best performing configura-
ming language. tions [24].
To provide insight into performance bottlenecks in massively
parallel architectures, especially GPU architectures, we propose a
Categories and Subject Descriptors simple analytical model. The model can be used statically with-
C.1.4 [Processor Architectures]: Parallel Architectures out executing an application. The basic intuition of our analytical
; C.4 [Performance of Systems]: Modeling techniques model is that estimating the cost of memory operations is the key
; C.5.3 [Computer System Implementation]: Microcomputers component of estimating the performance of parallel GPU appli-
cations. The execution time of an application is dominated by the
latency of memory instructions, but the latency of each memory op-
General Terms eration can be hidden by executing multiple memory requests con-
Measurement, Performance currently. By using the number of concurrently running threads and
the memory bandwidth consumption, we estimate how many mem-
ory requests can be executed concurrently, which we call memory
Keywords warp1 parallelism (MWP).We also introduce computation warp
Analytical model, CUDA, GPU architecture, Memory level paral- parallelism (CWP). CWP represents how much computation can
lelism, Warp level parallelism, Performance estimation be done by other warps while one warp is waiting for memory val-
ues. CWP is similar to a metric, arithmetic intensity2 [23] in the
GPGPU community. Using both MWP and CWP, we estimate ef-
fective costs of memory requests, thereby estimating the overall
Permission to make digital or hard copies of all or part of this work for execution time of a program.
personal or classroom use is granted without fee provided that copies are We evaluate our analytical model based on the CUDA [20, 22]
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to 1 A warp is a batch of threads that are internally executed together
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
by the hardware. Section 2 describes a warp.
2 Arithmetic intensity is defined as math operations per memory
ISCA’09, June 20–24, 2009, Austin, Texas, USA.
Copyright 2009 ACM 978-1-60558-526-0/09/06 ...$5.00. operation.
thread
thread

thread

thread
thread

thread

thread
thread

thread

thread
thread

thread

thread
thread

thread

thread
thread

thread

thread
thread

thread

thread
thread

thread
programming language, which is C with extensions for parallel
... ... ... ... ... ... ... ... ... ... ... ...
threads. We compare the results of our analytical model with the
actual execution time on several GPUs. Our results show that the warp warp warp warp warp warp warp warp

geometric mean of absolute error of our model on micro-benchmarks block block block block

is 5.4% and on the Merge benchmarks [17]3 is 13.3%


The contributions of our work are as follows: Streamming Streamming Streamming Streamming
PC

Multiprocessor Multiprocessor Multiprocessor Multiprocessor I−cache


(Multithreaded (Multithreaded (Multithreaded ... (Multithreaded
1. To the best of our knowledge, we propose the first analytical processor) processor) processor) processor) Decoder

model for the GPU architecture. This can be easily extended Shared Memory
to other multithreaded architectures as well. Interconnection Network

SIMD Execution Unit


2. We propose two new metrics, MWP and CWP, to represent

Stream Processor

Stream Processor

Stream Processor

Stream Processor
the degree of warp level parallelism that provide key insights Global Memory (Device Memory)
...
identifying performance bottlenecks.

2. BACKGROUND AND MOTIVATION Figure 1: An overview of the GPU architecture


We provide a brief background on the GPU architecture and pro-
gramming model that we modeled. Our analytical model is based
on the CUDA programming model and the NVIDIA Tesla archi- All the threads in one block are executed on one SM together.
tecture [3, 8, 20] used in the GeForce 8-series GPUs. One SM can also have multiple concurrently running blocks. The
number of blocks that are running on one SM is determined by the
2.1 Background on the CUDA Programming resource requirements of each block such as the number of registers
Model and shared memory usage. The blocks that are running on one SM
The CUDA programming model is similar in style to a single- at a given time are called active blocks in this paper. Since one
program multiple-data (SPMD) software model. The GPU is treated block typically has several warps (the number of warps is the same
as a coprocessor that executes data-parallel kernel functions. as the number of threads in a block divided by 32), the total number
CUDA provides three key abstractions, a hierarchy of thread of active warps per SM is equal to the number of warps per block
groups, shared memories, and barrier synchronization. Threads times the number of active blocks.
have a three level hierarchy. A grid is a set of thread blocks that The shared memory is implemented within each SM multipro-
execute a kernel function. Each grid consists of blocks of threads. cessor as an SRAM and the global memory is part of the offchip
Each block is composed of hundreds of threads. Threads within one DRAM. The shared memory has very low access latency (almost
block can share data using shared memory and can be synchronized the same as that of register) and high bandwidth. However, since a
at a barrier. All threads within a block are executed concurrently warp of 32 threads access the shared memory together, when there
on a multithreaded architecture. is a bank conflict within a warp, accessing the shared memory takes
The programmer specifies the number of threads per block, and multiple cycles.
the number of blocks per grid. A thread in the CUDA program-
ming language is much lighter weight than a thread in traditional 2.3 Coalesced and Uncoalesced Memory Ac-
operating systems. A thread in CUDA typically processes one data cesses
element at a time. The CUDA programming model has two shared The SM processor executes one warp at one time, and sched-
read-write memory spaces, the shared memory space and the global ules warps in a time-sharing fashion. The processor has enough
memory space. The shared memory is local to a block and the functional units and register read/write ports to execute 32 threads
global memory space is accessible by all blocks. CUDA also pro- (i.e. one warp) together. Since an SM has only 8 functional units,
vides two read-only memory spaces, the constant space and the executing 32 threads takes 4 SM processor cycles for computation
texture space, which reside in external DRAM, and are accessed instructions.4
via read-only caches. When the SM processor executes a memory instruction, it gen-
erates memory requests and switches to another warp until all the
2.2 Background on the GPU Architecture memory values in the warp are ready. Ideally, all the memory ac-
Figure 1 shows an overview of the GPU architecture. The GPU cesses within a warp can be combined into one memory transac-
architecture consists of a scalable number of streaming multipro- tion. Unfortunately, that depends on the memory access pattern
cessors (SMs), each containing eight streaming processor (SP) cores, within a warp. If the memory addresses are sequential, all of the
two special function units (SFUs), a multithreaded instruction fetch memory requests within a warp can be coalesced into a single mem-
and issue unit, a read-only constant cache, and a 16KB read/write ory transaction. Otherwise, each memory address will generate a
shared memory [8]. different transaction. Figure 2 illustrates two cases. The CUDA
The SM executes a batch of 32 threads together called a warp. manual [22] provides detailed algorithms to identify types of co-
Executing a warp instruction applies the instruction to 32 threads, alesced/uncoalesced memory accesses. If memory requests in a
similar to executing a SIMD instruction like an SSE instruction [14] warp are uncoalesced, the warp cannot be executed until all mem-
in X86. However, unlike SIMD instructions, the concept of warp is ory transactions from the same warp are serviced, which takes sig-
not exposed to the programmers, rather programmers write a p ro- nificantly longer than waiting for only one memory request (coa-
gram for one thread, and then specify the number of parallel threads lesced case).
in a block, and the number of blocks in a kernel grid. The Tesla ar-
chitecture forms a warp using a batch of 32 threads [13, 9] and in
the rest of the paper we also use a warp as a batch of 32 threads.
3 The Merge benchmarks consist of several media processing appli- 4 In this paper, a computation instruction means a non-memory in-
cations. struction.
A Single Memory Transaction 1

Addr 1 Addr 2 Addr 3 Addr 4 Addr 5 Addr 6 Addr 32 0.9

    


  0.8

(a) 0.7

Multiple Memory Transactions


0.6

Occupancy
Addr 1 Addr 2 Addr 3 Addr 31 Addr 32
0.5

    


0.4

Naïve
(b)
0.3

0.2 Constant

0.1 Constant+Optimized
Figure 2: Memory requests from a single warp. (a) coalesced
0
memory access (b) uncoalesced memory access 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484

THREADS PER BLOCK

2.4 Motivating Example Figure 4: Occupancy values of SVM


To motivate the importance of a static performance analysis on
the GPU architecture, we show an example of performance differ-
ences from three different versions of the same algorithm in Fig- improvement as shown in Figure 4. First, when the number of
ure 3. The SVM benchmark is a kernel extracted from a face clas- threads per block is less than 64, all three cases show the same
sification algorithm [28]. The performance of applications is mea- occupancy values even though the performances of 3 cases are dif-
sured on NVIDIA QuadroFX5600 [4]. There are three different ferent. Second, even though SM processor occupancy is improved,
optimized versions of the same SVM algorithm: Naive, Constant, for some cases, there is no performance improvement. For exam-
and Constant+Optimized. Naive uses only the global memory, ple, the performance of Constant is not improved at all even though
Constant uses the cached read-only constant memory5 , and Con- the SM processor occupancy is increased from 0.35 to 1. Hence, we
stant+Optimized also optimizes memory accesses6 on top of using need other metrics to differentiate the three cases and to understand
the constant memory. Figure 3 shows the execution time when the what the critical component of performance is.
number of threads per block is varied. In this example, the number
of blocks is fixed so the number of threads per block determines the
total number of threads in a program. The performance improve-
3. ANALYTICAL MODEL
ment of Constant+Optimized and that of Constant over the Naive
implementation are 24.36x and 1.79x respectively. Even though
3.1 Introduction to MWP and CWP
the performance of each version might be affected by the number The GPU architecture is a multithreaded architecture. Each SM
of threads, once the number of threads exceeds 64, the performance can execute multiple warps in a time-sharing fashion while one or
does not vary significantly. more warps are waiting for memory values. As a result, the ex-
ecution cost of warps that are executed concurrently can be hid-
1400 den. The key component of our analytical model is finding out how
1200
many memory requests can be serviced and how many warps can
be executed together while one warp is waiting for memory values.
1000
To represent the degree of warp parallelism, we introduce two
Execution Time (ms)

800 metrics, MWP (Memory Warp Parallelism) and CWP (Computa-


tion Warp Parallelism). MWP represents the maximum number of
600
warps per SM that can access the memory simultaneously during
400 the time period from right after the SM processor executes a mem-
Naïve Constant Constant +

200
Optimized ory instruction from one warp (therefore, memory requests are just
sent to the memory system) until all the memory requests from the
0
4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484
same warp are serviced (therefore, the processor can execute the
THREADS PER BLOCK next instruction from that warp). The warp that is waiting for mem-
ory values is called a memory warp in this paper. The time period
Figure 3: Optimization impacts on SVM from right after one warp sent memory requests until all the mem-
ory requests from the same warp are serviced is called one memory
Figure 4 shows SM processor occupancy [22] for the three cases. warp waiting period. CWP represents the number of warps that the
The SM processor occupancy indicates the resource utilization, which SM processor can execute during one memory warp waiting pe-
has been widely used to optimize GPU computing applications. It riod plus one. A value one is added to include the warp itself that
is calculated based on the resource requirements for a given pro- is waiting for memory values. (This means that CWP is always
gram. Typically, high occupancy (the max value is 1) is better greater than or equal to 1.)
for performance since many actively running threads would more MWP is related to how much memory parallelism in the system.
likely hide the DRAM memory access latency. However, SM pro- MWP is determined by the memory bandwidth, memory bank par-
cessor occupancy does not sufficiently estimate the performance allelism and the number of running warps per SM. MWP plays a
5 The benefits of using the constant memory are (1) it has an on- very important role in our analytical model. When MWP is higher
chip cache per SM and (2) using the constant memory can reduce than 1, the cost of memory access cycles from (MWP-1) number
register usage, which might increase the number of running blocks of warps is all hidden, since they are all accessing the memory sys-
in one SM. tem together. The detailed algorithm of calculating MWP will be
6 The programmer optimized the code to have coalesced memory described in Section 3.3.1.
accesses instead of uncoalesced memory accesses. CWP is related to the program characteristics. It is similar to
an arithmetic intensity, but unlike arithmetic intensity, higher CWP For both cases, the total execution cycles are only the sum of 2
means less computation per memory access. CWP also considers computation periods and 4 memory waiting periods. Using MWP,
timing information but arithmetic intensity does not consider tim- the total execution cycles can be calculated using the below two
ing information. CWP is mainly used to decide whether the total equations. We divide Comp_cycles by #M em_insts to get the
execution time is dominated by computation cost or memory access number of cycles in one computation period.
cost. When CWP is greater than MWP, the execution cost is domi- N
Exec_cycles = M em_cycles × + Comp_p × M W P (1)
nated by memory access cost. However, when MWP is greater than MW P
CWP, the execution cost is dominated by computation cost. How Comp_p = Comp_cycles/#Mem_insts (2)
to calculate CWP will be described in Section 3.3.2.
M em_cycles: Memory waiting cycles per each warp (see Equation (18))
3.2 The Cost of Executing Multiple Warps in Comp_cycles: Computation cycles per each warp (see Equation (19))
the GPU architecture Comp_p: execution cycles of one computation period
#M em_insts: Number of memory instructions
To explain how executing multiple warps in each SM affects
N : Number of active running warps per SM
the total execution time, we will illustrate several scenarios in Fig-
ures 5, 6, 7 and 8. A computation period indicates the period when
instructions from one warp are executed on the SM processor. A
memory waiting period indicates the period when memory requests
3.2.2 MWP is Greater than CWP
are being serviced. The numbers inside the computation period In general, CWP is greater than MWP. However, for some cases,
boxes and memory waiting period boxes in Figures 5, 6, 7 and 8 MWP is greater than CWP. Let’s say that the system can service 8
indicate a warp identification number. memory warps concurrently. Again CWP is still 4 in this scenario.
In this case, as soon as the first computation period finishes, the
3.2.1 CWP is Greater than MWP processor can send memory requests. Hence, a memory waiting
period of a warp always immediately follows the previous compu-
%CUG %CUG
tation period. If all warps are independent, the processor continu-
         
          ,GOHF\FOHV ously executes another warp. Case 3 in Figure 6a shows the timing
 
  information. In this case, the memory waiting periods are all over-
 
&:3  0HPRU\

0:3   lapped with other warps except the last warp. The total execution
 
:DLWLQJSHULRG
  cycles are the sum of 8 computation periods and only one memory
 &RPSXWDWLRQ   0HPRU\
&RPSXWDWLRQ0HPRU\  &RPSXWDWLRQ   0HPRU\
&RPSXWDWLRQ0HPRU\ waiting period.

C
D
UV /GOQT[RGTKQF UV %QORWVCVKQPRGTKQF
PF /GOQT[RGTKQF
/GOQT[ RGTKQF PF %QORWVCVKQPRGTKQF
%QORWVCVKQP RGTKQF %CUG   0:3  %CUG    
     
     
     
Figure 5: Total execution time when CWP is greater than  
 
MWP: (a) 8 warps (b) 4 warps &:3 
   &RPSXWDWLRQ   0HPRU\
&RPSXWDWLRQ0HPRU\
 

For Case 1 in Figure 5a, we assume that all the computation pe- &RPSXWDWLRQ0HPRU\

C
D
riods and memory waiting periods are from different warps. The
UV /GOQT[RGTKQF UV %QORWVCVKQPRGTKQF
system can service two memory warps simultaneously. Since one
PF /GOQT[RGTKQF PF %QORWVCVKQPRGTKQF
computation period is roughly one third of one memory waiting
warp period, the processor can finish 3 warps’ computation peri-
ods during one memory waiting warp period. (i.e., MWP is 2 and Figure 6: Total execution time when MWP is greater than
CWP is 4 for this case.) As a result, the 6 computation periods are CWP: (a) 8 warps (b) 4 warps
completely overlapped with other memory waiting periods. Hence,
only 2 computations and 4 memory waiting periods contribute to Even if not all warps are independent, when CWP is higher than
the total execution cycles. MWP, many of memory waiting periods are overlapped. Case 4
For Case 2 in Figure 5b, there are four warps and each warp has in Figure 6b shows an example. Each warp has two computation
two computation periods and two memory waiting periods. The periods and two memory waiting periods. Since the computation
second computation period can start only after the first memory time is dominant, the total execution cycles are again the sum of 8
waiting period of the same warp is finished. MWP and CWP are computation periods and only one memory waiting period.
the same as Case 1. First, the processor executes four of the first Using MWP and CWP, the total execution cycles can be calcu-
computation periods from each warp one by one. By the time the lated using the following equation:
processor finishes the first computation periods from all warps, two Exec_cycles = M em_p + Comp_cycles × N (3)
memory waiting periods are already serviced. So the processor can
execute the second computation periods for these two warps. After M em_p: One memory waiting period (= M em_L in Equation (12))
that, there are no ready warps. The first memory waiting periods for Case 5 in Figure 7 shows an extreme case. In this case, not even
the renaming two warps are still not finished yet. As soon as these one computation period can be finished while one memory waiting
two memory requests are serviced, the processor starts to execute period is completed. Hence, CWP is less than 2. Note that CWP
the second computation periods for the other warps. Surprisingly, is always greater 1. Even if MWP is 8, the application cannot take
even though there are some idle cycles between computation peri- advantage of any memory warp parallelism. Hence, the total exe-
ods, the total execution cycles are the same as Case 1. When CWP cution cycles are 8 computation periods plus one memory waiting
is higher than MWP, there are enough warps that are waiting for the period. Note that even this extreme case, the total execution cycles
memory values, so the cost of computation periods can be almost of Case 5 are the same as that of Case 4. Case 5 happens when
always hidden by memory access periods. Comp_cycles are longer than Mem_cycles.
%CUG MWP is tightly coupled with the DRAM memory system. In our
&:3
       
        0:3  analytical model, we model the DRAM system as a simple queue
and each SM has its own queue. Each active SM consumes an equal
 &RPSXWDWLRQ   0HPRU\
&RPSXWDWLRQ0HPRU\ amount of memory bandwidth. Figure 9 shows the memory model
/GOQT[RGTKQF
and a timeline of memory warps.
%QORWVCVKQPRGTKQF
The latency of each memory warp is at least M em_L cycles.
Departure_delay is the minimum departure distance between two
Figure 7: Total execution time when computation cycles are consecutive memory warps. Mem_L is a round trip time to the
longer than memory waiting cycles. (8 warps) DRAM, which includes the DRAM access time and the address
and data transfer time.
3.2.3 Not Enough Warps Running &RUH 0HPRU\ 6+/'
0HPB/ 'HSDUWXUHGHOD\
The previous two sections described situations when there are 60 %DQGZLGWK
:DUS 'HSDUWXUHGHOD\
:DUS
enough number of warps running on one SM. Unfortunately, if an :DUS
application does not have enough number of warps, the system can- 60
not take advantage of all available warp parallelism. MWP and
CWP cannot be greater than the number of active warps on one :DUS
:DUS
SM. 60 :DUS

D E
%CUG                
Figure 9: Memory system model: (a) memory model (b) time-
&RPSXWDWLRQ0HPRU\ line of memory warps

C

%CUG         MWP represents the number of memory warps per SM that can
       
be handled during Mem_L cycles. MWP cannot be greater than the
&
&RPSXWDWLRQ0HPRU\
W WL 0 number of warps per SM that reach the peak memory bandwidth

D
(M W P _peak_BW ) of the system as shown in Equation (5). If
/GOQT[RGTKQF %QORWVCVKQPRGTKQF
fewer SMs are executing warps, each SM can consume more band-
width than when all SMs are executing warps. Equation (6) repre-
Figure 8: Total execution time when MWP is equal to N: (a) 1 sents M W P _peak_BW . If an application does not reach the peak
warp (b) 2 warps bandwidth, MWP is a function of M em_L and departure_delay .
M W P _W ithout_BW is calculated using Equations (10) – (17).
Case 6 in Figure 8a shows when only one warp is running. All MWP cannot be also greater than the number of active warps as
the executions are serialized. Hence, the total execution cycles are shown in Equation (5). If the number of active warps is less than
the sum of the computation and memory waiting periods. Both M W P _W ithout_BW _full, the processor does not have enough
CWP and MWP are 1 in this case. Case 7 in Figure 8b shows there number of warps to utilize memory level parallelism.
are two running warps. Let’s assume that MWP is two. Even if one M W P = M IN(M W P _W ithout_BW, MW P _peak_BW, N ) (5)
computation period is less than the half of one memory waiting pe- M em_Bandwidth
riod, because there are only two warps, CWP is still two. Because M W P _peak_BW = (6)
BW _per _warp × #ActiveSM
of MWP, the total execution time is roughly the half of the sum of F req × Load_bytes_per _warp
BW _per _warp = (7)
all the computation periods and memory waiting periods. M em_L
Using MWP, the total execution cycles of the above two cases
can be calculated using the following equation:
'HSDUWXUHBGHOBXQFRDO 'HSDUWXUHBGHOBFRDO
0(0B/'
Exec_cycles =M em_cycles × N/M W P + Comp_cycles× $GGU 
0(0B/' 0HPB/B&RDO

$GGU  ZDUS $GGU a$GGU 


:DUS
N/M W P + Comp_p(MW P − 1) (4)
S
ZDUS

$GGU 
ZDUS $GGU a$GGU 
=M em_cycles + Comp_cycles + Comp_p(MW P − 1)
$GGU  ZDUS $GGU a$GGU 

Note that for both cases, MWP and CWP are equal to N, the number
0HPB/B8QFRDO
of active warps per SM. D E

3.3 Calculating the Degree of Warp Parallelism Figure 10: Illustrations of departure delays for uncoalesced
and coalesced memory warps: (a) uncoalesced case (b) coa-
3.3.1 Memory Warp Parallelism (MWP)
lesced case
MWP is slightly different from MLP [10]. MLP represents how
many memory requests can be serviced together. MWP repre- The latency of memory warps is dependent on memory access
sents the maximum number of warps in each SM that can access pattern (coalesced/uncoalesced) as shown in Figure 10. For unco-
the memory simultaneously during one memory warp waiting pe- alesced memory warps, since one warp requests multiple number
riod. The main difference between MLP and MWP is that MWP is of transactions (#U ncoal_per_mw), M em_L includes departure de-
counting all memory requests from a warp as one unit, while MLP lays for all #U ncoal_per_mw number of transactions. Departure_delay
counts all individual memory requests separately. As we discussed also includes #U ncoal_per_mw number of Departure_del_uncoal
in Section 2.3, one memory instruction in a warp can generate mul- cycles. M em_LD is a round-trip latency to the DRAM for each
tiple memory transactions. This difference is very important be- memory transaction. In this model, M em_LD for uncoalesced and
cause a warp cannot be executed until all values are ready. coalesced are considered as the same, even though a coalesced
memory request might take a few more cycles because of large data the number of elements per thread, counting the number of total
size. instructions per thread is simply counting the number of computa-
In an application, some memory requests would be coalesced tion instructions and the number of memory instructions per data
and some would be not. Since multiple warps are running con- element. The detailed algorithm to count the number of instruc-
currently, the analytical model simply uses the weighted average tions from PTX code is provided in an extended version of this
of memory latency of coalesced and uncoalesced latency for the paper [12].
memory latency (Mem_L). A weight is determined by the number
of coalesced and uncoalesced memory requests as shown in Equa- 3.4.4 Cycles Per Instruction (CPI)
tions (13) and (14). MWP is calculated using Equations (10) – Cycles per Instruction (CPI) is commonly used to represent the
(17). The parameters used in these equations are summarized in Ta- cost of each instruction. Using total execution cycles, we can cal-
ble 1. Mem_LD, Departure_del_coal and Departure_del_uncoal culate Cycles Per Instruction using Equation (25). Note that, CPI is
are measured with micro-benchmarks as we will show in Section 5.1. the cost when an instruction is executed by all threads in one warp.
Exec_cycles_app
3.3.2 Computation Warp Parallelism (CWP) CP I =
#T otal_insts ×
#T hreads_per_block × #Blocks
Once we calculate the memory latency for each warp, calculat- #T hreads_per_warp #Active_SM s
ing CWP is straightforward. CW P _f ull is when there are enough (25)
number of warps. When CW P _f ull is greater than N (the num-
ber of active warps in one SM) CW P is N, otherwise, CW P _f ull 3.4.5 Coalesced/Uncoalesced Memory Accesses
becomes CW P . As Equations (15) and (12) suggest, the latency of memory in-
M em_cycles + Comp_cycles struction is heavily dependent on memory access type. Whether
CW P _f ull = (8)
Comp_cycles memory requests inside a warp can be coalesced or not is depen-
CW P = M IN (CW P _f ull, N ) (9) dent on the microarchitecture of the memory system and memory
access pattern in a warp. The GPUs that we evaluated have two co-
3.4 Putting It All Together in CUDA alesced/uncoalesced polices, specified by the Compute capability
So far, we have explained our analytical model without strongly version. The CUDA manual [22] describes when memory requests
being coupled with the CUDA programming model to simplify the in a warp can be coalesced or not in more detail. Earlier compute
model. In this section, we extend the analytical model to consider capability versions have two differences compared with the later
the CUDA programming model. version(1.3): (1) stricter rules are applied to be coalesced, (2) when
memory requests are uncoalesced, one warp generates 32 memory
3.4.1 Number of Warps per SM transactions. In the latest version (1.3), the rules are more relaxed
The GPU SM multithreading architecture executes 100s of threads and all memory requests are coalesced into as few memory trans-
concurrently. Nonetheless, not all threads in an application can be actions as possible.8
executed at the same time. The processor fetches a few blocks at The detailed algorithms to detect coalesced/uncoalesced mem-
one time. The processor fetches additional blocks as soon as one ory accesses and to count the number of memory transactions per
block retires. #Rep represents how many times a single SM exe- each warp at static time are provided in an extended version of this
cutes multiple active number of blocks. For example, when there paper [12].
are 40 blocks in an application and 4 SMs. If each SM can execute
3.4.6 Synchronization Effects
2 blocks concurrently, then #Rep is 5. Hence, the total number of
warps per SM is #Active_warps_per_SM (N) times #Rep. N is #FFKVKQPCNFGNC[
determined by machine resources.
       
3.4.2 Total Execution Cycles        
       
Depending on MWP and CWP values, total execution cycles for        
an entire application (Exec_cycles_app) are calculated using Equa- 5[PEJTQPK\CVKQP 5[PEJTQPK\CVKQP
tions (22),(23), and (24). M em_L is calculated in Equation (12). D E
Execution cycles that consider synchronization effects will be de- UV /GOQT[RGTKQF UV %QORWVCVKQPRGTKQF

scribed in Section 3.4.6. PF /GOQT[RGTKQF PF %QORWVCVKQPRGTKQF

3.4.3 Dynamic Number of Instructions Figure 11: Additional delay effects of thread synchronization:
Total execution cycles are calculated using the number of dy- (a) no synchronization (b) thread synchronization after each
namic instructions. The compiler generates intermediate assembler- memory access period
level instruction, the NVIDIA PTX instruction set [22]. PTX in-
structions translate nearly one to one with native binary microin-
The CUDA programming model supports thread synchroniza-
structions later.7 We use the number of PTX instructions for the
tion through the __syncthreads() function. Typically, all the
dynamic number of instructions.
threads are executed asynchronously whenever all the source operands
The total number of instructions is proportional to the number
in a warp are ready. However, if there is a barrier, the processor
of data elements. Programmers must decide the number of threads
cannot execute the instructions after the barrier until all the threads
and blocks for each input data. The number of total instructions
per thread is related to how many data elements are computed in 8 In the CUDA manual, compute capability 1.3 says all requests are
one thread, programmers must know this information. If we know coalesced because all memory requests within each warp are al-
ways combined into as few transactions as possible. However, in
7 Since some PTX instructions expand to multiple binary instruc- our analytical model, we use the coalesced memory access model
tions, using PTX instruction count could be one of the error sources only if all memory requests are combined into one memory trans-
in the analytical model. action.
M em_L_U ncoal = M em_LD + (#Uncoal_per _mw − 1) × Departure_del_uncoal (10)
Mem_L_Coal = M em_LD (11)
Mem_L = M em_L_Uncoal × W eight_uncoal + M em_L_Coal × W eight_coal (12)
#U ncoal_Mem_insts
W eight_uncoal = (13)
(#Uncoal_M em_insts + #Coal_Mem_insts)
#Coal_M em_insts
W eight_coal = (14)
(#Coal_Mem_insts + #U ncoal_Mem_insts)
Departure_delay = (Departure_del_uncoal × #Uncoal_per _mw) × W eight_uncoal + Departure_del_coal × W eight_coal (15)
M W P _W ithout_BW _f ull = M em_L/Departure_delay (16)
M W P _W ithout_BW = M IN (MW P _W ithout_BW _f ull, #Active_warps_per_SM) (17)
M em_cycles = M em_L_Uncoal × #Uncoal_M em_insts + M em_L_Coal × #Coal_Mem_insts (18)
Comp_cycles = #Issue_cycles × (#total_insts) (19)
N = #Active_warps_per _SM (20)
#Blocks
#Rep = (21)
#Active_blocks_per _SM × #Active_SMs
If (MWP is N warps per SM) and (CWP is N warps per SM)
Comp_cycles
Exec_cycles_app = (M em_cycles + Comp_cycles + × (MW P − 1)) × #Rep (22)
#M em_insts
If (CWP >= MWP) or (Comp_cycles > Mem_cycles)
N Comp_cycles
Exec_cycles_app = (M em_cycles × + × (M W P − 1)) × #Rep (23)
MW P #Mem_insts
If (MWP > CWP)
Exec_cycles_app = (M em_L + Comp_cycles × N ) × #Rep (24)
*All the parameters are summarized in Table 1.

1: MatrixMulKernel<<<80, 128>>> (M, N, P);


reach the barrier. Hence, there will be additional delays due to a 2: ....
thread synchronization. Figure 11 illustrates the additional delay 3: MatrixMulKernel(Matrix M, Matrix N, Matrix P)
effect. Surprisingly, the additional delay is less than one waiting 4: {
5: // init code ...
period. Actually, the additional delay per synchronization instruc- 6:
tion in one block is the multiple of Departure_delay and (MWP-1). 7: for (int a=starta, b=startb, iter=0; a<=enda;
Since the synchronization occurs as a block granularity, we need to 8: a+=stepa, b+=stepb, iter++)
9: {
account for the number of blocks in each SM. The final execution 10: __shared__ float Msub[BLOCKSIZE][BLOCKSIZE];
cycles of an application with synchronization delay effect can be 11: __shared__ float Nsub[BLOCKSIZE][BLOCKSIZE];
calculated by Equation (27). 12:
13: Msub[ty][tx] = M.elements[a + wM * ty + tx];
Synch_cost = Departure_delay × (MW P − 1) × #synch_insts× 14: Nsub[ty][tx] = N.elements[b + wN * ty + tx];
#Active_blocks_per _SM × #Rep (26) 15:
16: __syncthreads();
Exec_cycles_with_synch = Exec_cycles_app + Synch_cost (27) 17:
18: for (int k=0; k < BLOCKSIZE; ++k)
19: subsum += Msub[ty][k] * Nsub[k][tx];
3.5 Limitations of the Analytical Model 20:
Our analytical model does not consider the cost of cache misses 21: __syncthreads();
such as I-cache, texture cache, or constant cache. The cost of cache 22: }
23:
misses is negligible due to almost 100% cache hit ratio. 24: int index = wN * BLOCKSIZE * by + BLOCKSIZE
The current G80 architecture does not have a hardware cache 25: P.elements[index + wN * ty + tx] = subsum;
for the global memory. Typical stream applications running on the 26:}
GPUs do not have strong temporal locality. However, if an appli-
cation has temporal locality and a future architecture provides a
Figure 12: CUDA code of tiled matrix multiplication
hardware cache, the model should include a model of cache. In
future work, we will include cache models.
The cost of executing branch instructions is not modeled in de-
tail. Double counting the number of instructions in both paths will per block (4 warps per block), and 80 blocks for execution. And 5
probably provide an upper bound of execution cycles. blocks are actively assigned to each SM (Active_blocks_per_SM )
instead of 8 maximum blocks9 due to high resource usage.
3.6 Code Example We assume that the inner loop is iterated only once and the outer
To provide a concrete example, we apply the analytical model loop is iterated 3 times to simplify the example. Hence, #Comp_insts
for a tiled matrix multiplication example in Figure 12 to a system is 27, which is 9 computation (Figure 13 lines 5, 7, 8, 9, 10, 11, 13,
that has 80GB/s memory bandwidth, 1GHz frequency and 16 SM
processors. Let’s assume that the programmer specified 128 threads 9 Each SM can have up to 8 blocks at a given time.
Table 1: Summary of Model Parameters
Model Parameter Definition Obtained
1 #Threads_per_warp Number of threads per warp 32 [22]
2 Issue_cycles Number of cycles to execute one instruction 4 cycles [13]
3 Freq Clock frequency of the SM processor Table 3
4 Mem_Bandwidth Bandwidth between the DRAM and GPU cores Table 3
5 Mem_LD DRAM access latency (machine configuration) Table 6
6 Departure_del_uncoal Delay between two uncoalesced memory transactions Table 6
7 Departure_del_coal Delay between two coalesced memory transactions Table 6
8 #Threads_per_block Number of threads per block Programmer specifies inside a program
9 #Blocks Total number of blocks in a program Programmer specifies inside a program
10 #Active_SMs Number of active SMs Calculated based on machine resources
11 #Active_blocks_per_SM Number of concurrently running blocks on one SM Calculated based on machine resources [22]
12 #Active_warps_per_SM (N) Number of concurrently running warps on one SM Active_blocks_per_SM x Number of warps per block
13 #Total_insts (#Comp_insts + #Mem_insts)
14 #Comp_insts Total dynamic number of computation instructions in one thread Source code analysis
15 #Mem_insts Total dynamic number of memory instructions in one thread Source code analysis
16 #Uncoal_Mem_insts Number of uncoalesced memory type instructions in one thread Source code analysis
17 #Coal_Mem_insts Number of coalesced memory type instructions in one thread Source code analysis
18 #Synch_insts Total dynamic number of synchronization instructions in one thread Source code analysis
19 #Coal_per_mw Number of memory transactions per warp (coalesced access) 1
20 #Uncoal_per_mw Number of memory transactions per warp (uncoalesced access) Source code analysis[12](Table 3)
21 Load_bytes_per_warp Number of bytes for each warp Data size (typically 4B) x #Threads_per_warp

1: ... // Init Code


2: 4. EXPERIMENTAL METHODOLOGY
3: $OUTERLOOP:
4: ld.global.f32 %f2, [%rd23+0]; //
5: st.shared.f32 [%rd14+0], %f2; //
6: ld.global.f32 %f3, [%rd19+0]; // 4.1 The GPU Characteristics
7: st.shared.f32 [%rd15+0], %f3; // Table 3 shows the list of GPUs used in this study. GTX280 sup-
8: bar.sync 0; // Synchronization
9: ld.shared.f32 %f4, [%rd8+0]; // Innerloop unrolling ports 64-bit floating point operations and also has a later computing
10: ld.shared.f32 %f5, [%rd6+0]; // version (1.3) that improves uncoalesced memory accesses. To mea-
11: mad.f32 %f1, %f4, %f5, %f1; // sure the GPU kernel execution time, cudaEventRecord API
12: // the code of unrolled loop is omitted
13: bar.sync 0; // synchronization
that uses GPU Shader clock cycles is used. All the measured exe-
14: setp.le.s32 %p2, %r21, %r24; // cution time is the average of 10 runs.
15: @%p2 bra $OUTERLOOP; // Branch
16: ... // Index calculation
17: st.global.f32 [%rd27+0], %f1; // Store in P.elements
4.2 Micro-benchmarks
All the benchmarks are compiled with NVCC [22]. To test the
Figure 13: PTX code of tiled matrix multiplication analytical model and also to find memory model parameters, we de-
sign a set of micro-benchmarks that simply repeat a loop for 1000
times. We vary the number of load instructions and computation
instructions per loop. Each micro-benchmark has two memory ac-
cess patterns: coalesced and uncoalesced memory accesses.

14, and 15) instructions times 3. Note that ld.shared instruc- 4.3 Merge Benchmarks
tions in Figure 13 lines 9 and 10 are also counted as a computa- To test how our analytical model can predict typical GPGPU
tion instruction since the latency of accessing the shared memory applications, we use 6 different benchmarks that are mostly used
is almost as fast as that of the register file. Lines 13 and 14 in Fig- in the Merge work [17]. Table 5 explains the description of each
ure 12 show global memory accesses in the CUDA code. Memory benchmark and summarizes the characteristics of each benchmark.
indexes (a+wM*ty+tx) and (b+wN*ty+tx) determine memory The number of registers used per thread and shared memory usage
access coalescing within a warp. Since a and b are more likely per block are statically obtained by compiling the code with -cubin
not a multiple of 32, we treat that all the global loads are uncoa- flag. The number of dynamic PTX instructions is calculated using
lesced [12]. So #U ncoal_Mem_insts is 6, and #Coal_M em_insts program’s input values [12]. The rest of the characteristics are stat-
is 0. ically determined and can be found in PTX code. Note that, since
Table 2 shows the necessary model parameters and intermediate we estimate the number dynamic instructions just based on static
calculation processes to calculate the total execution cycles of the information and an input size, the number counted is an approxi-
program. Since CWP is greater than MWP, we use Equation (23) to mated value. To simplify the evaluation, depending on the majority
calculate Exec_cycles_app. Note that in this example, the execution load type, we treat all memory access as either coalesced or un-
cost of synchronization instructions is a significant part of the total coalesced for each benchmark. For the Mat. (tiled) benchmark,
execution cost. This is because we simplified the example. In most the number of memory instructions and computation instructions
real applications, the number of dynamic synchronization instruc- change with respect to the number of warps per block, which the
tions is much less than other instructions, so the synchronization programmers specify. This is because the number of inner loop
cost is not that significant. iterations for each thread depends on blocksize (i.e., the tile size).
Table 5: Characteristics of the Merge Benchmarks (Arith. intensity means arithmetic intensity.)
Benchmark Description Input size Comp insts Mem insts Arith. intensity Registers Shared Mem
Sepia [17] Filter for artificially aging images 7000 x 7000 71 6 (uncoalesced) 11.8 7 52B
Linear [17] Image filter for computing the avg. of 9-pixels 10000 x 10000 111 30 (uncoalesced) 3.7 15 60B
SVM [17] Kernel from a SVM-based algorithm 736 x 992 10871 819 (coalesced) 13.3 9 44B
Mat. (naive) Naive version of matrix multiplication 2000 x 2000 12043 4001(uncoalesced) 3 10 88B
Mat. (tiled) [22] Tiled version of matrix multiplication 2000 x 2000 9780 - 24580 201 - 1001(uncoalesced) 48.7 18 3960B
Blackscholes [22] European option pricing 9000000 137 7 (uncoalesced) 19 11 36B

Table 2: Applying the Model to Figure 12 Table 3: The specifications of GPUs used in this study
Model Parameter Obtained Value Model 8800GTX Quadro FX5600 8800GT GTX280
Mem_LD Machine conf. 420 #SM 16 16 14 30
Departure_del_uncoal Machine conf. 10 (SP) Processor Cores 128 128 112 240
#Threads_per_block Figure 12 Line 1 128 Graphics Clock 575 MHz 600 MHz 600 MHz 602 MHz
#Blocks Figure 12 Line 1 80 Processor Clock 1.35 GHz 1.35GHz 1.5 GHz 1.3 GHz
#Active_blocks_per_SM Occupancy [22] 5 Memory Size 768 MB 1.5 GB 512 MB 1 GB
#Active_SMs Occupancy [22] 16 Memory Bandwidth 86.4 GB/s 76.8 GB/s 57.6 GB/s 141.7 GB/s
#Active_warps_per_SM 128/32(T able 1) × 5 20 Peak Gflop/s 345.6 384 336 933
#Comp_insts Figure 13 27 Computing Version 1.0 1.0 1.1 1.3
#Uncoal_Mem_insts Figure 12 Lines 13, 14 6 #Uncoal_per_mw 32 32 32 [12]
#Coal_Mem_insts Figure 12 Lines 13, 14 0 #Coal_per_mw 1 1 1 1
#Synch_insts Figure 12 Lines 16, 21 6=2×3
#Coal_per_mw see Sec. 3.4.5 1
#Uncoal_per_mw see Sec. 3.4.5 32
Load_bytes_per_warp Figure 13 Lines 4, 6 128B = 4B × 32
Departure_delay Equation (15) 320=32 × 10
Table 4: The characteristics of micro-benchmarks
Mem_L Equations (10), (12) 730=420 + (32 − 1) × 10
MWP_without_BW_full Equation (16) 2.28 =730/320 # inst. per loop Mb1 Mb2 Mb3 Mb4 Mb5 Mb6 Mb7
Memory 0 1 1 2 2 4 6
BW_per_warp Equation (7) 0.175GB/S = 1G×128B
730
80GB/s Comp. (FP) 23 (20) 17 (8) 29 (20) 27(12) 35(20) 47(20) 59(20)
MWP_peak_BW Equation (6) 28.57= 0.175GB×16
MWP Equation (5) 2.28=MIN(2.28, 28.57, 20)
Comp_cycles Equation (19) 132 cycles= 4 × (27 + 6)
Mem_cycles Equation (18) 4380 = (730 × 6)
CWP_full Equation (8) 34.18=(4380 + 132)/132 of load instructions, the CPI increases. For the coalesced load cases
CWP Equation (9) 20 = MIN(34.18, 20) (Mb1_C – Mb7_C), the cost of load instructions is almost hidden
#Rep Equation (21) 1 = 80/(16 × 5)
38450 = 4380 × 2.28 20
+
because of high MWP but for uncoalesced load cases (Mb1_UC
Exec_cycles_app Equation (23) 132 – Mb7_UC), the cost of load instructions linearly increases as the
6 × (2.28 − 1)
Synch_cost Equation (26) 12288= number of load instructions increases.
320 × (2.28 − 1) × 6 × 5
Final Time Equation (27) 50738 =38450 + 12288 5.2 Merge Benchmarks
Figure 15 and Figure 16 show the measured and estimated ex-
ecution time of the Merge benchmarks on FX5600 and GTX280.
The number of threads per block is varied from 4 to 512, (512 is
5. RESULTS the maximum value that one block can have in the evaluated CUDA
programs.) Even though the number of threads is varied, the pro-
5.1 Micro-benchmarks grams calculate the same amount data elements. In other words,
The micro-benchmarks are used to measure the constant vari- if we increase the number of threads in a block, the total number
ables that are required to model the memory system. We vary three of blocks is also reduced to process the same amount of data in
parameters (M em_LD, Departure_del_uncoal, and Departure_del_coal) one application. That is why the execution times are mostly the
for each GPU to find the best fitting values. FX5600, 8800GTX same. For the Mat.(tiled) benchmark, as we increase the number of
and 8800GT use the same model parameters. Table 6 summarizes threads the execution time reduces, because the number of active
the results. Departure_del_coal is related to the memory access warps per SM increases.
time to a single memory block. Departure_del_uncoal is longer Figure 17 shows the average of the measured and estimated CPIs
than Departure_del_coal, due to the overhead of 32 small mem- across four GPUs in Figures 15 and 16 configurations. The aver-
ory access requests. Departure_del_uncoal for GTX280 is much age value of CWP and MWP per SM are also shown in Figures 18,
longer than that of FX5600. GTX280 coalesces 32 thread memory and 19 respectively. 8800GT has the least amount of bandwidth
requests per warp into the minimum number of memory access re-
quests, and the overhead per access request is higher, with fewer
accesses.
Table 6: Results of the Memory Model Parameters
Using the parameters in Table 6, we calculate CPI for the micro- Model FX5600 GTX280
benchmarks. Figure 14 shows the average CPI of the micro-benchmarks Mem_LD 420 450
for both measured value and estimated value using the analytical Departure_del_uncoal 10 40
model. The results show that the average geometric mean of the er- Departure_del_coal 4 4
ror is 5.4%. As we can predict, as the benchmark has more number
2808 78
15489 72
13768 2496 Measured 66
12047 Measured 2184 Model 60 Measured
54

Time (ms)
Model

Time (ms)
1872 Model
Time (ms)

10326 48
8605 1560 42
1248 36
6884 30
Mat. (tiled)
5163 936 24
3442 624 18
12 Blackscholes
1721 Mat. (naive) 312 6
0 0 0
0 48 96 144 192 240 288 336 384 432 480 0 48 96 144 192 240 288 336 384 432 480 0 48 96 144 192 240 288 336 384 432 480
Threads per block Threads per block Threads per block
330 1395 60
Measured
297 1240 Measured 55
Model Measured
264 1085 Model 50
231 45 Model
Time (ms)

Time (ms)
930

Time (ms)
198 40
775 35
165 30
132 620 25
99 465 20
66 310 15
10
33 Sepia 155 Linear 5 SVM
0 0 0
0 48 96 144 192 240 288 336 384 432 480 0 48 96 144 192 240 288 336 384 432 480 0 48 96 144 192 240 288 336 384 432 480
Threads per block Threads per block Threads per block

Figure 15: The total execution time of the Merge benchmarks on FX5600

3080 1050 39
36
2772 945 33
2464 Measured 840 30 Measured
Measured 27

Time (ms)
2156 Model 735
Time (ms)

Time (ms)

24 Model
1848 630 Model
21
1540 525 18
1232 420 Mat. (tiled) 15
924 315 12 Blackscholes
616 Mat. (naive) 210
9
6
308 105 3
0 0 0
0 48 96 144 192 240 288 336 384 432 480 0 48 96 144 192 240 288 336 384 432 480 0 48 96 144 192 240 288 336 384 432 480
Threads per block Threads per block Threads per block
78 340
72 306 Measured 39
66 36
60 272 Model 33
Measured 30
54 238 Measured
Time (ms)

Time (ms)

Model 27

Time (ms)
48 204 Model
42 24
36 170 21
30 136 18
15 SVM
24 102 12
18 68 9
12 Sepia 6
6 34 Linear 3
0 0 0
0 48 96 144 192 240 288 336 384 432 480 0 48 96 144 192 240 288 336 384 432 480 0 48 96 144 192 240 288 336 384 432 480
Threads per block Threads per block Threads per block

Figure 16: The total execution time of the Merge benchmarks on GTX280

compared to other GPUs, resulting in the highest CPI in contrast cause we have to estimate the number of memory requests that are
to GTX280. Generally, higher arithmetic intensity means lower generated by partially coalesced loads per warp in GTX280, unlike
CPI (lower CPI is higher performance). However, even though the other GPUs which have the fixed value 32. On average, the model
Mat.(tiled) benchmark has the highest arithmetic intensity, SVM estimates the execution cycles of FX5600 better than others. This
has the lowest CPI value. SVM has higher MWP and CWP than is because we set the machine parameters using FX5600.
those of Mat.(tiled) as shown in Figures 18 and 19. SVM has the There are several error sources in our model: (1) We used a very
highest MWP and the lowest CPI because only SVM has fully co- simple memory model and we assume that the characteristics of
alesced memory accesses. MWP in GTX280 is higher than the rest the memory behavior are similar across all the benchmarks. We
of GPUs because even though most memory requests are not fully found out that the outcome of the model is very sensitive to MWP
coalesced, they are still combined into as few requests as possible, values. (2) We assume that the DRAM memory scheduler sched-
which results in higher MWP. All other benchmarks are limited by ules memory requests equally for all warps. (3) We do not consider
departure_delay , which makes all other applications never reach the bank conflict latency in the shared memory. (4) All computa-
the peak memory bandwidth. tion instructions have the same latency even though some special
Figure 20 shows the average occupancy of the Merge bench- functional unit instructions have longer latency than others. (5) For
marks. Except Mat.(tiled) and Linear, all other benchmarks have some applications, the number of threads per block is not always
higher occupancy than 70%. The results show that occupancy is a multiple of 32. (6) The SM retires warps as a block granularity.
less correlated to the performance of applications. Even though there are free cycles, the SM cannot start to fetch new
The final geometric mean of the estimated CPI error on the Merge blocks, but the model assumes on average active warps.
benchmarks in Figure 17 over all four different types of GPUs is
13.3%. Generally the error is higher for GTX 280 than others, be-
36 16

32
FX5600(measured) 14
8800GT
28 FX5600(model) 12 FX5600
24
GTX280(measured) 8800GTX
GTX280(model) 10
GTX280
20

MWP
CPI

8
16
6
12

4
8

4 2

0 0
Mat. (naive) Mat. (tiled) SVM Sepia Linear Blackscholes

Mb1_UC

Mb2_UC

Mb3_UC

Mb4_UC

Mb5_UC

Mb6_UC

Mb7_UC
Mb1_C

Mb2_C

Mb3_C

Mb4_C

Mb5_C

Mb6_C

Mb7_C

Figure 19: MWP per SM on the Merge benchmarks

Figure 14: CPI on the micro-benchmarks


1.0
100 0.9
90 8800GT(measured)
0.8
8800GT(model)
80 0.7

OCCUPANCY
FX5600(measured)
70
FX5600(model) 0.6

60 8800GTX(measured) 0.5
CPI

50 8800GTX(model) 0.4
8800GT
40 GTX280(measured) 0.3 FX5600
GTX280(model)
30 0.2 8800GTX
20 0.1 GTX280
10 0.0

0
Mat. (naive) Mat. (tiled) SVM Sepia Linear Blackscholes
Mat.(naive) Mat.(tiled) SVM Sepia Linear Blackscholes
Figure 20: Occupancy on the Merge benchmarks
Figure 17: CPI on the Merge benchmarks

6. RELATED WORK analyze the performance of processors. They modeled long latency
We discuss research related to our analytical model in the ar- cache misses and other major performance bottleneck events using
eas of performance analytical modeling, and GPU performance es- a first-order model. They used different penalties for dependent
timation. No previous work we are aware of proposed a way of loads. Recently, Chen and Aamodit [7] improved the first-order
accurately predicting GPU performance or multithreaded program superscalar processor model by considering the cost of pending
performance at compile-time using only static time available infor- hits, data prefetching and MSHRs(Miss Status/Information Hold-
mation. Our cost estimation metrics provide a new way of estimat- ing Registers). They showed that not modeling prefetching and
ing the performance impacts. MSHRs can increase errors significantly in the first-order proces-
sor model. However, they only showed memory instructions’ CPI
6.1 Analytical Modeling results comparing with the results of a cycle accurate simulator.
There have been many existing analytical models proposed for There is a rich body of work that predicts parallel program per-
superscalar processors [21, 19, 18]. Most work did not consider formance prediction using stochastic modeling or task graph anal-
memory level parallelism or even cache misses. Karkhanis and ysis, which is beyond the scope of our work. Saavedra-Barrera and
Smith [15] proposed a first-order superscalar processor model to Culler [25] proposed a simple analytical model for multithreaded
machines using stochastic modeling. Their model uses memory la-
tency, switching overhead, the number of threads that can be inter-
30
leaved and the interval between thread switches. Their work pro-
25
8800GT vided insights into the performance estimation on multithreaded
FX5600 architectures. However, they have not considered synchronization
8800GTX effects. Furthermore, the application characteristics are represented
20
GTX280
with statistical modeling, which cannot provide detailed perfor-
CWP

15 mance estimation for each application. Their model also provided


insights into a saturation point and an efficiency metric that could
10
be useful for reducing the optimization spaces even though they did
5
not discuss that benefit in their work.
Sorin et al. [27] developed an analytical model to calculate through-
0 put of processors in the shared memory system. They developed a
Mat. (naive) Mat. (tiled) SVM Sepia Linear Blackscholes model to estimate processor stall times due to cache misses or re-
source constrains. They also discussed coalesced memory effects
inside the MSHR. The majority of their analytical model is also
Figure 18: CWP per SM on the Merge benchmarks based on statistical modeling.
6.2 GPU Performance Modeling [8] E. Lindholm, J. Nickolls, S.Oberman and J. Montrym.
Our work is strongly related with other GPU optimization tech- NVIDIA Tesla: A Unified Graphics and Computing
niques. The GPGPU community provides insights into how to opti- Architecture. IEEE Micro, 28(2):39–55, March-April 2008.
mize GPGPU code to increase memory level parallelism and thread [9] M. Fatica, P. LeGresley, I. Buck, J. Stone, J. Phillips,
level parallelism [11]. However, all the heuristics are qualitatively S. Morton, and P. Micikevicius. High Performance
discussed without using any analytical models. The most relevant Computing with CUDA, SC08, 2008.
metric is an occupancy metric that provides only general guidelines [10] A. Glew. MLP yes! ILP no! In ASPLOS Wild and Crazy Idea
as we showed in our Section 2.4. Recently, Ryoo et al. [24] pro- Session ’98, Oct. 1998.
posed two metrics to reduce optimization spaces for programmers [11] GPGPU. General-Purpose Computation Using Graphics
by calculating utilization and efficiency of applications. However, Hardware. http://www.gpgpu.org/.
their work focused on non-memory intensive workloads. We thor- [12] S. Hong and H. Kim. An analytical model for a GPU
oughly analyzed both memory intensive and non-intensive work- architecture with memory-level and thread-level parallelism
loads to estimate the performance of applications. Furthermore, awareness. Technical Report TR-2009-003, Atlanta, GA,
their work just provided optimization spaces to reduce program USA, 2009.
tuning time. In contrast, we predict the actual program execution [13] W. Hwu and D. Kirk. Ece 498al1: Programming massively
time. Bakhoda et al. [6] recently implemented a GPU simulator and parallel processors, fall 2007.
analyzed the performance of CUDA applications using the simula- http://courses.ece.uiuc.edu/ece498/al1/.
tion output. [14] Intel SSE / MMX2 / KNI documentation.
http://www.intel80386.com/simd/mmx2-doc.html.
7. CONCLUSIONS [15] T. S. Karkhanis and J. E. Smith. A first-order superscalar
This paper proposed and evaluated a memory parallelism aware processor model. In ISCA, 2004.
analytical model to estimate execution cycles for the GPU architec- [16] Khronos. Opencl - the open standard for parallel
ture. The key idea of the analytical model is to find the maximum programming of heterogeneous systems.
number of memory warps that can execute in parallel, a metric http://www.khronos.org/opencl/.
which we called MWP, to estimate the effective memory instruction [17] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng.
cost. The model calculates the estimated CPI (cycles per instruc- Merge: a programming model for heterogeneous multi-core
tion), which could provide a simple performance estimation metric systems. In ASPLOS XIII, 2008.
for programmers and compilers to decide whether they should per- [18] P. Michaud and A. Seznec. Data-flow prescheduling for large
form certain optimizations or not. Our evaluation shows that the instruction windows in out-of-order processors. In HPCA,
geometric mean of absolute error of our analytical model on micro- 2001.
benchmarks is 5.4% and on GPU computing applications is 13.3%. [19] P. Michaud, A. Seznec, and S. Jourdan. Exploring
We believe that this analytical model can provide insights into how instruction-fetch bandwidth requirement in wide-issue
programmers should improve their applications, which will reduce superscalar processors. In PACT, 1999.
the burden of parallel programmers. [20] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable
Parallel Programming with CUDA. ACM Queue, 6(2):40–53,
Acknowledgments March-April 2008.
[21] D. B. Noonburg and J. P. Shen. Theoretical modeling of
Special thanks to John Nickolls for insightful and detailed com- superscalar processor performance. In MICRO-27, 1994.
ments in preparation of the final version of the paper. We thank
[22] NVIDIA Corporation. CUDA Programming Guide, Version
the anonymous reviewers for their comments. We also thank Chi-
2.1.
keung Luk, Philip Wright, Guru Venkataramani, Gregory Diamos,
[23] M. Pharr and R. Fernando. GPU Gems 2. Addison-Wesley
and Eric Sprangle for their feedback on improving the paper. We
Professional, 2005.
gratefully acknowledge the support of Intel Corporation, Microsoft
Research, and the equipment donations from NVIDIA. [24] S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng,
J. Stratton, and W. Hwu. Program optimization space
pruning for a multithreaded gpu. In CGO, 2008.
8. REFERENCES [25] R. H. Saavedra-Barrera and D. E. Culler. An analytical
[1] ATI Mobility RadeonTM HD4850/4870 Graphics-Overview. solution for a markov chain modeling multithreaded.
http://ati.amd.com/products/radeonhd4800. Technical report, Berkeley, CA, USA, 1991.
[2] Intel Core2 Quad Processors. [26] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash,
http://www.intel.com/products/processor/core2quad. P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin,
[3] NVIDIA GeForce series GTX280, 8800GTX, 8800GT. R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan.
http://www.nvidia.com/geforce. Larrabee: a many-core x86 architecture for visual
computing. ACM Trans. Graph., 2008.
[4] NVIDIA Quadro FX5600. http://www.nvidia.com/quadro.
[27] D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A.
[5] Advanced Micro Devices, Inc. AMD Brook+.
Wood. Analytic evaluation of shared-memory systems with
http://ati.amd.com/technology/streamcomputing/AMD-
ILP processors. In ISCA, 1998.
Brookplus.pdf.
[28] C. A. Waring and X. Liu. Face detection using spectral
[6] A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M.
histograms and SVMs. Systems, Man, and Cybernetics, Part
Aamodt. Analyzing cuda workloads using a detailed GPU
B, IEEE Transactions on, 35(3):467–476, June 2005.
simulator. In IEEE ISPASS, April 2009.
[7] X. E. Chen and T. M. Aamodt. A first-order fine-grained
multithreaded throughput model. In HPCA, 2009.

You might also like