0% found this document useful (0 votes)
41 views

Ppar2017 Gpu 1

The document introduces parallel graphics processing units (GPUs) and their use for general-purpose parallel computing (GPGPU). It discusses how technological limitations have driven the need for parallelism and throughput-optimized architectures like GPUs. GPUs were originally designed for graphics rendering but have become general parallel coprocessors due to their large numbers of cores and high memory bandwidth. The document outlines GPU architecture and programming models and how GPUs are being integrated into heterogeneous computing systems alongside CPUs.

Uploaded by

George Popov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Ppar2017 Gpu 1

The document introduces parallel graphics processing units (GPUs) and their use for general-purpose parallel computing (GPGPU). It discusses how technological limitations have driven the need for parallelism and throughput-optimized architectures like GPUs. GPUs were originally designed for graphics rendering but have become general parallel coprocessors due to their large numbers of cores and high memory bandwidth. The document outlines GPU architecture and programming models and how GPUs are being integrated into heterogeneous computing systems alongside CPUs.

Uploaded by

George Popov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Parallel programming:

Introduction to GPU architecture

Sylvain Collange
Inria Rennes – Bretagne Atlantique
[email protected]

PPAR - 2017
Outline of the course

March 6: Introduction to GPU architecture


Parallelism and how to exploit it
Performance models
March 13: GPU programming
The software side
Programming model
March 20: Performance optimization
Possible bottlenecks
Common optimization techniques
4 lab sessions, starting March 14-15
Labs 1&2: computing log(2) the hard way
Labs 3&4: Conway's Game of Life
Graphics processing unit (GPU)

or
GPU

GPU

Graphics rendering accelerator for computer games


Mass market: low unit price, amortized R&D
Increasing programmability and flexibility
Inexpensive, high-performance parallel processor
GPUs are everywhere, from cell phones to supercomputers
General-Purpose computation on GPU (GPGPU) 3
GPUs in high-performance computing

GPU/accelerator share in Top500 supercomputers


In 2010: 2%
In 2016: 17%

2016+ trend:
Heterogeneous multi-core processors influenced by GPUs

#1 Sunway TaihuLight (China) #2 Tianhe-2 (China)


40,960 × SW26010 (4 big + 256 small cores) 16,000 × (2×12-core Xeon + 3×57-core Xeon Phi)

5
GPGPU in the future?

Yesterday (2000-2010)
Homogeneous multi-core
Discrete components
Central Graphics
Processing Unit Processing
Today (2011-...) (CPU) Unit (GPU)
Chip-level integration
Many embedded SoCs
Intel Sandy Bridge
AMD Fusion
NVIDIA Denver/Maxwell project… Throughput-
optimized
Tomorrow Latency-
cores
Heterogeneous multi-core optimized Hardware
cores accelerators
GPUs to blend into
throughput-optimized cores? Heterogeneous multi-core chip

6
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling

7
The free lunch era... was yesterday
1980's to 2002: Moore's law, Dennard scaling, micro-architecture
improvements
Exponential performance increase
Software compatibility preserved

Hennessy, Patterson. Computer Architecture, a quantitative approach. 4 th Ed. 2006

Do not rewrite software, buy a new machine! 8


Technology evolution
Compute
Memory wall Performance
Memory speed does not increase as Gap
fast as computing speed Memory

Harder to hide memory latency Time


Power wall
Power consumption of transistors does Transistor
not decrease as fast as density density
increases
Performance is now limited by power Transistor
consumption power
Total power
ILP wall Time

Law of diminishing returns on


Instruction-Level Parallelism Cost

Pollack rule: cost ≃ performance²

Serial performance 9
Usage changes

New applications demand


parallel processing
Computer games : 3D graphics
Search engines, social networks…
“big data” processing
New computing devices are
power-constrained
Laptops, cell phones, tablets…
Small, light, battery-powered
Datacenters
High power supply
and cooling costs

10
Latency vs. throughput

Latency: time to solution


Minimize time, at the expense of
power
Metric: time
e.g. seconds
Throughput: quantity of tasks
processed per unit of time
Assumes unlimited parallelism
Minimize energy per operation
Metric: operations / time
e.g. Gflops / s
CPU: optimized for latency
GPU: optimized for throughput

11
Amdahl's law

Bounds speedup attainable on a parallel machine

1 S Speedup
S=
Time to run P Time to run P Ratio of parallel
1−P portions
sequential portions N parallel portions
N Number of
processors

S (speedup)

N (available processors)
G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale 12
Computing Capabilities. AFIPS 1967.
Why heterogeneous architectures?
1
Time to run S= Time to run
P
sequential portions 1−P parallel portions
N

Latency-optimized multi-core (CPU)


Low efficiency on parallel portions:
spends too much resources

Throughput-optimized multi-core (GPU)


Low performance on sequential portions

Heterogeneous multi-core (CPU+GPU)


Use the right tool for the right job
Allows aggressive optimization
for latency or for throughput

M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 13
Example: System on Chip for smartphone
Small cores
for background activity

GPU

Big cores
for applications
Lots of interfaces Special-purpose 14
accelerators
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling

15
The (simplest) graphics rendering pipeline
Primitives
(triangles…)

Vertices
Fragment shader Textures

Vertex shader
Z-Compare
Blending

Clipping, Rasterization
Attribute interpolation

Pixels
Framebuffer
Fragments
Z-Buffer

Programmable Parametrizable
stage stage 16
How much performance do we need

… to run 3DMark 11 at 50 frames/second?

Element Per frame Per second

Vertices 12.0M 600M


Primitives 12.6M 630M
Fragments 180M 9.0G
Instructions 14.4G 720G

Intel Core i7 2700K: 56 Ginsn/s peak


We need to go 13x faster
Make a special-purpose accelerator

17
Source: Damien Triolet, Hardware.fr
Beginnings of GPGPU

Microsoft DirectX
7.x 8.0 8.1 9.0 a 9.0b 9.0c 10.0 10.1 11
Unified shaders

NVIDIA
NV10 NV20 NV30 NV40 G70 G80-G90 GT200 GF100
FP 16 Programmable FP 32 Dynamic SIMT CUDA
shaders control flow

ATI/AMD FP 24 CTM FP 64 CAL


R100 R200 R300 R400 R500 R600 R700 Evergreen

GPGPU traction

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

18
Today: what do we need GPUs for?

1. 3D graphics rendering for games


Complex texture mapping, lighting
computations…

2. Computer Aided Design


workstations
Complex geometry

3. GPGPU
Complex synchronization, data
movements
One chip to rule them all
Find the common denominator

19
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling

20
What is parallelism?
Parallelism: independent operations which execution can be
overlapped
Operations: memory accesses or computations

How much parallelism do I need?


Little's law in queuing theory
Average customer arrival rate λ ← throughput
Average time spent W ← latency
Average number of customers
L = λ×W ← Parallelism = throughput × latency

Units
For memory: B = GB/s × ns
For arithmetic: flops = Gflops/s × ns

J. Little. A proof for the queuing formula L= λ W. JSTOR 1961. 21


Throughput and latency: CPU vs. GPU
Throughput (GB/s)
CPU memory: Core i7 4790,
DDR3-1600, 2 channels
25.6

67 Latency (ns)

GPU memory: NVIDIA GeForce GTX 980,


GDDR5-7010 , 256-bit
224

Throughput x8

Parallelism: ×56

Latency x6

410 ns 22
→ Need 56 times more parallelism!
Sources of parallelism

ILP: Instruction-Level Parallelism add r3 ← r1, r2


Parallel
Between independent instructions mul r0 ← r0, r1
in sequential program sub r1 ← r3, r0

Thread 1 Thread 2
TLP: Thread-Level Parallelism
Between independent execution add mul Parallel
contexts: threads

DLP: Data-Level Parallelism vadd r←a,b a1 a2 a3


Between elements of a vector: + + +
b1 b2 b3
same operation on several elements
r1 r2 r3

25
Example: X ← a×X

In-place scalar-vector product: X ← a×X

Sequential (ILP) For i = 0 to n-1 do:


X[i] ← a * X[i]

Threads (TLP) Launch n threads:


X[tid] ← a * X[tid]

Vector (DLP) X ← a * X

Or any combination of the above

26
Uses of parallelism

“Horizontal” parallelism
for throughput A B C D
More units working in parallel
throughput

“Vertical” parallelism
for latency hiding A B C D
Pipelining: keep units busy

latency
A B C
when waiting for dependencies,
memory A B

cycle 1 cycle 2 cycle 3 cycle 4

27
How to extract parallelism?

Horizontal Vertical

ILP Superscalar Pipelined

Multi-core Interleaved / switch-on-event


TLP SMT multithreading

DLP SIMD / SIMT Vector / temporal SIMT

We have seen the first row: ILP


We will now review techniques for the next rows: TLP, DLP 28
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling

29
Sequential processor

for i = 0 to n-1
X[i] ← a * X[i]
Source code
add i ← 18 Fetch
move i ← 0
loop: store X[17] Decode
load t ← X[i] Memory
mul Execute
mul t ← a×t
store X[i] ← t Memory
add i ← i+1
branch i<n? loop Sequential CPU
Machine code

Focuses on instruction-level parallelism


Exploits ILP: vertically (pipelining) and horizontally (superscalar)

30
The incremental approach: multi-core

Several processors
on a single chip
sharing one memory space

Intel Sandy Bridge

Area: benefits from Moore's law


Power: extra cores consume little when not in use
e.g. Intel Turbo Boost

Source: Intel
31
Homogeneous multi-core
Horizontal use of thread-level parallelism

add i ← 18 F add i ← 50 F
IF IF

Memory
store X[17] D
ID store X[49] D
ID
mul EX
X mul EX
X
LSU
Mem LSU
Mem

Threads: T0 T1

Improves peak throughput

32
Example: Tilera Tile-GX

Grid of (up to) 72 tiles


Each tile: 3-way VLIW processor,
5 pipeline stages, 1.2 GHz


Tile (1,1) Tile (1,2) Tile (1,8)



Tile (9,1) Tile (9,8)

33
Interleaved multi-threading
Vertical use of thread-level parallelism

mul Fetch
mul Decode

add i ←73 Execute


add i ← 50 Memory
load X[89] Memory
store X[72] load-store
load X[17] unit
store X[49]
Threads: T0 T1 T2 T3

Hides latency thanks to explicit parallelism


improves achieved throughput
34
Example: Oracle Sparc T5
16 cores / chip
Core: out-of-order superscalar, 8 threads
15 pipeline stages, 3.6 GHz

Thread 1
Thread 2

Thread 8

Core 1 Core 2 Core 16


35
Clustered multi-core
For each
individual unit, T0 T1 T2 T3
select between → Cluster 1 → Cluster 2
Horizontal replication
Vertical time-multiplexing
br Fetch
Examples
mul store Decode
Sun UltraSparc T2, T3
AMD Bulldozer mul EX
IBM Power 7 add i ←73 add i ← 50 Memory

load X[89] L/S Unit


store X[72]
load X[17]
store X[49]

Area-efficient tradeoff
Blurs boundaries between cores 36
Implicit SIMD
Factorization of fetch/decode, load-store units
Fetch 1 instruction on behalf of several threads
Read 1 memory location and broadcast to several registers

T0 (0-3) load F

Memory
T1 (0-3) store D
T2
(0) mul (1) mul (2) mul (3) mul X
T3
(0) (1) (2) (3) Mem

In NVIDIA-speak
SIMT: Single Instruction, Multiple Threads
Convoy of synchronized threads: warp
Extracts DLP from multi-thread applications

37
Explicit SIMD
Single Instruction Multiple Data
Horizontal use of data level parallelism

loop: add i ← 20 F
vload T ← X[i]
vmul T ← a×T vstore X[16..19 D

Memory
vstore X[i] ← T
add i ← i+4 vmul X
branch i<n? loop
Machine code Mem

SIMD CPU

Examples
Intel MIC (16-wide)
AMD GCN GPU (16-wide×4-deep)
Most general purpose CPUs (4-wide to 8-wide)
38
Quizz: link the words

Parallelism Architectures
ILP Superscalar processor
TLP Homogeneous multi-core
DLP Multi-threaded core
Use Clustered multi-core
Horizontal: Implicit SIMD
more throughput Explicit SIMD
Vertical:
hide latency

39
Quizz: link the words

Parallelism Architectures
ILP Superscalar processor
TLP Homogeneous multi-core
DLP Multi-threaded core
Use Clustered multi-core
Horizontal: Implicit SIMD
more throughput Explicit SIMD
Vertical:
hide latency

40
Quizz: link the words

Parallelism Architectures
ILP Superscalar processor
TLP Homogeneous multi-core
DLP Multi-threaded core
Use Clustered multi-core
Horizontal: Implicit SIMD
more throughput Explicit SIMD
Vertical:
hide latency

41
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling

42
Example CPU: Intel Core i7

Is a wide superscalar, but has also


Multicore
Multi-thread / core
SIMD units
Up to 117 operations/cycle from 8 threads

256-bit
SIMD
units: AVX
4 CPU cores

Wide superscalar

Simultaneous Multi-Threading:
2 threads

44
Example GPU: NVIDIA GeForce GTX 980
SIMT: warps of 32 threads
16 SMs / chip
4×32 cores / SM, 64 warps / SM

Warp 1 Warp 2 Warp 3 Warp 4


Warp 5 Warp 6 Warp 7 Warp 8
Core 1
Core 2

Core 32

Core 33
Core 34

Core 64

Core 65
Core 66

Core 91

Core 92
Core 93

Core 127
… … … … …

Warp 60 Warp 61 Warp 62 Warp 63

Time SM1 SM16

4612 Gflop/s
Up to 32768 threads in flight
45
Taxonomy of parallel architectures

Horizontal Vertical

ILP Superscalar / VLIW Pipelined

Multi-core Interleaved / switch-on-


TLP
SMT event multithreading

DLP SIMD / SIMT Vector / temporal SIMT

46
Classification: multi-core
Intel Haswell Fujitsu SPARC64 X

Horizontal Vertical
ILP 8 8
TLP 4 2 16 2
DLP 8 2
General-purpose
SIMD Cores multi-cores:
(AVX) Hyperthreading
balance ILP, TLP and DLP

IBM Power 8 Oracle Sparc T5 Sparc T:


focus on TLP
10 2

12 8 16 8

Cores Threads
47
How to read the table

Given an application with known ILP, TLP, DLP


how much throughput / latency hiding can I expect?
For each cell, take minimum of existing parallelism
and hardware capability
The column-wise product gives throughput / latency hiding

Sequential code
no TLP, no DLP Horizontal Vertical
ILP 10 min(8, 10) =8
TLP 1 min(4, 1)=1 2
DLP 1 min(8, 1)=1

Max throughput = 8×1×1


for this application
Peak throughput = 8×4×8
→Efficiency: ~3%
48
Classification: GPU and many small-core
Intel MIC Nvidia Kepler AMD GCN

Horizontal Vertical
ILP 2 2
TLP 60 4 16×4 32 20×4 40
DLP 16 32 16 4

SIMD Cores Cores SIMT Multi-


×units threading

GPU: focus on DLP, TLP


horizontal and vertical
Tilera Tile-GX Kalray MPPA-256
Many small-core:
3 5 focus on horizontal TLP
72 17×16

49
Takeaway

Parallelism for throughput and latency hiding


Types of parallelism: ILP, TLP, DLP
All modern processors exploit the 3 kinds of parallelism
GPUs focus on Thread-level and Data-level parallelism

50
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
51
What is inside a graphics card?

NVIDIA GeForce GTX 980 Maxwell GPU. Artistic rendering!

52
External memory: discrete GPU

Classical CPU-GPU model Motherboard Graphics card


Split memory spaces PCI
Need to transfer data Express
explicitly CPU GPU
16GB/s
Highest bandwidth from
GPU memory 26GB/s 224GB/s
Transfers to main memory
are slower

Main memory Graphics memory


16 GB 4 GB

Example configuration:
Intel Core i7 4790, Nvidia GeForce GTX 980

We will assume this model for CUDA programming


54
External memory: embedded GPU
System on Chip
Most GPUs today are integrated
Same physical memory CPU GPU
May support memory coherence
Cache
GPU can read directly from CPU
caches
More contention on external 26GB/s
memory

Main memory
8 GB

55
GPU high-level organization
Graphics card
Processing units GPU chip
Streaming Multiprocessors (SM)
in Nvidia jargon SM SM SM
~2 TB/s
Compute Unit (CU) in AMD's L1 L1 L1
SMem SMem Smem 1 MB
Closest equivalent to a CPU core (aggregate)
Today: from 1 to 20 SMs in a GPU
Memory system: caches Crossbar

Keep frequently-accessed data


Reduce throughput demand on L2 L2 L2 6 MB
main memory
Managed by hardware (L1, L2) or 290 GB/s
software (Shared Memory) Global memory

56
GPU processing unit organization
Each SM is a highly-multithreaded processor
Today: 24 to 48 warps of 32 threads each
→ ~1K threads on each SM, ~10K threads on a GPU

Thread

Warp

Execution units
...

Registers

Shared L1
memory cache
SM 1 SM 2
To L2 cache /
external memory 57
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling

64
First-order performance model

Questions you should ask yourself,


before starting to code or optimize

Will my code run faster on the GPU?


Is my existing code running as fast as it should?
Is performance limited by computations or memory bandwidth?

Pen-and-pencil calculations can (often) answer such questions

65
Performance: metrics and definitions

Optimistic evaluation: upper bound on performance


Assume perfect overlap of computations and memory accesses
Memory accesses: bytes
Only external memory,
not caches or registers
Computations: flops
Only “useful” computations (usually floating-point)
not address calculations, loop iterators..
Arithmetic intensity: flops / bytes
= computations / memory accesses
Property of the code
Arithmetic throughput: flops / s
Property of code + architecture

66
The roofline model
How much performance can I get for a given arithmetic intensity?
Upper bound on arithmetic throughput, as a function of arithmetic intensity
Property of the architecture
Arithmetic throughput
(Gflops/s)

Arithmetic intensity
(flops/byte)

Memory-bound Compute-bound

S. Williams, A. Waterman, D. Patterson. Roofline: an insightful visual performance model 70


for multicore architectures. Communications of the ACM, 2009
Building the machine model

Compute or measure:
Peak memory throughput GTX 980: 224 GB/s
Ideal arithmetic intensity = peak compute throughput / mem throughput
GTX 980: 6412 (Gflop/s) / 224 (GB/s) = 28.6 flop/B
× 4 (B/flop) = 114 (dimensionless)
Arithmetic throughput Beware of units:
(Gflop/s) float=4B, double=8B !
4612 Gflops

Arithmetic intensity
(flop/byte)
28.6 flop/B

Achievable peaks may be lower than theoretical peaks


Lower curves when adding realistic constraints
71
Using the model

Compute arithmetic intensity, measure performance of program


Identify bottleneck: memory or computation
Take optimization decision
Arithmetic throughput
(Gflop/s)

computation
accesses
Optimize

Optimize
memory

Measured Reuse data


performance Arithmetic intensity
(flop/byte)

72
Example: dot product

for i = 1 to n
r += a[i] * b[i]

How many computations?


How many memory accesses?
Arithmetic intensity?
Compute-bound or memory-bound?
How many Gflop/s on a GTX 980 GPU?
With data in GPU memory?
With data in CPU memory?
How many Gflop/s on an i7 4790 CPU?

GTX 980: 4612 Gflop/s, 224 GB/s


i7 4790: 460 Gflop/s, 25.6 GB/s
73
PCIe link: 16 GB/s
Example: dot product

for i = 1 to n
r += a[i] * b[i]

How many computations? → 2 n flops


How many memory accesses? → 2 n words
Arithmetic intensity? → 1 flop/word = 0.25 flop/B
Compute-bound or memory-bound? →Highly memory-bound
How many Gflop/s on a GTX 980 GPU?
With data in GPU memory? 224 GB/s × 0.25 flop/B → 56 Gflop/s
With data in CPU memory? 16 GB/s × 0.25 flop/B → 4 Gflop/s
How many Gflop/s on an i7 4790 CPU?
25.6 GB/s × 0.25 flop/B → 6.4 Gflop/s
Conclusion: don't bother porting to GPU!
GTX 980: 4612 Gflop/s, 224 GB/s
i7 4790: 460 Gflop/s, 25.6 GB/s
74
PCIe link: 16 GB/s
Takeaway

Result of many tradeoffs


Between locality and parallelism
Between core complexity and interconnect complexity
GPU optimized for throughput
Exploits primarily DLP, TLP
Energy-efficient on parallel applications with regular behavior
CPU optimized for latency
Exploits primarily ILP
Can use TLP and DLP when available
Performance models
Back-of-the-envelope calculations and common sense can save time

Next time: GPU programming in CUDA


75

You might also like