Ppar2017 Gpu 1
Ppar2017 Gpu 1
Sylvain Collange
Inria Rennes – Bretagne Atlantique
[email protected]
PPAR - 2017
Outline of the course
or
GPU
GPU
2016+ trend:
Heterogeneous multi-core processors influenced by GPUs
5
GPGPU in the future?
Yesterday (2000-2010)
Homogeneous multi-core
Discrete components
Central Graphics
Processing Unit Processing
Today (2011-...) (CPU) Unit (GPU)
Chip-level integration
Many embedded SoCs
Intel Sandy Bridge
AMD Fusion
NVIDIA Denver/Maxwell project… Throughput-
optimized
Tomorrow Latency-
cores
Heterogeneous multi-core optimized Hardware
cores accelerators
GPUs to blend into
throughput-optimized cores? Heterogeneous multi-core chip
6
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling
7
The free lunch era... was yesterday
1980's to 2002: Moore's law, Dennard scaling, micro-architecture
improvements
Exponential performance increase
Software compatibility preserved
Serial performance 9
Usage changes
10
Latency vs. throughput
11
Amdahl's law
1 S Speedup
S=
Time to run P Time to run P Ratio of parallel
1−P portions
sequential portions N parallel portions
N Number of
processors
S (speedup)
N (available processors)
G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale 12
Computing Capabilities. AFIPS 1967.
Why heterogeneous architectures?
1
Time to run S= Time to run
P
sequential portions 1−P parallel portions
N
M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 13
Example: System on Chip for smartphone
Small cores
for background activity
GPU
Big cores
for applications
Lots of interfaces Special-purpose 14
accelerators
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling
15
The (simplest) graphics rendering pipeline
Primitives
(triangles…)
Vertices
Fragment shader Textures
Vertex shader
Z-Compare
Blending
Clipping, Rasterization
Attribute interpolation
Pixels
Framebuffer
Fragments
Z-Buffer
Programmable Parametrizable
stage stage 16
How much performance do we need
17
Source: Damien Triolet, Hardware.fr
Beginnings of GPGPU
Microsoft DirectX
7.x 8.0 8.1 9.0 a 9.0b 9.0c 10.0 10.1 11
Unified shaders
NVIDIA
NV10 NV20 NV30 NV40 G70 G80-G90 GT200 GF100
FP 16 Programmable FP 32 Dynamic SIMT CUDA
shaders control flow
GPGPU traction
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
18
Today: what do we need GPUs for?
3. GPGPU
Complex synchronization, data
movements
One chip to rule them all
Find the common denominator
19
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling
20
What is parallelism?
Parallelism: independent operations which execution can be
overlapped
Operations: memory accesses or computations
Units
For memory: B = GB/s × ns
For arithmetic: flops = Gflops/s × ns
67 Latency (ns)
Throughput x8
Parallelism: ×56
Latency x6
410 ns 22
→ Need 56 times more parallelism!
Sources of parallelism
Thread 1 Thread 2
TLP: Thread-Level Parallelism
Between independent execution add mul Parallel
contexts: threads
25
Example: X ← a×X
Vector (DLP) X ← a * X
26
Uses of parallelism
“Horizontal” parallelism
for throughput A B C D
More units working in parallel
throughput
“Vertical” parallelism
for latency hiding A B C D
Pipelining: keep units busy
latency
A B C
when waiting for dependencies,
memory A B
27
How to extract parallelism?
Horizontal Vertical
29
Sequential processor
for i = 0 to n-1
X[i] ← a * X[i]
Source code
add i ← 18 Fetch
move i ← 0
loop: store X[17] Decode
load t ← X[i] Memory
mul Execute
mul t ← a×t
store X[i] ← t Memory
add i ← i+1
branch i<n? loop Sequential CPU
Machine code
30
The incremental approach: multi-core
Several processors
on a single chip
sharing one memory space
Source: Intel
31
Homogeneous multi-core
Horizontal use of thread-level parallelism
add i ← 18 F add i ← 50 F
IF IF
Memory
store X[17] D
ID store X[49] D
ID
mul EX
X mul EX
X
LSU
Mem LSU
Mem
Threads: T0 T1
32
Example: Tilera Tile-GX
…
Tile (1,1) Tile (1,2) Tile (1,8)
…
…
…
Tile (9,1) Tile (9,8)
33
Interleaved multi-threading
Vertical use of thread-level parallelism
mul Fetch
mul Decode
Thread 1
Thread 2
Thread 8
Area-efficient tradeoff
Blurs boundaries between cores 36
Implicit SIMD
Factorization of fetch/decode, load-store units
Fetch 1 instruction on behalf of several threads
Read 1 memory location and broadcast to several registers
T0 (0-3) load F
Memory
T1 (0-3) store D
T2
(0) mul (1) mul (2) mul (3) mul X
T3
(0) (1) (2) (3) Mem
In NVIDIA-speak
SIMT: Single Instruction, Multiple Threads
Convoy of synchronized threads: warp
Extracts DLP from multi-thread applications
37
Explicit SIMD
Single Instruction Multiple Data
Horizontal use of data level parallelism
loop: add i ← 20 F
vload T ← X[i]
vmul T ← a×T vstore X[16..19 D
Memory
vstore X[i] ← T
add i ← i+4 vmul X
branch i<n? loop
Machine code Mem
SIMD CPU
Examples
Intel MIC (16-wide)
AMD GCN GPU (16-wide×4-deep)
Most general purpose CPUs (4-wide to 8-wide)
38
Quizz: link the words
Parallelism Architectures
ILP Superscalar processor
TLP Homogeneous multi-core
DLP Multi-threaded core
Use Clustered multi-core
Horizontal: Implicit SIMD
more throughput Explicit SIMD
Vertical:
hide latency
39
Quizz: link the words
Parallelism Architectures
ILP Superscalar processor
TLP Homogeneous multi-core
DLP Multi-threaded core
Use Clustered multi-core
Horizontal: Implicit SIMD
more throughput Explicit SIMD
Vertical:
hide latency
40
Quizz: link the words
Parallelism Architectures
ILP Superscalar processor
TLP Homogeneous multi-core
DLP Multi-threaded core
Use Clustered multi-core
Horizontal: Implicit SIMD
more throughput Explicit SIMD
Vertical:
hide latency
41
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling
42
Example CPU: Intel Core i7
256-bit
SIMD
units: AVX
4 CPU cores
Wide superscalar
Simultaneous Multi-Threading:
2 threads
44
Example GPU: NVIDIA GeForce GTX 980
SIMT: warps of 32 threads
16 SMs / chip
4×32 cores / SM, 64 warps / SM
Core 32
Core 33
Core 34
Core 64
Core 65
Core 66
Core 91
Core 92
Core 93
Core 127
… … … … …
4612 Gflop/s
Up to 32768 threads in flight
45
Taxonomy of parallel architectures
Horizontal Vertical
46
Classification: multi-core
Intel Haswell Fujitsu SPARC64 X
Horizontal Vertical
ILP 8 8
TLP 4 2 16 2
DLP 8 2
General-purpose
SIMD Cores multi-cores:
(AVX) Hyperthreading
balance ILP, TLP and DLP
12 8 16 8
Cores Threads
47
How to read the table
Sequential code
no TLP, no DLP Horizontal Vertical
ILP 10 min(8, 10) =8
TLP 1 min(4, 1)=1 2
DLP 1 min(8, 1)=1
Horizontal Vertical
ILP 2 2
TLP 60 4 16×4 32 20×4 40
DLP 16 32 16 4
49
Takeaway
50
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
51
What is inside a graphics card?
52
External memory: discrete GPU
Example configuration:
Intel Core i7 4790, Nvidia GeForce GTX 980
Main memory
8 GB
55
GPU high-level organization
Graphics card
Processing units GPU chip
Streaming Multiprocessors (SM)
in Nvidia jargon SM SM SM
~2 TB/s
Compute Unit (CU) in AMD's L1 L1 L1
SMem SMem Smem 1 MB
Closest equivalent to a CPU core (aggregate)
Today: from 1 to 20 SMs in a GPU
Memory system: caches Crossbar
56
GPU processing unit organization
Each SM is a highly-multithreaded processor
Today: 24 to 48 warps of 32 threads each
→ ~1K threads on each SM, ~10K threads on a GPU
Thread
Warp
Execution units
...
Registers
Shared L1
memory cache
SM 1 SM 2
To L2 cache /
external memory 57
Outline
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
High-level performance modeling
64
First-order performance model
65
Performance: metrics and definitions
66
The roofline model
How much performance can I get for a given arithmetic intensity?
Upper bound on arithmetic throughput, as a function of arithmetic intensity
Property of the architecture
Arithmetic throughput
(Gflops/s)
Arithmetic intensity
(flops/byte)
Memory-bound Compute-bound
Compute or measure:
Peak memory throughput GTX 980: 224 GB/s
Ideal arithmetic intensity = peak compute throughput / mem throughput
GTX 980: 6412 (Gflop/s) / 224 (GB/s) = 28.6 flop/B
× 4 (B/flop) = 114 (dimensionless)
Arithmetic throughput Beware of units:
(Gflop/s) float=4B, double=8B !
4612 Gflops
Arithmetic intensity
(flop/byte)
28.6 flop/B
computation
accesses
Optimize
Optimize
memory
72
Example: dot product
for i = 1 to n
r += a[i] * b[i]
for i = 1 to n
r += a[i] * b[i]