GPU Architecture
GPU Architecture
Architectures
A
CPU
Perspec+ve
Goals
Data
Parallelism:
What
is
it,
and
how
to
exploit
it?
Workload
characterisHcs
Data
Parallel
ExecuHon
on
GPUs
Data Parallelism, Programming Models, SIMT
Graphics
Workloads
Streaming
computaHon
GPU
Graphics
Workloads
Streaming
computaHon
on
pixels
GPU
Graphics
Workloads
Iden2cal,
Streaming
computaHon
on
pixels
GPU
Graphics
Workloads
Iden2cal,
Independent,
Streaming
computaHon
on
pixels
GPU
Spell
Independent
0,7
=()
7,0
1,7
=()
6,0
2,7
=()
5,0
3,7
=()
4,0
Nave
Approach
Split
independent
work
over
mul1ple
processors
0,7
1,7
2,7
3,7
CPU0
7,0
()
=
CPU1
6,0
()
=
CPU2
5,0
()
=
CPU3
4,0
()
=
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
10
=()
Program
=()
Program
=()
Program
=()
0,7
CPU0
7,0
Memory
Writeback
Fetch
Decode
Execute
1,7
CPU1
6,0
Memory
Writeback
Fetch
Decode
Execute
2,7
CPU2
5,0
Memory
Writeback
Fetch
Decode
Execute
3,7
CPU3
4,0
Memory
Writeback
Fetch
Decode
Execute
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
11
=()
0,7
CPU0
7,0
Memory
Writeback
Fetch
Decode
Execute
When
work
is
iden1cal
(same
program):
1,7
CPU1
Program
Single
Program
MulHple
Data
(SPMD)
=()
Program
=()
2,7
6,0
CPU2
5,0
Memory
Writeback
Fetch
Decode
Execute
3,7
CPU3
4,0
Memory
Writeback
Fetch
Decode
Execute
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
12
=()
Program
=()
Program
=()
Program
=()
0,7
CPU0
7,0
Memory
Writeback
Fetch
Decode
Execute
1,7
CPU1
6,0
Memory
Writeback
Fetch
Decode
Execute
2,7
CPU2
5,0
Memory
Writeback
Fetch
Decode
Execute
3,7
CPU3
4,0
Memory
Writeback
Fetch
Decode
Execute
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
13
Program
=()
CPU0
3,7
Execute
Memory
Execute
Memory
Fetch
Decode
Memory
Execute
Memory
Execute
7,0
Writeback
6,0
Writeback
Writeback
5,0
Writeback
4,0
14
Program
=()
CPU0
3,7
Execute
Memory
Memory
Execute
Fetch
Decode
Memory
Execute
Memory
Execute
7,0
Writeback
6,0
Writeback
Writeback
5,0
Writeback
4,0
Register
File
15
Program
=()
WF0
1,7
Memory
Writeback
Execute
2,7
3,7
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
7,0
6,0
5,0
4,0
Memory
Writeback
Execute
16
Terminology Headache #1
17
MulHple
independent
threads
SIMD/Vector
SIMT
18
Example
Architecture
MIMD/SPMD
SIMD/Vector
MulHcore CPUs
x86 SSE/AVX
More
general:
supports
TLP
Gather/Scamer
can
be
awkward
Pros
Cons
SIMT
GPUs
Divergence
kills
performance
19
GPU
20
OoO/Dynamic
Scheduling
Need
ILP
MulHcore/MulHthreading/SMT
Need
independent
threads
21
GPU
GPU
Core
GPU Core
22
GPU Core
23
Processing Element
Lane
SIMD Unit
Pipeline
Compute Unit
Core
GPU Device
Device
GPU Core
24
GPU
Programming
Models
OpenCL
26
27
OpenCL
Early
CPU
languages
were
light
abstracHons
of
physical
hardware
E.g.,
C
28
OpenCL
Early
CPU
languages
were
light
abstracHons
of
physical
hardware
E.g.,
C
GPU
Architecture
GPU
GPU
Core
GPU Core
29
OpenCL
Early
CPU
languages
were
light
abstracHons
of
physical
hardware
E.g.,
C
GPU Architecture
OpenCL Model
GPU
GPU
Core
NDRange
GPU
Core
Workgroup
Workgroup
Work-item
Wavefront
30
NDRange
N-Dimensional
(N
=
1,
2,
or
3)
index
space
ParHHoned
into
workgroups,
wavefronts,
and
work-items
NDRange
Workgroup
Workgroup
31
Kernel
Run
an
NDRange
on
a
kernel
(i.e.,
a
funcHon)
Same
kernel
executes
for
each
work-item
Smells
like
MIMD/SPMD
Kernel
0,7
=()
7,0
1,7
=()
6,0
2,7
=()
5,0
3,7
=()
4,0
32
Kernel
Run
an
NDRange
on
a
kernel
(i.e.,
a
funcHon)
Same
kernel
executes
for
each
work-item
Smells
like
MIMD/SPMDbut
beware,
its
not!
Kernel
Workgroup
0,7
=()
7,0
1,7
=()
6,0
2,7
=()
5,0
3,7
=()
4,0
33
OpenCL Code
__kernel
void flip_and_recolor(__global float3 **in_image,
__global float3 **out_image,
int img_dim_x, int img_dim_y)
{
int x = get_global_id(1); // get work-item id in dim 1
int y = get_global_id(2); // get work-item id in dim 2
out_image[img_dim_x - x][img_dim_y - y] =
recolor(in_image[x][y]);
}
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
34
GPU
Microarchitecture
AMD Graphics Core Next
Local Memory
SIMT
SIMT
L1
Cache
SIMT
SIMT
SIMT
SIMT
L1 Cache
SIMT
GPU Core
SIMT
GPU Core
L2 Cache
Local Memory
36
Workgroup
SIMT
SIMT
SIMT
SIMT
L1 Cache
Local Memory
37
Time
1
2
3
4
5
6
7
8
9
10
11
12
SIMT0
SIMT1
WF1_0
WF1_1
WF1_2
WF1_3
WF5_0
WF5_1
WF5_2
WF5_3
WF9_0
WF9_1
WF9_2
WF9_3
WF2_0
WF2_1
WF2_2
WF2_3
WF6_0
WF6_1
WF6_2
WF6_3
WF10_0
WF10_1
WF10_2
SIMT2
SIMT3
WF3_0
WF3_1
WF3_2
WF3_3
WF7_0
WF7_1
WF7_2
WF7_3
WF11_0
WF11_1
WF4_0
WF4_1
WF4_2
WF4_3
WF8_0
WF8_1
WF8_2
WF8_3
WF12_0
38
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
39
Address
Coalescing
Wavefront:
Issue
64
memory
requests
NDRange
Workgroup
Workgroup
40
Address
Coalescing
Wavefront:
Issue
64
memory
requests
Common
case:
work-items
in
same
wavefront
touch
same
cache
block
Coalescing:
Merge
many
work-items
requests
into
single
cache
block
request
41
GPU
Memory
GPUs
have
caches.
42
GPU (GCN)
16KB
16 KB
2560
16KB
6.4 bytes
8MB
768KB
81,920
1MB
9.6 bytes
43
GPU
Caches
Maximize
throughput,
not
hide
latency
Not
there
for
either
spaHal
or
temporal
locality
44
Scratchpad
Memory
GPUs
have
scratchpads
(Local
Memory)
Allocated
to
a
workgroup
i.e.,
shared
by
wavefronts
in
workgroup
SIMT
SIMT
SIMT
Rename
address
Manage
capacity
manual
ll/evicHon
SIMT
L1 Cache
Local Memory
45
46
47
A
Rose
by
Any
Other
Name
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
48
AMD/OpenCL
CUDA Processor
Processing Element
Lane
CUDA
Core
GPU
Core
Streaming
MulHprocessor
GPU Device
SIMD Unit
Pipeline
Compute Unit
Core
GPU Device
Device
49
OpenCL/AMD
Henn&Pai
Work-item
Sequence
of
SIMD
Lane
OperaHons
Wavefront
Thread
of
SIMD
InstrucHons
Block
Workgroup
Body
of
vectorized
loop
Grid
NDRange
Vectorized
loop
Thread
Warp
Group
50
Allocated
to
a
workgroup
i.e.,
shared
by
wavefronts
in
workgroup
SIMT
SIMT
SIMT
Rename
address
Manage
capacity
manual
ll/evicHon
SIMT
L1 Cache
Local Memory
51
Recap
Data
Parallelism:
IdenHcal,
Independent
work
over
mulHple
data
inputs
GPU
version:
Add
streaming
access
pamern
52
Advanced
Topics
GPU LimitaAons, Future of GPGPU
53
54
if (x <= 0)
y = 0;
else
y = x;
55
if (x <= 0)
y = 0;
else
y = x;
56
if (x <= 0)
y = 0;
else
y = x;
57
Branch
Divergence
When
control
ow
diverges,
all
lanes
take
all
paths
58
Beware!
Divergence
isnt
just
a
performance
problem:
// acquire lock
while (test&set(lock, 1) == false) {
// spin
}
return;
}
59
Beware!
Divergence
isnt
just
a
performance
problem:
// acquire lock
while (test&set(lock, 1) == false) {
// spin
}
return;
}
60
Memory
Bandwidth
SIMT
DRAM
Lane 0
Bank 0
Lane 1
Bank 1
Lane 2
Bank 2
Lane 3
Bank 3
--
Parallel
Access
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
61
Memory
Bandwidth
SIMT
DRAM
Lane 0
Bank 0
Lane 1
Bank 1
Lane 2
Bank 2
Lane 3
Bank 3
--
Sequen1al
Access
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
62
Memory
Bandwidth
Memory
divergence
SIMT
DRAM
Lane 0
Bank 0
Lane 1
Bank 1
Lane 2
Bank 2
Lane 3
Bank 3
--
Sequen1al
Access
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
63
Memory
Divergence
One
work-item
stalls
enHre
wavefront
must
stall
Cause:
Bank
conicts,
cache
misses
64
Memory
Divergence
One
work-item
stalls
enHre
wavefront
must
stall
Cause:
Bank
conicts,
cache
misses
65
66
Safety
net:
Fence
make
sure
all
previous
accesses
are
visible
before
proceeding
Built-in
barriers
are
also
fences
A
wrench:
GPU
fences
are
scoped
only
apply
to
subset
of
work-items
in
system
E.g.,
local
barrier
67
GPU
Coherence?
NoHce:
GPU
consistency
model
does
not
require
coherence
i.e.,
Single
Writer,
MulHple
Reader
68
Reliability:
Historically:
Who
noHces
a
bad
pixel?
Future:
GPU
compute
demands
correctness
Power:
Mobile,
mobile
mobile!!!
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
69
70