GPGPU
GPGPU
• Introduced in 1970
– First microprocessor
• 2,250 transistors
• 12 mm2
• 108 KHz
Intel 8086 Die Scan
• 29,000 transistors
• 33 mm2
• 5 MHz
• Introduced in 1979
– Basic architecture of the IA32 PC
Pentium III
• 9,500,000
transistors
• 125 mm2
• 450 MHz
• Introduced in 1999
HW-SW
HW-SW
Heterogeneity Everywhere
Neuromorphic Chips
Latency vs. throughput
Interconnect
Memory
Use best match for the job!
Storag
e
Software Perspective
Two type of developers
Cache
DRAM DRAM
CPU is optimized for sequential
code performance
ALU ALU
Control
ALU ALU
CPU GP U
Cache
DRAM DRAM
ALU ALU
Control
ALU ALU
CP U GP U
Cache
DRAM DRAM
(a) and (b) represent discrete GPU solutions, with a CPU-
integrated memory controller in (b). Diagram (c) corresponds to
integrated CPU-GPU solutions, as the AMD's Accelerated
Processing Unit (APU) chips.
source: Multicore and GPU Programming: An Integrated Approach by G. Barlas, 2014
• Heck no!
• You will get the best performance from
GPU if your application is:
– Computation intensive
– Many independent computations
– Many similar computations
A Glimpse at A
GPGPU:
• 16 highly threaded SM’s,
• >128 FPU’s, 367 GFLOPS,
Host • 768 MB DRAM,
• 86.4 GB/S Mem BW,
Input Assembler • 4GB/S BW to CPU
Thread Execution Manager
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e
Global Memory
A Glimpse at GPU
Streaming Multiprocessor (SM)
Host
Input Assembler
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e
Global Memory
A Glimpse at GPU
Streaming
Processor (SP) SPs within SM share control logic
Host
and instruction cache
Input Assembler
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e
Global Memory
A Glimpse at GPU
• Much higher bandwidth than typical system memory
• A bit slower than typical system memory
Host • Communication between GPU memory
Input Assembler
and system memory is slow
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e
Global Memory
Winning Applications Use Both CPU and GPU
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
Thread Processor
TF TF TF TF TF TF TF TF
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2
FB FB FB FB FB FB
• Addressing modes
Texture
Constants
• Shader capabilities
– Limited outputs Output Registers
• Instruction sets
FB
Memory
– Lack of Integer
& bit ops
• Communication
limited
• No user-defined
data types
The Birth of GPU Computing
• Step 1: Designing high-efficiency floating-point and
integer processors.
• Step 2: Exploiting data parallelism by having large
number of processors
• Step 3: Shader processors fully programmable with
large instruction cache, instruction memory, and
instruction control logic.
• Step 4: Reducing the cost of hardware by having
multiple shader processors to share their cache and
control logic.
• Step 5: Adding memory load/store instructions with
random byte addressing capability
• Step 6: Developing CUDA C/C++ compiler, libraries,
and runtime software models.
A Quick Glimpse on: Flynn Classification
• A taxonomy of computer architecture
• Proposed by Micheal Flynn in 1966
• It is based two things:
– Instructions
– Data
Multiple
Single instruction instruction
Single data SISD MISD
Multiple data SIMD MIMD
PU = Processing Unit
Which one
is closest to
GPU?
Problem With GPUs: Power
Source: http://www.eteknix.com/gigabyte-g1-gaming-geforce-gtx-980-4gb-graphics-card-review/17/
Problems Faced by GPUs
• Need enough parallelism
• Under-utilization
• Bandwidth to CPU
Still a way to go
Let’s Take A Closer Look:
The Hardware
Simplified View
Source: “The CUDA Handbook” by Nicholas Wilt .. Copyright (c) by Pearson Education Inc.
A Closer Look …
Source: “The CUDA Handbook” by Nicholas Wilt .. Copyright (c) by Pearson Education Inc.
source: http://static.ddmcdn.com/gif/graphics-card-5.jpg
PROCESSING
FLOW
PCI
Bus
PCI
Bus
PCI
Bus
Speed for
v3.0
Speed of PCIe
Version Speed (x1)
1.0 2.5 GT/s 250 MB/s
2.0 5 GT/s 500 MB/s
3.0 8 GT/s 984.6 MB/s
4.0 16 GT/s 1969 MB/s
5.0(expected in 2019) 32 or 25 GT/s 3938 or 3077 MB/s
3 x1 PCI e Slots
2 PCI Slots
source: http://www.nvidia.com/object/nvlink.html
NVLINK
Source: http://gadgets.ndtv.com/laptops/news/nvidia-announces-nvlink-architecture-3d-stacked-memory-pascal-gpu-500335
This is how we expose GPU as parallel processor.
Quick Glimpse At
GPU Programming
Model
Application Kernels Threads Blocks
Grid
Quick Glimpse At
GPU Programming
•
Model
Application can include multiple kernels
• Threads of the same block run on the same SM
– So threads in SM can operate and share memory
– Block in an SM is divided into warps of 32 threads
each
– A warp is the fundamental unit of dispatch in an
SM
• Blocks in a grid can coordinate using global
shared memory
• Each grid executes a kernel
Scheduling In Modern NVIDIA GPUs
1 S Speedup
S=
Time to run Time to run P Ratio of parallel
1−P portions
sequential portions N parallel portions
P N Number of
processors
S (speedup)
N (available processors)
G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale 35
Computing Capabilities. AFIPS 1967.
Why heterogeneous architectures?
1
Time to run
S= Time to run
sequential portions 1−P parallel portions
P
N
Latency-optimized multi-core (CPU)
Low efficiency on parallel portions:
spends too much resources
Throughput-optimized multi-
core (GPU)
Low performance on sequential
portions
Heterogeneous multi-core
(CPU+GPU)
Use the right tool for the right job
M. Hill, M. Marty. Amdahl'soptimization
law in the multicore era. IEEE Computer, 2008. 36
Allows aggressive for
Example: System on Chip for smartphone
Small cores
for background activity
GPU
Big cores
for applications
Lots of interfaces Special-purpose 37
accelerators
CUDA
F16
× + F32
F16
F32
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
D= A1,0
A2,0
A1,1
A2,1
A1,2
A2,2
A1,3
A2,3
B1,0
B2,0
B1,1
B2,1
B1,2
B2,2
B1,3
B2,3
C1,0
C2,0
C1,1
C2,1
C1,2
C2,2
C1,3
C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3
D = AB + C
RESNET-50 FP32
PERFORMANCE
Caffe Caffe2 TensorFlow MXNet Torch CNTK
2000
Chainer
1750
1500
Images per second
1250
1000
750
500
250
0
1 GPU 4 GPU 8 GPU
2 GPU
RESNET-50 MIXED PRECISION AND
FP32
1 GPU 2 GPU 4 GPU
7000
8 GPU
6500
6000
5500
5000
Images per second
4500
4000
3500
3000
2500
2000
1500
1000
500
0 1
MXNet FP32 GTC 2017 MXNet FP32 GTC 2018 MXNet Mixed GTC 2018
1
0
RESNET-50 MIXED PRECISION AND
FP32
1 GPU 2 GPU 4 GPU
7000
8 GPU
6500
6000
5500
5000
Images per second
4500
4000
3500
3000
2500
2000
1500
1000
500
0 1
MXNet FP32 GTC 2017 MXNet FP32 GTC 2018 MXNet Mixed GTC 2018
1
1
NVIDIA DGX
SOFTWARE STACK
DEEP LEARNING FRAMEWORKS
UPDATE
MODEL
CHECK IF
GOOD
ENOUGH
The V-100
1*t
256 * t +
7*t
=