0% found this document useful (0 votes)
57 views

GPGPU

Uploaded by

Cosmic02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

GPGPU

Uploaded by

Cosmic02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 139

GPGPU

Hariharan Venugopal | Deep Learning Solution Architect


Computer architecture crash course

• How does a processor work?


• Or was working in the 1980s to 1990s: modern
processors are much more complicated!
Machine language:
instruction set
The Von Neumann
processor
Step by step: Fetch
Decode
Read operands
Execute operation
Write back
Increment PC
Load or store
instruction
Load or store
instruction
Branch instruction
What about the
state machine?
Going faster using
ILP: pipeline
Pipelined processor
Superscalar execution
Branch prediction
Caches
Intel 4004 Die Photo

• Introduced in 1970
– First microprocessor
• 2,250 transistors
• 12 mm2
• 108 KHz
Intel 8086 Die Scan

• 29,000 transistors
• 33 mm2
• 5 MHz
• Introduced in 1979
– Basic architecture of the IA32 PC
Pentium III

• 9,500,000
transistors
• 125 mm2
• 450 MHz
• Introduced in 1999
HW-SW
HW-SW

• How to control software cost?


– By reducing redesigning of the software.
• And how to do that?
– By making the application scalable
• More cores
• More threads per core
• More memory
• Faster interconnect
• Basically: scalability in the face of hardware growth.
– By making the application portable
• Across different instruction sets (x86, ARM, …)
• From multicore to GPU to FPGA to ….
• Shared vs distributed memory
NOT PARALLEL

But Heterogeneous parallel


programming
Core B

Heterogeneity Everywhere

Neuromorphic Chips
Latency vs. throughput

Latency: time to solution


CPUs
Minimize time, at the expense of
power

Throughput: quantity of tasks


processed per unit of time
GPUs
Assumes unlimited parallelism
Minimize energy per
operation
GPU FPGA

Multicor Automata Neuromorphic


e
Processin
g

Interconnect

Memory
Use best match for the job!

Storag
e
Software Perspective
Two type of developers

Performance Group Productivity Group


(C/C++, CUDA, OpenCL, …. ) (Python, Scala, … )
Attempts to Make Parallel
Programming Easy
• 1st idea: The right computer language
would make parallel programming
straightforward
– Result so far: Some languages made
parallel programming easier, but none has
made it as fast, efficient, and flexible
as traditional sequential programming.
Attempts to Make Parallel
Programming Easy
• 2nd idea: If you just design the
hardware properly, parallel programming
would become easy.
– Result so far: no one has yet succeeded!
Attempts to Make Parallel
Programming Easy
• 3rd idea: Write software that will
automatically parallelize existing
sequential programs.
– Result so far: Success here is inversely
proportional to the number of cores!
Two Main Goals

•Maintain execution speed of old sequential programs

•Increase throughput of parallel programs


Two Main Goals
•Maintain execution speed of old
sequential programs
CPU
•Increase throughput of parallel
programs
GPU
+
CPU
ALU ALU
Control
ALU ALU
CP U GP U

Cache

DRAM DRAM
CPU is optimized for sequential
code performance

ALU ALU
Control
ALU ALU
CPU GP U

Cache

DRAM DRAM
ALU ALU
Control
ALU ALU
CP U GP U

Cache

DRAM DRAM

Almost 10x the bandwidth of multicore


(relaxed memory model)
How to Choose A Processor for Your
Application?
• Performance
• Very large installation base
• Practical form-factor and easy
accessibility
• Support for IEEE floating point
standard
Integrated GPU vs Discrete
GPU


(a) and (b) represent discrete GPU solutions, with a CPU-
integrated memory controller in (b). Diagram (c) corresponds to
integrated CPU-GPU solutions, as the AMD's Accelerated
Processing Unit (APU) chips.
source: Multicore and GPU Programming: An Integrated Approach by G. Barlas, 2014

Copyright © 2015 Elsevier Inc. All rights reserved.

Tradeoff: Low energy vs higher performance


Integrated CPU+GPU processors
• More than 90% of processors shipping today include a GPU on die
• Low energy use is a key design goal
Intel 4th Generation Core Processor: “Haswell” AMD Kaveri
APU

4-core GT2 Desktop: 35 W package


http://www.geeks3d.com/20140114/amd-kaveri-a10-7850k-a10-7700k-and-a8-7600-apus-announced/

2-core GT2 Ultrabook: 11.5 W package Desktop: 45-95 W package


Mobile, embedded: 15 W package
source: Performance and Programmability Trade-offs in the OpenCL 2.0 SVM and Memory Model
by Brian T. Lewis, Intel Labs
Is Any Application Suitable for GPU?

• Heck no!
• You will get the best performance from
GPU if your application is:
– Computation intensive
– Many independent computations
– Many similar computations
A Glimpse at A
GPGPU:
• 16 highly threaded SM’s,
• >128 FPU’s, 367 GFLOPS,
Host • 768 MB DRAM,
• 86.4 GB/S Mem BW,
Input Assembler • 4GB/S BW to CPU
Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e

Load/ Load/ Load/ Load/ Load/ Load/


store store store store store store

Global Memory
A Glimpse at GPU
Streaming Multiprocessor (SM)

Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e

Load/ Load/ Load/ Load/ Load/ Load/


store store store store store store

Global Memory
A Glimpse at GPU
Streaming
Processor (SP) SPs within SM share control logic
Host
and instruction cache
Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e

Load/ Load/ Load/ Load/ Load/ Load/


store store store store store store

Global Memory
A Glimpse at GPU
• Much higher bandwidth than typical system memory
• A bit slower than typical system memory
Host • Communication between GPU memory
Input Assembler
and system memory is slow

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e

Load/ Load/ Load/ Load/ Load/ Load/


store store store store store store

Global Memory
Winning Applications Use Both CPU and GPU

• CPUs for sequential • GPUs for parallel parts


parts where where throughput
latency matters wins
– CPUs can be 10X+ faster – GPUs can be 10X+ faster
than GPUs for sequential than CPUs for parallel
code code

Source: NVIDIA GPU teaching kit


History of GPUs …
How did they evolve?
Why Looking at GPU History?
• Looking at how things evolved can
highlight future directions.
• Some of the current architecture
decisions won’t make sense without
historical perspective.
A Little Bit of Vocabulary
• Rendering: the process of generating an
image from a model
• Vertex: the corner of a polygon (usually
that polygon is a triangle)
• Pixel: smallest addressable screen
element
From Numbers to Screen
Before GPUs
• Vertices to pixels:
– Transformations done on CPU
– Compute each pixel “by hand”, in series…
slow!
Example: 1 million triangles * 100 pixels
per triangle * 10 lights * 4 cycles per
light computation = 4 billion cycles
Early GPUs:
Early 80s to Late 90s
Fixed-Function Pipeline
Early GPUs: Early 80s to Late 90s
• Fixed-Function Pipeline

• Receives graphics commands and d


from CPU
Early GPUs: Early 80s to Late 90s
• Fixed-Function Pipeline

•Receives triangle data


• Converts them into a form that hardware
understands
•Store the prepared data in vertex cache
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

•Vertex shading transform and lighting


•Assigns per-vertex value (colors, …).
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

Creates edge equations to interpolate


colors across pixels touched by the triangle
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

• Determines which pixel


falls into which triangle
• For each pixel, interpolate
per-pixel values from vertices
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

Determines the final color


of each pixel
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

The raster operation:


performs color raster operations
that blend the color of overlapping
objects for transparency and
antialiasing
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

The frame buffer interface


manages memory reads/writes.
Next Steps
• In 2001:
– NVIDIA exposed the application developer to
the instruction set of VS/T&L stage
• Later:
– General programmability extended to to shader
stage  trend toward unifying the
functionality of the different stages as seen
by the application programmer.
– In graphics pipelines, certain stages do a
great deal of floating-points arithmetic on a
completely independent data.
• Data independence is exploited  key assumption in
GPUs
Fragment = a technical term usually meaning a single pixel
In 2006
• NVIDIA GeForce 8800 mapped
separate graphics stage to a unified
array of processors
– For vertex shading, geometry processing,
and pixel processing
– Allows dynamic partition
Regularity + Massive Parallelism
Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

Thread Processor
TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB

Exploring the use of GPUs to solve compute intensive problems


The birth of GPGPU but there are many constraints
GPUs and associated APIs were designed to process graphics
Previous GPGPU Constraints
• Dealing with graphics API
per thread

– Working with the corner cases of Input Registers


per Shader
per Context

the graphics API Fragment Program

• Addressing modes
Texture

Constants

– Limited texture size/dimension


Temp Registers

• Shader capabilities
– Limited outputs Output Registers

• Instruction sets
FB
Memory

– Lack of Integer
& bit ops
• Communication
limited
• No user-defined
data types
The Birth of GPU Computing
• Step 1: Designing high-efficiency floating-point and
integer processors.
• Step 2: Exploiting data parallelism by having large
number of processors
• Step 3: Shader processors fully programmable with
large instruction cache, instruction memory, and
instruction control logic.
• Step 4: Reducing the cost of hardware by having
multiple shader processors to share their cache and
control logic.
• Step 5: Adding memory load/store instructions with
random byte addressing capability
• Step 6: Developing CUDA C/C++ compiler, libraries,
and runtime software models.
A Quick Glimpse on: Flynn Classification
• A taxonomy of computer architecture
• Proposed by Micheal Flynn in 1966
• It is based two things:
– Instructions
– Data
Multiple
Single instruction instruction
Single data SISD MISD
Multiple data SIMD MIMD
PU = Processing Unit
Which one
is closest to
GPU?
Problem With GPUs: Power

Source: http://www.eteknix.com/gigabyte-g1-gaming-geforce-gtx-980-4gb-graphics-card-review/17/
Problems Faced by GPUs
• Need enough parallelism
• Under-utilization
• Bandwidth to CPU

Still a way to go
Let’s Take A Closer Look:
The Hardware
Simplified View

Source: “The CUDA Handbook” by Nicholas Wilt .. Copyright (c) by Pearson Education Inc.
A Closer Look …

Source: “The CUDA Handbook” by Nicholas Wilt .. Copyright (c) by Pearson Education Inc.
source: http://static.ddmcdn.com/gif/graphics-card-5.jpg
PROCESSING
FLOW

PCI
Bus

Copy from Host Memory (CPU) to


Device Memory (GPU)
PROCESSING
CPU LAUNCHES KERNEL
FLOW

PCI
Bus

Kernel accesses memory at much faster


rate
Utilizes on-chip cache memory
PROCESSING
FLOW

PCI
Bus

Copy results back from Device Memory


(GPU) to Host Memory (CPU)
The Interconnection:
CPU-GPU and GPU-
GPU
PCIe
About Connections
NVLINK
PCIe
• Peripheral Component Interconnect
• Developed by Intel
• PCI Express architecture is a high performance,
IO interconnect for peripherals.
• A serial point-to-point interconnect between two
devices
• Data sent in packets
• Each lane enables 250 MBytes/s bandwidth per
direction.
• Synchronous
• No shared bus but a shared switch
PCIe

Speed for
v3.0
Speed of PCIe
Version Speed (x1)
1.0 2.5 GT/s 250 MB/s
2.0 5 GT/s 500 MB/s
3.0 8 GT/s 984.6 MB/s
4.0 16 GT/s 1969 MB/s
5.0(expected in 2019) 32 or 25 GT/s 3938 or 3077 MB/s
3 x1 PCI e Slots

1 x16 PCI e Slots

2 PCI Slots

Source: National Instruments


1 x16 PCI e Slots

Source: National Instruments


NVLINK
• From NVIDIA
• Starting from PASCAL chips
• higher-bandwidth alternative to PCI
Express 3.0
• GPU-to-GPU connections
• Also expected: CPU-GPU
• Allows data sharing at rates 5 to 12 times
faster than the traditional PCIe.
• Next generation will support coherence
among chips
NVLINK

source: http://www.nvidia.com/object/nvlink.html
NVLINK

source: NVIDIA® NVLink TM High-Speed Interconnect: Application Performance


whitepaper, November 2014.
NVLINK

Source: http://gadgets.ndtv.com/laptops/news/nvidia-announces-nvlink-architecture-3d-stacked-memory-pascal-gpu-500335
This is how we expose GPU as parallel processor.
Quick Glimpse At
GPU Programming
Model
Application Kernels Threads Blocks

Grid
Quick Glimpse At
GPU Programming

Model
Application can include multiple kernels
• Threads of the same block run on the same SM
– So threads in SM can operate and share memory
– Block in an SM is divided into warps of 32 threads
each
– A warp is the fundamental unit of dispatch in an
SM
• Blocks in a grid can coordinate using global
shared memory
• Each grid executes a kernel
Scheduling In Modern NVIDIA GPUs

• At any point of time the entire device is


dedicated to a single application (well,
more on that later!)
– Switch from an application to another takes
~25 microseconds
• GPU can simultaneously execute multiple
kernels of the same application
• Two warps from different blocks (or even
different kernels) can be issued and
executed simultaneously
Scheduling In GPUs
• Two-level, distributed thread scheduler
– At the device level: a global work
distribution engine schedules thread blocks
to various SMs
– At the SM level, each warp
scheduler distributes warps of 32
threads to its execution units.
Amdahl's law

Bounds speedup attainable on a parallel machine

1 S Speedup
S=
Time to run Time to run P Ratio of parallel
1−P portions
sequential portions N parallel portions
P N Number of
 processors

S (speedup)

N (available processors)
G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale 35
Computing Capabilities. AFIPS 1967.
Why heterogeneous architectures?
1
Time to run
S= Time to run
sequential portions 1−P parallel portions
P

N
Latency-optimized multi-core (CPU)
Low efficiency on parallel portions:
spends too much resources

Throughput-optimized multi-
core (GPU)
Low performance on sequential
portions

Heterogeneous multi-core
(CPU+GPU)
Use the right tool for the right job
M. Hill, M. Marty. Amdahl'soptimization
law in the multicore era. IEEE Computer, 2008. 36
Allows aggressive for
Example: System on Chip for smartphone
Small cores
for background activity

GPU

Big cores
for applications
Lots of interfaces Special-purpose 37
accelerators
CUDA

• Compute Unified Device Architecture


• Extension of the C language
• Used to control the device
• The programmer specifies CPU and GPU
functions − The host code can be C++ − Device
code may only be C
• The programmer specifies thread layout
DGX-1
Architecture
TRAINING
INFERENCING
VOLTA TENSOR
CORE
VOLTA TENSOR
OPERATION
Sum with
FP16 Full precision FP32 Convert to
storage/input product accumulat FP32
or result
more products

F16
× + F32
F16

F32

Also supports FP16 accumulator mode for inferencing


https://www.nvidia.com/en-us/data-center/tensorcore/
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices

A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3

D= A1,0

A2,0
A1,1

A2,1
A1,2

A2,2
A1,3

A2,3
B1,0

B2,0
B1,1

B2,1
B1,2

B2,2
B1,3

B2,3
C1,0

C2,0
C1,1

C2,1
C1,2

C2,2
C1,3

C2,3

A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3

FP16 or FP32 FP16 FP16 FP16 or FP32

D = AB + C
RESNET-50 FP32
PERFORMANCE
Caffe Caffe2 TensorFlow MXNet Torch CNTK
2000
Chainer

1750

1500
Images per second

1250

1000

750

500

250

0
1 GPU 4 GPU 8 GPU

2 GPU
RESNET-50 MIXED PRECISION AND
FP32
1 GPU 2 GPU 4 GPU
7000
8 GPU
6500
6000
5500
5000
Images per second

4500
4000
3500
3000
2500
2000
1500
1000
500
0 1
MXNet FP32 GTC 2017 MXNet FP32 GTC 2018 MXNet Mixed GTC 2018
1
0
RESNET-50 MIXED PRECISION AND
FP32
1 GPU 2 GPU 4 GPU
7000
8 GPU
6500
6000
5500
5000
Images per second

4500
4000
3500
3000
2500
2000
1500
1000
500
0 1
MXNet FP32 GTC 2017 MXNet FP32 GTC 2018 MXNet Mixed GTC 2018
1
1
NVIDIA DGX
SOFTWARE STACK
DEEP LEARNING FRAMEWORKS

Fully Integrated Software for


DEEP LEARNING USER SOFTWARE
NVIDIA DIGITS™ Instant Productivity
Advantages:
Instant productivity with NVIDIA
optimized deep learning frameworks
CONTAINERIZATION TOOL
Caffe, CNTK, MXNet, PyTorch, TensorFlow,
NVIDIA Docker
Theano, and Torch
GPU DRIVER
Performance optimized across
NVIDIA Driver
the entire stack
SYSTEM
Faster Time-to-Insight with pre-built, tested,
Host OS
and ready to run framework containers
DGX SOFTWARE STACK
Flexibility to use different versions
of libraries like libc, cuDNN in each
framework container
19
What is a learning algorithm?

Recall Mitchell’s definition of a learning algorithm:


‘A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at
tasks in T , as measured by P, improves with experience E .’

What kinds of tasks T are machine learning algorithms suited to?


How does training look like?

UPDATE
MODEL

BUILD A GRAB NEW


MODEL DATA

CHECK IF
GOOD
ENOUGH
The V-100

And why is it so good @ Machine


Learning???
Strengths of
V100
● Built for Massively Parallel
Computations

● Specific hardware / software to manage


Deep Learning Workloads (Tensor
Cores, mixed-precision execution, etc)
Strengths of
V100
● Built for Massively Parallel
Computations

● Specific hardware / software to manage


Deep Learning Workloads (Tensor
Cores, mixed-precision execution, etc)

Tesla SXM V100


● 5376 cores
(FP32)
My Questi ons Around the GPU
What are we going to do with 5376 FP32
cores?
The Unsatisfactory
Answer
What are we going to do with 5376 FP32
cores?
"Execute things in parallel"!
What are we going to do with 5376 FP32 cores?
"Execute things in parallel"!

Yes, but how can we exactly do that for ML


Workloads?
● We may have a huge number of layers
● Each layer can have huge number of neurons
--> There may be 100s millions or even billions * and + ops

All knobs are W values that we need to tune


So that given a certain input, they generate the correct
output
"Matrix Multiplication is
EATING (the computing resources of) THE
WORLD"
hi_j = [X0, X1, X2, ...] * [W0, W1, W2, ...]

hi_j = X0*W0 + X1*W1 + X2*W2 + ...


Matmul
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values
W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values
= X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
h0,0
Single-threaded
Execution
Comparing - Order of Magnitude
(sequences)
Single-Threaded GPU
Execution Multi-Threaded
Execution

1*t
256 * t +
7*t
=

You might also like