0% found this document useful (0 votes)

57 views

GPGPU

Uploaded by

Cosmic02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views

GPGPU

Uploaded by

Cosmic02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 139

GPGPU

Hariharan Venugopal | Deep Learning Solution Architect

Computer architecture crash course

• How does a processor work?

• Or was working in the 1980s to 1990s: modern
processors are much more complicated!
Machine language:
instruction set
The Von Neumann
processor
Step by step: Fetch
Decode
Read operands
Execute operation
Write back
Increment PC
Load or store
instruction
Load or store
instruction
Branch instruction
What about the
state machine?
Going faster using
ILP: pipeline
Pipelined processor
Superscalar execution
Branch prediction
Caches
Intel 4004 Die Photo

• Introduced in 1970
– First microprocessor
• 2,250 transistors
• 12 mm2
• 108 KHz
Intel 8086 Die Scan

• 29,000 transistors
• 33 mm2
• 5 MHz
• Introduced in 1979
– Basic architecture of the IA32 PC
Pentium III

• 9,500,000
transistors
• 125 mm2
• 450 MHz
• Introduced in 1999
HW-SW
HW-SW

• How to control software cost?

– By reducing redesigning of the software.
• And how to do that?
– By making the application scalable
• More cores
• More threads per core
• More memory
• Faster interconnect
• Basically: scalability in the face of hardware growth.
– By making the application portable
• Across different instruction sets (x86, ARM, …)
• From multicore to GPU to FPGA to ….
• Shared vs distributed memory
NOT PARALLEL

But Heterogeneous parallel

programming
Core B

Heterogeneity Everywhere

Neuromorphic Chips
Latency vs. throughput

Latency: time to solution

CPUs
Minimize time, at the expense of
power

Throughput: quantity of tasks

processed per unit of time
GPUs
Assumes unlimited parallelism
Minimize energy per
operation
GPU FPGA

Multicor Automata Neuromorphic

e
Processin
g

Interconnect

Memory
Use best match for the job!

Storag
e
Software Perspective
Two type of developers

Performance Group Productivity Group

(C/C++, CUDA, OpenCL, …. ) (Python, Scala, … )
Attempts to Make Parallel
Programming Easy
• 1st idea: The right computer language
would make parallel programming
straightforward
– Result so far: Some languages made
parallel programming easier, but none has
made it as fast, efficient, and flexible
as traditional sequential programming.
Attempts to Make Parallel
Programming Easy
• 2nd idea: If you just design the
hardware properly, parallel programming
would become easy.
– Result so far: no one has yet succeeded!
Attempts to Make Parallel
Programming Easy
• 3rd idea: Write software that will
automatically parallelize existing
sequential programs.
– Result so far: Success here is inversely
proportional to the number of cores!
Two Main Goals

•Maintain execution speed of old sequential programs

•Increase throughput of parallel programs

Two Main Goals
•Maintain execution speed of old
sequential programs
CPU
•Increase throughput of parallel
programs
GPU
+
CPU
ALU ALU
Control
ALU ALU
CP U GP U

Cache

DRAM DRAM
CPU is optimized for sequential
code performance

ALU ALU
Control
ALU ALU
CPU GP U

Cache

DRAM DRAM
ALU ALU
Control
ALU ALU
CP U GP U

Cache

DRAM DRAM

Almost 10x the bandwidth of multicore

(relaxed memory model)
How to Choose A Processor for Your
Application?
• Performance
• Very large installation base
• Practical form-factor and easy
accessibility
• Support for IEEE floating point
standard
Integrated GPU vs Discrete
GPU


(a) and (b) represent discrete GPU solutions, with a CPU-
integrated memory controller in (b). Diagram (c) corresponds to
integrated CPU-GPU solutions, as the AMD's Accelerated
Processing Unit (APU) chips.
source: Multicore and GPU Programming: An Integrated Approach by G. Barlas, 2014

Copyright © 2015 Elsevier Inc. All rights reserved.

Tradeoff: Low energy vs higher performance

Integrated CPU+GPU processors
• More than 90% of processors shipping today include a GPU on die
• Low energy use is a key design goal
Intel 4th Generation Core Processor: “Haswell” AMD Kaveri
APU

4-core GT2 Desktop: 35 W package

http://www.geeks3d.com/20140114/amd-kaveri-a10-7850k-a10-7700k-and-a8-7600-apus-announced/

2-core GT2 Ultrabook: 11.5 W package Desktop: 45-95 W package

Mobile, embedded: 15 W package
source: Performance and Programmability Trade-offs in the OpenCL 2.0 SVM and Memory Model
by Brian T. Lewis, Intel Labs
Is Any Application Suitable for GPU?

• Heck no!
• You will get the best performance from
GPU if your application is:
– Computation intensive
– Many independent computations
– Many similar computations
A Glimpse at A
GPGPU:
• 16 highly threaded SM’s,
• >128 FPU’s, 367 GFLOPS,
Host • 768 MB DRAM,
• 86.4 GB/S Mem BW,
Input Assembler • 4GB/S BW to CPU
Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e

Load/ Load/ Load/ Load/ Load/ Load/

store store store store store store

Global Memory
A Glimpse at GPU
Streaming Multiprocessor (SM)

Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e

Load/ Load/ Load/ Load/ Load/ Load/

store store store store store store

Global Memory
A Glimpse at GPU
Streaming
Processor (SP) SPs within SM share control logic
Host
and instruction cache
Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e

Load/ Load/ Load/ Load/ Load/ Load/

store store store store store store

Global Memory
A Glimpse at GPU
• Much higher bandwidth than typical system memory
• A bit slower than typical system memory
Host • Communication between GPU memory
Input Assembler
and system memory is slow

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Textur
Texture Texture Texture Texture Texture Texture Texture Texture
e

Load/ Load/ Load/ Load/ Load/ Load/

store store store store store store

Global Memory
Winning Applications Use Both CPU and GPU

• CPUs for sequential • GPUs for parallel parts

parts where where throughput
latency matters wins
– CPUs can be 10X+ faster – GPUs can be 10X+ faster
than GPUs for sequential than CPUs for parallel
code code

Source: NVIDIA GPU teaching kit

History of GPUs …
How did they evolve?
Why Looking at GPU History?
• Looking at how things evolved can
highlight future directions.
• Some of the current architecture
decisions won’t make sense without
historical perspective.
A Little Bit of Vocabulary
• Rendering: the process of generating an
image from a model
• Vertex: the corner of a polygon (usually
that polygon is a triangle)
• Pixel: smallest addressable screen
element
From Numbers to Screen
Before GPUs
• Vertices to pixels:
– Transformations done on CPU
– Compute each pixel “by hand”, in series…
slow!
Example: 1 million triangles * 100 pixels
per triangle * 10 lights * 4 cycles per
light computation = 4 billion cycles
Early GPUs:
Early 80s to Late 90s
Fixed-Function Pipeline
Early GPUs: Early 80s to Late 90s
• Fixed-Function Pipeline

• Receives graphics commands and d

from CPU
Early GPUs: Early 80s to Late 90s
• Fixed-Function Pipeline

•Receives triangle data

• Converts them into a form that hardware
understands
•Store the prepared data in vertex cache
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

•Vertex shading transform and lighting

•Assigns per-vertex value (colors, …).
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

Creates edge equations to interpolate

colors across pixels touched by the triangle
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

• Determines which pixel

falls into which triangle
• For each pixel, interpolate
per-pixel values from vertices
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

Determines the final color

of each pixel
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

The raster operation:

performs color raster operations
that blend the color of overlapping
objects for transparency and
antialiasing
Early GPUs: Early 80s to Late 90s
Fixed-Function Pipeline

The frame buffer interface

manages memory reads/writes.
Next Steps
• In 2001:
– NVIDIA exposed the application developer to
the instruction set of VS/T&L stage
• Later:
– General programmability extended to to shader
stage  trend toward unifying the
functionality of the different stages as seen
by the application programmer.
– In graphics pipelines, certain stages do a
great deal of floating-points arithmetic on a
completely independent data.
• Data independence is exploited  key assumption in
GPUs
Fragment = a technical term usually meaning a single pixel
In 2006
• NVIDIA GeForce 8800 mapped
separate graphics stage to a unified
array of processors
– For vertex shading, geometry processing,
and pixel processing
– Allows dynamic partition
Regularity + Massive Parallelism
Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

Thread Processor
TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB

Exploring the use of GPUs to solve compute intensive problems

The birth of GPGPU but there are many constraints
GPUs and associated APIs were designed to process graphics
Previous GPGPU Constraints
• Dealing with graphics API
per thread

– Working with the corner cases of Input Registers

per Shader
per Context

the graphics API Fragment Program

• Addressing modes
Texture

Constants

– Limited texture size/dimension

Temp Registers

• Shader capabilities
– Limited outputs Output Registers

• Instruction sets
FB
Memory

– Lack of Integer
& bit ops
• Communication
limited
• No user-defined
data types
The Birth of GPU Computing
• Step 1: Designing high-efficiency floating-point and
integer processors.
• Step 2: Exploiting data parallelism by having large
number of processors
• Step 3: Shader processors fully programmable with
large instruction cache, instruction memory, and
instruction control logic.
• Step 4: Reducing the cost of hardware by having
multiple shader processors to share their cache and
control logic.
• Step 5: Adding memory load/store instructions with
random byte addressing capability
• Step 6: Developing CUDA C/C++ compiler, libraries,
and runtime software models.
A Quick Glimpse on: Flynn Classification
• A taxonomy of computer architecture
• Proposed by Micheal Flynn in 1966
• It is based two things:
– Instructions
– Data
Multiple
Single instruction instruction
Single data SISD MISD
Multiple data SIMD MIMD
PU = Processing Unit
Which one
is closest to
GPU?
Problem With GPUs: Power

Source: http://www.eteknix.com/gigabyte-g1-gaming-geforce-gtx-980-4gb-graphics-card-review/17/
Problems Faced by GPUs
• Need enough parallelism
• Under-utilization
• Bandwidth to CPU

Still a way to go
Let’s Take A Closer Look:
The Hardware
Simplified View

Source: “The CUDA Handbook” by Nicholas Wilt .. Copyright (c) by Pearson Education Inc.
A Closer Look …

Source: “The CUDA Handbook” by Nicholas Wilt .. Copyright (c) by Pearson Education Inc.
source: http://static.ddmcdn.com/gif/graphics-card-5.jpg
PROCESSING
FLOW

PCI
Bus

Copy from Host Memory (CPU) to

Device Memory (GPU)
PROCESSING
CPU LAUNCHES KERNEL
FLOW

PCI
Bus

Kernel accesses memory at much faster

rate
Utilizes on-chip cache memory
PROCESSING
FLOW

PCI
Bus

Copy results back from Device Memory

(GPU) to Host Memory (CPU)
The Interconnection:
CPU-GPU and GPU-
GPU
PCIe
About Connections
NVLINK
PCIe
• Peripheral Component Interconnect
• Developed by Intel
• PCI Express architecture is a high performance,
IO interconnect for peripherals.
• A serial point-to-point interconnect between two
devices
• Data sent in packets
• Each lane enables 250 MBytes/s bandwidth per
direction.
• Synchronous
• No shared bus but a shared switch
PCIe

Speed for
v3.0
Speed of PCIe
Version Speed (x1)
1.0 2.5 GT/s 250 MB/s
2.0 5 GT/s 500 MB/s
3.0 8 GT/s 984.6 MB/s
4.0 16 GT/s 1969 MB/s
5.0(expected in 2019) 32 or 25 GT/s 3938 or 3077 MB/s
3 x1 PCI e Slots

1 x16 PCI e Slots

2 PCI Slots

Source: National Instruments

1 x16 PCI e Slots

Source: National Instruments

NVLINK
• From NVIDIA
• Starting from PASCAL chips
• higher-bandwidth alternative to PCI
Express 3.0
• GPU-to-GPU connections
• Also expected: CPU-GPU
• Allows data sharing at rates 5 to 12 times
faster than the traditional PCIe.
• Next generation will support coherence
among chips
NVLINK

source: http://www.nvidia.com/object/nvlink.html
NVLINK

source: NVIDIA® NVLink TM High-Speed Interconnect: Application Performance

whitepaper, November 2014.
NVLINK

Source: http://gadgets.ndtv.com/laptops/news/nvidia-announces-nvlink-architecture-3d-stacked-memory-pascal-gpu-500335
This is how we expose GPU as parallel processor.
Quick Glimpse At
GPU Programming
Model
Application Kernels Threads Blocks

Grid
Quick Glimpse At
GPU Programming
•
Model
Application can include multiple kernels
• Threads of the same block run on the same SM
– So threads in SM can operate and share memory
– Block in an SM is divided into warps of 32 threads
each
– A warp is the fundamental unit of dispatch in an
SM
• Blocks in a grid can coordinate using global
shared memory
• Each grid executes a kernel
Scheduling In Modern NVIDIA GPUs

• At any point of time the entire device is

dedicated to a single application (well,
more on that later!)
– Switch from an application to another takes
~25 microseconds
• GPU can simultaneously execute multiple
kernels of the same application
• Two warps from different blocks (or even
different kernels) can be issued and
executed simultaneously
Scheduling In GPUs
• Two-level, distributed thread scheduler
– At the device level: a global work
distribution engine schedules thread blocks
to various SMs
– At the SM level, each warp
scheduler distributes warps of 32
threads to its execution units.
Amdahl's law

Bounds speedup attainable on a parallel machine

1 S Speedup
S=
Time to run Time to run P Ratio of parallel
1−P portions
sequential portions N parallel portions
P N Number of
 processors

S (speedup)

N (available processors)
G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale 35
Computing Capabilities. AFIPS 1967.
Why heterogeneous architectures?
1
Time to run
S= Time to run
sequential portions 1−P parallel portions
P

N
Latency-optimized multi-core (CPU)
Low efficiency on parallel portions:
spends too much resources

Throughput-optimized multi-
core (GPU)
Low performance on sequential
portions

Heterogeneous multi-core
(CPU+GPU)
Use the right tool for the right job
M. Hill, M. Marty. Amdahl'soptimization
law in the multicore era. IEEE Computer, 2008. 36
Allows aggressive for
Example: System on Chip for smartphone
Small cores
for background activity

GPU

Big cores
for applications
Lots of interfaces Special-purpose 37
accelerators
CUDA

• Compute Unified Device Architecture

• Extension of the C language
• Used to control the device
• The programmer specifies CPU and GPU
functions − The host code can be C++ − Device
code may only be C
• The programmer specifies thread layout
DGX-1
Architecture
TRAINING
INFERENCING
VOLTA TENSOR
CORE
VOLTA TENSOR
OPERATION
Sum with
FP16 Full precision FP32 Convert to
storage/input product accumulat FP32
or result
more products

F16
× + F32
F16

F32

Also supports FP16 accumulator mode for inferencing

https://www.nvidia.com/en-us/data-center/tensorcore/
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices

A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3

D= A1,0

A2,0
A1,1

A2,1
A1,2

A2,2
A1,3

A2,3
B1,0

B2,0
B1,1

B2,1
B1,2

B2,2
B1,3

B2,3
C1,0

C2,0
C1,1

C2,1
C1,2

C2,2
C1,3

C2,3

A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3

FP16 or FP32 FP16 FP16 FP16 or FP32

D = AB + C
RESNET-50 FP32
PERFORMANCE
Caffe Caffe2 TensorFlow MXNet Torch CNTK
2000
Chainer

1750

1500
Images per second

1250

1000

750

500

250

0
1 GPU 4 GPU 8 GPU

2 GPU
RESNET-50 MIXED PRECISION AND
FP32
1 GPU 2 GPU 4 GPU
7000
8 GPU
6500
6000
5500
5000
Images per second

4500
4000
3500
3000
2500
2000
1500
1000
500
0 1
MXNet FP32 GTC 2017 MXNet FP32 GTC 2018 MXNet Mixed GTC 2018
1
0
RESNET-50 MIXED PRECISION AND
FP32
1 GPU 2 GPU 4 GPU
7000
8 GPU
6500
6000
5500
5000
Images per second

4500
4000
3500
3000
2500
2000
1500
1000
500
0 1
MXNet FP32 GTC 2017 MXNet FP32 GTC 2018 MXNet Mixed GTC 2018
1
1
NVIDIA DGX
SOFTWARE STACK
DEEP LEARNING FRAMEWORKS

Fully Integrated Software for

DEEP LEARNING USER SOFTWARE
NVIDIA DIGITS™ Instant Productivity
Advantages:
Instant productivity with NVIDIA
optimized deep learning frameworks
CONTAINERIZATION TOOL
Caffe, CNTK, MXNet, PyTorch, TensorFlow,
NVIDIA Docker
Theano, and Torch
GPU DRIVER
Performance optimized across
NVIDIA Driver
the entire stack
SYSTEM
Faster Time-to-Insight with pre-built, tested,
Host OS
and ready to run framework containers
DGX SOFTWARE STACK
Flexibility to use different versions
of libraries like libc, cuDNN in each
framework container
19
What is a learning algorithm?

Recall Mitchell’s definition of a learning algorithm:

‘A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at
tasks in T , as measured by P, improves with experience E .’

What kinds of tasks T are machine learning algorithms suited to?

How does training look like?

UPDATE
MODEL

BUILD A GRAB NEW

MODEL DATA

CHECK IF
GOOD
ENOUGH
The V-100

And why is it so good @ Machine

Learning???
Strengths of
V100
● Built for Massively Parallel
Computations

● Speciﬁc hardware / software to manage

Deep Learning Workloads (Tensor
Cores, mixed-precision execution, etc)
Strengths of
V100
● Built for Massively Parallel
Computations

● Speciﬁc hardware / software to manage

Deep Learning Workloads (Tensor
Cores, mixed-precision execution, etc)

Tesla SXM V100

● 5376 cores
(FP32)
My Questi ons Around the GPU
What are we going to do with 5376 FP32
cores?
The Unsatisfactory
Answer
What are we going to do with 5376 FP32
cores?
"Execute things in parallel"!
What are we going to do with 5376 FP32 cores?
"Execute things in parallel"!

Yes, but how can we exactly do that for ML

Workloads?
● We may have a huge number of layers
● Each layer can have huge number of neurons
--> There may be 100s millions or even billions * and + ops

All knobs are W values that we need to tune

So that given a certain input, they generate the correct
output
"Matrix Multiplication is
EATING (the computing resources of) THE
WORLD"
hi_j = [X0, X1, X2, ...] * [W0, W1, W2, ...]

hi_j = X0W0 + X1W1 + X2*W2 + ...

Matmul
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values
W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values
= X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
h0,0
Single-threaded
Execution
Comparing - Order of Magnitude
(sequences)
Single-Threaded GPU
Execution Multi-Threaded
Execution

1*t
256 * t +
7*t
=

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Wire Guard
No ratings yet
Wire Guard
12 pages
Lab 9.2.7 Troubleshooting Using Network Utilities
No ratings yet
Lab 9.2.7 Troubleshooting Using Network Utilities
5 pages
LK-101 LK-102 LK-103 LK-105 License Authentication Operation Guide
No ratings yet
LK-101 LK-102 LK-103 LK-105 License Authentication Operation Guide
76 pages
Quick Introduction To Reverse Engineering For Beginners by Dennis Yurichev
100% (3)
Quick Introduction To Reverse Engineering For Beginners by Dennis Yurichev
213 pages
Unit 3 Final 1
No ratings yet
Unit 3 Final 1
153 pages
Azure Data Engineer Master Program Curriculum - Revised
No ratings yet
Azure Data Engineer Master Program Curriculum - Revised
11 pages
M1 CDL Student Slides v2
No ratings yet
M1 CDL Student Slides v2
184 pages
AI-900 Exam Study Guide Even
No ratings yet
AI-900 Exam Study Guide Even
4 pages
C2 - W1 Mlopssadsa
No ratings yet
C2 - W1 Mlopssadsa
111 pages
DevOps Tasks Devops Shack
No ratings yet
DevOps Tasks Devops Shack
5 pages
DevOps Shack 200 Maven NPM Interview Q&A
No ratings yet
DevOps Shack 200 Maven NPM Interview Q&A
32 pages
Docker Scenario Based Questions and Answers
No ratings yet
Docker Scenario Based Questions and Answers
25 pages
Kubernetes Architecture a Deep Dive
No ratings yet
Kubernetes Architecture a Deep Dive
10 pages
UTD CNSP Workshop Guide
No ratings yet
UTD CNSP Workshop Guide
110 pages
OpenSolaris DTrace - Harry J Foxwell PDF
No ratings yet
OpenSolaris DTrace - Harry J Foxwell PDF
181 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
Cluster Computing Tutorial
No ratings yet
Cluster Computing Tutorial
101 pages
CLS 1306 WXCC - AI&Orchestration
No ratings yet
CLS 1306 WXCC - AI&Orchestration
135 pages
Immediate Joiner
No ratings yet
Immediate Joiner
9 pages
Stas Bekman - Machine Learning Engineering
No ratings yet
Stas Bekman - Machine Learning Engineering
217 pages
Azure 2 in 1 Course Outline
No ratings yet
Azure 2 in 1 Course Outline
4 pages
Intro HPC Linux Gent
No ratings yet
Intro HPC Linux Gent
124 pages
Steps For Creating A Virtual Machine (VM) in AWS
No ratings yet
Steps For Creating A Virtual Machine (VM) in AWS
4 pages
Devops Ultimate Monitoring Project
No ratings yet
Devops Ultimate Monitoring Project
17 pages
Red Hat Enterprise Linux-8-Building Running and Managing containers-en-US
No ratings yet
Red Hat Enterprise Linux-8-Building Running and Managing containers-en-US
127 pages
Generative AI Interview Questions and Answers
No ratings yet
Generative AI Interview Questions and Answers
7 pages
Automl: A Perspective Where Industry Meets Academy
No ratings yet
Automl: A Perspective Where Industry Meets Academy
154 pages
CS313L Maunual v2 PDF
No ratings yet
CS313L Maunual v2 PDF
104 pages
UTD CNSP 2.0 Developement Track
No ratings yet
UTD CNSP 2.0 Developement Track
132 pages
PRESENTATION - Ask The Expert - How Do I Integrate SAS Viya and Open Source
No ratings yet
PRESENTATION - Ask The Expert - How Do I Integrate SAS Viya and Open Source
121 pages
network plateform software reference
No ratings yet
network plateform software reference
104 pages
Aws SDK Java DG
No ratings yet
Aws SDK Java DG
197 pages
Sudo Apt-Get Install Docker-Ce 17.12.0 cd-0 Ubuntu
No ratings yet
Sudo Apt-Get Install Docker-Ce 17.12.0 cd-0 Ubuntu
5 pages
Docker For Enterprise Operations
No ratings yet
Docker For Enterprise Operations
202 pages
pupu_4n8_merged
No ratings yet
pupu_4n8_merged
122 pages
Promk 8
No ratings yet
Promk 8
166 pages
Data Engineering 101 - Azure Synapse Analytics
No ratings yet
Data Engineering 101 - Azure Synapse Analytics
45 pages
AWS Partner Security Best Practices (Technical) - 200-SIPSBP-14-En-SG
No ratings yet
AWS Partner Security Best Practices (Technical) - 200-SIPSBP-14-En-SG
206 pages
DevOps Shack 3-Tier
No ratings yet
DevOps Shack 3-Tier
8 pages
ALL MODULES_Advantage Partner Program Training for VMware Resellers_July 2024
No ratings yet
ALL MODULES_Advantage Partner Program Training for VMware Resellers_July 2024
134 pages
HPC Job
No ratings yet
HPC Job
8 pages
DDWRT WireGuard Client Setup Guide v14
No ratings yet
DDWRT WireGuard Client Setup Guide v14
10 pages
1.top500oops Java Interview Que
No ratings yet
1.top500oops Java Interview Que
127 pages
Cloud Computing Day - 1
No ratings yet
Cloud Computing Day - 1
124 pages
?????? ???????????!
No ratings yet
?????? ???????????!
129 pages
Devops Full Notes
No ratings yet
Devops Full Notes
223 pages
Kubernetes Common Errors & Troubleshooting
No ratings yet
Kubernetes Common Errors & Troubleshooting
10 pages
Session 14 Alerting
No ratings yet
Session 14 Alerting
24 pages
Grafana Dashboard
No ratings yet
Grafana Dashboard
112 pages
AWS Storage Use Cases
No ratings yet
AWS Storage Use Cases
12 pages
Practical DevSecOps 2021 - 8
No ratings yet
Practical DevSecOps 2021 - 8
83 pages
SEC201.1 Computing Fundamentals
No ratings yet
SEC201.1 Computing Fundamentals
187 pages
Docker Commands
No ratings yet
Docker Commands
7 pages
Full Certified Tester 4
No ratings yet
Full Certified Tester 4
104 pages
Prometheus Certified Associate-1
No ratings yet
Prometheus Certified Associate-1
513 pages
1-3
No ratings yet
1-3
184 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
About Kubernetes and Security Practices - Short Edition: First Edition, #1
From Everand
About Kubernetes and Security Practices - Short Edition: First Edition, #1
Ami Adi
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Software Engineering 1
No ratings yet
Software Engineering 1
13 pages
GTG First Synch Test Program 200212
No ratings yet
GTG First Synch Test Program 200212
2 pages
Pols P-15, S-210 V1.32C: - User Instructions
No ratings yet
Pols P-15, S-210 V1.32C: - User Instructions
5 pages
(Ebook) Incident Response & Computer Forensics by Chris Prosise, Kevin Mandia ISBN 9780072226966, 9780072230376, 007222696X, 0072230371 download pdf
100% (1)
(Ebook) Incident Response & Computer Forensics by Chris Prosise, Kevin Mandia ISBN 9780072226966, 9780072230376, 007222696X, 0072230371 download pdf
82 pages
Advertisement T D E U J R /T: Delegation To Indonesia and Brunei Darussalam Mission To Asean
No ratings yet
Advertisement T D E U J R /T: Delegation To Indonesia and Brunei Darussalam Mission To Asean
2 pages
MIT6 02F12 Lec11
No ratings yet
MIT6 02F12 Lec11
19 pages
Assignment1 DC
No ratings yet
Assignment1 DC
7 pages
Fane Colossus 18XB DS240316
No ratings yet
Fane Colossus 18XB DS240316
1 page
RMiT Apendices
No ratings yet
RMiT Apendices
14 pages
Design Agreement Dec 07, 2023
No ratings yet
Design Agreement Dec 07, 2023
3 pages
Continuous Maximum Rated (CMR) Transformers
No ratings yet
Continuous Maximum Rated (CMR) Transformers
41 pages
Visualgps Read Me
No ratings yet
Visualgps Read Me
19 pages
Pnoz X3 PDF
No ratings yet
Pnoz X3 PDF
4 pages
Nota VW
No ratings yet
Nota VW
5 pages
User Interface: Electronic Controllers For Refrigerating Units
No ratings yet
User Interface: Electronic Controllers For Refrigerating Units
4 pages
3 PowerQuality
No ratings yet
3 PowerQuality
8 pages
BPS-C6 Brochures
No ratings yet
BPS-C6 Brochures
8 pages
01-Sales Funnel Mastery 3.0 Workbook
No ratings yet
01-Sales Funnel Mastery 3.0 Workbook
8 pages
Cambridge International Advanced Subsidiary and Advanced Level
No ratings yet
Cambridge International Advanced Subsidiary and Advanced Level
8 pages
Tugas RO Integer Programming Formulation
100% (2)
Tugas RO Integer Programming Formulation
3 pages
EV-ADF4159EB1Z/EV-ADF4159EB3Z User Guide: Evaluating The Frequency Synthesizer For Phase-Locked Loops
No ratings yet
EV-ADF4159EB1Z/EV-ADF4159EB3Z User Guide: Evaluating The Frequency Synthesizer For Phase-Locked Loops
16 pages
Chapter 2
No ratings yet
Chapter 2
61 pages
11kvht Panel (1 Incomer & 3 Outgoing)
No ratings yet
11kvht Panel (1 Incomer & 3 Outgoing)
7 pages
Akmen Abc
No ratings yet
Akmen Abc
3 pages
Fluiddrawp5enus PDF
No ratings yet
Fluiddrawp5enus PDF
239 pages
EDK II Module Writer S Guide 0 7
No ratings yet
EDK II Module Writer S Guide 0 7
94 pages
Fedex Strategy and Obstacles
No ratings yet
Fedex Strategy and Obstacles
2 pages
Scrum Oum Mapping
No ratings yet
Scrum Oum Mapping
6 pages
Manajemen Proyek: Dosen: Prof. Dr. Ir. Budi Santosa, MSC Dyah Santhi Dewi, ST - Meng SC, PHD
No ratings yet
Manajemen Proyek: Dosen: Prof. Dr. Ir. Budi Santosa, MSC Dyah Santhi Dewi, ST - Meng SC, PHD
27 pages

GPGPU

Uploaded by

GPGPU

Uploaded by

GPGPU

Hariharan Venugopal | Deep Learning Solution Architect

• How does a processor work?

• How to control software cost?

But Heterogeneous parallel

Latency: time to solution

Throughput: quantity of tasks

Multicor Automata Neuromorphic

Performance Group Productivity Group

•Maintain execution speed of old sequential programs

•Increase throughput of parallel programs

Almost 10x the bandwidth of multicore

Copyright © 2015 Elsevier Inc. All rights reserved.

Tradeoff: Low energy vs higher performance

4-core GT2 Desktop: 35 W package

2-core GT2 Ultrabook: 11.5 W package Desktop: 45-95 W package

Load/ Load/ Load/ Load/ Load/ Load/

Thread Execution Manager

Load/ Load/ Load/ Load/ Load/ Load/

Thread Execution Manager

Load/ Load/ Load/ Load/ Load/ Load/

Thread Execution Manager

Load/ Load/ Load/ Load/ Load/ Load/

• CPUs for sequential • GPUs for parallel parts

Source: NVIDIA GPU teaching kit

• Receives graphics commands and d

•Receives triangle data

•Vertex shading transform and lighting

Creates edge equations to interpolate

• Determines which pixel

Determines the final color

The raster operation:

The frame buffer interface

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

Exploring the use of GPUs to solve compute intensive problems

– Working with the corner cases of Input Registers

the graphics API Fragment Program

– Limited texture size/dimension

Copy from Host Memory (CPU) to

Kernel accesses memory at much faster

Copy results back from Device Memory

1 x16 PCI e Slots

Source: National Instruments

Source: National Instruments

source: NVIDIA® NVLink TM High-Speed Interconnect: Application Performance

• At any point of time the entire device is

Bounds speedup attainable on a parallel machine

• Compute Unified Device Architecture

Also supports FP16 accumulator mode for inferencing

FP16 or FP32 FP16 FP16 FP16 or FP32

Fully Integrated Software for

Recall Mitchell’s definition of a learning algorithm:

What kinds of tasks T are machine learning algorithms suited to?

BUILD A GRAB NEW

And why is it so good @ Machine

● Speciﬁc hardware / software to manage

● Speciﬁc hardware / software to manage

Tesla SXM V100

Yes, but how can we exactly do that for ML

All knobs are W values that we need to tune

hi_j = X0*W0 + X1*W1 + X2*W2 + ...

You might also like

hi_j = X0W0 + X1W1 + X2*W2 + ...