0% found this document useful (0 votes)

159 views38 pages

The Evolution of Gpus For General Purpose Computing

GPGPU

Uploaded by

sufikasih

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views38 pages

The Evolution of Gpus For General Purpose Computing

GPGPU

Uploaded by

sufikasih

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

The Evolution of GPUs for General

Purpose Computing
Ian Buck| Sr. Director GPU Computing Software
San Jose Convention Center, CA | September 2023, 2010

Talk Outline
History of early graphics hardware
First GPU Computing

When GPUs became programmable

Creating GPU Computing
Future Trends and directions

First Generation - Wireframe

Vertex:

transform, clip, and project

Rasterization: lines only

Pixel:

no pixels! calligraphic display

Dates:

prior to 1987

Storage Tube Terminals

CRTs with analog charge persistence
Accumulate a detailed static image by writing points or
line segments
Erase the stored image to start a new one

Early Framebuffers
By the mid-1970s one could afford framebuffers with a
few bits per pixel at modest resolution
A Random Access Video Frame Buffer,
Kajiya, Sutherland, Cheadle, 1975

Vector displays were still better for fine position detail

Framebuffers were used to emulate storage tube vector

terminals on a raster display

Second Generation Shaded Solids

Vertex:

lighting

Rasterization: filled polygons

Pixel:

depth buffer, color blending

Dates:

1987 - 1992

Third Generation Texture Mapping

Vertex:

more, faster

Rasterization: more, faster

Pixel:
Dates:

texture filtering, antialiasing

1992 - 2001

IRIS 3000 Graphics Cards

Geometry Engines & Rasterizer

4 bit / pixel Framebuffer

(2 instances)

1990s
Desktop 3D workstations under $5000
Single-board, multi-chip graphics subsystems

Rise of 3D on the PC
40 company free-for-all until intense competition knocked out all but a
few players
Many were decelerators, and easy to beat

Single-chip GPUs
Interesting hardware experimentation
PCs would take over the workstation business

Interesting consoles
3DO, Nintendo, Sega, Sony

Before Programmable Shading

Computing though image processing circa.1995
GL_ARB_imaging

Moving toward programmability

DirectX 5
Riva 128

DirectX 6
Multitexturing
Riva TNT

1998

Half-Life

DirectX 7
T&L TextureStageState
GeForce 256

1999

Quake 3

2000

DirectX 8
SM 1.x
GeForce 3

2001

Giants

DirectX 9
SM 2.0
GeForceFX

2002

2003

Halo

DirectX 9.0c
SM 3.0
GeForce 6

2004

Far Cry

UE3

All images their respective owners

Programmable Shaders: GeForceFX

(2002)

Vertex and fragment operations specified in small (macro)

assembly language
User-specified mapping of input data to operations
Limited ability to use intermediate computed values to
index input data (textures and vertex uniforms)

Input
Input01
Input 2
OP
Temp
Temp01
Temp 2

ADDR
DP3R
RSQR
MULR
ADDR
DP3R
RSQR
MADR
MULR
DP3R
MAXR

R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx;

R0.w, R0.xyzx, R0.xyzx;
R0.w, R0.w;
R0.xyz, R0.w, R0.xyzx;
R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx;
R0.w, R1.xyzx, R1.xyzx;
R0.w, R0.w;
R0.xyz, R0.w, R1.xyzx, R0.xyzx;
R1.xyz, R0.w, R1.xyzx;
R0.w, R1.xyzx, f[TEX1].xyzx;
R0.w, R0.w, {0}.x;

No Lighting

Copyright NVIDIA Corporation 2006

Per-Vertex Lighting

Per-Pixel Lighting

Unreal Epic

Stunning Graphics Realism

Lush, Rich Worlds

Crysis 2006 Crytek / Electronic Arts

Incredible Physics Effects

Core of the Definitive Gaming Platform

Copyright NVIDIA Corporation 2006

Hellgate: London 2005-2006 Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc.

recent trends
multiplies per second
(observed peak)

GFLOPS

NVIDIA NV30, 35, 40

ATI R300, 360, 420
Pentium 4

July 01

Jan 02

July 02

Jan 03

July 03

Jan 04

GPU history
NVIDIA historicals
Product

Process

Trans

MHz

GFLOPS
(MUL)

Aug-02

GeForce FX5800

0.13

121M

500

Jan-03

GeForce FX5900

0.13

130M

475

Dec-03

GeForce 6800

0.13

222M

400

translating transistors into performance

1.8x increase of transistors
20% decrease in clock rate
6.6x GFLOP speedup

Early GPGPU (2002)

www.gpgpu.org

Early Raytracing

Ray Tracing on Programmable Graphics Hardware

Purcell et al.
PDEs in Graphics Hardware
Strzodka,,Rumpf
Fast Matrix Multiplies using Graphics Hardware
Larsen, McAllister
Using Modern Graphics Architectures for
General-Purpose Computing: A Framework and Analysis.
Thompson et al.

Programming model challenge

Demonstrate GPU performace
PHD computer graphics to do this
Financial companies hiring game programmers
GPU as a processor

Brook (2003)
C with streams
streams
collection of records requiring similar computation
particle positions, voxels, FEM cell,
Ray r<200>;
float3 velocityfield<100,100,100>;

similar to arrays, but

index operations disallowed: position[i]
read/write stream operators:
streamRead (positions, p_ptr);

streamWrite (velocityfield, v_ptr);

kernels
functions applied to streams
similar to for_all construct
kernel void add (float a<>, float b<>,
out float result<>) {
result = a + b;
}

float a<100>;
float b<100>;
float c<100>;
add(a,b,c);

for (i=0; i<100; i++)

c[i] = a[i]+b[i];

Challenges
Input Registers

Hardware

Software

Addressing modes

Texture
Building the GPU Computing
Ecosystem
Constants

Limited texture size/dimension

Fragment Program

Shader capabilities
Limited outputs

Instruction sets
Integer & bit ops

Communication limited
Between pixels
Scatter a[i] = p

Output Registers

Registers

GeForce 7800 Pixel

Input Registers

Texture
Fragment Program

Constants
Registers

Output Registers

Thread Programs
Features

Thread Number

Texture
Thread Program

Constants
Registers

Output Registers

Millions of instructions
Full Integer and Bit instructions
No limits on branching, looping
1D, 2D, or 3D thread ID
allocation

Global Memory
Features

Thread Number

Texture
Thread Program

Constants
Registers

Global Memory

Fully general load/store to GPU

memory: Scatter/Gather
Programmer flexibility on how
memory is accessed

Untyped, not limited to fixed

texture types
Pointer support

Shared Memory
Thread Number

Features
Shared

Thread Program

Global Memory

Texture

Dedicated on-chip memory

Shared between threads for

inter-thread communication

Constants

Explicitly managed

Registers

As fast as registers

Managing Communication with Shared

CPU
Control

GPGPU

Cache

DRAM

Control

ALU

ALU
Pn=P1+P2+P3+P4

P1
P2
P3
P4

Single thread
out of cache

GPU Computing

Pn=P1+P2+P3+P4

P1,P2
P3,P4

Program/Control

Control

ALU
Pn=P1+P2+P3+P4

P1,P2
P3,P4
Video
Memory

ALU
Pn=P1+P2+P3+P4
Control

Control

ALU
ALU
Pn=P1+P2+P3+P4

P1,P2
P3,P4

Multiple passes through

video memory
Data/Computation

Thread
Execution
Manager

ALU
Pn=P1+P2+P3+P4

Control

ALU
ALU

Pn=P1+P2+P3+P4

Shared
Data
P1
P2
P3
P4
P5

DRAM

GeForce 8800
Build the architecture around the processor
Host
Input Assembler

Setup / Rstr / ZCull

Pixel Thread Issue

Geom Thread Issue

Thread Processor

Vtx Thread Issue

FB
NVIDIA Corporation 2007

GeForce 8800 GPU Computing

Next step: Expose the GPU as massively parallel processors

Host
Input Assembler
Thread Execution Manager

Thread Processors

Parallel Data
Cache

Thread Processors

Parallel Data
Cache

Load/store
Global Memory

NVIDIA Corporation 2007

Building GPU Computing Ecosystem

Convince the world to program an entirely new kind of
processor
Tradeoffs between functional vs. performance requirements
Deliver HPC feature parity
Seed larger ecosystem with foundational components

CUDA: C on the GPU

A simple, explicit programming language solution
Extend only where necessary
__global__ void KernelFunc(...);
__shared__ int SharedVar;
KernelFunc<<< 500, 128 >>>(...);

Explicit GPU memory allocation

cudaMalloc(), cudaFree()

Memory copy from host to device, etc.

cudaMemcpy(), cudaMemcpy2D(), ...

CUDA: Threading in Data Parallel

Threading in a data parallel world
Operations drive execution, not data

Users simply given thread id

They decide what thread access which data element
One thread = single data element or block or variable or nothing.

No need for accessors, views, or built-ins

Flexibility
Not requiring the data layout to force the algorithm
Blocking computation for the memory hierarchy (shared)
Think about the algorithm, not the data

Divergence in Parallel Computing

Removing divergence pain from parallel programming
SIMD Pain
User required to SIMD-ify
User suffers when computation goes divergent

GPUs: Decouple execution width from programming model

Threads can diverge freely
Inefficiency only when granularity exceeds native machine width
Hardware managed

Managing divergence becomes performance optimization

Scalable

Foundations
Baseline HPC solution
Ubiquity: CUDA Everywhere

Software
C99 Math.h
BLAS & FFT

GPU co-processor

Hardware
IEEE math (G80)
Double Precision (GT200)
ECC (Fermi)

Customizing Solutions
Ease of Adoption

Ported Applications
Domain Libraries
Domain specific lang
C
Driver API

PTX
HW
Generality

PTX Virtual Machine and ISA

C/C++
Application

DSL
Fortran
Online CodeGen

C/C++
Compiler
PTX Code

Translator
Tesla
SM 1.3

Target code

Variable declarations
Data initialization
Instructions and operands

PTX Translator (OCG)

PTX to Target

NVIDIA Confidential

Programming model
Execution resources and state
Abstract and unify target details

PTX ISA Instruction Set Architecture

PTX Code

Tesla
SM 1.0

PTX Virtual Machine

Fermi
SM 2.0

Translate PTX code to Target code

At program build time
At program install time
Or JIT at program run time

Driver implements PTX VM runtime

Coupled with Translator

GPU Computing Software Libraries

and Engines
GPU Computing Applications
Application Acceleration Engines (AXEs)
SceniX, CompleX,Optix, PhysX

Foundation Libraries
CUBLAS, CUFFT, CULA, NVCUVID/VENC, NVPP, Magma

Development Environment
C, C++, Fortran, Python, Java, OpenCL, Direct Compute,

CUDA Compute Architecture

Directions
Hardware and Software are one
Within the Node
OS integration: Scheduling, Preemption, Virtual Memory
Results: Programming model simplification

Expanding the cluster

Cluster wide communication and synchronization

GPU on-load
Enhance the programming model to keep more of the computation
(less cpu interaction) and more of the data (less host side
shadowing).

Thank You!

Thank you!

Additional slide credits: John Montrym & David Kirk

Graphics Processing Unit
No ratings yet
Graphics Processing Unit
21 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
UNIT 4 GPU Computing - HPC
No ratings yet
UNIT 4 GPU Computing - HPC
13 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Commercial Building
50% (2)
Commercial Building
11 pages
GPU Gems 2
No ratings yet
GPU Gems 2
534 pages
Bava Kalai Final
No ratings yet
Bava Kalai Final
235 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Introduction To Graphics Hardware and Gpus Introduction To Graphics Hardware and Gpus
No ratings yet
Introduction To Graphics Hardware and Gpus Introduction To Graphics Hardware and Gpus
22 pages
Modern GPU Architecture
No ratings yet
Modern GPU Architecture
93 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 17-Introduction To GPU
No ratings yet
Lecture 17-Introduction To GPU
36 pages
GPU Architecture and Function: Michael Foster and Ian Frasch
No ratings yet
GPU Architecture and Function: Michael Foster and Ian Frasch
35 pages
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
No ratings yet
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
56 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
TDCI Arch
No ratings yet
TDCI Arch
77 pages
Part1 22
No ratings yet
Part1 22
77 pages
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
No ratings yet
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
39 pages
3dfx Nst121spring2015presentation 150428172041 Conversion Gate02
No ratings yet
3dfx Nst121spring2015presentation 150428172041 Conversion Gate02
36 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
GPGPU
No ratings yet
GPGPU
139 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
21 pages
Gpus
No ratings yet
Gpus
32 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
IEEEMicro Ge Force 6800
No ratings yet
IEEEMicro Ge Force 6800
12 pages
Unit 4
No ratings yet
Unit 4
48 pages
NVIDIAFermiComputeArchitectureWhitepaper PDF
No ratings yet
NVIDIAFermiComputeArchitectureWhitepaper PDF
21 pages
Whitepaper NVIDIA's Next Generation CUDA Compute Architecture
No ratings yet
Whitepaper NVIDIA's Next Generation CUDA Compute Architecture
21 pages
GPUIntro
No ratings yet
GPUIntro
21 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Graphics Processing Unit (GPU)
No ratings yet
Graphics Processing Unit (GPU)
13 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
GPU (Graphics Processing Unit)
No ratings yet
GPU (Graphics Processing Unit)
23 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Gpu IEEE Paper
No ratings yet
Gpu IEEE Paper
14 pages
Report On Gpu
No ratings yet
Report On Gpu
39 pages
Presented by Ragasudha.B Pavitha.P
No ratings yet
Presented by Ragasudha.B Pavitha.P
13 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Presentation Prepared by Saatwik Kumar 1101219423 ETC, ET-2
No ratings yet
Presentation Prepared by Saatwik Kumar 1101219423 ETC, ET-2
18 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Plaster Bill
No ratings yet
Plaster Bill
2 pages
NHB PDF
100% (3)
NHB PDF
43 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Ramoji Film City
50% (2)
Ramoji Film City
22 pages
Assingmentbic 10503
No ratings yet
Assingmentbic 10503
13 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Ib Vogt Solar India Pvt. LTD
No ratings yet
Ib Vogt Solar India Pvt. LTD
2 pages
Sri Lanka Rain Water Harvesting Gasset
No ratings yet
Sri Lanka Rain Water Harvesting Gasset
14 pages
Kadri Profile
No ratings yet
Kadri Profile
32 pages
Submitted By: - : Brijesh Kumar Patel BT - Cs 0709510012
No ratings yet
Submitted By: - : Brijesh Kumar Patel BT - Cs 0709510012
24 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
Manual Composite Beam Design - ENU
No ratings yet
Manual Composite Beam Design - ENU
56 pages
Tech Service Manual 07
100% (1)
Tech Service Manual 07
247 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Amc of Lan Technical Scope of Work
0% (1)
Amc of Lan Technical Scope of Work
7 pages
Optoma Es550 Es551 Ex550 Ex551
No ratings yet
Optoma Es550 Es551 Ex550 Ex551
48 pages
60 MX AD Quick Ref Refv2
No ratings yet
60 MX AD Quick Ref Refv2
133 pages
ErshigsPiping Duct Catalog
No ratings yet
ErshigsPiping Duct Catalog
44 pages
Cven4301 - Pc-1 - Intro To PC
No ratings yet
Cven4301 - Pc-1 - Intro To PC
43 pages
Toshiba PLC
No ratings yet
Toshiba PLC
10 pages
Teoría y Práctica Del Regionalismo Moderno en Cuba
100% (2)
Teoría y Práctica Del Regionalismo Moderno en Cuba
11 pages
Deep Learning - Wikipedia
No ratings yet
Deep Learning - Wikipedia
36 pages
Civics XI-XII
No ratings yet
Civics XI-XII
43 pages
Big O
No ratings yet
Big O
2 pages
Bellandur 290111 537
No ratings yet
Bellandur 290111 537
9 pages
Eye of The Beholder 3 Guide
No ratings yet
Eye of The Beholder 3 Guide
18 pages
VMware VCenter Site Recovery Manager Cheat Sheet en
No ratings yet
VMware VCenter Site Recovery Manager Cheat Sheet en
3 pages
C010-MS-04-0001 Rev.2
No ratings yet
C010-MS-04-0001 Rev.2
29 pages
Intelligent Traffic Light For Emergency Vehicles Based On Voice Commands
No ratings yet
Intelligent Traffic Light For Emergency Vehicles Based On Voice Commands
20 pages
1 12 21
No ratings yet
1 12 21
6 pages
Multimedia 1
No ratings yet
Multimedia 1
27 pages
Bootable OSX PDF
No ratings yet
Bootable OSX PDF
3 pages
How To Use Android Marshmallow Battery Optimization - Android Guides
No ratings yet
How To Use Android Marshmallow Battery Optimization - Android Guides
12 pages
Hreaded Ccessories: FIG. 35 & 35L Weldless Eye Nut
No ratings yet
Hreaded Ccessories: FIG. 35 & 35L Weldless Eye Nut
1 page
Technical Interview Questions - Networking
No ratings yet
Technical Interview Questions - Networking
9 pages
Informatics and Infotainment System For Smart E-Bike Using Raspberry Pi
No ratings yet
Informatics and Infotainment System For Smart E-Bike Using Raspberry Pi
4 pages
Trimoterm Technical Specification PDF
No ratings yet
Trimoterm Technical Specification PDF
4 pages
ST ND RD: Ntroduction
No ratings yet
ST ND RD: Ntroduction
4 pages
Study of Barrel Vault
No ratings yet
Study of Barrel Vault
3 pages
Nvidia Tesla: Gpu Accelerators
No ratings yet
Nvidia Tesla: Gpu Accelerators
3 pages
Lai Ren Yu Final X de Answer Booklet
No ratings yet
Lai Ren Yu Final X de Answer Booklet
2 pages
Read Me
No ratings yet
Read Me
1 page
Ishaan
No ratings yet
Ishaan
2 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
VMware Horizon View Essentials
From Everand
VMware Horizon View Essentials
Peter von Oven
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

The Evolution of Gpus For General Purpose Computing

Uploaded by

The Evolution of Gpus For General Purpose Computing

Uploaded by

The Evolution of GPUs for General

When GPUs became programmable

First Generation - Wireframe

transform, clip, and project

Rasterization: lines only

no pixels! calligraphic display

Storage Tube Terminals

Vector displays were still better for fine position detail

Framebuffers were used to emulate storage tube vector

Second Generation Shaded Solids

Rasterization: filled polygons

depth buffer, color blending

Third Generation Texture Mapping

Rasterization: more, faster

texture filtering, antialiasing

IRIS 3000 Graphics Cards

Geometry Engines & Rasterizer

4 bit / pixel Framebuffer

Before Programmable Shading

Moving toward programmability

All images their respective owners

Programmable Shaders: GeForceFX

Vertex and fragment operations specified in small (macro)

R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx;

Copyright NVIDIA Corporation 2006

Stunning Graphics Realism

Lush, Rich Worlds

Crysis 2006 Crytek / Electronic Arts

Incredible Physics Effects

Core of the Definitive Gaming Platform

Copyright NVIDIA Corporation 2006

NVIDIA NV30, 35, 40

translating transistors into performance

Early GPGPU (2002)

Ray Tracing on Programmable Graphics Hardware

Programming model challenge

similar to arrays, but

streamWrite (velocityfield, v_ptr);

for (i=0; i<100; i++)

Limited texture size/dimension

GeForce 7800 Pixel

Fully general load/store to GPU

Untyped, not limited to fixed

Dedicated on-chip memory

Shared between threads for

Managing Communication with Shared

Multiple passes through

Setup / Rstr / ZCull

Pixel Thread Issue

Geom Thread Issue

Vtx Thread Issue

GeForce 8800 GPU Computing

NVIDIA Corporation 2007

Building GPU Computing Ecosystem

CUDA: C on the GPU

Explicit GPU memory allocation

Memory copy from host to device, etc.

CUDA: Threading in Data Parallel

Users simply given thread id

No need for accessors, views, or built-ins

Divergence in Parallel Computing

GPUs: Decouple execution width from programming model

Managing divergence becomes performance optimization

PTX Virtual Machine and ISA

PTX Translator (OCG)

PTX ISA Instruction Set Architecture

PTX Virtual Machine

Translate PTX code to Target code

Driver implements PTX VM runtime

GPU Computing Software Libraries

CUDA Compute Architecture

Expanding the cluster

Additional slide credits: John Montrym & David Kirk

You might also like