0% found this document useful (0 votes)
159 views38 pages

The Evolution of Gpus For General Purpose Computing

GPGPU

Uploaded by

sufikasih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views38 pages

The Evolution of Gpus For General Purpose Computing

GPGPU

Uploaded by

sufikasih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

The Evolution of GPUs for General

Purpose Computing
Ian Buck| Sr. Director GPU Computing Software
San Jose Convention Center, CA | September 2023, 2010

Talk Outline
History of early graphics hardware
First GPU Computing

When GPUs became programmable


Creating GPU Computing
Future Trends and directions

First Generation - Wireframe


Vertex:

transform, clip, and project

Rasterization: lines only

Pixel:

no pixels! calligraphic display

Dates:

prior to 1987

Storage Tube Terminals


CRTs with analog charge persistence
Accumulate a detailed static image by writing points or
line segments
Erase the stored image to start a new one

Early Framebuffers
By the mid-1970s one could afford framebuffers with a
few bits per pixel at modest resolution
A Random Access Video Frame Buffer,
Kajiya, Sutherland, Cheadle, 1975

Vector displays were still better for fine position detail

Framebuffers were used to emulate storage tube vector


terminals on a raster display

Second Generation Shaded Solids


Vertex:

lighting

Rasterization: filled polygons

Pixel:

depth buffer, color blending

Dates:

1987 - 1992

Third Generation Texture Mapping


Vertex:

more, faster

Rasterization: more, faster

Pixel:
Dates:

texture filtering, antialiasing


1992 - 2001

IRIS 3000 Graphics Cards

Geometry Engines & Rasterizer

4 bit / pixel Framebuffer


(2 instances)

1990s
Desktop 3D workstations under $5000
Single-board, multi-chip graphics subsystems

Rise of 3D on the PC
40 company free-for-all until intense competition knocked out all but a
few players
Many were decelerators, and easy to beat

Single-chip GPUs
Interesting hardware experimentation
PCs would take over the workstation business

Interesting consoles
3DO, Nintendo, Sega, Sony

Before Programmable Shading


Computing though image processing circa.1995
GL_ARB_imaging

Moving toward programmability


DirectX 5
Riva 128

DirectX 6
Multitexturing
Riva TNT

1998

Half-Life

DirectX 7
T&L TextureStageState
GeForce 256

1999

Quake 3

2000

DirectX 8
SM 1.x
GeForce 3

2001

Giants

DirectX 9
SM 2.0
GeForceFX

Cg

2002

2003

Halo

DirectX 9.0c
SM 3.0
GeForce 6

2004

Far Cry

UE3

All images their respective owners

Programmable Shaders: GeForceFX

(2002)

Vertex and fragment operations specified in small (macro)


assembly language
User-specified mapping of input data to operations
Limited ability to use intermediate computed values to
index input data (textures and vertex uniforms)

Input
Input01
Input 2
OP
Temp
Temp01
Temp 2

ADDR
DP3R
RSQR
MULR
ADDR
DP3R
RSQR
MADR
MULR
DP3R
MAXR

R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx;


R0.w, R0.xyzx, R0.xyzx;
R0.w, R0.w;
R0.xyz, R0.w, R0.xyzx;
R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx;
R0.w, R1.xyzx, R1.xyzx;
R0.w, R0.w;
R0.xyz, R0.w, R1.xyzx, R0.xyzx;
R1.xyz, R0.w, R1.xyzx;
R0.w, R1.xyzx, f[TEX1].xyzx;
R0.w, R0.w, {0}.x;

No Lighting

Copyright NVIDIA Corporation 2006

Per-Vertex Lighting

Per-Pixel Lighting

Unreal Epic

Stunning Graphics Realism

Lush, Rich Worlds

Crysis 2006 Crytek / Electronic Arts

Incredible Physics Effects

Core of the Definitive Gaming Platform

Copyright NVIDIA Corporation 2006

Hellgate: London 2005-2006 Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc.

Full Spectrum Warrior: Ten Hammers 2006 Pandemic Studios, LLC. All rights reserved. 2006 THQ Inc. All rights reserved.

recent trends
multiplies per second
(observed peak)

GFLOPS

NVIDIA NV30, 35, 40


ATI R300, 360, 420
Pentium 4

July 01

Jan 02

July 02

Jan 03

July 03

Jan 04

GPU history
NVIDIA historicals
Product

Process

Trans

MHz

GFLOPS
(MUL)

Aug-02

GeForce FX5800

0.13

121M

500

Jan-03

GeForce FX5900

0.13

130M

475

20

Dec-03

GeForce 6800

0.13

222M

400

53

translating transistors into performance


1.8x increase of transistors
20% decrease in clock rate
6.6x GFLOP speedup

Early GPGPU (2002)

www.gpgpu.org

Early Raytracing

Ray Tracing on Programmable Graphics Hardware


Purcell et al.
PDEs in Graphics Hardware
Strzodka,,Rumpf
Fast Matrix Multiplies using Graphics Hardware
Larsen, McAllister
Using Modern Graphics Architectures for
General-Purpose Computing: A Framework and Analysis.
Thompson et al.

Programming model challenge


Demonstrate GPU performace
PHD computer graphics to do this
Financial companies hiring game programmers
GPU as a processor

Brook (2003)
C with streams
streams
collection of records requiring similar computation
particle positions, voxels, FEM cell,
Ray r<200>;
float3 velocityfield<100,100,100>;

similar to arrays, but


index operations disallowed: position[i]
read/write stream operators:
streamRead (positions, p_ptr);

streamWrite (velocityfield, v_ptr);

kernels
functions applied to streams
similar to for_all construct
kernel void add (float a<>, float b<>,
out float result<>) {
result = a + b;
}

float a<100>;
float b<100>;
float c<100>;
add(a,b,c);

for (i=0; i<100; i++)


c[i] = a[i]+b[i];

Challenges
Input Registers

Hardware

Software

Addressing modes

Texture
Building the GPU Computing
Ecosystem
Constants

Limited texture size/dimension

Fragment Program

Shader capabilities
Limited outputs

Instruction sets
Integer & bit ops

Communication limited
Between pixels
Scatter a[i] = p

Output Registers

Registers

GeForce 7800 Pixel


Input Registers

Texture
Fragment Program

Constants
Registers

Output Registers

Thread Programs
Features

Thread Number

Texture
Thread Program

Constants
Registers

Output Registers

Millions of instructions
Full Integer and Bit instructions
No limits on branching, looping
1D, 2D, or 3D thread ID
allocation

Global Memory
Features

Thread Number

Texture
Thread Program

Constants
Registers

Global Memory

Fully general load/store to GPU


memory: Scatter/Gather
Programmer flexibility on how
memory is accessed

Untyped, not limited to fixed


texture types
Pointer support

Shared Memory
Thread Number

Features
Shared

Thread Program

Global Memory

Texture

Dedicated on-chip memory

Shared between threads for


inter-thread communication

Constants

Explicitly managed

Registers

As fast as registers

Managing Communication with Shared


CPU
Control

GPGPU

Cache

DRAM

Control

ALU

ALU
Pn=P1+P2+P3+P4

P1
P2
P3
P4

Single thread
out of cache

GPU Computing

Pn=P1+P2+P3+P4

P1,P2
P3,P4

Program/Control

Control

Control

ALU
Pn=P1+P2+P3+P4

P1,P2
P3,P4
Video
Memory

ALU
Pn=P1+P2+P3+P4
Control

Control

ALU
ALU
Pn=P1+P2+P3+P4

P1,P2
P3,P4

Multiple passes through


video memory
Data/Computation

Thread
Execution
Manager

ALU
Pn=P1+P2+P3+P4

Control

ALU
ALU

Pn=P1+P2+P3+P4

Shared
Data
P1
P2
P3
P4
P5

DRAM

GeForce 8800
Build the architecture around the processor
Host
Input Assembler

Setup / Rstr / ZCull

SP

SP

SP

TF

SP

TF

L1

SP

TF

L1

SP

Pixel Thread Issue

SP

SP

TF

SP

SP

TF

L1

L2

FB

SP

SP

TF

L1

L2

FB

SP

TF

L1

L1

L2

FB

SP

SP

TF

L1

L2

FB

SP

Geom Thread Issue

Thread Processor

Vtx Thread Issue

L1

L2

FB

L2

FB
NVIDIA Corporation 2007

GeForce 8800 GPU Computing


Next step: Expose the GPU as massively parallel processors

Host
Input Assembler
Thread Execution Manager

Thread Processors

Thread Processors

Thread Processors

Thread Processors

Thread Processors

Thread Processors

Thread Processors

Parallel Data
Cache

Parallel Data
Cache

Parallel Data
Cache

Parallel Data
Cache

Parallel Data
Cache

Parallel Data
Cache

Parallel Data
Cache

Thread Processors

Parallel Data
Cache

Load/store
Global Memory

NVIDIA Corporation 2007

Building GPU Computing Ecosystem


Convince the world to program an entirely new kind of
processor
Tradeoffs between functional vs. performance requirements
Deliver HPC feature parity
Seed larger ecosystem with foundational components

CUDA: C on the GPU


A simple, explicit programming language solution
Extend only where necessary
__global__ void KernelFunc(...);
__shared__ int SharedVar;
KernelFunc<<< 500, 128 >>>(...);

Explicit GPU memory allocation


cudaMalloc(), cudaFree()

Memory copy from host to device, etc.


cudaMemcpy(), cudaMemcpy2D(), ...

CUDA: Threading in Data Parallel


Threading in a data parallel world
Operations drive execution, not data

Users simply given thread id


They decide what thread access which data element
One thread = single data element or block or variable or nothing.

No need for accessors, views, or built-ins

Flexibility
Not requiring the data layout to force the algorithm
Blocking computation for the memory hierarchy (shared)
Think about the algorithm, not the data

Divergence in Parallel Computing


Removing divergence pain from parallel programming
SIMD Pain
User required to SIMD-ify
User suffers when computation goes divergent

GPUs: Decouple execution width from programming model


Threads can diverge freely
Inefficiency only when granularity exceeds native machine width
Hardware managed

Managing divergence becomes performance optimization


Scalable

Foundations
Baseline HPC solution
Ubiquity: CUDA Everywhere

Software
C99 Math.h
BLAS & FFT

GPU co-processor

Hardware
IEEE math (G80)
Double Precision (GT200)
ECC (Fermi)

Customizing Solutions
Ease of Adoption

Ported Applications
Domain Libraries
Domain specific lang
C
Driver API

PTX
HW
Generality

PTX Virtual Machine and ISA


C/C++
Application

DSL
Fortran
Online CodeGen

C/C++
Compiler
PTX Code

Translator
Tesla
SM 1.3

Target code

Variable declarations
Data initialization
Instructions and operands

PTX Translator (OCG)

PTX to Target

NVIDIA Confidential

Programming model
Execution resources and state
Abstract and unify target details

PTX ISA Instruction Set Architecture

PTX Code

Tesla
SM 1.0

PTX Virtual Machine

Fermi
SM 2.0

Translate PTX code to Target code


At program build time
At program install time
Or JIT at program run time

Driver implements PTX VM runtime


Coupled with Translator

GPU Computing Software Libraries


and Engines
GPU Computing Applications
Application Acceleration Engines (AXEs)
SceniX, CompleX,Optix, PhysX

Foundation Libraries
CUBLAS, CUFFT, CULA, NVCUVID/VENC, NVPP, Magma

Development Environment
C, C++, Fortran, Python, Java, OpenCL, Direct Compute,

CUDA Compute Architecture

Directions
Hardware and Software are one
Within the Node
OS integration: Scheduling, Preemption, Virtual Memory
Results: Programming model simplification

Expanding the cluster


Cluster wide communication and synchronization

GPU on-load
Enhance the programming model to keep more of the computation
(less cpu interaction) and more of the data (less host side
shadowing).

Thank You!

Thank you!

Additional slide credits: John Montrym & David Kirk

You might also like