The Evolution of Gpus For General Purpose Computing
The Evolution of Gpus For General Purpose Computing
Purpose Computing
Ian Buck| Sr. Director GPU Computing Software
San Jose Convention Center, CA | September 2023, 2010
Talk Outline
History of early graphics hardware
First GPU Computing
Pixel:
Dates:
prior to 1987
Early Framebuffers
By the mid-1970s one could afford framebuffers with a
few bits per pixel at modest resolution
A Random Access Video Frame Buffer,
Kajiya, Sutherland, Cheadle, 1975
lighting
Pixel:
Dates:
1987 - 1992
more, faster
Pixel:
Dates:
1990s
Desktop 3D workstations under $5000
Single-board, multi-chip graphics subsystems
Rise of 3D on the PC
40 company free-for-all until intense competition knocked out all but a
few players
Many were decelerators, and easy to beat
Single-chip GPUs
Interesting hardware experimentation
PCs would take over the workstation business
Interesting consoles
3DO, Nintendo, Sega, Sony
DirectX 6
Multitexturing
Riva TNT
1998
Half-Life
DirectX 7
T&L TextureStageState
GeForce 256
1999
Quake 3
2000
DirectX 8
SM 1.x
GeForce 3
2001
Giants
DirectX 9
SM 2.0
GeForceFX
Cg
2002
2003
Halo
DirectX 9.0c
SM 3.0
GeForce 6
2004
Far Cry
UE3
(2002)
Input
Input01
Input 2
OP
Temp
Temp01
Temp 2
ADDR
DP3R
RSQR
MULR
ADDR
DP3R
RSQR
MADR
MULR
DP3R
MAXR
No Lighting
Per-Vertex Lighting
Per-Pixel Lighting
Unreal Epic
Hellgate: London 2005-2006 Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc.
Full Spectrum Warrior: Ten Hammers 2006 Pandemic Studios, LLC. All rights reserved. 2006 THQ Inc. All rights reserved.
recent trends
multiplies per second
(observed peak)
GFLOPS
July 01
Jan 02
July 02
Jan 03
July 03
Jan 04
GPU history
NVIDIA historicals
Product
Process
Trans
MHz
GFLOPS
(MUL)
Aug-02
GeForce FX5800
0.13
121M
500
Jan-03
GeForce FX5900
0.13
130M
475
20
Dec-03
GeForce 6800
0.13
222M
400
53
www.gpgpu.org
Early Raytracing
Brook (2003)
C with streams
streams
collection of records requiring similar computation
particle positions, voxels, FEM cell,
Ray r<200>;
float3 velocityfield<100,100,100>;
kernels
functions applied to streams
similar to for_all construct
kernel void add (float a<>, float b<>,
out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
add(a,b,c);
Challenges
Input Registers
Hardware
Software
Addressing modes
Texture
Building the GPU Computing
Ecosystem
Constants
Fragment Program
Shader capabilities
Limited outputs
Instruction sets
Integer & bit ops
Communication limited
Between pixels
Scatter a[i] = p
Output Registers
Registers
Texture
Fragment Program
Constants
Registers
Output Registers
Thread Programs
Features
Thread Number
Texture
Thread Program
Constants
Registers
Output Registers
Millions of instructions
Full Integer and Bit instructions
No limits on branching, looping
1D, 2D, or 3D thread ID
allocation
Global Memory
Features
Thread Number
Texture
Thread Program
Constants
Registers
Global Memory
Shared Memory
Thread Number
Features
Shared
Thread Program
Global Memory
Texture
Constants
Explicitly managed
Registers
As fast as registers
GPGPU
Cache
DRAM
Control
ALU
ALU
Pn=P1+P2+P3+P4
P1
P2
P3
P4
Single thread
out of cache
GPU Computing
Pn=P1+P2+P3+P4
P1,P2
P3,P4
Program/Control
Control
Control
ALU
Pn=P1+P2+P3+P4
P1,P2
P3,P4
Video
Memory
ALU
Pn=P1+P2+P3+P4
Control
Control
ALU
ALU
Pn=P1+P2+P3+P4
P1,P2
P3,P4
Thread
Execution
Manager
ALU
Pn=P1+P2+P3+P4
Control
ALU
ALU
Pn=P1+P2+P3+P4
Shared
Data
P1
P2
P3
P4
P5
DRAM
GeForce 8800
Build the architecture around the processor
Host
Input Assembler
SP
SP
SP
TF
SP
TF
L1
SP
TF
L1
SP
SP
SP
TF
SP
SP
TF
L1
L2
FB
SP
SP
TF
L1
L2
FB
SP
TF
L1
L1
L2
FB
SP
SP
TF
L1
L2
FB
SP
Thread Processor
L1
L2
FB
L2
FB
NVIDIA Corporation 2007
Host
Input Assembler
Thread Execution Manager
Thread Processors
Thread Processors
Thread Processors
Thread Processors
Thread Processors
Thread Processors
Thread Processors
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Thread Processors
Parallel Data
Cache
Load/store
Global Memory
Flexibility
Not requiring the data layout to force the algorithm
Blocking computation for the memory hierarchy (shared)
Think about the algorithm, not the data
Foundations
Baseline HPC solution
Ubiquity: CUDA Everywhere
Software
C99 Math.h
BLAS & FFT
GPU co-processor
Hardware
IEEE math (G80)
Double Precision (GT200)
ECC (Fermi)
Customizing Solutions
Ease of Adoption
Ported Applications
Domain Libraries
Domain specific lang
C
Driver API
PTX
HW
Generality
DSL
Fortran
Online CodeGen
C/C++
Compiler
PTX Code
Translator
Tesla
SM 1.3
Target code
Variable declarations
Data initialization
Instructions and operands
PTX to Target
NVIDIA Confidential
Programming model
Execution resources and state
Abstract and unify target details
PTX Code
Tesla
SM 1.0
Fermi
SM 2.0
Foundation Libraries
CUBLAS, CUFFT, CULA, NVCUVID/VENC, NVPP, Magma
Development Environment
C, C++, Fortran, Python, Java, OpenCL, Direct Compute,
Directions
Hardware and Software are one
Within the Node
OS integration: Scheduling, Preemption, Virtual Memory
Results: Programming model simplification
GPU on-load
Enhance the programming model to keep more of the computation
(less cpu interaction) and more of the data (less host side
shadowing).
Thank You!
Thank you!