0% found this document useful (0 votes)

143 views

The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games

Tim Sweeney is the CEO and founder of Epic Games. He discusses the limitations of current GPU architectures and graphics pipelines. He argues that the fixed-function nature of GPUs has reached a plateau and is limiting further advances in graphics quality. Sweeney envisions a future where software rendering techniques are used to bypass GPU limitations and unlock new graphics capabilities, such as ray tracing, advanced antialiasing, and volumetric rendering. This would require general-purpose programming of graphics processing.

Uploaded by

api-26184004

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

143 views

The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games

Uploaded by

api-26184004

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Tim Sweeney

CEO, Founder
Epic Games
[email protected]

THE END OF THE GPU

ROADMAP
Background:
Epic Games
Background: Epic Games
 Independent game developer
 Located in Raleigh, North Carolina, USA
 Founded in 1991
 Over 30 games released
 Gears of War
 Unreal series
 Unreal Engine 3 is used by 100’s of games
History:
Unreal Engine
Unreal Engine 1
1996-1999
 First modern game engine
 Object-oriented
 Real-time, visual toolset
 Scripting language
 Last major software renderer
 Software texture mapping
 Colored lighting, shadowing
 Volumetric lighting & fog
 Pixel-accurate culling
 25 games shipped
Unreal Engine 2
2000-2005

 PlayStation 2, Xbox, PC
 DirectX 7 graphics
 Single-threaded
 40 games shipped
Unreal Engine 3
2006-2012
 PlayStation 3, Xbox 360, PC
 DirectX 9 graphics
 Pixel shaders
 Advanced lighting & shadowing
 Multithreading (6 threads)
 Advanced physics
 More visual tools
 Game Scripting
 Materials
 Animation
 Cinematics…
 150 games in
development
Unreal Engine 3 Games

Army of Two (Electronic Arts)

Mass Effect (BioWare)

Undertow (Chair Entertainment)

BioShock (2K Games)
Game Development: 2009
Gears of War 2: Project Overview
 Project Resources
 15 programmers
 45 artists
 2-year schedule
 $12M development budget
 Software Dependencies
 1 middleware game engine
 ~20 middleware libraries
 Platform libraries
Gears of War 2: Software Dependencies

Gears of War 2
Gameplay Code
~250,000 lines C++, script code

Unreal Engine 3
Middleware Game Engine
~2,000,000 lines C++ code

ZLib
DirectX OpenAL Speed
Tree
FaceFX
Face
Bink
Movie
Data …
Graphics Audio Rendering Animation Codec
Compr-
ession
Hardware:
History
Computing History

1985 Intel 80386: Scalar, in-order CPU

1989 Intel 80486: Caches!
1993 Pentium: Superscalar execution
1995 Pentium Pro: Out-of-order execution
1999 Pentium 3: Vector floating-point
2003 AMD Opteron: Multi-core
2006 PlayStation 3, Xbox 360: “Many-core”
…and we’re back to in-order execution
Graphics History

1984 3D workstation (SGI)

1997 GPU (3dfx)
2002 DirectX9, Pixel shaders (ATI)
2006 GPU with full programming language
(NVIDIA GeForce 8)
2009? x86 CPU/GPU Hybrid
(Intel Larrabee)
Hardware:
2012-2020
Hardware: 2012-2020
Processor Processor Processor Processor Processor
In Order In Order In Order In Order In Order
4 Threads 4 Threads 4 Threads 4 Threads 4 Threads
I$ D$ I$ D$ I$ D$ I$ D$ I$ D$

L2 Cache

Processor Processor Processor Processor Processor

In Order In Order In Order In Order In Order
4 Threads 4 Threads 4 Threads 4 Threads 4 Threads
I$ D$ I$ D$ I$ D$ I$ D$ I$ D$

Intel Larrabee NVIDIA GeForce 8

 x86 CPU-GPU Hybrid  General Purpose GPU
 C/C++ Compiler  CUDA “C” Compiler
 DirectX/OpenGL  DirectX/OpenGL
 Many-core, vector architecture  Many-core, vector architecture
 Teraflop-class performance  Teraflop-class performance
Hardware: 2012-2020

CONCLUSION
CPU, GPU architectures are getting closer
THE GPU TODAY
The GPU Today

 Large frame buffer

 Complicated pipeline
 It’s fixed-function
 But we can specify
shader programs
that execute in certain pipeline stages
Shader Program Limitations

 No random-access memory writes

 Can write to current pixel in frame buffer
 Can’t create data structures
 Can’t traverse data structures
 Can hack it using texture accesses
 Hard to share data between main program
and shaders programs
 Weird programming language
 HLSL rather than C/C++

Result: “The Shader ALU Plateau”

Antialiasing Limitations

 MSAA & Oversampling

 Every 1 bit of output precision costs up to
2X memory & performance!
 Ideally want 10-20 bits
 Discrete sampling (in general)
 Texture filtering only implies antialiasing when
shader equation is linear
 Most shader equations are nonlinear

Aliasing is the #1 visual artifact in Gears of War

Texture Sampling Limitations

 Inherent artifacts of bilinear/trilinear

 Poor approximation of Integrate(color,area)
in the presence of:
 Small triangles
 Texture seams
 Alpha translucency
 Masking
 Fixed-function = poor scalability
 Megatexture, etc
Frame Buffer Model Limitation

 Frame buffer: 1 (or n) layers of 4-vectors,

where n = small constant
 Ineffective for
 General translucency
 Complex shadowing models
 Memory bandwidth requirement =
FPS * Pixel Count * Layers Depth * pow(2,n)
where n = quality of MSAA
Summary of Limitations

 “The Shader ALU Plateau”

 Antialiasing limitations
 Texture Sampling limitations
 Frame Buffer limitations
The Meta-Problem:

 The fixed-function pipeline is

too fixed to solve its problems
 Result:
 All games look similar
 Derive little benefit from Moore’s Law
 Crysis on high-end NVIDIA SLI solution only looks at
most marginally better than top Xbox 360 games

This is a market BEGGING

to be disrupted :-)
SO...
Return to 100% “Software” Rendering

 Bypass the OpenGL/DirectX API

 Implement a 100% software renderer
 Bypass all fixed-function pipeline hardware
 Generate image directly
 Build & traverse complex data structures
 Unlimited possibilities

Could implement this…

 On Intel CPU using C/C++
 On NVIDIA GPU using CUDA (no DirectX)
Software Rendering in Unreal 1 (1998)

Ran 100% on CPU

No GPU required!

Features
 Real-time colored lighting
 Volumetric Fog
 Tiled Rendering
 Occlusion Detection
Software Rendering in 1998 vs 2012

60 MHz Pentium could execute:

16 operations per pixel
at 320x200, 30 Hz

In 2012, a 4 Teraflop processor

would execute:
16000 operations per pixel
at 1920x1080, 60 Hz

Assumption: Using 50% of computing power for graphics, 50% for gameplay
Future Graphics:
Raytracing

 For each pixel

 Cast a ray off into scene
 Determine which objects were hit
 Continue for reflections, refraction, etc

 Consider
 Less efficient than pure rendering
 Can use for reflections in traditional render
Future Graphics:
The REYES Rendering Model
 “Dice” all objects in scene down into sub-pixel-
sized triangles
 Rendering with
 Flat Shading (!)
 Analytic antialiasing
 Per-pixel occlusion
(A-Buffer/BSP)
 Benefits
 Displacement maps for free
 Analytic Antialiasing
 Advanced filtering (Gaussian)
 Eliminates texture sampling
Future Graphics:
The REYES Rendering Model

Today’s Pipeline Potential 2012 Pipeline

 Build 4M poly “high-res” character  Build 4M poly “high-res” character
 Generate normal maps from  Render it in-game!
geometry in high-res  Advanced LOD scheme assures
 Rendering 20K poly “low-res” proper sub-pixel sized triangles
character in-game
Future Graphics:
Volumetric Rendering
 Direct Voxel Rendering
 Raycasting
 Efficient for trees, foliage
 Tesselated Volume Rendering
 Marching Cubes
 Marching Tetrahedrons
 Point Clouds
 Signal-Space Volume Rendering
 Fourier Projection Slice Theorem
 Great for clouds, translucent volumetric data
Future Graphics:
Software Tiled Rendering

 Split the frame buffer up into bins

 Example: 1 bin = 8x8 pixels
 Process one bin at a time
 Transform, rasterize all objects in the bin

 Consider
 Cache efficiency
 Deep frame buffers, antialiasing
Hybrid Graphics Algorithms

 Analytic Antialiasing
– Analytic solution, better than 1024x MSAA

 Sort-independent translucency
– Sorted linked-list per pixel of fragments requiring per-pixel memory
allocation, pointer-following, conditional branching (A-Buffer).

 Advanced shadowing techniques

– Physically accurate per-pixel penumbra volumes
– Extension of well-known stencil buffering algorithm
– Requires storing, traversing, and updating a very simple BSP tree per-
pixel with memory allocation and pointed following.

 Scenes with very large numbers of objects

– Fixed-function GPU + API has 10X-100X state change disadvantage
Graphics: 2012-2020
Potential Industry Goals

Achieve movie-quality:
 Antialiasing
 Direct Lighting
 Shadowing
 Particle Effects
 Reflections

Significantly improve:
 Character animation
 Object counts
 Indirect lighting
SOFTWARE IMPLICATIONS
Software Implications

Software must scale to…

• 10’s – 100’s of threads
• Vector instruction sets
Software Implications

Programming Models
• Shared State Concurrency
• Message Passing
• Pure Functional Programming
• Software Transactional Memory
Multithreading in Unreal Engine 3:
“Task Parallelism”
 Gameplay thread
 AI, scripting
 Thousands of interacting objects

 Rendering thread
 Scene traversal, occlusion
 Direct3D command submission

 Pool of helper threads for other work

 Physics Solver
 Animation Updates

Good for 4 threads.

No good for 100 threads!
“Shared State Concurrency”
The standard C++/Java threading model

 Many threads are running

 There is 512MB of data
 Any thread can modify any data at any time
 All synchronization is explicit, manual
 See: LOCK, MUTEX, SEMAPHORE
 No compile-time verification of correctness properties:
 Deadlock-free
 Race-free
 Invariants
Multithreaded Gameplay Simulation:
Manual Synchronization

Idea:
 Update objects in multiple threads
 Each object contains a lock
 “Just lock an object before using it”

Problems:
 “Deadlocks”
 “Data Races”
 Debugging is difficult/expensive
Multithreaded Gameplay Simulation:
“Message Passing”

Idea:
 Update objects in multiple threads
 Each object can only modify itself
 Communicate with other objects by sending
messages

Problems:
 Requires writing 1000’s of message protocols
 Still need synchronization
Pure Functional Programming

“Pure Functional” programming style:

• Define algorithms that don’t write to shared
memory or perform I/O operations

(their only effect is to return a result)

Examples:
• Collision Detection
• Physics Solver
• Pixel Shading
Pure Functional Programming

“Inside a function with no side effects,

sub-computations can be run in any order,
or concurrently,
without affecting the function’s result”

With this property:

• A programmer can explicitly multithread the
code, safely.
• Future compilers will be able to automatically
multithread the code, safely.

See: “Implementing Lazy Functional Languages on Stock Hardware”;

Simon Peyton Jones; Journal of Functional Programming 2005
Multithreaded Gameplay Simulation:
Software Transactional Memory
Idea:
 Update objects in multiple threads
 Each thread runs inside a transaction block
and has an atomic view of its “local” changes to memory
 C++ runtime detects conflicts between transactions
 Non-conflicting transactions are applied to “global” memory
 Conflicting transactions are “rolled back” and re-run
Implemented 100% in software; no custom hardware required.

Problems:
 “Object update” code must be free of side-effects
 Requires C++ runtime support
 Cost around 30% performance

See: “Composable Memory Transactions”; Tim Harris, Simon Marlow, Simon Peyton Jones,
and Maurice Herlihy. ACM Conference on Principles and Practice of Parallel Programming 2005
Vectorization
Supporting “Vector Instruction Sets” efficiently

NVIDIA GeForce 8:
• 8 to 15 cores
• 16-wide vectors
Vectorization

C++, Java compilers generate “scalar” code

GPU Shader compilers generate “vector” code

 Arbitrary vector size (4, 16, 64, …)
 N-wide vectors yield N-wide speedup
Vectorization: “The Old Way”

 “Old Vectors” (SIMD):

Intel SSE, Motorola Altivec
 4-wide vectors
 4-wide arithmetic operations
 Vector loads
Load vector register from vector stored in memory

 Vector swizzle & mask

Future Programming Models:
Vectorization

 “Old Vectors”
Intel SSE, Motorola Altivec
x0 x1 x2 x3
vec4 x,y,z;
... + + + +
z = x+y;
y0 y1 y2 y3

= = = =

z0 z1 z2 z3
Vectorization: “New Vectors”

(ATI, NVIDIA GeForce 8, Intel Larrabee)

 16-wide vectors
 16-wide arithmetic
 Vector loads/stores
 Load 16-wide vector register from scalars
from 16 independent memory addresses,
where the addresses are stored in a vector!
 Analogy: Register-indexed constant access in DirectX

 Conditional vector masks

“New SIMD” is better than “Old SIMD”

 “Old Vectors” were only useful when dealing

with vector-like data types:
 “XYZW” vectors from graphics
 4x4 matrices

 “New Vectors” are far more powerful:

Any loop whose body has a statically-known call graph
free of sequential dependencies can be “vectorized”,
or compiled into an equivalent 16-wide vector
program. And it runs up to 16X faster!
“New Vectors” are universal
int n;
cmplx coords[];

(Mandelbrot set generator)

int color[] = new int[n]

for(int i=0; i<n; i++) {

int j=0;
cmplx c=cmplx(0,0)
while(mag(c) < 2) {
c=c*c + coords[i];
j++;
}
color[i] = j;
}

This code…
 is free of sequential dependencies
 has a statically known call graph
Therefore, we can mechanically transform it into an equivalent
data parallel code fragment.
“New Vectors” Translation
for(int i=0; i<n; i++) { for(int i=0; i<n; i+=N) {
… i_vector={i,i+1,..i+N-1}
} i_mask={i<n,i+1<N,i+2<N,..i+N-1<N}
…
}

Standard data-parallel loop setup

Note: Any code outside this loop

(which invokes the loop)
is necessarily scalar!
“New Vectors” Translation
int n;
cmplx coords[];
int color[] = new int[n]
Note: Any code outside this loop
for(int i=0; i<n; i++) {
(which invokes the loop)
int j=0;
cmplx c=cmplx(0,0)
is necessarily scalar!
while(mag(c) < 2) {
c=c*c +
coords[i];
j++; int n;
} cmplx coords[];
color[i] = j; int color[] = new int[n]
}
for(int i=0; i<n; i+=N) {
int[N] i_vector={i,i+1,..i+N-1}
Loop Index Vector bool[N] i_mask={i<n,i+1<N,i+2<N,..i+N-1<N}

complx[N] c_vector={cmplx(0,0),..}
Loop Mask Vector
while(1) {
bool[N] while_vector={
Vectorized Loop Variable i_mask[0] && mag(c_vector[0])<2,
..
Vectorized Conditional: }
if(all_false(while_vector))
Propagates loop mask break;
to local condition c_vector=c_vector*c_vector + coords[i..i+N-1 : i_mask]
}
colors[i..i+N-1 : i_mask] = c_vector;
}

Mask-predicated Mask-predicated
vector write vector read
Vectorization Tricks
 Vectorization of loops
 Subexpressions independent of loop variable are scalar and can be
lifted out of loop
 Subexpressions dependent on loop variable are vectorized
 Each loop iteraction computes an “active mask” enabling operation
on some subset of the N components
 Vectorization of function calls
 For every scalar function, generate an N-wide vector version of the
function taking an N-wide “active mask”
 Vectorization of conditionals
 Evaluate N-wide conditional and combine it with the current active
mask
 Execute “true” branch if any masked conditions true
 Execute “false” branch if any masked conditions false
 Will often execute both branches
Vectorization Paradigms
 Hand-coded vector operations
 Current approach to SSE/Altivec

 Loop vectorization
 See: Vectorizing compilers

 Run a big function with a big bundle of data

 CUDA/OpenCL

 Nested Data Parallelism

 See NESTL
 Very general set of “vectorization” transforms
for many categories of nested computations
Layers: Multithreading & Vectors

Physics, collision detection, scene

traversal, path finding ..

Game World State

Graphics shader programs

Vector (Data Parallel) Subset

Purely functional core

Software Transactional Memory

Sequential Execution

Hardware I/O
Potential Performance Gains*: 2012-2020
Up to...
 64X for multithreading
 1024X for multithreading + vectors!

1024X
64X

64X

* My estimate of feasibility based on Moore’s Law

Multithreading & Vectorization:
Who Choses?

 Hardware companies impose a limited

model on developers
 Sony Cell, NVIDIA CUDA, Apple OpenCL

 Hardware provides general feature;

languages & runtimes make it nice;
users choose!
 Tradeoffs
 Performance
 Productivity
 Familiarity
HARDWARE IMPLICATIONS
The Graphics Hardware of the Future

All else is just computing!

Future Hardware:
A unified architecture for computing and graphics

Hardware Model
 Three performance dimensions
 Clock rate
 Cores
 Vector width
 Executes two kinds of code:
 Scalar code (like x86, PowerPC)
 Vector code (like GPU shaders or SSE/Altivec)
 Some fixed-function hardware
 Texture sampling
 Rasterization?
Vector Instruction Issues

 A future computing device needs…

 Full vector ISA
 Masking & scatter/gather memory access
 64-bit integer ops & memory addressing
 Full scalar ISA
 Dynamic control-flow is essential
 Efficient support for scalar<->vector transitions
 Initiating a vector computation
 Reducing the results
 Repacking vectors
 Must support billions of transitions per second
Memory System Issues

Effective bandwidth demands will be huge

Typically read 1 byte of memory per FLOP

4 TFLOP of computing power

demands
4 TBPS of effective memory bandwidth!

Yes, really!
Memory System Issues

Threads (GPU) Caches (CPU)

 Hide memory latency  Expose memory latency
 Lose data locality  Exploit data locality
to minimize main
memory bandwidth
Memory System Issues

 Cache coherency is vital

 It should be the default
Revisiting REYES

 “Dice” all objects in scene down into sub-

pixel-sized triangles
 Tile-based setup

 Rendering with
 Flat Shading
 No texture sampling
 Analytic antialiasing
 Per-pixel occlusion
(A-Buffer/BSP) Requires no artificial
software threading
or pipelining.
LESSONS LEARNED
Lessons learned:
Productivity is vital!
Hardware will become 20X faster, but:
 Game budgets will increase less than 2X.

Therefore...
 Developers must be willing to sacrifice performance
in order to gain productivity.
 High-level programming beats
low-level programming.
 Easier hardware beats faster hardware!
 We need great tools: compilers, engines, middleware
libraries...
Lessons learned:
Today’s hardware is too hard!
 If it costs X (time, money, pain) to develop an efficient
single-threaded algorithm, then…
 Multithreaded version costs 2X
 PlayStation 3 Cell version costs 5X
 Current “GPGPU” version is costs: 10X or more

 Over 2X is uneconomical for most software companies!

 This is an argument against:

 Hardware that requires difficult programming techniques
 Non-unified memory architectures
 Limited “GPGPU” programming models
Lessons learned:
Plan Ahead
Previous Generation:
 Lead-time for engine development was 3 years
 Unreal Engine 3:
 2003: development started
 2006: first game shipped

Next Generation:
 Lead-time for engine development is 5 years
 Start in 2009, ship in 2014!

So, let’s get started!

CONCLUSION
END

58 Candlestick Patterns PDF Manual_ FREE Download – Trading
100% (1)
58 Candlestick Patterns PDF Manual_ FREE Download – Trading
62 pages
Elec4602 Notes
No ratings yet
Elec4602 Notes
34 pages
Mastering Autodesk Maya 2016: Autodesk Official Press
From Everand
Mastering Autodesk Maya 2016: Autodesk Official Press
Todd Palamar
No ratings yet
English Language Proficiency Test PDF
No ratings yet
English Language Proficiency Test PDF
3 pages
(BS en 678-1994) - Determination of The Dry Density of Autoclaved Aerated Concrete.
No ratings yet
(BS en 678-1994) - Determination of The Dry Density of Autoclaved Aerated Concrete.
10 pages
Graphic Engines 3
No ratings yet
Graphic Engines 3
8 pages
A Brief Introduction To 3d
100% (1)
A Brief Introduction To 3d
84 pages
DirectX Demystified: A Comprehensive Guide to Game Development Essentials
From Everand
DirectX Demystified: A Comprehensive Guide to Game Development Essentials
Kameron Hussain
No ratings yet
OpenGL to Vulkan: Mastering Graphics Programming
From Everand
OpenGL to Vulkan: Mastering Graphics Programming
Kameron Hussain
No ratings yet
Game Console - Graphics Team - Bibliographic Report
No ratings yet
Game Console - Graphics Team - Bibliographic Report
42 pages
Course28-Advanced Real-Time Rendering in 3D Graphics and Games SIGGRAPH07
No ratings yet
Course28-Advanced Real-Time Rendering in 3D Graphics and Games SIGGRAPH07
144 pages
3do m2
No ratings yet
3do m2
4 pages
Transmeta Crusoe: A Revolutionary CPU For Mobile Computing Ashraful Alam
No ratings yet
Transmeta Crusoe: A Revolutionary CPU For Mobile Computing Ashraful Alam
23 pages
Rasterization Texture
No ratings yet
Rasterization Texture
85 pages
Crusoe Processor: Seminar Guide: - By: - Prof. H. S. Kulkarni Ashish
No ratings yet
Crusoe Processor: Seminar Guide: - By: - Prof. H. S. Kulkarni Ashish
26 pages
Physics Engine For Simulation
No ratings yet
Physics Engine For Simulation
6 pages
Sousa Graphics Gems CryENGINE3
No ratings yet
Sousa Graphics Gems CryENGINE3
59 pages
Verilog Project Report
No ratings yet
Verilog Project Report
13 pages
Game Programming Genesis
No ratings yet
Game Programming Genesis
120 pages
Real Time 3D Graphics Processor On Fpga
No ratings yet
Real Time 3D Graphics Processor On Fpga
4 pages
15 Tutorial 13 - Cube Mapping
No ratings yet
15 Tutorial 13 - Cube Mapping
14 pages
Tutorial 25 - SkyBox
No ratings yet
Tutorial 25 - SkyBox
10 pages
GDC09 Abrash Larrabee+Final
No ratings yet
GDC09 Abrash Larrabee+Final
116 pages
Lecture9 - Fixed Point
No ratings yet
Lecture9 - Fixed Point
36 pages
Crusoe Processor Model TM5800 Features
No ratings yet
Crusoe Processor Model TM5800 Features
8 pages
Direct2D Succinctly PDF
No ratings yet
Direct2D Succinctly PDF
187 pages
Emulated X68000 Development Workstation Setup
No ratings yet
Emulated X68000 Development Workstation Setup
12 pages
Graphics Programming in C
100% (1)
Graphics Programming in C
13 pages
d502 Console Architecture
No ratings yet
d502 Console Architecture
22 pages
Stage Lighting Technician Ebook
100% (1)
Stage Lighting Technician Ebook
100 pages
Masm 2
No ratings yet
Masm 2
16 pages
Continuous World
No ratings yet
Continuous World
57 pages
Unofficial RayTracingGems v1.7
No ratings yet
Unofficial RayTracingGems v1.7
622 pages
Computer Graphics Course: Ray Tracing
No ratings yet
Computer Graphics Course: Ray Tracing
76 pages
Gamebryo Console Commands
100% (1)
Gamebryo Console Commands
117 pages
Advanced Animation and Rendering Techniques
No ratings yet
Advanced Animation and Rendering Techniques
5 pages
Ray Casting 1
No ratings yet
Ray Casting 1
14 pages
3D Computer Graphics
No ratings yet
3D Computer Graphics
17 pages
Nu Game Engine
No ratings yet
Nu Game Engine
39 pages
SGI RealityEngine
No ratings yet
SGI RealityEngine
8 pages
[FREE PDF sample] Geometry for Programmers (MEAP v11) Oleksandr Kaleniuk ebooks
100% (3)
[FREE PDF sample] Geometry for Programmers (MEAP v11) Oleksandr Kaleniuk ebooks
40 pages
Lecture 8c
No ratings yet
Lecture 8c
29 pages
Computer Graphics
100% (1)
Computer Graphics
132 pages
Mantle Programming Guide and API Reference
No ratings yet
Mantle Programming Guide and API Reference
435 pages
GPU Wiki
No ratings yet
GPU Wiki
9 pages
Chapter 12 A
No ratings yet
Chapter 12 A
57 pages
Yoshi's Nightmare
No ratings yet
Yoshi's Nightmare
15 pages
Overview of Graphics System
100% (5)
Overview of Graphics System
42 pages
Video Graphics Array (VGA) With Fpga
No ratings yet
Video Graphics Array (VGA) With Fpga
22 pages
Massively Multi Player Game Development
100% (1)
Massively Multi Player Game Development
12 pages
Creating Textures For Games Part-1
No ratings yet
Creating Textures For Games Part-1
11 pages
3D Real-Time-Strategy (RTS) Game Tutorial - Unity3D - Coffee Break Codes
No ratings yet
3D Real-Time-Strategy (RTS) Game Tutorial - Unity3D - Coffee Break Codes
5 pages
Functional Sample
No ratings yet
Functional Sample
48 pages
Real-Time Physics Based Simulation For 3D Computer Graphics
No ratings yet
Real-Time Physics Based Simulation For 3D Computer Graphics
120 pages
Ray Tracing: Dr. T. Joshva Devadas Professor-CSE SOC Kare
100% (1)
Ray Tracing: Dr. T. Joshva Devadas Professor-CSE SOC Kare
43 pages
Quality Isosurface Mesh Generation Using An Extended Marching Cubes Lookup Table
No ratings yet
Quality Isosurface Mesh Generation Using An Extended Marching Cubes Lookup Table
8 pages
SIGGRAPH 2015 Remedy Notes PDF
No ratings yet
SIGGRAPH 2015 Remedy Notes PDF
164 pages
Fundamentals of Computer Graphics
No ratings yet
Fundamentals of Computer Graphics
58 pages
Point Cloud
No ratings yet
Point Cloud
4 pages
Lecture 3: Animation & Graphics
No ratings yet
Lecture 3: Animation & Graphics
32 pages
Fight Night Champion GDC Presentation
No ratings yet
Fight Night Champion GDC Presentation
75 pages
Mastering Unity: Advanced Techniques for Interactive Design: Unity Game Development Series
From Everand
Mastering Unity: Advanced Techniques for Interactive Design: Unity Game Development Series
Kameron Hussain
No ratings yet
Firestorm Engine BrownBag Edits
No ratings yet
Firestorm Engine BrownBag Edits
20 pages
Air Handling Unit
No ratings yet
Air Handling Unit
27 pages
Bcastudyguide Com Computer Fundamental Office Automation Cfoa
No ratings yet
Bcastudyguide Com Computer Fundamental Office Automation Cfoa
10 pages
Bài Tập Tiếng Anh Lớp 12 Mới: Theo Unit Unit 2 Urbanisation Có Đáp Án
No ratings yet
Bài Tập Tiếng Anh Lớp 12 Mới: Theo Unit Unit 2 Urbanisation Có Đáp Án
5 pages
Aerial/Underground: Fiber Optic Splice Guide
No ratings yet
Aerial/Underground: Fiber Optic Splice Guide
1 page
The Beedi Industry in India: An Overview
100% (1)
The Beedi Industry in India: An Overview
41 pages
Cambridge Pre-U: Further Mathematics 9795/02
No ratings yet
Cambridge Pre-U: Further Mathematics 9795/02
8 pages
Instant Download The Princeton Companion to Atlantic History Joseph C. Miller (Editor) PDF All Chapters
No ratings yet
Instant Download The Princeton Companion to Atlantic History Joseph C. Miller (Editor) PDF All Chapters
67 pages
Weekly Plan of Care 2019
No ratings yet
Weekly Plan of Care 2019
3 pages
CLA Position Available Apply by 15th October 1695981134
No ratings yet
CLA Position Available Apply by 15th October 1695981134
4 pages
DDX Mata Tenang Dengan Visus Turun Perlahan
No ratings yet
DDX Mata Tenang Dengan Visus Turun Perlahan
15 pages
West Bengal Board Class 12 Biological Sciences Syllabus
No ratings yet
West Bengal Board Class 12 Biological Sciences Syllabus
9 pages
Ab 1783-Etap PDF
No ratings yet
Ab 1783-Etap PDF
4 pages
Deloitte - Esport Study 2022 en
No ratings yet
Deloitte - Esport Study 2022 en
36 pages
Questions Yes/No and WH
No ratings yet
Questions Yes/No and WH
22 pages
2bu4q Swimming Times February 2015
No ratings yet
2bu4q Swimming Times February 2015
84 pages
Lesson Plan in Music I
No ratings yet
Lesson Plan in Music I
3 pages
PSoC 4000S FAMILY PSoC R 4 REGISTERS TECHNICAL REFERENCE MANUAL TRM
No ratings yet
PSoC 4000S FAMILY PSoC R 4 REGISTERS TECHNICAL REFERENCE MANUAL TRM
756 pages
Urban Health Problems & Nuhm
No ratings yet
Urban Health Problems & Nuhm
48 pages
MPhil Econometrics Question Final Exam 2022
No ratings yet
MPhil Econometrics Question Final Exam 2022
2 pages
Operation Theatre Complex
No ratings yet
Operation Theatre Complex
26 pages
Shrek
No ratings yet
Shrek
2 pages
Establishment: Mission Statement
50% (2)
Establishment: Mission Statement
4 pages
Cartilla Kristine Mary F. 1n Act. 9
No ratings yet
Cartilla Kristine Mary F. 1n Act. 9
6 pages
Ultimate Halloween Word Search
No ratings yet
Ultimate Halloween Word Search
6 pages
School Grade Level Teacher Learning Area Teaching Dates and Time Quarter
No ratings yet
School Grade Level Teacher Learning Area Teaching Dates and Time Quarter
4 pages
WB TP 501
No ratings yet
WB TP 501
296 pages
Keto Diet
No ratings yet
Keto Diet
2 pages

The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games

Uploaded by

The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games

Uploaded by

Tim Sweeney

THE END OF THE GPU

Army of Two (Electronic Arts)

Undertow (Chair Entertainment)

1985 Intel 80386: Scalar, in-order CPU

1984 3D workstation (SGI)

Processor Processor Processor Processor Processor

Intel Larrabee NVIDIA GeForce 8

 Large frame buffer

 No random-access memory writes

Result: “The Shader ALU Plateau”

 MSAA & Oversampling

Aliasing is the #1 visual artifact in Gears of War

 Inherent artifacts of bilinear/trilinear

 Frame buffer: 1 (or n) layers of 4-vectors,

 “The Shader ALU Plateau”

 The fixed-function pipeline is

This is a market BEGGING

 Bypass the OpenGL/DirectX API

Could implement this…

Ran 100% on CPU

60 MHz Pentium could execute:

In 2012, a 4 Teraflop processor

 For each pixel

Today’s Pipeline Potential 2012 Pipeline

 Split the frame buffer up into bins

 Advanced shadowing techniques

 Scenes with very large numbers of objects

Software must scale to…

 Pool of helper threads for other work

Good for 4 threads.

 Many threads are running

“Pure Functional” programming style:

(their only effect is to return a result)

“Inside a function with no side effects,

With this property:

See: “Implementing Lazy Functional Languages on Stock Hardware”;

C++, Java compilers generate “scalar” code

GPU Shader compilers generate “vector” code

 “Old Vectors” (SIMD):

 Vector swizzle & mask

(ATI, NVIDIA GeForce 8, Intel Larrabee)

 Conditional vector masks

 “Old Vectors” were only useful when dealing

 “New Vectors” are far more powerful:

(Mandelbrot set generator)

for(int i=0; i<n; i++) {

Standard data-parallel loop setup

Note: Any code outside this loop

 Run a big function with a big bundle of data

 Nested Data Parallelism

Physics, collision detection, scene

Game World State

Vector (Data Parallel) Subset

Purely functional core

Software Transactional Memory

* My estimate of feasibility based on Moore’s Law

 Hardware companies impose a limited

 Hardware provides general feature;

All else is just computing!

 A future computing device needs…

Effective bandwidth demands will be huge

4 TFLOP of computing power

Threads (GPU) Caches (CPU)

 Cache coherency is vital

 “Dice” all objects in scene down into sub-

 Over 2X is uneconomical for most software companies!

 This is an argument against:

So, let’s get started!

You might also like