The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
CEO, Founder
Epic Games
[email protected]
PlayStation 2, Xbox, PC
DirectX 7 graphics
Single-threaded
40 games shipped
Unreal Engine 3
2006-2012
PlayStation 3, Xbox 360, PC
DirectX 9 graphics
Pixel shaders
Advanced lighting & shadowing
Multithreading (6 threads)
Advanced physics
More visual tools
Game Scripting
Materials
Animation
Cinematics…
150 games in
development
Unreal Engine 3 Games
Gears of War 2
Gameplay Code
~250,000 lines C++, script code
Unreal Engine 3
Middleware Game Engine
~2,000,000 lines C++ code
ZLib
DirectX OpenAL Speed
Tree
FaceFX
Face
Bink
Movie
Data …
Graphics Audio Rendering Animation Codec
Compr-
ession
Hardware:
History
Computing History
L2 Cache
CONCLUSION
CPU, GPU architectures are getting closer
THE GPU TODAY
The GPU Today
Features
Real-time colored lighting
Volumetric Fog
Tiled Rendering
Occlusion Detection
Software Rendering in 1998 vs 2012
Assumption: Using 50% of computing power for graphics, 50% for gameplay
Future Graphics:
Raytracing
Consider
Less efficient than pure rendering
Can use for reflections in traditional render
Future Graphics:
The REYES Rendering Model
“Dice” all objects in scene down into sub-pixel-
sized triangles
Rendering with
Flat Shading (!)
Analytic antialiasing
Per-pixel occlusion
(A-Buffer/BSP)
Benefits
Displacement maps for free
Analytic Antialiasing
Advanced filtering (Gaussian)
Eliminates texture sampling
Future Graphics:
The REYES Rendering Model
Consider
Cache efficiency
Deep frame buffers, antialiasing
Hybrid Graphics Algorithms
Analytic Antialiasing
– Analytic solution, better than 1024x MSAA
Sort-independent translucency
– Sorted linked-list per pixel of fragments requiring per-pixel memory
allocation, pointer-following, conditional branching (A-Buffer).
Achieve movie-quality:
Antialiasing
Direct Lighting
Shadowing
Particle Effects
Reflections
Significantly improve:
Character animation
Object counts
Indirect lighting
SOFTWARE IMPLICATIONS
Software Implications
Programming Models
• Shared State Concurrency
• Message Passing
• Pure Functional Programming
• Software Transactional Memory
Multithreading in Unreal Engine 3:
“Task Parallelism”
Gameplay thread
AI, scripting
Thousands of interacting objects
Rendering thread
Scene traversal, occlusion
Direct3D command submission
Idea:
Update objects in multiple threads
Each object contains a lock
“Just lock an object before using it”
Problems:
“Deadlocks”
“Data Races”
Debugging is difficult/expensive
Multithreaded Gameplay Simulation:
“Message Passing”
Idea:
Update objects in multiple threads
Each object can only modify itself
Communicate with other objects by sending
messages
Problems:
Requires writing 1000’s of message protocols
Still need synchronization
Pure Functional Programming
Examples:
• Collision Detection
• Physics Solver
• Pixel Shading
Pure Functional Programming
Problems:
“Object update” code must be free of side-effects
Requires C++ runtime support
Cost around 30% performance
See: “Composable Memory Transactions”; Tim Harris, Simon Marlow, Simon Peyton Jones,
and Maurice Herlihy. ACM Conference on Principles and Practice of Parallel Programming 2005
Vectorization
Supporting “Vector Instruction Sets” efficiently
NVIDIA GeForce 8:
• 8 to 15 cores
• 16-wide vectors
Vectorization
“Old Vectors”
Intel SSE, Motorola Altivec
x0 x1 x2 x3
vec4 x,y,z;
... + + + +
z = x+y;
y0 y1 y2 y3
= = = =
z0 z1 z2 z3
Vectorization: “New Vectors”
This code…
is free of sequential dependencies
has a statically known call graph
Therefore, we can mechanically transform it into an equivalent
data parallel code fragment.
“New Vectors” Translation
for(int i=0; i<n; i++) { for(int i=0; i<n; i+=N) {
… i_vector={i,i+1,..i+N-1}
} i_mask={i<n,i+1<N,i+2<N,..i+N-1<N}
…
}
complx[N] c_vector={cmplx(0,0),..}
Loop Mask Vector
while(1) {
bool[N] while_vector={
Vectorized Loop Variable i_mask[0] && mag(c_vector[0])<2,
..
Vectorized Conditional: }
if(all_false(while_vector))
Propagates loop mask break;
to local condition c_vector=c_vector*c_vector + coords[i..i+N-1 : i_mask]
}
colors[i..i+N-1 : i_mask] = c_vector;
}
Mask-predicated Mask-predicated
vector write vector read
Vectorization Tricks
Vectorization of loops
Subexpressions independent of loop variable are scalar and can be
lifted out of loop
Subexpressions dependent on loop variable are vectorized
Each loop iteraction computes an “active mask” enabling operation
on some subset of the N components
Vectorization of function calls
For every scalar function, generate an N-wide vector version of the
function taking an N-wide “active mask”
Vectorization of conditionals
Evaluate N-wide conditional and combine it with the current active
mask
Execute “true” branch if any masked conditions true
Execute “false” branch if any masked conditions false
Will often execute both branches
Vectorization Paradigms
Hand-coded vector operations
Current approach to SSE/Altivec
Loop vectorization
See: Vectorizing compilers
Sequential Execution
Hardware I/O
Potential Performance Gains*: 2012-2020
Up to...
64X for multithreading
1024X for multithreading + vectors!
1024X
64X
64X
1X
Hardware Model
Three performance dimensions
Clock rate
Cores
Vector width
Executes two kinds of code:
Scalar code (like x86, PowerPC)
Vector code (like GPU shaders or SSE/Altivec)
Some fixed-function hardware
Texture sampling
Rasterization?
Vector Instruction Issues
Yes, really!
Memory System Issues
Rendering with
Flat Shading
No texture sampling
Analytic antialiasing
Per-pixel occlusion
(A-Buffer/BSP) Requires no artificial
software threading
or pipelining.
LESSONS LEARNED
Lessons learned:
Productivity is vital!
Hardware will become 20X faster, but:
Game budgets will increase less than 2X.
Therefore...
Developers must be willing to sacrifice performance
in order to gain productivity.
High-level programming beats
low-level programming.
Easier hardware beats faster hardware!
We need great tools: compilers, engines, middleware
libraries...
Lessons learned:
Today’s hardware is too hard!
If it costs X (time, money, pain) to develop an efficient
single-threaded algorithm, then…
Multithreaded version costs 2X
PlayStation 3 Cell version costs 5X
Current “GPGPU” version is costs: 10X or more
Next Generation:
Lead-time for engine development is 5 years
Start in 2009, ship in 2014!