GPU Pro2: Advanced Rendering Techniques
GPU Pro2: Advanced Rendering Techniques
i i
GPU Pro2
i i
i i
i i
i i
i i
i i
i i
i i
GPU Pro2
Advanced Rendering Techniques
A K Peters, Ltd.
Natick, Massachusetts
i i
i i
i i
i i
A K Peters, Ltd.
5 Commonwealth Road, Suite 2C
Natick, MA 01760
[Link]
All rights reserved. No part of the material protected by this copyright notice
may be reproduced or utilized in any form, electronic or mechanical, including
photocopying, recording, or by any information storage and retrieval system,
without written permission from the copyright owner.
T385.G6885 2011
006.6–dc22
2010040134
Front cover art courtesy of Lionhead Studios. All Fable R artwork appears courtesy of Lionhead
Studios and Microsoft Corporation. “Fable” is a registered trademark or trademark of Microsoft
Corporation in the United States and/or other countries. c 2010 Microsoft Corporation. All
Rights Reserved. Microsoft, Fable, Lionhead, the Lionhead Logo, Xbox, and the Xbox logo are
trademarks of the Microsoft group of companies.
Back cover images are generated with CryENGINE 3 by Anton Kaplanyan. The Crytek
Sponza model is publicly available courtesy of Frank Meinl.
Printed in India
14 13 12 11 10 9 8 7 6 5 4 3 2 1
i i
i i
i i
i i
Contents
Acknowledgments xv
I Geometry Manipulation 1
Wolfgang Engel, editor
i i
i i
i i
i i
vi Contents
II Rendering 39
Christopher Oat, editor
i i
i i
i i
i i
Contents vii
i i
i i
i i
i i
viii Contents
i i
i i
i i
i i
Contents ix
IV Shadows 205
Wolfgang Engel, editor
i i
i i
i i
i i
x Contents
i i
i i
i i
i i
Contents xi
i i
i i
i i
i i
xii Contents
i i
i i
i i
i i
Contents xiii
Contributors 473
i i
i i
i i
i i
i i
i i
i i
i i
Acknowledgments
The GPU Pro: Advanced Rendering Techniques book series covers ready-to-use
ideas and procedures that can solve many of your daily graphics-programming
challenges.
The second book in the series wouldn’t have been possible without the help
of many people. First, I would like to thank the section editors for the fantastic
job they did. The work of Wessam Bahnassi, Sebastien St Laurent, Carsten
Dachsbacher, Christopher Oat, and Kristof Beets ensured that the quality of the
series meets the expectations of our readers.
The great cover screenshots were taken from Fable III. I would like to thank
Fernando Navarro from Microsoft Game Studios for helping us get the permissions
to use those shots. You will find the Fable III-related article about morphological
antialiasing in this book.
The team at A K Peters made the whole project happen. I want to thank
Alice and Klaus Peters, Sarah Cutler, and the entire production team, who took
the articles and made them into a book. Special thanks go out to our families
and friends, who spent many evenings and weekends without us during the long
book production cycle.
I hope you have as much fun reading the book as we had creating it.
—Wolfgang Engel
P.S. Plans for an upcoming GPU Pro 3 are already in progress. Any comments,
proposals, and suggestions are highly welcome ([Link]@[Link]).
xv
i i
i i
i i
i i
i i
i i
i i
i i
Web Materials
Example programs and source code to accompany some of the chapters are avail-
able at [Link] The directory structure closely follows
the book structure by using the chapter number as the name of the subdirectory.
You will need to download the DirectX August 2009 SDK.
Updates
Updates of the example programs will be periodically posted.
xvii
i i
i i
i i
i i
i i
i i
i i
i i
I
Geometry
Manipulation
The “Geometry Manipulation” section of the book focuses on the ability of graph-
ics processing units (GPUs) to process and generate geometry in exciting ways.
The article “Terrain and Ocean Rendering” looks at the tessellation related
stages of DirectX 11, explains a simple implementation of terrain rendering, and
implements the techniques from the ShaderX6 article “Procedural Ocean Effects”
by László Szécsi and Khashayar Arman.
Jorge Jimenez, Jose I. Echevarria, Christopher Oat, and Diego Gutierrez
present a method to add expressive and animated wrinkles to characters in the
article “Practical and Realistic Facial Wrinkles Animation.” Their system allows
the animator to independently blend multiple wrinkle maps across regions of a
character’s face. When combined with traditional blend-target morphing for fa-
cial animation, this technique can produce very compelling results that enable
virtual characters to be much more expressive in both their actions and dialog.
The article “Procedural Content Generation on GPU,” by Aleksander Netzel
and Pawel Rohleder, demonstrates the generating and rendering of infinite and
deterministic heightmap-based terrain utilizing fractal Brownian noise calculated
in real time on the GPU. Additionally it proposes a random tree distribution
scheme that exploits previously generated terrain information. The authors use
spectral synthesis to accumulate several layers of approximated fractal Brownian
motion. They also show how to simulate erosion in real time.
—Wolfgang Engel
i i
i i
i i
i i
i i
i i
i i
i i
1
I
i i
i i
i i
i i
4 I Geometry Manipulation
i i
i i
i i
i i
To calculate the output data there are two functions. The first is executed
for every patch, and there you can calculate the tessellation factor for every
edge of the patch and inside it. It is defined in the high-level shader language
(HLSL) code and the attribute [patchconstantfunc(‘‘func name’’)] must be
specified.
The other function, the main one, is executed for every control point in
the patch, and there you can manipulate this control point. In both func-
tions, you have the information of all the control points in the patch; in ad-
dition, in the second function, you have the ID of the control point that you are
processing.
An example of a hull shader header is as follows:
[ domain ( ‘ ‘ quad ’ ’ ) ]
[ partitioning ( ‘ ‘ integer ’ ’ ) ]
[ outputtopology ( ‘ ‘ triangle cw ’ ’ ) ]
[ o u t p u t c o n t r o l p o i n t s (OUTPUT PATCH SIZE ) ]
[ p a t c h c o n s t a n t f u n c ( ‘ ‘ TerrainConstantHS ’ ’ ) ]
HS OUTPUT h s T e r r a i n (
InputPatch<VS CONTROL POINT OUTPUT, INPUT PATCH SIZE> p ,
u i n t i : SV OutputControlPointID ,
u i n t PatchID : S V P r i m i t i v e I D )
• INPUT PATCH SIZE is an integer and it must match with the control point
primitive.
• OUTPUT PATCH SIZE will affect the number of times the main function will
be executed.
i i
i i
i i
i i
6 I Geometry Manipulation
Figure 1.2. Tessellation in a quad domain: to define the tessellation on the edges, we
start on the left edge and then we conitnue counterclockwise. To define the tessellation
inside, first we input the horizontal and then the vertical tessellation factor.
i i
i i
i i
i i
[ domain ( ‘ ‘ quad ’ ’ ) ]
DS OUTPUT d s T e r r a i n (
HS CONSTANT DATA OUTPUT i n p u t ,
f l o a t 2 UV : SV DomainLocation ,
c o n s t OutputPatch<HS OUTPUT, OUTPUT PATCH SIZE> patch )
Figure 1.3. Division of the terrain into a grid: Vx and Ex represent vertices and edges,
respectively, in every patch where x is the index to access. Inside 0 and Inside 1 represent
the directions of the tessellation inside a patch.
i i
i i
i i
i i
8 I Geometry Manipulation
Figure 1.4. Lines between patches when tessellation factors are wrong.
applied to a lot of shapes, but we will use the most intuitive shape: a patch with
four points. We will divide the terrain into patches of the same size, like a grid
(Figure 1.3). For every patch, we will have to decide the tessellation factor on
every edge and inside. It’s very important that two patches that share the same
edge have the same tessellation factor, otherwise you will see some lines between
patches (Figure 1.4).
i i
i i
i i
i i
max(te )
2 log 2
, for d ≤ min(d),
d−min(d)
te(d) = 2round(diff(telog 2 )(1− diff(d) )+min(telog 2 )) , for min(d) < d < max(d),
min(telog 2 )
2 , for d ≥ max(d).
(1.1)
max(te )
2 log 2
, for d ≤ min(d),
d−min(d)
te(d) = 2diff(telog 2 )(1− diff(d) )+min(telog 2 ) , for min(d) < d < max(d), (1.2)
min(telog 2 )
2 , for d ≥ max(d),
where diff(x) = max(x) − min(x) and d is the distance from the point where we
want to calculate the tessellation factor to the camera. The distances defined by
the user are min(d) and max(d) and min(telog2 ) and max(telog2 ) are the tessella-
tion factors defined by the user. For the tessellation factors, we use the log2 in
order to get a range from 0 to 6 instead of from 1 to 64. The final value te(d) is
calculated five times for every patch, using different distances—four for the edges
and one for inside.
As we said before, when an edge is shared by two patches the tessellation
factor must be the same. To do this we will calculate five different distances in
every patch, one for each edge and one inside. To calculate the tessellation factor
for each edge, we calculate the distance between the camera and the central point
of the edge. This way, in two adjacent patches with the same edge, the distance
at the middle point of this edge will be the same because they share the two
vertices that we use to calculate it. To calculate the tessellation factor inside
the patch in U and V directions, we calculate the distance between the camera
position and the middle point of the patch.
i i
i i
i i
i i
10 I Geometry Manipulation
1.3.1 Terrain
In terrain rendering (see Figure 1.5) we can easy calculate the x- and z-coordinates
with a single interpolation between the position of the vertices of the patch, but
we also need the y-coordinate that represents the height of the terrain at every
point and the texture coordinates. Since we have defined the terrain, to calculate
the texture coordinates, we have only to take the final x- and z-positions and
divide by the size of the terrain. This is because the positions of the terrain
range from 0 to the size of the terrain, and we want values from 0 to 1 to match
all the texture over it.
Once we have the texture coordinates, to get the height and the normal of the
terrain in a vertex, we read the information from a heightmap and a normal map
in world coordinates combined in one texture. To apply this information, we have
to use mipmap levels or we will see some popping when new vertices appear. To
reduce this popping we get the value from a texture in which the concentration
of points is the same compared to the concentration in the area where the vertex
is located. To do this, we linearly interpolate between the minimum and the
maximum mipmap levels depending on the distance (see Equation (1.3)).
Four patches that share a vertex have to use the same mipmap level in that
vertex to be coherent; for this reason, we calculate one mipmap level for each
vertex in a patch. Then, to calculate the mipmap level for the other vertices, we
have only to interpolate between the mipmap levels of the vertices of the patch,
where diff(x) = max(x) − min(x), M = MipmapLevel, and d is the distance from
the point to the camera:
min(M ),
for d ≤ min(d),
Mipmap(d) = diff(M ) d−min(d)
diff(d) + min(M ), for min(d) < d < max(d), (1.3)
max(M ), for d ≥ max(d).
To calculate the minimum and the maximum values for the mipmap level
variables, we use the following equations, where textSize is the size of the texture
that we use for the terrain:
i i
i i
i i
i i
We have to keep in mind that we use only squared textures with a power-of-two
size. If the minimum value is less than 0, we use 0.
1.3.2 Ocean
Calculating the final vertex position in ocean rendering (see Figure 1.6) is more
difficult than calculating terrain rendering. For this, we do not have a heightmap,
and we have to calculate the final position depending on the waves and the posi-
tion in the world coordinate space. To get real motion, we will use the technique
explained in ShaderX6 developed by Szécsi and Arman [Szécsi and Arman 08].
First, we have to imagine a single wave with a wavelength (λ), an amplitude
(a), and a direction (k). Its velocity (v) can be represented by
r
gλ
v= .
2π
Then the phase (ϕ) at time (t) in a point (p) is
2π
ϕ= (p · k + vt) .
λ
Finally, the displacement (s) to apply at that point is
An ocean is not a simple wave, and we have to combine all the waves to get
a realistic motion:
n
X
pΣ = p + s(p, ai , λi , ki ).
i=0
i i
i i
i i
i i
12 I Geometry Manipulation
∂pΣ ∂pΣ
N= × .
∂z ∂x
i i
i i
i i
i i
The rank value is used to decide how important the angle is compared with
the distance. The programmer can decide the value, but is advised to use values
from 0 to 1 to reduce or increase the tessellation factor by 50%. If you decide to
use a rank of 0.4 then the tessellation factor will be multiplied by a value between
0.8 and 1.2, depending on the angle.
To be consistent with this modification, we have to apply the correction to the
value that we use to access the mipmap level in a texture. It is very important to
understand that four patches that share the same vertex have the same mipmap
value at that vertex. To calculate the angle of the camera at this point, we
calculate the mean of the angles between the camera (b c) and every vector over
the edges (v0,
c v1,c v2,
c v3)
c that share the point (see Figure 1.8):
arccos(|b c|)+arccos(|b
c·v0 c|)+arccos(|b
c·v1 c|)+arccos(|b
c·v2 c|)
c·v3
π
2 − 4 rank
π rank + 1 − .
2 2
We don’t know the information about the vertex of the other patches for the
hull shader, but these vectors can be calculated in the vertex shader because we
know the size of the terrain and the number of patches.
i i
i i
i i
i i
14 I Geometry Manipulation
1.5 Conclusions
As we have shown in this article, hardware tessellation is a powerful tool that
reduces the information transfer from the CPU to the GPU. Three new stages
added to the graphic pipeline allow great flexibility to use hardware tessellation
advantageously. We have seen the application in two fields—terrain and water
rendering—but it can be used in similar meshes.
The main point to bear in mind is that we can use other techniques to calculate
the tessellation factor, but we always have to be aware of the tessellation factor
and mipmap levels with all the patches to avoid lines between them.
In addition, we have seen that it is better if we can use functions to represent
a mesh like the ocean, because the resolution can be as high as the tessellation
factor sets. If we use a heightmap, as we do for the terrain, it would be possible
to not have enough information in the texture, and we would have to interpolate
between texels.
Bibliography
[Microsoft ] Microsoft. “Windows DirectX Grapics Documentation.”
[Szécsi and Arman 08] László Szécsi and Khashayar Arman. Procedural Ocean Effects.
Hingham, MA: Charles River Media, 2008.
i i
i i
i i
i i
2
I
Virtual characters in games are becoming more and more realistic, with recent
advances, for instance, in the techniques of skin rendering [d’Eon and Luebke 07,
Hable et al. 09, Jimenez and Gutierrez 10] or behavior-based animation.1 To
avoid lifeless representations and to make the action more engaging, increasingly
sophisticated algorithms are being devised that capture subtle aspects of the
appearance and motion of these characters. Unfortunately, facial animation and
the emotional aspect of the interaction have not been traditionally pursued with
the same intensity. We believe this is an important aspect missing in games,
especially given the current trend toward story-driven AAA games and their
movie-like, real-time cut scenes.
Without even realizing it, we often depend on the subtleties of facial ex-
pression to give us important contextual cues about what someone is saying,
thinking, or feeling. For example, a wrinkled brow can indicate surprise, while
a furrowed brow may indicate confusion or inquisitiveness. In the mid-1800s, a
French neurologist named Guillaume Duchenne performed experiments that in-
volved applying electric stimulation to his subjects’ facial muscles. Duchenne’s
experiments allowed him to map which facial muscles were used for different fa-
cial expressions. One interesting fact that he discovered was that smiles resulting
from true happiness utilize not only the muscles of the mouth, but also those of
the eyes. It is this subtle but important additional muscle movement that dis-
tinguishes a genuine, happy smile from an inauthentic or sarcastic smile. What
we learn from this is that facial expressions are complex and sometimes subtle,
but extraordinarily important in conveying meaning and intent. In order to allow
artists to create realistic, compelling characters, we must allow them to harness
the power of subtle facial expression.
1 Euphoria NaturalMotion technology
15
i i
i i
i i
i i
16 I Geometry Manipulation
Figure 2.1. This figure shows our wrinkle system for a complex facial expression com-
posed of multiple, simultaneous blend shapes.
i i
i i
i i
i i
Figure 2.2. The same scene (a) without and (b) with animated facial wrinkles. Adding
them helps to increase visual realism and conveys the mood of the character.
2.1 Background
Bump maps and normal maps are well-known techniques for adding the illusion
of surface features to otherwise coarse, undetailed surfaces. The use of nor-
mal maps to capture the facial detail of human characters has been considered
standard practice for the past several generations of real-time rendering appli-
cations. However, using static normal maps unfortunately does not accurately
represent the dynamic surface of an animated human face. In order to simulate
dynamic wrinkles, one option is to use length-preserving geometric constraints
along with artist-placed wrinkle features to dynamically create wrinkles on ani-
mated meshes [Larboulette and Cani 04]. Since this method actually displaces
geometry, the underlying mesh must be sufficiently tessellated to represent the
finest level of wrinkle detail. A dynamic facial-wrinkle animation scheme pre-
sented recently [Oat 07] employs two wrinkle maps (one for stretch poses and
one for compress poses), and allows them to be blended to independent regions
of the face using artist-animated weights along with a mask texture. We build
upon this technique, demonstrating how to dramatically optimize the memory
requirements. Furthermore, our technique allows us to easily include more than
two wrinkle maps when needed, because we no longer map negative and positive
values to different textures.
i i
i i
i i
i i
18 I Geometry Manipulation
Table 2.1. Weights used for each expression and zone (see color meaning in the mask
map of Figure 2.3).
i i
i i
i i
i i
#i f d e f WRINKLES
f l o a t 2 w r i n k l e s = w r i n k l e T e x . Sample ( L i n e a r S a m p l e r ,
texcoord ) . gr ;
w r i n k l e s = −1.0 + 2 . 0 ∗ w r i n k l e s ;
b a s e . xy += mask1 . r ∗ wrinkles ;
b a s e . xy += mask1 . g ∗ wrinkles ;
b a s e . xy += mask1 . b ∗ wrinkles ;
b a s e . xy += mask1 . a ∗ wrinkles ;
b a s e . xy += mask2 . r ∗ wrinkles ;
b a s e . xy += mask2 . g ∗ wrinkles ;
b a s e . xy += mask2 . b ∗ wrinkles ;
b a s e . xy += mask2 . a ∗ wrinkles ;
#endif
return normalize ( b a s e ) ;
}
Listing 2.1. HLSL code of our technique. We are using a linear instead of an anisotropic
sampler for the wrinkle and mask maps because the low-frequency nature of their infor-
mation does not require higher quality filtering. This code is a more readable version
of the optimized code found in the web material.
i i
i i
i i
i i
20 I Geometry Manipulation
Neutral Surprise
Figure 2.4. The net result of applying both surprise and anger expressions on top of the
neutral pose is an unwrinkled forehead. In order to accomplish this, we use positive and
negative weights in the forehead wrinkle zones, for the surprise and angry expressions,
respectively.
i i
i i
i i
i i
where w(x, y) is the normal at pixel (x, y) of the wrinkle map, and b(x, y) is
the corresponding value from the base normal map. When DXT compression
is used for storing the differences map, it is recommended that the resulting
normal be renormalized after adding the delta, in order to alleviate the arti-
facts caused by the compression scheme (see web material for the corresponding
listing).
i i
i i
i i
i i
22 I Geometry Manipulation
Figure 2.5. We calculate a wrinkle-difference map by subtracting the base normal map
from the wrinkle map. In runtime, the wrinkle-difference map is selectively added on
top of the base normal map by using a wrinkle mask (see Figure 2.3 (right) for the
mask). The grey color of the image on the right is due to the bias and scale introduced
when computing the difference map.
Partial-derivative normal mapping has the following advantages over the dif-
ferences approach:
• It can be a little bit faster because it saves one GPU cycle when recon-
structing the normal, and also allows us to add only two-component nor-
mal derivatives instead of a full (x, y, z) difference; these two-component
additions can be done two at once, in only one cycle. This translates to
a measured performance improvement of 1.12x in the GeForce 8600GT,
whereas we have not observed any performance gain in either the GeForce
9800GTX+ nor in the GeForce 295GTX .
• It requires only two channels to be stored vs. the three channels required
for the differences approach. This provides higher quality because 3Dc can
be used to compress the wrinkle map for the same memory cost.
On the other hand, the differences approach has the following advantages over
the partial-derivative normal mapping approach:
The suitability of each approach will depend on both the constraints of the
pipeline and the characteristics of the art assets.
i i
i i
i i
i i
2.3 Results
For our implementation we used DirectX 10, but the wrinkle-animation shader
itself could be easily ported to DirectX 9. However, to circumvent the limitation
that only four blend shapes can be packed into per-vertex attributes at once, we
used the DirectX 10 stream-out feature, which allows us to apply an unlimited
number of blend shapes using multiple passes [Lorach 07]. The base normal map
has a resolution of 2048 × 2048, whereas the difference wrinkle and mask maps
have a resolution of 256 × 256 and 64 × 64, respectively, as they contain only
low-frequency information. We use 3Dc compression for the base and wrinkle
maps, and DXT for the color and mask maps. The high-quality scanned head
model and textures were kindly provided by XYZRGB, Inc., with the wrinkle
maps created manually, adding the missing touch to the photorealistic look of
the images. We used a mesh resolution of 13063 triangles, mouth included, which
is a little step ahead of current generation games; however, as current high-end
systems become mainstream, it will be more common to see such high polygon
counts, especially in cinematics.
To simulate the subsurface scattering of the skin, we use the recently devel-
oped screen-space approach [Jimenez and Gutierrez 10,Jimenez et al. 10b], which
transfers computations from texture space to screen space by modulating a convo-
lution kernel according to depth information. This way, the simulation is reduced
to a simple post-process, independent of the number of objects in the scene and
easy to integrate in any existing pipeline. Facial-color animation is achieved us-
ing a recently proposed technique [Jimenez et al. 10a], which is based on in vivo
melanin and hemoglobin measurements of real subjects. Another crucial part of
our rendering system is the Kelemen/Szirmay-Kalos model, which provides real-
istic specular reflections in real time [d’Eon and Luebke 07]. Additionally, we use
the recently introduced filmic tone mapper [Hable 10], which yields really crisp
blacks.
Figure 2.6. Closeups showing the wrinkles produced by nasalis (nose), frontalis (fore-
head), and mentalis (chin) muscles.
i i
i i
i i
i i
24 I Geometry Manipulation
Figure 2.7. Transition between various expressions. Having multiple mask zones for
the forehead wrinkles allows their shape to change according to the animation.
i i
i i
i i
i i
Table 2.2. Performance measurements for different GPUs. The times shown correspond
specifically to the execution of the code of the wrinkles shader.
For the head shown in the images, we have not created wrinkles for the
zones corresponding to the cheeks because the model is tessellated enough in
this zone, allowing us to produce geometric deformations directly on the blend
shapes.
Figure 2.6 shows different close-ups that allow appreciating the wrinkles added
in detail. Figure 2.7 depicts a sequential blending between compound expres-
sions, illustrating that adding facial-wrinkle animation boosts realism and adds
mood to the character (frames taken from the movie are included in the web
material).
Table 2.2 shows the performance of our shader using different GPUs, from
the low-end GeForce 8600GT to the high-end GeForce 295GTX. An in-depth
examination of the compiled shader code reveals that the wrinkle shader adds a
per-pixel arithmetic instruction/memory access count of 9/3. Note that animat-
ing wrinkles is useful mostly for near-to-medium distances; for far distances it
can be progressively disabled to save GPU cycles. Besides, when similar charac-
ters share the same (u, v) arrangement, we can reuse the same wrinkles, further
improving the use of memory resources.
2.4 Discussion
From direct observation of real wrinkles, it may be natural to assume that shad-
ing could be enhanced by using techniques like ambient occlusion or parallax
occlusion mapping [Tatarchuk 07]. However, we have found that wrinkles exhibit
very little to no ambient occlusion, unless the parameters used for its generation
are pushed beyond its natural values. Similarly, self-occlusion and self-shadowing
can be thought to be an important feature when dealing with wrinkles, but in
practice we have found that the use of parallax occlusion mapping is most often
unnoticeable in the specific case of facial wrinkles.
Furthermore, our technique allows the incorporation of additional wrinkle
maps, like the lemon pose used in [Oat 07], which allows stretching wrinkles
already found in the neutral pose. However, we have not included them because
they have little effect on the expressions we selected for this particular character
model.
i i
i i
i i
i i
26 I Geometry Manipulation
2.5 Conclusion
Compelling facial animation is an extremely important and challenging aspect of
computer graphics. Both games and animated feature films rely on convincing
characters to help tell a story, and a critical part of character animation is the
character’s ability to use facial expression. We have presented an efficient tech-
nique for achieving animated facial wrinkles for real-time character rendering.
When combined with traditional blend-target morphing for facial animation, our
technique can produce very compelling results that enable virtual characters to
accompany both their actions and dialog with increased facial expression. Our
system requires very little texture memory and is extremely efficient, enabling
true emotional and realistic character renderings using technology available in
widely adopted PC graphics hardware and current generation game consoles.
2.6 Acknowledgments
Jorge would like to dedicate this work to his eternal and most loyal friend Kazán. We
would like to thank Belen Masia for her very detailed review and support, Wolfgang
Engel for his editorial efforts and ideas to improve the technique, and Xenxo Alvarez
for helping to create the different poses. This research has been funded by a Marie
Curie grant from the Seventh Framework Programme (grant agreement no.: 251415),
the Spanish Ministry of Science and Technology (TIN2010-21543), and the Gobierno de
Aragón (projects OTRI 2009/0411 and CTPP05/09). Jorge Jimenez was additionally
funded by a grant from the Gobierno de Aragón. The authors would also like to thank
XYZRGB Inc. for the high-quality head scan.
Bibliography
[Acton 08] Mike Acton. “Ratchet and Clank Future: Tools of Destruction Technical
Debriefing.” Technical report, Insomniac Games, 2008.
[d’Eon and Luebke 07] Eugene d’Eon and David Luebke. “Advanced Techniques for
Realistic Real-Time Skin Rendering.” In GPU Gems 3, edited by Hubert Nguyen,
Chapter 14, pp. 293–347. Reading, MA: Addison Wesley, 2007.
[Hable et al. 09] John Hable, George Borshukov, and Jim Hejl. “Fast Skin Shading.”
In ShaderX 7 , edited by Wolfgang Engel, Chapter II.4, pp. 161–173. Hingham, MA:
Charles River Media, 2009.
[Hable 10] John Hable. “Uncharted 2: HDR Lighting.” Game Developers Conference,
2010.
[Jimenez and Gutierrez 10] Jorge Jimenez and Diego Gutierrez. “Screen-Space Sub-
surface Scattering.” In GPU Pro, edited by Wolfgang Engel, Chapter V.7. Natick,
MA: A K Peters, 2010.
[Jimenez et al. 10a] Jorge Jimenez, Timothy Scully, Nuno Barbosa, Craig Donner,
Xenxo Alvarez, Teresa Vieira, Paul Matts, Veronica Orvalho, Diego Gutierrez,
i i
i i
i i
i i
and Tim Weyrich. “A Practical Appearance Model for Dynamic Facial Color.”
ACM Transactions on Graphics 29:6 (2010), Article 141.
[Jimenez et al. 10b] Jorge Jimenez, David Whelan, Veronica Sundstedt, and Diego
Gutierrez. “Real-Time Realistic Skin Translucency.” IEEE Computer Graphics
and Applications 30:4 (2010), 32–41.
[Larboulette and Cani 04] C. Larboulette and M. Cani. “Real-Time Dynamic Wrin-
kles.” In Proc. of the Computer Graphics International, pp. 522–525. Washington,
DC: IEEE Computer Society, 2004.
[Lorach 07] T. Lorach. “DirectX 10 Blend Shapes: Breaking the Limits.” In GPU
Gems 3, edited by Hubert Nguyen, Chapter 3, pp. 53–67. Reading, MA: Addison
Wesley, 2007.
[Oat 07] Christopher Oat. “Animated Wrinkle Maps.” In SIGGRAPH ’07: ACM
SIGGRAPH 2007 courses, pp. 33–37. New York: ACM, 2007.
[Tatarchuk 07] Natalya Tatarchuk. “Practical Parallax Occlusion Mapping.” In
ShaderX 5 , edited by Wolfgang Engel, Chapter II.3, pp. 75–105. Hingham, MA:
Charles River Media, 2007.
i i
i i
i i
i i
i i
i i
i i
i i
3
I
3.1 Abstract
This article emphasizes on-the-fly procedural creation of content related to the
video games industry. We demonstrate the generating and rendering of infi-
nite and deterministic heightmap-based terrain utilizing fractal Brownian noise
calculated in real time on a GPU. We take advantage of a thermal erosion algo-
rithm proposed by David Cappola, which greatly improves the level of realism
in heightmap generation. In addition, we propose a random tree distribution al-
gorithm that exploits previously generated terrain information. Combined with
the natural-looking sky model based on Rayleigh and Mie scattering, we achieved
very promising quality results at real-time frame rates. The entire process can
be seen in our DirectX10-based demo application.
3.2 Introduction
Procedural content generation (PCG) refers to the wide process of generating
media algorithmically. Many existing games use PCG techniques to generate
a variety of content, from simple, random object placement over procedurally
generated landscapes to fully automatic creation of weapons, buildings, or AI
enemies. Game worlds tend to be increasingly rich, which requires a lot of ef-
fort that we can minimize by utilizing PCG techniques. One of the basic PCG
techniques in real-time computer graphics applications is the heightmap-based
terrain generation [Olsen 04].
29
i i
i i
i i
i i
30 I Geometry Manipulation
i i
i i
i i
i i
Figure 3.2. Red: camera, yellow: AABB, green: different patches in the grid. (a) Cam-
era in the middle. (b) Collision detected with AABB. (c) New generation of procedural
content, new AABB. (d) New row that is generated.
i i
i i
i i
i i
32 I Geometry Manipulation
i i
i i
i i
i i
f l o a t p = 0.0 f ;
// C a l c u l a t e s l o p e .
float f slope range = 0.17;
f l o a t f slope min = 1.0 − f slope range ;
f l o a t 3 v n o r m a l = g Normalmap . Sample ( samClamp , IN .UV ) . xyz ∗ 2 . 0
− 1.0;
f l o a t f h e i g h t = g Heightmap . Sample ( samClamp , IN .UV ) . x ∗ 2 . 0
− 1.0;
f l o a t f s l o p e = dot ( v normal , f l o a t 3 ( 0 , 1 , 0 ) ) ;
f slope = saturate ( f slope − f slope min );
f l o a t f s l o p e v a l = s m o ot h s t e p ( 0 . 0 , f s l o p e r a n g e , f s l o p e ) ;
// Get r e l a t i v e h e i g h t .
float f rel height threshold = 0.002;
float4 v heights = 0;
return p ;
generated forest will be too sparse. To solve this issue, we have to increase the
trees-per-texel ratio; therefore, we need one more placement technique.
We assign each tree type a different radius that determines how much space
this type of tree owns (in world space). It can be compared to the situation in
a real environment when bigger trees take more resources and prevent smaller
trees from growing in their neighborhood. Also, we want our trees to be evenly
but randomly distributed across a patch corresponding to one density map texel.
Our solution is to divide the current patch into a grid wherein each cell size
is determined by the biggest tree radius in the whole patch. The total number
of cells is a mix of the density of the current texel and the space that the texel
encloses in world space. In the center of every grid cell, we place one tree and
move its position using pseudorandom offset within the grid to remove repetitive
patterns.
i i
i i
i i
i i
34 I Geometry Manipulation
Figure 3.3. (a) Heightmap, (b) normal map, and (c) tree-density map.
Using camera frustum, we process only visible texels to form a density map.
Based on the distance to the camera, the tree’s LOD is determined so that only
trees close to the camera will be rendered with a complete list of faces.
After all these steps, we have a second stream prepared and we are ready to
render all the trees. We also input the maximum number of trees that we are
about to render because there can easily become too many, especially when we
have a large texel-to-world-space ratio. Care has to be taken with this approach
since we cannot simply process a tree-density map row by row; we have to process
texels close to the camera first. If we don’t, we might use all “available” trees
for the farthest patches, and there won’t be any close to the camera. In this
situation, we may use billboards for the farthest trees, or use a smooth transition
into the fog color.
The last part of our procedural generation is the Rayleigh-Mie atmospheric
scattering simulation [West 08, O’Neil 05]. Our implementation follows the tech-
nique described in GPU Gems 2. We first calculate optical depth to use it as a
lookup table for further generating Mie and Rayleigh textures. Mie and Rayleigh
textures are updated in every frame (using rendering to multiple render targets
(MRT)) and are then sampled during sky-dome rendering. This method is effi-
cient, fast, and gives visually pleasing results.
i i
i i
i i
i i
The tree-position stream is calculated with each frame since it depends on the
camera orientation (see Figure 3.5).
Figure 3.5. (a) Heightmap, (b) erosion map, (c) normal map, and (d) tree-density map.
i i
i i
i i
i i
36 I Geometry Manipulation
Table 3.1. Minimum, maximum, and average number of frames per second.
i i
i i
i i
i i
Bibliography
[Capolla 08] D. Capolla. “GPU Terrain.” Available at [Link]
blog, 2008.
[Dudask 07] B. Dudask. “Texture Arrays for Terrain Rendering.” Avail-
able at [Link]
TextureArrayTerrain/doc/[Link], 2007.
[Green 05] S. Green. “Implementing Improved Perlin Noise.” In GPU Gems 2, edited
by Hubert Nguyen, pp. 73–85. Reading, MA: Addison-Wesley, 2005.
[Marak 97] I. Marak. “Thermal Erosion.” Availalbe at [Link]
hostings/cescg/CESCG97/marak/[Link], 1997.
[Olsen 04] J. Olsen. “Real-Time Procedural Terrain Generation.” Available at http://
[Link]/download/terrain [Link], 2004.
[O’Neil 05] S. O’Neil. Accurate Atmospheric Scattering. Reading, MA: Addison-Wesley,
2005.
[West 08] M. West. “Random Scattering: Creating Realistic Landscapes.”
Available at [Link] scattering
creating .php, 2008.
i i
i i
i i
i i
i i
i i
i i
i i
II
Rendering
In this section we cover new techniques in the field of real-time rendering. Every
new generation of game or interactive application must push the boundaries of
what is possible to render and simulate in real time in order to remain compet-
itive and engaging. The articles presented here demonstrate some of the latest
advancements in real-time rendering that are being employed in the newest games
and interactive rendering applications.
The first article in the rendering section is “Pre-Integrated Skin Shading,” by
Eric Penner and George Borshukov. This article presents an interesting and very
efficient shading model for rendering realistic skin. It can be evaluated entirely
in a pixel shader and does not require extra rendering passes for blurring, thus
making it a very scalable skin-rendering technique.
Our next article is “Implementing Fur in Deferred Shading,” by Donald Revie.
The popularity of deferred shading has increased dramatically in recent years.
One of the limitations of working in a deferred-rendering engine is that techniques
involving alpha blending, such as fur rendering, become difficult to implement.
In this article we learn a number of tricks that enable fur to be rendered in a
deferred-shading environment.
The third article in the rendering section is “Large-Scale Terrain Rendering
for Outdoor Games,” by Ferenc Pintér. This article presents a host of production-
proven techniques that allow for large, high-quality terrains to be rendered on
resource-constrained platforms such as current-generation consoles. This arti-
cle provides practical tips for all areas of real-time terrain rendering, from the
content-creation pipeline to final rendering.
The fourth article in this section is “Practical Morphological Antialiasing,”
by Jorge Jimenez, Belen Masia, Jose I. Echevarria, Fernando Navarro, and Diego
Gutierrez. The authors take a new, high-quality, antialiasing algorithm and
demonstrate a highly optimized GPU implementation. This implementation is
so efficient that it competes quite successfully with hardware-based antialiasing
schemes in both performance and quality. This technique is particularly power-
ful because it provides a natural way to add antialiasing to a deferred-shading
engine.
We conclude the section with Emil Persson’s “Volume Decals” article. This
is a practical technique to render surface decals without the need to generate
i i
i i
i i
i i
40 II Rendering
special geometry for every decal. Instead, the GPU performs the entire projection
operation. The author shows how to use volume textures to render decals on
arbitrary surfaces while avoiding texture stretching and shearing artifacts.
The diversity of the rendering methods described in this section represents the
wide breadth of new work being generated by the real-time rendering community.
As a fan of new and clever interactive rendering algorithms, reading and editing
these articles has been a great joy. I hope you will enjoy reading them and will
find them as useful and relevant as I do.
—Christopher Oat
i i
i i
i i
i i
1
II
1.1 Introduction
Rendering realistic skin has always been a challenge in computer graphics. Human
observers are particularly sensitive to the appearance of faces and skin, and skin
exhibits several complex visual characteristics that are difficult to capture with
simple shading models. One of the defining characteristics of skin is the way
light bounces around in the dermis and epidermis layers. When rendering using
a simple diffuse model, the light is assumed to immediately bounce equally in all
directions after striking the surface. While this is very fast to compute, it gives
surfaces a very “thin” and “hard” appearance. In order to make skin look more
“soft” it is necessary to take into account the way light bounces around inside
a surface. This phenomenon is known as subsurface scattering, and substantial
recent effort has been spent on the problem of realistic, real-time rendering with
accurate subsurface scattering.
Current skin-shading techniques usually simulate subsurface scattering during
rendering by either simulating light as it travels through skin, or by gathering
incident light from neighboring locations. In this chapter we discuss a differ-
ent approach to skin shading: rather than gathering neighboring light, we pre-
integrate the effects of scattered light. Pre-integrating allows us to achieve the
nonlocal effects of subsurface scattering using only locally stored information and
a custom shading model. What this means is that our skin shader becomes just
that: a simple pixel shader. No extra passes are required and no blurring is
required, in texture space nor screen space. Therefore, the cost of our algorithm
scales directly with the number of pixels shaded, just like simple shading models
such as Blinn-Phong, and it can be implemented on any hardware, with minimal
programmable shading support.
41
i i
i i
i i
i i
42 II Rendering
i i
i i
i i
i i
Figure 1.1. Our pre-integrated skin-shading approach uses the same diffusion profiles
as texture-space diffusion, but uses a local shading model. Note how light bleeds over
lighting boundaries and into shadows. (Mesh and textures courtesy of XYZRGB.)
i i
i i
i i
i i
44 II Rendering
the surrounding mesh curvature, bumps in the normal map, and occluded light
(shadows). We deal with each of these phenomena separately.
i i
i i
i i
i i
given curvature from one direction and measure the accumulated light at each
angle with respect to the light (see Figures 1.2 and 1.3). This results in a two-
dimensional lookup texture that we can use at runtime. More formally, for each
skin curvature and for all angles θ between N and L, we perform the integration
in Equation (1.1):
Rπ
cos(θ + x) · R(2r sin(x/2))dx
D(θ, r) = -π Rπ (1.1)
-π
R(2 sin(x/2))dx
Table 1.1. The weights used by [d’Eon and Luebke 07] for texture-space diffusion.
Although we aren’t limited to the sum of Gaussians approximations, we use the same
profile for comparison.
i i
i i
i i
i i
46 II Rendering
R(x)
N +θ
N
2rsi
n(θ/
2) )
Distance(mm)
Figure 1.2. The graph (left) illustrates the diffusion profile of red, green, and blue light
in skin, using the sum of Gaussians from Table 1.1. The diagram (right) illustrates
how we pre-integrate the effect of scattering into a diffuse BRDF lookup. The diffusion
profile for skin (overlaid radially for one angle) is used to blur a simple diffuse BRDF
for all curvatures of skin.
∆N
|N | = 1
∆p 1
r
1 ∆N
=
r ∆p
N ·L
i i
i i
i i
i i
details represented in a normal map. We take advantage of this and let the
measured diffuse falloff be chosen at the smooth geometry level, while adding
another approach to deal with creases and small bumps in normal maps, which
are responsible for quick changes in curvature.
i i
i i
i i
i i
48 II Rendering
1 n 1
Σi=1 (Kdiffuse L · Ni ) = Kdiffuse L · ( Σni=1 Ni ).
n n
The reason this isn’t always the case is that diffuse lighting incorporates a self-
shadowing term max(0, N · L) instead of simply N · L. This means back-facing
bumps will actually contribute negative light when linearly filtered. Nonetheless,
using the unnormalized normal will still be valid when all bumps are unshadowed
or completely shadowed, and provides a better approximation than the normal-
ized normal in all situations, according to [Kilgard 00].
Although we would prefer a completely robust method of pre-integrating nor-
mal maps that supports even changes in incident/scattered light over the filtering
region, we found that blurring, using diffusion profiles, provided surprisingly good
results (whether or not we renormalize). In addition, since using four normals
would require four transformations into tangent space and four times the mem-
ory, we investigated an approximation using only one mipmapped normal map.
When using this optimization, we sample the specular normal as usual, but also
sample a red normal clamped below a tunable miplevel in another sampler. We
then transform those two normals into tangent space and blend between them
to get green and blue normals. The resulting diffuse-lighting calculations must
then be performed three times instead of once. The geometry normal can even
be used in place of the second normal map sample, if the normal map contains
small details exclusively. If larger curves are present, blue/green artifacts will
appear where the normal map and geometry normal deviate, thus the second
mipmapped sample is required.
We found that this approach to handling normal maps complements our cus-
tom diffuse falloff very well. Since the red normal becomes more heavily blurred,
the surface represented by the blurred normal becomes much more smooth, which
is the primary assumption made in our custom diffuse falloff. Unfortunately, there
is one caveat to using these two approaches together. Since we have separate nor-
mals for each color, we need to perform three diffuse lookups resulting in three
texture fetches per light. We discuss a few approaches to optimizing this in
Section 1.7.
i i
i i
i i
i i
1
w Scattering
s
Original Penumbra Pre-Integrated Penumbrae New Penumbra
a small trick, we found we could pre-integrate the effect of scattering over shadow
boundaries in the same way we represent scattering in our lighting model.
The trick we use for shadows is to think of the results of our shadow map-
ping algorithm as a falloff function rather than directly as a penumbra. When
the falloff is completely black or white, we know we are completely occluded or
unoccluded, respectively. However, we can choose to reinterpret what happens
between those two values. Specifically, if we ensure the penumbra size created
by our shadow map filter is of adequate width to contain most of the diffusion
profile, we can choose a different (smaller) size for the penumbra and use the
rest of the falloff to represent scattering according to the diffusion profile (see
Figure 1.4).
To calculate an accurate falloff, we begin by using the knowledge of the shape
of our shadow mapping blur kernel to pre-integrate a representative shadow
penumbra against the diffusion profile for skin. We define the representative
shadow penumbra P () as a one-dimensional falloff from filtering a straight shadow
edge (a step function) against the shadow mapping blur kernel. Assuming a mono-
tonically decreasing shadow mapping kernel, the representative shadow falloff is
also a monotonically decreasing function and is thus invertible within the penum-
bra. Thus, for a given shadow value we can find the position within the repre-
sentative penumbra using the inverse P −1 (). As an example, for the simple case
of a box filter, the shadow will be a linear ramp, for which the inverse is also a
linear ramp. More complicated filters have more complicated inverses and need
to be derived by hand or by using software like Mathematica. Using the inverse,
we can create a lookup texture that maps the original falloff back to its location
i i
i i
i i
i i
50 II Rendering
i i
i i
i i
i i
Figure 1.6. Comparison of our approach with texture-space diffusion using an optimized
blur kernel from [Hable et al. 09]. (Mesh and textures courtesy of XYZRGB.)
i i
i i
i i
i i
52 II Rendering
We would also like to look at the effect of using more than one principal axis
of curvature. For models where curvature discontinuities occur, we generate a
curvature map that can be blurred and further edited by hand, similar to a
stretch map in TSD.
Another challenge we would like to meet is to efficiently combine our normal
map and diffuse-lighting approaches. When using three diffuse normals, we cur-
rently need three diffuse-texture lookups. We found we could use fewer lookups
depending on the number of lights and the importance of each light. We have also
found it promising to approximate the diffuse and shadow falloffs using analytical
approximations that can be evaluated without texture lookups.
We would also like to apply our technique to environment mapping. It
should be straightforward to support diffuse-environment mapping via an ar-
ray of diffuse-environment maps that are blurred based on curvature, in the same
manner as our diffuse-falloff texture.
float Gaussian ( f l o a t v , f l o a t r )
{
return 1 . 0 / s q r t ( 2 . 0 ∗ PI ∗v ) ∗ exp ( −( r ∗ r ) / ( 2 ∗ v ) ) ;
}
f l o a t 3 Scatter ( float r )
{
// C o e f f i c i e n t s from GPU Gems 3 − ‘ ‘ Advanced S k i n R e n d e r i n g
.’’
return Gaussian ( 0 . 0 0 6 4 ∗ 1 . 4 1 4 , r ) ∗ float3
(0.233 ,0.455 ,0.649) +
Gaussian ( 0 . 0 4 8 4 ∗ 1 . 4 1 4 , r ) ∗ float3
(0.100 ,0.336 ,0.344) +
Gaussian ( 0 . 1 8 7 0 ∗ 1 . 4 1 4 , r ) ∗ float3
(0.118 ,0.198 ,0.000) +
Gaussian ( 0 . 5 6 7 0 ∗ 1 . 4 1 4 , r ) ∗ float3
(0.113 ,0.007 ,0.007) +
Gaussian ( 1 . 9 9 0 0 ∗ 1 . 4 1 4 , r ) ∗ float3
(0.358 ,0.004 ,0.000) +
Gaussian ( 7 . 4 1 0 0 ∗ 1 . 4 1 4 , r ) ∗ float3
(0.078 ,0.000 ,0.000) ;
}
f l o a t 3 i n t e g r a t e S h a d o w S c a t t e r i n g ( f l o a t penumbraLocation ,
f l o a t penumbraWidth )
{
f l o a t 3 totalWeights = 0;
float3 totalLight = 0;
f l o a t a= −PROFILE WIDTH ;
i i
i i
i i
i i
return t o t a l L i g h t / t o t a l W e i g h t s ;
}
f l o a t 3 i n t e g r a t e D i f f u s e S c a t t e r i n g O n R i n g ( f l o a t cosTheta , f l o a t
skinRadius )
{
// Angle from l i g h t i n g d i r e c t i o n .
float t h e t a = a c o s ( c os Th e t a ) ;
f l o a t 3 totalWeights = 0;
float3 totalLight = 0;
f l o a t a= −(PI / 2 ) ;
while ( a<=(PI / 2 ) )
while ( a<=(PI / 2 ) )
{
f l o a t sampleAngle = t h e t a + a ;
f l o a t d i f f u s e = s a t u r a t e ( c o s ( sampleAngle ) ) ;
f l o a t s a m p l e D i s t = abs ( 2 . 0 ∗ s k i n R a d i u s ∗ s i n ( a ∗ 0 . 5 ) ) ;
// D i s t a n c e .
f l o a t 3 weights = Scatter ( sampleDist ) ;
// P r o f i l e Weight .
t o t a l W e i g h t s += w e i g h t s ;
t o t a l L i g h t += d i f f u s e ∗ w e i g h t s ;
a+=i n c ;
}
return t o t a l L i g h t / t o t a l W e i g h t s ;
}
f l o a t 3 S k i n D i f f u s e ( f l o a t curv , f l o a t 3 NdotL )
{
f l o a t 3 l o o k u p = NdotL ∗ 0 . 5 + 0 . 5 ;
float3 diffuse ;
i i
i i
i i
i i
54 II Rendering
d i f f u s e . r = tex2D ( S k i n D i f f u s e S a m p l e r , f l o a t 2 ( l o o k u p . r , c u r v
) ).r;
d i f f u s e . g = tex2D ( S k i n D i f f u s e S a m p l e r , f l o a t 2 ( l o o k u p . g , c u r v
) ).g;
d i f f u s e . b = tex2D ( S k i n D i f f u s e S a m p l e r , f l o a t 2 ( l o o k u p . b , c u r v
) ) .b;
return d i f f u s e ;
}
Bibliography
[Borshukov and Lewis 03] George Borshukov and J.P. Lewis. “Realistic Human Face
Rendering for The Matrix Reloaded.” In ACM Siggraph Sketches and Applications.
New York: ACM, 2003.
[Borshukov and Lewis 05] George Borshukov and J.P. Lewis. “Fast Subsurface Scatter-
ing.” In ACM Siggraph Course on Digital Face Cloning. New York: ACM, 2005.
[d’Eon and Luebke 07] E. d’Eon and D. Luebke. “Advanced Techniques for Realistic
Real-Time Skin Rendering.” In GPU Gems 3, Chapter 14. Reading, MA: Addison
Wesley, 2007.
i i
i i
i i
i i
[Donner and Jensen 05] Craig Donner and Henrik Wann Jensen. “Light Diffusion in
Multi-Layered Translucent Materials.” ACM Trans. Graph. 24 (2005), 1032–1039.
[Gosselin et al. 04] D. Gosselin, P.V. Sander, and J.L. Mitchell. “Real-Time Texture-
Space Skin Rendering.” In ShaderX3 : Advanced Rendering with DirectX and
OpenGl. Hingham, MA: Charles River Media, 2004.
[Green 04] Simon Green. “Real-Time Approximations to Subsurface Scattering.” In
GPU Gems, pp. 263–278. Reading, MA: Addison-Wesley, 2004.
[Hable et al. 09] John Hable, George Borshukov, and Jim Hejl. “Fast Skin Shading.” In
ShaderX7 : Advanced Rendering Techniques, Chapter II.4. Hingham, MA: Charles
River Media, 2009.
[Jensen et al. 01] Henrik Jensen, Stephen Marschner, Mark Levoy, and Pat Hanrahan.
“A Practical Model for Subsurface Light Transport.” In Proceedings of the 28th
Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH
’01, pp. 511–518. New York: ACM, 2001.
[Jimenez et al. 09] Jorge Jimenez, Veronica Sundstedt, and Diego Gutierrez. “Screen-
Space Perceptual Rendering of Human Skin.” ACM Transactions on Applied Per-
ception 6:4 (2009), 23:1–23:15.
[Kilgard 00] Mark J. Kilgard. “A Practical and Robust Bump-Mapping Technique for
Today’s GPUs.” In GDC 2000, 2000.
[Ma et al. 07] W.C. Ma, T. Hawkins, P. Peers, C.F. Chabert, M. Weiss, and P. Debevec.
“Rapid Acquisition of Specular and Diffuse Normal Maps from Polarized Spherical
Gradient Illumination.” In Eurographics Symposium on Rendering. Aire-la-Ville,
Switzerland: Eurographics Association, 2007.
[Olano and Baker 10] Marc Olano and Dan Baker. “LEAN mapping.” In ACM Siggraph
Symposium on Interactive 3D Graphics and Games, pp. 181–188. New York: ACM,
2010.
i i
i i
i i
i i
i i
i i
i i
i i
2
II
57
i i
i i
i i
i i
58 II Rendering
In deferred rendering the format of the G-buffer (Figure 2.2) defines a standard
interface between all light-receiving materials and all light sources. Each object
assigned a light-receiving material writes a uniform set of data into the G-buffer,
which is then interpreted by each light source with no direct information regarding
the original object. One key advantage to maintaining this interface is that
geometric complexity is decoupled from lighting complexity.
This creates a defined pipeline (Figure 2.3) in which we render all geome-
try to the G-buffer, removing the connection between the geometric data and
individual objects, unless we store this information in the G-buffer. We then
calculate lighting from all sources in the scene using this information, creating
a light-accumulation buffer that again discards all information about individual
lights. We can revisit this information in a material pass, rendering individual
meshes again and using the screen-space coordinates to identify the area of the
light-accumulation buffer and G-buffer representing a specific object. This ma-
terial phase is required in deferred lighting, inferred lighting, and light pre-pass
rendering to complete the lighting process since the G-buffer for these techniques
i i
i i
i i
i i
does not include surface color. After this, a post-processing phase acts upon the
contents of the composition buffer, again without direct knowledge of individual
lights or objects.
This stratification of the deferred rendering pipeline allows for easy extensi-
bility in the combination of different materials and lights. However, adherence to
the interfaces involved also imposes strict limitations on the types of materials
and lights that can be represented. In particular, deferred rendering solutions
have difficulty representing transparent materials, because information regard-
ing surfaces seen through the material would be discarded. Solutions may also
struggle with materials that reflect light in a nontypical manner, potentially in-
creasing the complexity of all lighting calculations and the amount of information
required within the G-buffer. Choosing the right phases and buffer formats are
key to maximizing the power of deferred rendering solutions.
We describe techniques that address the limitations of rendering such materi-
als while continuing to respect the interfaces imposed by deferred rendering. To
illustrate these techniques and demonstrate ways in which they might be com-
bined to form complex materials, we outline in detail a solution for implementing
fur in deferred shading.
2.2 Fur
Fur has a number of characteristics that make it difficult to represent using the
same information format commonly used to represent geometry in deferred ren-
dering solutions.
Fur is a structured material composed of many fine strands forming a complex
volume rather than a single continuous surface. This structure is far too fine to
describe each strand within the G-buffer on current hardware; the resolution
required would be prohibitive in both video memory and fragment processing.
As this volumetric information cannot be stored in the G-buffer, the fur must
be approximated as a continuous surface when receiving light. We achieve this
by ensuring that the surface exhibits the observed lighting properties that would
normally be created by the structure.
i i
i i
i i
i i
60 II Rendering
(a) (b)
Figure 2.4. Fur receiving light from behind (a) without scattering and (b) with scat-
tering approximation.
The diffuse nature of fur causes subsurface scattering; light passing into the
volume of the fur is reflected and partially absorbed before leaving the medium
at a different point. Individual strands are also transparent, allowing light to
pass through them. This is often seen as a halo effect; fur is silhouetted against
a light source that illuminates the fur layer from within, effectively bending light
around the horizon of the surface toward the viewer. This is best seen in fur with
a loose, “fluffy” structure (see Figure 2.4).
The often-uniform, directional nature of fur in combination with the struc-
ture of individual strands creates a natural grain to the surface being lit. The
reflectance properties of the surface are anisotropic, dependent on the grain di-
rection. Anisotropy occurs on surfaces characterized by fine ridges following the
grain of the surface, such as brushed metal, and causes light to reflect according
to the direction of the grain. This anisotropy is most apparent in fur that is
“sleek” with a strong direction and a relatively unbroken surface (see Figure 2.5).
(a) (b)
Figure 2.5. Fur receiving light (a) without anisotropy and (b) with anisotropy approx-
imation.
i i
i i
i i
i i
2.3 Techniques
We look at each of the characteristics of fur separately so that the solutions dis-
cussed can be reused to represent other materials that share these characteristics
and difficulties when implemented within deferred rendering.
i i
i i
i i
i i
62 II Rendering
// Get o f f s e t d i r e c t i o n v e c t o r
f l o a t 3 d i r = IN . normal . xyz + ( IN . d i r e c t i o n . xyz ∗ shellDepth ) ;
d i r . xyz = n o r m a l i z e ( d i r . xyz ) ;
// O f f s e t v e r t e x p o s i t i o n a l o n g f u r d i r e c t i o n .
OUT. p o s i t i o n = IN . p o s i t i o n ;
OUT. p o s i t i o n . xyz = ( d i r . xyz ∗ s h e l l D e p t h ∗ fu rDe pth
∗ IN . f u r L e n g t h ) ;
OUT. p o s i t i o n = mul ( w o r l d V i e w P r o j e c t i o n , OUT. p o s i t i o n ) ;
This method of fur rendering can be further augmented with the addition of
fins, slices perpendicular to the surface of the mesh, which improve the quality of
silhouette edges. However, fin geometry cannot be generated from the base mesh
as part of a vertex program and is therefore omitted here (details on generating
fin geometry can be found in [Lengyel 01]).
This technique cannot be applied in the geometry phase because the structure
of fur is constructed from a large amount of subpixel detail that cannot be stored
in the G-buffer where each pixel must contain values for a discrete surface point.
Therefore, in deferred shading we must apply the concentric shell method in the
material phase, sampling the lighting and color information for each hair from a
single point in the light-accumulation buffer. The coordinates for this point can
be found by transforming the vertex position of the base mesh into screen space
in the same way it was transformed originally in the geometry phase (Listing 2.2).
// Vertex s h a d e r .
// See ( L i s t i n g 3 . 1 . 1 ) f o r o m i t t e d c o n t e n t .
// Output s c r e e n p o s i t i o n o f b a s e mesh v e r t e x .
OUT. s c r e e n P o s = mul ( w o r l d V i e w P r o j e c t i o n , IN . p o s i t i o n ) ;
// −−−−−−−−−−−−−−−−−−−−−−−−−−−−
// P i x e l s h a d e r .
IN . s c r e e n P o s /= IN . s c r e e n P o s . w ;
// B ri n g v a l u e s i n t o r a n g e ( 0 , 1 ) from ( −1 ,1) .
f l o a t 2 s c r e e n C o o r d = ( IN . s c r e e n P o s . xy + 1 . f . xx ) ∗ 0 . 5 f . xx ;
// Sample l i t mesh c o l o r
c o l o r = tex2D ( l i g h t A c c u m u l a t i o n T e x t u r e , s c r e e n C o o r d ) .
i i
i i
i i
i i
This sampling of lighting values can cause an issue specific to rendering the
fur. As fur pixels are offset from the surface being sampled, it is possible for the
original point to have been occluded by other geometry and thus be missing from
the G-buffer. In this case the occluding geometry, rather than the base mesh,
would contribute to the coloring of the fur leading to visual artifacts in the fur
(Figure 2.8). We explore a solution to this in Sections 2.3.4 and 2.4.4 of this
article.
i i
i i
i i
i i
64 II Rendering
i i
i i
i i
i i
2.3.3 Anisotropy
Anisotropic light reflection occurs on surfaces where the distribution of surface
normals is dependent on the surface direction; such surfaces are often character-
ized by fine ridges running in a uniform direction across the surface, forming a
“grain.” The individual strands in fur and hair can create a surface that exhibits
this type of lighting [Scheuermann 04].
This distinctive lighting is created because in anisotropic surfaces the ridges
or, in this case, the strands are considered to be infinitely fine lines running
parallel to the grain. These lines do not have a defined surface normal but instead
have an infinite number of possible normals radiating out perpendicularly to their
direction (see Figure 2.11). Therefore, the lighting calculation at any point on the
surface must integrate the lighting for all the normals around the strand. This
is not practical in a pixel shader; the best solution is to choose a single normal
that best represents the lighting at this point [Wyatt 07].
In forward shading, anisotropy is often implemented using a different lighting
calculation from those used to describe other surfaces (Listing 2.4) [Heidrich 98].
This algorithm calculates lighting based on the grain direction of the surface
rather than the normal.
D i f f u s e = s q r t ( 1 − (< L , T >)ˆ2)
S p e c u l a r = s q r t ( 1 − (< L , T >)ˆ2) s q r t ( 1 − (< V, T > ) ˆ 2 )
− < L , T > < V, T >
In deferred shading we cannot know in the geometry phase the nature of any
light sources that might contribute to the lighting on the surface and are bound
by the interface of the G-buffer to provide a surface normal. Therefore, we define
i i
i i
i i
i i
66 II Rendering
the most significant normal as the normal that is coplanar with the grain direction
and the eye vector at that point (see Figure 2.12). We calculate the normal of
this plane as the cross product of the eye vector and the grain direction, the
normal for lighting is then the cross product of the plane’s normal and the grain
direction (see Listing 2.5).
(a) (b)
Figure 2.13. (a) Isotropic highlight and (b) anisotropic highlight.
i i
i i
i i
i i
this characteristic of stretched lighting for all light sources, including image-based
lights.
(a) (b)
// Get s c r e e n c o o r d i n a t e s i n r a n g e ( 0 , 1 ) .
f l o a t 2 s c r e e n C o o r d = ( ( IN . s c r e e n P o s . xy /IN . s c r e e n P o s . w)
+ 1 . f . xx ) ∗ 0 . 5 h . xx ;
// Convert c o o r d i n a t e s i n t o p i x e l s .
i n t 2 sample = s c r e e n C o o r d . xy ∗ f l o a t 2 ( 1 2 8 0 . f , 7 2 0 . f ) ;
// I f p i x e l i s not t h e top l e f t i n a 2 x2 t i l e d i s c a r d i t .
i n t 2 t i l e I n d i c e s = i n t 2 ( sample . x \% 2 , sample . y \% 2 ) ;
i f ( ( t i l e I n d i c e s . x != 0 ) | | ( t i l e I n d i c e s . y != 0 ) )
discard ;
In deferred shading, transparent objects are written into the G-buffer using
a stipple pattern. During the material phase, the values of pixels containing
data for the transparent surface are blended with neighboring pixels containing
i i
i i
i i
i i
68 II Rendering
information on the scene behind. By varying the density of the stipple pattern,
different resolutions of data can be interleaved, allowing for multiple layers of
transparency. This technique is similar to various interlaced rendering methods
for transparent surfaces [Pangerl 09, Kircher 09].
The concept of stippled rendering can be extended further to blend multiple
definitions of a single surface together. By rendering the same mesh multiple
times but writing distinctly different data in alternating pixels, we can assign
multiple lighting values for each point on the object at a reduced resolution.
During the material phase the object is rendered again, and this information
is deinterlaced and combined to allow for more complex lighting models. For
example, a skin material could write separate values for a subsurface scattering
pass and a specular layer, as interleaved samples. The material pass would then
additively blend the specular values over the blurred result of the diffuse lighting
values.
i i
i i
i i
i i
Figure 2.15. Fur length (left) and direction (right) encoded as color.
A color set was chosen to encode this data for a number of reasons. First,
many asset-creation tools allow for easy “painting” of vertex colors while viewing
the shaded mesh in real time. This effectively allows the author to choose a
direction represented as a color and then comb sections of the fur appropriately
using the tool, setting an alpha component to the color trims the length of the fur
locally. Second, the approach of encoding direction as color is already familiar to
most authors through the process of generating world- and tangent-space normal
maps. The process has proven to be quite intuitive and easy to use.
As part of the loading process, we transform the vectors encoded in this
color channel from the local space of the mesh into its tangent space and at the
same time orthonormalize them, making them tangential to the mesh surface.
Thus when the mesh is deformed during animation, the combing direction of the
fur will remain constant in relation to the surface orientation. This is the only
engine side code that was required to fully support the fur-rendering technique
(see Listing 2.7).
// B u i l d l o c a l t o t a n g e n t s p a c e m a t r i x .
Matrix t a n g e n t S p a c e ;
tangentSpace . LoadIdentity ( ) ;
tangentSpace . SetCol (0 , tangent ) ;
tangentSpace . SetCol (1 , binormal ) ;
t a n g e n t S p a c e . S e t C o l ( 2 , normal ) ;
t a n g e n t S p a c e . Tr ansp ose ( ) ;
// Convert c o l o r i n t o v e c t o r .
i i
i i
i i
i i
70 II Rendering
// Gram Schmidt o r t h o n o r m a l i z a t i o n .
d i r = d i r − ( dot ( d i r , normal ) ∗ normal ) ; d i r . N o r m a l i s e ( ) ;
// Transform v e c t o r i n t o t a n g e n t s p a c e .
tangentSpace . TransformInPlace ( d i r ) ;
Texture data. To provide the G-buffer with the necessary surface information, the
material is assigned an RGB albedo map and a lighting map containing per pixel
normal information and specular intensity and exponent at any given pixel. In
addition to this, a second albedo map is provided to describe the changes applied
to lighting as it passes deeper into the fur; over the length of the strands, the
albedo color that is used is blended from this map to the surface color. This
gives the author a high degree of control over how the ambient occlusion term is
applied to fur across the whole surface, allowing for a greater variation.
To represent the fur volume required for the concentric shell rendering, a
heightfield was chosen as an alternative to generating a volumetric data set.
While this solution restricts the types of volume that can be described, it requires
considerably less texture information to be stored and accessed in order to render
the shells. It is more flexible in that it can be sampled using an arbitrary number
of slices without the need to composite slices when undersampling the volume,
and it is far simpler to generate with general-image authoring tools.
i i
i i
i i
i i
global fur length scaled by the vertex color alpha. The pixel shader identifies
likely silhouette edges using the dot product of the view vector and the surface
normals; the normals at these points are adjusted by adding the view vector
scaled by this weight value. The unmodified normals are recalculated to use the
anisotropic normals like those of the first pass (Figure 2.17).
This second pass solves the occlusion issue when constructing concentric fur
shells from the light-accumulation buffer, since both samples are unlikely to be
occluded simultaneously while any part of the strand is still visible. The second
pass allows light calculations to be performed for both the surface of the mesh
and also the back faces where light entering the reverse faces may be visible.
In order to avoid incorrect results from screen-space ambient occlusion (SSAO),
edge detection, and similar techniques that rely on discontinuities in the G-buffer,
these should be calculated before the second pass since the stipple pattern will
create artifacts.
i i
i i
i i
i i
72 II Rendering
i i
i i
i i
i i
In the pixel shader these two samples are retrieved from the light-accumulation
buffer, their respective linear depths in the G-buffer are also sampled to compare
against the depth of the sample coordinates and thus correct for occlusion errors.
If both samples are valid, the maximum of the two is chosen to allow for the halo
effect of scattering around the edges of the object without darkening edges where
there is no back lighting. The contribution of the albedo map to the accumulated
light values is removed by division and then reapplied as a linear interpolation
of the base and top albedo maps to account for ambient occlusion by the fur.
The heightfield for the fur volume is sampled at a high frequency by applying
an arbitrary scale to the mesh UVs in the material. The smoothstep function is
used to fade out pixels in the current shell as the interpolation factor equals and
exceeds the values stored in the heightfield, thus individual strands of fur fade
out at different rates, creating the impression of subpixel detail (see Figure 2.20).
i i
i i
i i
i i
74 II Rendering
2.5 Conclusion
This article has described a series of techniques used to extend the range of
materials that can be presented in a deferred rendering environment, particularly
a combination of these techniques that can be used to render aesthetically pleasing
fur at real-time speeds.
2.6 Acknowledgments
Thanks to everyone at Cohort Studios for showing their support, interest, and enthu-
siasm during the development of this technique, especially Bruce McNeish and Gordon
Bell, without whom there would be no article.
Special thanks to Steve Ions for patiently providing excellent artwork and feedback
while this technique was still very much in development, to Baldur Karlsson and Gordon
McLean for integrating the fur, helping track down the (often humorous) bugs, and
bringing everything to life, and to Shaun Simpson for all the sanity checks.
Bibliography
[Engel 09] W. Engel. “The Light Pre-Pass Renderer.” In ShaderX7 , pp. 655–666. Hing-
ham, MA: Charles River Media, 2009.
[Green 04] S. Green. “Real-Time Approximations to Sub-Surface Scattering.” In GPU
Gems, pp. 263–278. Reading, MA: Addison Wesley, 2004.
[Hable 09] J. Hable, G. Borshakov, and J. Heil. “Fast Skin Shading.” In ShaderX7 , pp.
161–173. Hingham, MA: Charles River Media, 2009.
[Heidrich 98] W. Heidrich and Hans-Peter Seidel. “Efficient Rendering of Anisotropic
Surfaces Using Computer Graphics Hardware.” In Proceedings of Image and Multi-
Dimensional Digital Signal Processing Workshop,Washington, DC: IEEE, 1998.
i i
i i
i i
i i
[Kircher 09] S. Kircher and A Lawrance. “Inferred Lighting: Fast Dynamic Lighting
and Shadows for Opaque and Translucent Objects.” In Proceedings of the 2009
ACM SIGGRAPH Symposium on Video Games, Sandbox ’09, pp. 39–45. New
York: ACM, 2009.
[Lengyel 01] J. Lengyel, E. Praun, A. Finkelstein, and H. Hoppe. “Real-Time Fur Over
Arbitrary Surfaces.” In I3D ’01 Proceedings of the 2001 Symposium on Interactive
3D Graphics, pp. 227–232. New York, ACM Press, 2001.
[Mittring 09] M. Mittring. “A Bit More Deferred - CryEngine3.” Triangle Game Con-
ference, 2009.
[Mulder 98] J. D. Mulder, F. C. A. Groen, and J. J. van Wijk. “Pixel Masks for Screen-
Door Transparency.” In Proceedings of the Conference on Visualization ’98, pp.
351–358. Los Alamitos, CA: IEEE Computer Society Press, 1998.
[Valient 07] M. Valient. “Deferred Rendering in Killzone 2.” Develop Conference, July
2007.
i i
i i
i i
i i
i i
i i
i i
i i
3
II
3.1 Introduction
Visualizing large scale (above 10 km2 ) landscapes on current generation consoles
is a challenging task, because of the restricted amount of memory and processing
power compared with high-end PCs. This article describes in detail how we
approach both limitations and deal with the different content creation, rendering,
and performance issues. After a short explanation of the decisions and trade-
offs we made, terrain-content generation and rendering methods will follow. We
conclude by listing the pros and cons of our technique, measured implementation
performance, as well as possible extensions.
Figure 3.1. In-game screenshot of a 10 km2 canyon area. ( c 2010 Digital Reality.)
77
i i
i i
i i
i i
78 II Rendering
i i
i i
i i
i i
Figure 3.2. Screenshot from Battlestations: Pacific, released on XBOX 360 and PC.
( c 2008 Square Enix Europe.)
i i
i i
i i
i i
80 II Rendering
i i
i i
i i
i i
We found these drawbacks too restricting, and chose to go with the approaches
listed below.
Figure 3.4. Alpha channel of the painted 5122 tilemap, containing terrain-type info.
( c 2010 Digital Reality.)
i i
i i
i i
i i
82 II Rendering
Figure 3.5. Editor-artist interface for painting the tilemap, red ring indicating brush
position. ( c 2010 Digital Reality.)
Creating the tilemap is very intuitive. Its terrain-index part can be based on
either soil-type maps exported from many commercial terrain-generation/erosion
software programs (though you might need to convert the world-space type values
to relaxed UV space), or global terrain height- and slope-based rules, enhanced
by noise.
The base for the color part can again originate from DCC tools, from shadow
maps, or can be simply desaturated color noise. Over this base, artists can
easily paint or modify the chosen terrain-type values using simple brushes, and
temporary layers. Ray casting is used to determine which tilemap texels the
brush touches. The editor interface (see Figure 3.5) can also support multiple
undo-levels (by caching paint commands), soft brushes, or paint limitations (to
allow painting only over regions within desired height/slope limits).
At runtime, hardware bilinear filtering of the tilemap indices automatically
solves type-blending problems present in the LUT method, and different nearby
tile-index values will get smoothly interpolated over the terrain pixels. We can
also compress the tilemap using DXT5 texture compression. Since this format
compresses the three color channels independently from the alpha channel, we
can keep most of the index resolution while shrinking memory usage.
Note: Suppose we have sand encoded as tile-index value 0, grass as 1, and rock
as 2. Now, due to filtering, rock can never be visible near sand, but only through
an intermediate grass area. This can be avoided by duplicating types with
i i
i i
i i
i i
i i
i i
i i
i i
84 II Rendering
Figure 3.6. Editor-artist interface for painting mesh objects, red sphere indicating brush
volume. ( c 2010 Digital Reality.)
i i
i i
i i
i i
Figure 3.7. A packed 20482 diffuse-texture atlas, containing 16 different terrain types.
( c 2010 Digital Reality.)
we have a 4 × 4 atlas. This also means that we cannot rely on hardware tex-
ture wrapping; this must be performed manually in the shader. As we will see,
this causes problems in hardware miplevel selection (texture arrays do not need
these corrections, however). For this to work correctly, one must know how the
hardware calculates which mip levels to use. GPUs use the first derivatives of
screen-space texture coordinates in any given 2×2 pixel block and the dimensions
of the texture itself to determine the pixel-to-texel projection ratio, and thus find
the appropriate miplevel to use. To access a tile from the atlas for any pixel, we
need to emulate the hardware wrapping for the tile. By using the frac() hlsl
i i
i i
i i
i i
86 II Rendering
intrinsic, we break the screenspace UV derivative continuity for the pixel quads
at tile borders. Since the derivatives will be very large, the hardware will pick the
largest miplevel from the chain, which in turn results in a one-pixel-wide seam
whenever the subtiling wraps around. Fortunately, we have many options here:
we can balance the GPU load between texture filtering, arithmetic logic unit
(ALU) cost, shader thread counts, texture stalls, and overall artifact visibility.
The safest but slowest option is to calculate the mip level manually in the
shader, right before sampling from the atlas [Wloka 03,Brebion 08]. This produces
the correct result, but the extra ALU cost is high since we need to issue gradient
instructions that require extra GPU cycles, and textures need to be sampled
with manually specified mip levels, which reduces the sampling rate on many
architectures. As a side effect, texture stalls begin to appear in the pipeline. We
can use multiple methods to shorten these stalls. Some compilers and platforms
allow for explicitly setting the maximum number of general purpose GPU registers
(GPRs) a compiled shader can use. (They try to optimize the shader code to meet
the specified limit, sometimes by issuing more ALUs to move temporary shader
data around with fewer registers.) If a shader uses fewer GPRs, more shading
cores can run it in parallel, thus the number of simultaneous threads increases.
Using more threads means that stalling all of them is less probable. On systems
using unified shader architectures, one can also increase pixel shader GPR count
by reducing the available GPRs for the vertex shader. Some platforms also have
shader instructions that explicitly return the mip level the hardware would use
at a given pixel, thus saving you from having to compute it yourself in a shader.
Using dynamic branching and regular hardware mipmapping on pixel quads far
from the frac() region as speedup might also prove useful.
Branch performance might degrade for faraway fragments though, where tiling
UV values and derivatives vary fast, and pixels in the same GPU-pixel-processing
vector must take different branches. Branching might be disabled for distant
fragments, since stall problems are also most relevant on up close terrain, which
fills most screen area and uses the first few mip levels.
One option for estimating the correct mip level is to construct a texture that
encodes the mip index in the texture itself (for example, the first mip level encodes
“0” in all its texels, the second mip level encodes “1” in all its texels, etc.). This
texture should have the same dimensions as the atlas tile. You can then use a
normal texture fetch to sample this texture and allow the hardware to choose the
appropriate mip level. The value of the texel that is fetched will indicate which
mip level was chosen by the hardware and then can be used to issue a tex2dlod
instruction on the atlas tile. Dynamic branching is a viable option here too.
We chose to go with a third option, which is the fastest, but does result
in some minor artifacts which we deemed acceptable. We simply sample using
regular tex2D, but we generate only the first four mipmaps of the mip chain. This
means that the GPU has to filter a bit more, but our measurements have shown
that only 7–10% of texels fall into the larger miplevels, thus the performance
i i
i i
i i
i i
Figure 3.8. Screenshot from the canyon, with 100x UV tiling factor (note the lack of
UV distortions on walls). ( c 2010 Digital Reality.)
i i
i i
i i
i i
88 II Rendering
struct sVSInput
{
float4 Position : POSITION ;
f l o a t 3 Normal : NORMAL;
f l o a t 2 UV : TEXCOORD0;
f l o a t 3 Tangent : TEXCOORD1;
};
struct sVSOutput
{
f l o a t 4 ProjPo s : POSITION ;
f l o a t 2 UV : TEXCOORD0;
f l o a t 3 Normal : TEXCOORD1;
f l o a t 3 TgLightVec : TEXCOORD2;
};
f l o a t 4 x 3 cWorldMatrix ;
f l o a t 4 x 4 cViewProjMatrix ;
f l o a t c U V m u l t i p l i e r ; // T e r r a i n −t e x t u r e t i l i n g factor .
f l o a t 3 cCameraPos ;
f l o a t 3 cSunDirection ;
f l o a t 3 cSunColor ;
// L i g h t i n g i s i n t a n g e n t s p a c e .
f l o a t 3 x 3 MakeWorldToTangent ( f l o a t 3 iTangent , f l o a t 3 iNormal )
{
f l o a t 3 x 3 TangentToLocal=
f l o a t 3 x 3 ( iTangent , c r o s s ( iNormal , i T a n g e n t ) , iNormal ) ;
f l o a t 3 x 3 TangentToWorld=
mul ( TangentToLocal , ( f l o a t 3 x 3 ) cWorldMatrix ) ;
f l o a t 3 x 3 WorldToTangent = t r a n s p o s e ( TangentToWorld ) ;
return WorldToTangent ;
}
i i
i i
i i
i i
f l o a t 3 x 3 WorldToTangent=
MakeWorldToTangent ( I n . Tangent , I n . Normal ) ;
return Out ;
}
sampler2D D i f f A t l a s ;
sampler2D NormAtlas ;
sampler2D T i l e T a b l e ;
f l o a t GetMipLevel ( f l o a t 2 iUV , f l o a t 2 i T e x t u r e S i z e )
{
f l o a t 2 dx= ddx ( iUV ∗ i T e x t u r e S i z e . x ) ;
f l o a t 2 dy= ddy ( iUV ∗ i T e x t u r e S i z e . y ) ;
f l o a t d= max( dot ( dx , dx ) , dot ( dy , dy ) ) ;
return 0 . 5 ∗ l o g 2 ( d ) ;
}
f l o a t 2 fracUV = f r a c ( In .UV) ;
f l o a t 2 D i f f C o r r e c t U V= fracUV / 4 . 0 f ;
// Blend t y p e s and b l e n d r a t i o .
f l o a t type A = f l o o r ( T i l e I n d e x ) ;
f l o a t type B = c e i l ( T i l e I n d e x ) ;
f l o a t f a c t o r = T i l e I n d e x − type A ;
f l o a t tmp = f l o o r ( type A / 4 ) ;
f l o a t 2 UV A = D i f f C o r r e c t U V + f l o a t 2 ( type A−tmp ∗ 4 , tmp ) / 4 ;
tmp = f l o o r ( type B / 4 ) ;
f l o a t 2 UV B = D i f f C o r r e c t U V + f l o a t 2 ( type B−tmp ∗ 4 , tmp ) / 4 ;
f l o a t 4 D i f f u s e C o l o r= l e r p ( colA , colB , f a c t o r ) ;
// E x t r a c t normal map .
f l o a t 3 norm= 2 ∗ ( normtex . rgb − 0 . 5 ) ;
i i
i i
i i
i i
90 II Rendering
f l o a t 3 t g n o r m a l= n o r m a l i z e ( norm ) ;
f l o a t NdotL=
s a t u r a t e ( dot ( tgnormal , n o r m a l i z e ( I n . TgLightVec ) ) ) ;
f l o a t 3 S u n D i f f u s e C o l o r= NdotL ∗ cSunColor ;
f l o a t 3 Albedo= D i f f u s e C o l o r . xyz ∗ C o l o r T i n t ∗ 2 ;
f l o a t 3 AmbientColor= 0 . 5 ;
f l o a t 3 L i t A l b e d o= Albedo ∗ ( AmbientColor + S u n D i f f u s e C o l o r ) ;
return f l o a t 4 ( L itAlbe do , 1 ) ;
}
3.4 Performance
The system load for rendering the base terrain mesh spanning a 10 km2 area,
and consisting of 600 k triangles is 14 MB memory, and 6 ms frame time, on
XBOX 360. This includes geometry vertex and index data, LODs, diffuse, normal,
shadow, and terrain type maps, while not using any streaming bandwidth. To
reach the desired rendering time and memory footprint, a couple of optimizations
are required.
To allow using 16-bit indices in the triangle list, the terrain had to be sliced
into blocks no larger than 65 k vertices, during COLLADA import. Using blocks
with <1 km2 area also helps our static KD-tree based culling. Balancing between
better culling and fewer draw calls can be done by adjusting block count and
block size. LODs are also generated at import time, since they are essential to
reduce vertex load, and also keep the pixel quad efficiency high. If the LODs
are generated by skipping vertices, only index data needs to be extended by
a small amount (say, 33%, if each consecutive level contains a quarter of the
previous version, with regard to triangle count), and each new LOD can refer
to the original, untouched vertex buffer. Keeping silhouettes and vertices at
block boundaries regardless of LOD are important in order to prevent holes from
appearing in places where differed LODs meet. To help pre- and post-transform
vertex and index caches, vertex and index reordering is also done at the import
phase.
Vertex compression is also heavily used to reduce the required amount of
memory and transfer bandwidth. We can store additional information in the
vertices. AO and soil-type indices are valid choices, but color-tint values, shadow
terms, or bent normals for better shading are still possible.
Texture atlases are also compressed. We found atlases of 4 × 4 5122 tiles
to contain enough variation and resolution too. The diffuse atlas can use the
DXT1 format, while the normal atlas can use better, platform-specific 2-channel
i i
i i
i i
i i
i i
i i
i i
i i
92 II Rendering
Figure 3.9. Screenshot of a canyon wall, with distance-dependant atlas tiling factor.
( c 2010 Digital Reality.)
i i
i i
i i
i i
3.6 Acknowledgments
First, I would like to thank Balázs Török, my friend and colleague, for his helpful
suggestions and for the initial idea of contributing to the book. I would also like
to thank the art team at Digital Reality for their excellent work, on which I based
my illustrations. Also, special thanks to my love Tı́mea Kapui for her endless
patience and support.
Bibliography
[Andersson 07] Johan Andersson. “Terrain Rendering in Frostbite Using Proce-
dural Shader Splatting.” In ACM SIGGRAPH 2007 courses. New York:
ACM, 2007. Available online ([Link]
Chapter5-Andersson-Terrain Rendering in [Link]).
[Barrett 08] Sean Barrett. “Sparse Virtual Textures blog.” 2008. Available online (http:
//[Link]/src/svt/).
[Brebion 08] Flavien Brebion. “Journal of Ysaneya, [Link].” 2008. Available on-
line ([Link]
263350).
[Dudash 07] Bryan Dudash. “DX10 Texture Arrays, nVidia SDK.” 2007.
Available online ([Link]
[Link]).
[Eidos 08] Eidos, Square Enix Europe. “Battlestations: Pacific.” 2008. Available online
([Link]
[Geiss 07] Ryan Geiss. “Generating Complex Procedural Terrains Using the GPU.” In
GPU Gems 3. Reading, MA: Addison Wesley, 2007. Available online ([Link]
[Link]/GPUGems3/gpugems3 [Link]).
[Hardy 09] Alexandre Hardy. “Blend Maps: Enhanced Terrain Texturing.” Available
online ([Link]
[Herzog et al. 10] Robert Herzog, Elmar Eisemann, Karol Myszkowski, and H-P. Seidel.
“Spatio-Temporal Upsampling on the GPU.”
[Kemen 08] Brano Kemen. “Journal of Lethargic Programming, Cameni’s [Link]
blog, July 2008.” 2008. Available online ([Link]
forums/mod/journal/[Link]?jn=503094&cmonth=7&cyear=2008).
[Mittring 08] Martin Mittring. “Advanced Virtual Texturing Topics.” In ACM SIG-
GRAPH 2008 Classes SIGGRAPH ’08, pp. 23–51. New York, NY, USA: ACM,
2008.
[Pangilinan and Ruppel 10] Erick Pangilinan and Robh Ruppel. “Uncharted 2 Art di-
rection.” GDC 2010. Available online ([Link]
280/conference/).
i i
i i
i i
i i
94 II Rendering
[Persson 06] Emil Persson. “Humus, Ambient Aperture Lighting.” Available online
([Link]
[Quilez 09] Inigo Quilez. “Behind Elevated.” Function 2009. Available online (http://
[Link]/www/material/function2009/[Link]).
[van Rossen 08] Sander van Rossen. “Sander’s Blog on Virtual Texturing.” 2008.
Available online ([Link]
-[Link]).
[van Waveren 09] J. M. P. van Waveren. “idTech5 Challeges: From Texture Virtualiza-
tion to Massive Parallelization.” Siggraph 2009. Available online ([Link]
[Link]/talks/05-JP id Tech 5 [Link]).
[Wetzel 07] Mikey Wetzel. “Under The Hood: Revving Up Shader Perfor-
mance.” Microsoft Gamefest Unplugged Europe, 2007. Available online
([Link]
-44E9-A43E-6F1615D9FCE0&displaylang=en).
[Wloka 03] Matthias Wloka. “Improved Batching Via Texture Atlases.” In ShaderX3 .
Hingham, MA: Charles River Media, 2003. Available online ([Link]
com/Tables%20of%[Link]).
i i
i i
i i
i i
4
II
Practical Morphological
Antialiasing
Jorge Jimenez, Belen Masia, Jose I. Echevarria,
Fernando Navarro, and Diego Gutierrez
The use of antialiasing techniques is crucial when producing high quality graphics.
Up to now, multisampling antialiasing (MSAA) has remained the most advanced
solution, offering superior results in real time. However, there are important
drawbacks to the use of MSAA in certain scenarios. First, the increase in pro-
cessing time it consumes is not negligible at all. Further, limitations of MSAA
include the impossibility, in a wide range of platforms, of activating multisam-
pling when using multiple render targets (MRT), on which fundamental techniques
such as deferred shading [Shishkovtsov 05, Koonce 07] rely. Even on platforms
where MRT and MSAA can be simultaneously activated (i.e., DirectX 10), imple-
mentation of MSAA is neither trivial nor cost free [Thibieroz 09]. Additionally,
MSAA poses a problem for the current generation of consoles. In the case of the
Xbox 360, memory constraints force the use of CPU-based tiling techniques in
case high-resolution frame buffers need to be used in conjunction with MSAA;
whereas on the PS3 multisampling is usually not even applied. Another drawback
of MSAA is its inability to smooth nongeometric edges, such as those resulting
from the use of alpha testing, frequent when rendering vegetation. As a result,
when using MSAA, vegetation can be antialiased only if alpha to coverage is
used. Finally, multisampling requires extra memory, which is always a valuable
resource, especially on consoles.
In response to the limitations described above, a series of techniques have im-
plemented antialiasing solutions in shader units, the vast majority of them being
based on edge detection and blurring. In S.T.A.L.K.E.R. [Shishkovtsov 05], edge
detection is performed by calculating differences in the eight-neighborhood depth
values and the four-neighborhood normal angles; then, edges are blurred using
a cross-shaped sampling pattern. A similar, improved scheme is used in Tabula
Rasa [Koonce 07], where edge detection uses threshold values that are resolution
95
i i
i i
i i
i i
96 II Rendering
independent, and the full eight-neighborhood of the pixel is considered for dif-
ferences in the normal angles. In Crysis [Sousa 07], edges are detected by using
depth values, and rotated triangle samples are used to perform texture lookups
using bilinear filtering. These solutions alleviate the aliasing problem but do not
mitigate it completely. Finally, in Killzone 2, samples are rendered into a double
horizontal resolution G-buffer. Then, in the lighting pass, two samples of the
G-buffer are queried for each pixel of the final buffer. The resulting samples are
then averaged and stored in the final buffer. However, this necessitates executing
the lighting shader twice per final pixel.
In this article we present an alternative technique that avoids most of the prob-
lems described above. The quality of our results lies between 4x and 8x MSAA
at a fraction of the time and memory consumption. It is based on morphological
antialiasing [Reshetov 09], which relies on detecting certain image patterns to
reduce aliasing. However, the original implementation is designed to be run in a
CPU and requires the use of list structures that are not GPU-amenable.
Since our goal is to achieve real-time practicality in games with current main-
stream hardware, our algorithm implements aggressive optimizations that provide
an optimal trade-off between quality and execution times. Reshetov searches for
specific patterns (U-shaped, Z-shaped, and L-shaped patterns), which are then
decomposed into simpler ones, an approach that would be impractical on a GPU.
We realize that the pattern type, and thus the antialiasing to be performed, de-
pends on only four values, which can be obtained for each edge pixel (edgel) with
only two memory accesses. This way, the original algorithm is transformed so
that it uses texture structures instead of lists (see Figure 4.1). Furthermore, this
approach allows handling of all pattern types in a symmetric way, thus avoiding
the need to decompose them into simpler ones. In addition, precomputation of
Figure 4.1. Starting from an aliased image (left), edges are detected and stored in the
edges texture (center left). The color of each pixel depicts where edges are: green pixels
have an edge at their top boundary, red pixels at their left boundary, and yellow pixels
have edges at both boundaries. The edges texture is then used in conjunction with the
precomputed area texture to produce the blending weights texture (center right) in the
second pass. This texture stores the weights for the pixels at each side of an edgel in the
RGBA channels. In the third pass, blending is performed to obtain the final antialiased
image (right).
i i
i i
i i
i i
certain values into textures allows for an even faster implementation. Finally,
in order to accelerate calculations, we make extensive use of hardware bilinear
interpolation for smartly fetching multiple values in a single query and provide
means of decoding the fetched values into the original unfiltered values. As a
result, our algorithm can be efficiently executed by a GPU, has a moderate mem-
ory footprint, and can be integrated as part of the standard rendering pipeline of
any game architecture.
Some of the optimizations presented in this work may seem to add complexity
at a conceptual level, but as our results show, their overall contribution makes
them worth including. Our technique yields image quality between 4x and 8x
MSAA, with a typical execution time of 3.79 ms on Xbox 360 and 0.44 ms on a
NVIDIA GeForce 9800 GTX+, for a resolution of 720p. Memory footprint is 2x
the size of the backbuffer on Xbox 360 and 1.5x on the 9800 GTX+. According
to our measurements, 8x MSAA takes an average of 5 ms per image on the same
GPU at the same resolution, that is, our algorithm is 11.80x faster.
In order to show the versatility of our algorithm, we have implemented the
shader both for Xbox 360 and PC, using DirectX 9 and 10 respectively. The code
presented in this article is that of the DirectX 10 version.
4.1 Overview
The algorithm searches for patterns in edges which then allow us to reconstruct
the antialiased lines. This can, in general terms, be seen as a revectorization of
edges. In the following we give a brief overview of our algorithm.
First, edge detection is performed using depth values (alternatively, lumi-
nances can be used to detect edges; this will be further discussed in Section 4.2.1).
We then compute, for each pixel belonging to an edge, the distances in pixels from
it to both ends of the edge to which the edgel belongs. These distances define
the position of the pixel with respect to the line. Depending on the location of
the edgel within the line, it will or will not be affected by the antialiasing pro-
cess. In those edges which have to be modified (those which contain yellow or
green areas in Figure 4.2 (left)) a blending operation is performed according to
Equation (4.1):
cnew = (1 − a) · cold + a · copp , (4.1)
where cold is the original color of the pixel, copp is the color of the pixel on the
other side of the line, cnew is the new color of the pixel, and a is the area shown
in yellow in Figure 4.2 (left). The value of a is a function of both the pattern
type of the line and the distances to both ends of the line. The pattern type is
defined by the crossing edges of the line, i.e., edges which are perpendicular to
the line and thus define the ends of it (vertical green lines in Figure 4.2). In order
to save processing time, we precompute this area and store it as a two-channel
texture that can be seen in Figure 4.2 (right) (see Section 4.3.3 for details).
i i
i i
i i
i i
98 II Rendering
Figure 4.2. Antialiasing process (left). Color copp bleeds into cold according to the area
a below the blue line. Texture containing the precomputed areas (right). The texture
uses two channels to store areas at each side of the edge, i.e., for a pixel and its opposite
(pixels (1, 1) and (1, 2) on the left). Each 9 × 9 subtexture corresponds to a pattern
type. Inside each of these subtextures, (u, v) coordinates encode distances to the left
and to the right, respectively.
i i
i i
i i
i i
for all the pixels in the image this way, given the fact that two adjacent pixels
have a common boundary. This difference is thresholded to obtain a binary value,
which indicates whether an edge exists in a pixel boundary. This threshold, which
varies with resolution, can be made resolution independent [Koonce 07]. Then,
the left and top edges are stored, respectively, in the red and green channels of
the edges texture, which will be used as input for the next pass.
Whenever using depth-based edge detection, a problem may arise in places
where two planes at different angles meet: the edge will not be detected because
of samples having the same depth. A common solution to this is the addition of
information from normals. However, in our case we found that the improvement
in quality obtained when using normals was not worth the increase in execution
time it implied.
Then, for each pixel, the difference in luminance with respect to the pixel on top
and on the left is obtained, the implementation being equivalent to that of depth-
based detection. When thresholding to obtain a binary value, we found 0.1 to be
an adequate threshold for most cases. It is important to note that using either
luminance- or depth-based edge detection does not affect the following passes.
Although qualitywise both methods offer similar results, depth-based detec-
tion is more robust, yielding a more reliable edges texture. And, our technique
takes, on average, 10% less time when using depth than when using luminance
values. Luminance values are useful when depth information cannot be accessed
and thus offer a more universal approach. Further, when depth-based detection
is performed, edges in shading will not be detected, whereas luminance-based
detection allows for antialias shading and specular highlights. In general terms,
one could say that luminance-based detection works in a more perceptual way be-
cause it smoothes visible edges. As an example, when dense vegetation is present,
using luminance values is faster than using depth values (around 12% faster for
the particular case shown in Figure 4.5 (bottom row)), since a greater number of
edges are detected when using depth values. Optimal results in terms of quality,
at the cost of a higher execution time, can be obtained by combining luminance,
depth, and normal values.
Listing 4.1 shows the source code of this pass, using depth-based edge detec-
tion. Figure 4.1 (center left) is the resulting image of the edge-detection pass,
in this particular case, using luminance-based detection, as depth information is
not available.
i i
i i
i i
i i
100 II Rendering
f l o a t 4 EdgeDetectionPS ( f l o a t 4 p o s i t i o n : SV POSITION,
f l o a t 2 t e x c o o r d : TEXCOORD0) : SV TARGET {
// We need t h e s e f o r u p d a t i n g t h e s t e n c i l b u f f e r .
f l o a t D r i g h t = depthTex . SampleLevel ( PointSampler ,
texcoord , 0 , int2 (1 , 0 ) ) ;
f l o a t Dbottom = depthTex . SampleLevel ( PointSampler ,
texcoord , 0 , int2 (0 , 1 ) ) ;
i f ( dot ( e dg es , 1 . 0 ) == 0 . 0 ) {
discard ;
}
return e d g e s ;
}
i i
i i
i i
i i
float4 BlendingWeightCalculationPS (
f l o a t 4 p o s i t i o n : SV POSITION,
f l o a t 2 t e x c o o r d : TEXCOORD0) : SV TARGET {
float4 weights = 0 . 0 ;
[ branch ]
i f ( e . g ) { // Edge a t n o r t h .
float2 d = float2 ( SearchXLeft ( texcoord ) ,
SearchXRight ( t e x c o o r d ) ) ;
[ branch ]
i f ( e . r ) { // Edge a t west .
f l o a t 2 d = f l o a t 2 ( SearchYUp ( t e x c o o r d ) ,
SearchYDown ( t e x c o o r d ) ) ;
f l o a t 4 c o o r d s = mad( f l o a t 4 ( − 0 . 2 5 , d . x , −0.25 , d . y + 1 . 0 ) ,
PIXEL SIZE . xyxy , t e x c o o r d . xyxy ) ;
f l o a t e1 = edgesTex . SampleLevel ( L i n e a r S a m p l e r ,
c o o r d s . xy , 0 ) . g ;
f l o a t e2 = edgesTex . SampleLevel ( L i n e a r S a m p l e r ,
c o o r d s . zw , 0 ) . g ;
w e i g h t s . ba = Area ( abs ( d ) , e1 , e2 ) ;
}
return w e i g h t s ;
}
i i
i i
i i
i i
102 II Rendering
2. When the returned value is 0.5 we cannot distinguish which of the two
pixels contains an edgel.
i i
i i
i i
i i
Figure 4.3. Hardware bilinear filtering is used when searching for distances from each
pixel to the end of the line. The color of the dot at the center of each pixel represents
the value of that pixel in the edges texture. In the case shown here, distance search of
the left end of the line is performed for the pixel marked with a star. Positions where
the edges texture is accessed, fetching pairs of pixels, are marked with rhombuses. This
allows us to travel twice the distance with the same number of accesses.
In the particular case of the Xbox 360 implementation, we make use of the
tfetch2D assembler instruction, which allows us to specify an offset in pixel
units with respect to the original texture coordinates of the query. This in-
struction is limited to offsets of −8 and 7.5, which constrains the maximum
distance that can be searched. When searching for distances greater than eight
pixels, we cannot use the hardware as efficiently and the performance is affected
negatively.
Listing 4.3. Distance search function (search in the left direction case).
i i
i i
i i
i i
104 II Rendering
Figure 4.4. Examples of the four possible types of crossing edge and corresponding
value returned by the bilinear query of the edges texture. The color of the dot at
the center of each pixel represents the value of that pixel in the edges texture. The
rhombuses, at a distance of 0.25 from the center of the pixel, indicate the sampling
position, while their color represents the value returned by the bilinear access.
i i
i i
i i
i i
f l o a t 2 Area ( f l o a t 2 d i s t a n c e , f l o a t e1 , f l o a t e2 ) {
// ∗ By d i v i d i n g by AREA SIZE − 1 . 0 below we a r e
// i m p l i c i t e l y o f f s e t t i n g to always f a l l i n s i d e a p i x e l .
// ∗ Rounding p r e v e n t s b i l i n e a r a c c e s s p r e c i s i o n p r o b l e m s .
f l o a t 2 p i x c o o r d = NUM DISTANCES ∗
round ( 4 . 0 ∗ f l o a t 2 ( e1 , e2 ) ) + d i s t a n c e ;
f l o a t 2 t e x c o o r d = p i x c o o r d / ( AREA SIZE − 1 . 0 ) ;
return areaTex . SampleLevel ( PointSampler , t e x c o o r d , 0 ) . r g ;
}
allows for a simpler and faster indexing. The round instruction is used to avoid
possible precision problems caused by the bilinear filtering.
Following the same reasoning (explained at the beginning of the section) by
which we store area values for two adjacent pixels in the same pixel of the final
blending weights texture, the precomputed area texture needs to be built on a
per-edgel basis. Thus, each pixel of the texture stores two a values, one for a
pixel and one for its opposite. (Again, a will be zero for one of them in all cases
with the exception of those pixels centered on lines of odd length.)
1. the current pixel, which gives us the north and west blending weights;
Once more, to exploit hardware capabilities, we use four bilinear filtered accesses
to blend the current pixel with each of its four neighbors. Finally, as one pixel can
belong to four different lines, we find an average of the contributing lines. List-
ing 4.5 shows the source code of this pass; Figure 4.1 (right) shows the resulting
image.
i i
i i
i i
i i
106 II Rendering
f l o a t 4 NeighborhoodBlendingPS (
f l o a t 4 p o s i t i o n : SV POSITION,
f l o a t 2 t e x c o o r d : TEXCOORD0) : SV TARGET {
f l o a t 4 t o p L e f t = blendTex . SampleLevel ( PointSampler ,
texcoord , 0 ) ;
float right = blendTex . SampleLevel ( PointSampler ,
texcoord , 0 ,
int2 (0 , 1 ) ) . g ;
f l o a t bottom = blendTex . SampleLevel ( PointSampler ,
texcoord , 0 ,
int2 (1 , 0 ) ) . a ;
f l o a t 4 a = f l o a t 4 ( t o p L e f t . r , r i g h t , t o p L e f t . b , bottom ) ;
f l o a t sum = dot ( a , 1 . 0 ) ;
[ branch ]
i f ( sum > 0 . 0 ) {
f l o a t 4 o = a ∗ PIXEL SIZE . yyxx ;
float4 c o l o r = 0 . 0 ;
c o l o r = mad( c o l o r T e x . SampleLevel ( L i n e a r S a m p l e r ,
t e x c o o r d + f l o a t 2 ( 0 . 0 , −o . r ) , 0 ) , a . r , color );
c o l o r = mad( c o l o r T e x . SampleLevel ( L i n e a r S a m p l e r ,
texcoord + float2 ( 0.0 , o . g ) , 0) , a . g , color );
c o l o r = mad( c o l o r T e x . SampleLevel ( L i n e a r S a m p l e r ,
t e x c o o r d + f l o a t 2 (−o . b , 0.0) , 0) , a . b , color );
c o l o r = mad( c o l o r T e x . SampleLevel ( L i n e a r S a m p l e r ,
texcoord + float2 ( o . a , 0.0) , 0) , a . a , color );
return c o l o r / sum ;
} else {
return c o l o r T e x . SampleLevel ( L i n e a r S a m p l e r , t e x c o o r d , 0);
}
}
4.5 Results
Qualitywise, our algorithm lies between 4x and 8x MSAA, requiring a memory
consumption of only 1.5x the size of the backbuffer on a PC and of 2x on Xbox
360.4 Figure 4.5 shows a comparison between our algorithm, 8x MSAA, and
no antialiasing at all on images from Unigine Heaven Benchmark. A limitation
of our algorithm with respect to MSAA is the impossibility of recovering subpixel
4 The increased memory cost in the Xbox 360 is due to the fact that two-channel render
targets with 8-bit precision cannot be created in the framework we used for that platform,
forcing the usage of a four-channel render target for storing the edges texture.
i i
i i
i i
i i
Figure 4.5. Examples of images without antialiasing, processed with our algorithm, and
with 8x MSAA. Our algorithm offers similar results to those of 8x MSAA. A special
case is the handling of alpha textures (bottom row). Note that in the grass shown
here, alpha to coverage is used when MSAA is activated, which provides additional
detail, hence the different look. As the scene is animated, there might be slight changes
in appearance from one image to another. (Images from Unigine Heaven Benchmark
courtesy of Unigine Corporation.)
i i
i i
i i
i i
108 II Rendering
Figure 4.6. Images obtained with our algorithm. Insets show close-ups with no
antialiasing at all (left) and processed with our technique (right). (Images from Fable
III courtesy of Lionhead Studios.)
i i
i i
i i
i i
Figure 4.7. More images showing our technique in action. Insets show close-ups with
no antialiasing at all (left) and processed with our technique (right). (Images from Fable
III courtesy of Lionhead Studios.)
i i
i i
i i
i i
110 II Rendering
Table 4.1. Average times and standard deviations for a set of well-known commercial
games. A column showing the speed-up factor of our algorithm with respect to 8x
MSAA is also included for the PC/DirectX 10 implementation. Values marked with ?
indicate 4x MSAA, since 8x was not available, and the grand average of these includes
values only for 8x MSAA.
features. Further results of our technique, on images from Fable III, are shown
in Figures 4.6 and 4.7. Results of our algorithm in-game are available in the web
material.
As our algorithm works as a post-process, we have run it on a batch of screen-
shots of several commercial games in order to gain insight about its performance
in different scenarios. Given the dependency of the edge detection on image
content, processing times are variable. We have noticed that each game has a
more or less unique “look-and-feel,” so we have taken a representative sample of
five screenshots per game. Screenshots were taken at 1280 × 720 as the typical
case in the current generation of games. We used the slightly more expensive
luminance-based edge detection, since we did not have access to depth informa-
tion. Table 4.1 shows the average time and standard deviation of our algorithm
on different games and platforms (Xbox 360/DirectX 9 and PC/DirectX 10),
as well as the speed-up factor with respect to MSAA. On average, our method
implies a speed-up factor of 11.80x with respect to 8x MSAA.
4.6 Discussion
This section includes a brief compilation of possible alternatives that we tried,
in the hope that it would be useful for programmers employing this algorithm in
the future.
Edges texture compression. This is perhaps the most obvious possible optimiza-
tion, saving memory consumption and bandwidth. We tried two different alterna-
tives: a) using 1 bit per edgel, and b) separating the algorithm into a vertical and
a horizontal pass and storing the edges of four consecutive pixels in the RGBA
i i
i i
i i
i i
channels of each pixel of the edges texture (vertical and horizontal edges sepa-
rately). This has two advantages: first, the texture uses less memory; second, the
number of texture accesses is lower since several edges are fetched in each query.
However, storing the values and—to a greater extent—querying them later, be-
comes much more complex and time consuming, given that bitwise operations
are not available in all platforms. Nevertheless, the use of bitwise operations in
conjunction with edges texture compression could further optimize our technique
in platforms where they are available, such as DirectX 10.
Storing crossing edges in the edges texture. Instead of storing just the north and
west edges of the actual pixel, we tried storing the crossing edges situated at the
left and at the top of the pixel. The main reason for doing this was that we
could spare one texture access when detecting patterns; but we realized that by
using bilinear filtering we could also spare the access, without the need to store
those additional edges. The other reason for storing the crossing edges was that,
by doing so, when we searched for distances to the ends of the line, we could
stop the search when we encountered a line perpendicular to the one we were
following, which is an inaccuracy of our approach. However, the current solution
yields similar results, requires less memory, and processing time is lower.
Storing distances instead of areas. Our first implementation calculated and stored
only the distances to the ends of the line in the second pass, and they were then
used in the final pass to calculate the corresponding blending weights. However,
directly storing areas in the intermediate pass allows us to spare calculations,
reducing execution time.
4.7 Conclusion
In this chapter, we have presented an algorithm crafted for the computation of
antialiasing. Our method is based on three passes that detect edges, determine
the position of each pixel inside those image features, and produce an antialiased
result that selectively blends the pixel with its neighborhood according to its
relative position within the line it belongs to. We also take advantage of hardware
texture filtering, which allows us to reduce the number of texture fetches by half.
Our technique features execution times that make it usable in actual game
environments, and that are far shorter than those needed for MSAA. The method
presented has a minimal impact on existing rendering pipelines and is entirely
implemented as an image post-process. Resulting images are between 4x and
i i
i i
i i
i i
112 II Rendering
8x MSAA in quality, while requiring a fraction of their time and memory con-
sumption. Furthermore, it can antialias transparent textures such as the ones
used in alpha testing for rendering vegetation, whereas MSAA can smooth vege-
tation only when using alpha to coverage. Finally, when using luminance values
to detect edges, our technique can also handle aliasing belonging to shading and
specular highlights.
The method we are presenting solves most of the drawbacks of MSAA, which
is currently the most widely used solution to the problem of aliasing; the pro-
cessing time of our method is one order of magnitude below that of 8x MSAA.
We believe that the quality of the images produced by our algorithm, its speed,
efficiency, and pluggability, make it a good choice for rendering high quality im-
ages in today’s game architectures, including platforms where benefiting from
antialiasing, together with outstanding techniques like deferred shading, was dif-
ficult to achieve. In summary, we present an algorithm which challenges the
current gold standard for solving the aliasing problem in real time.
4.8 Acknowledgments
Jorge would like to dedicate this work to his eternal and most loyal friend Kazán. The
authors would like to thank the colleagues at the lab for their valuable comments, and
Christopher Oat and Wolfgang Engel for their editing efforts and help in obtaining
images. Thanks also to Lionhead Studios and Microsoft Games Studios for granting
permission to use images from Fable III. We are very grateful for the support and
useful suggestions provided by the Fable team during the production of this work. We
would also like to express our gratitude to Unigine Corporation, and Denis Shergin in
particular, for providing us with images and material for the video (available in the web
material) from their Unigine Heaven Benchmark. This research has been funded by a
Marie Curie grant from the 7th Framework Programme (grant agreement no.: 251415),
the Spanish Ministry of Science and Technology (TIN2010-21543) and the Gobierno de
Aragón (projects OTRI 2009/0411 and CTPP05/09). Jorge Jimenez and Belen Masia
are also funded by grants from the Gobierno de Aragón.
Bibliography
[Koonce 07] Rusty Koonce. “Deferred Shading in Tabula Rasa.” In GPU Gems 3,
pp. 429–457. Reading, MA: Addison Wesley, 2007.
[Reshetov 09] Alexander Reshetov. “Morphological Antialiasing.” In HPG ’09: Pro-
ceedings of the Conference on High Performance Graphics 2009, pp. 109–116. New
York: ACM, 2009. Available online ([Link]
publications/papers/2009/mlaa/[Link]).
[Shishkovtsov 05] Oles Shishkovtsov. “Deferred Shading in S.T.A.L.K.E.R.” In GPU
Gems 2, pp. 143–166. Reading, MA: Addison Wesley, 2005.
i i
i i
i i
i i
[Sousa 07] Tiago Sousa. “Vegetation Procedural Animation and Shading in Crysis.” In
GPU Gems 3, pp. 373–385. Reading, MA: Addison Wesley, 2007.
[Thibieroz 09] Nicolas Thibieroz. “Deferred Shading with Multisampling Anti-Aliasing
in DirectX 10.” In ShaderX7 , pp. 225–242. Hingham, MA: Charles River Media,
2009.
i i
i i
i i
i i
i i
i i
i i
i i
5
II
Volume Decals
Emil Persson
5.1 Introduction
Decals are often implemented as textured quads that are placed on top of the
scene geometry. While this implementation works well enough in many cases,
it can also provide some challenges. Using decals as textured quads can cause
Z-fighting problems. The underlying geometry may not be flat, causing the decal
to cut into the geometry below it. The decal may also overhang an edge, com-
pletely ruining its effect. Dealing with this problem often involves clipping the
decal to the geometry or discarding it entirely upon detecting the issue. Alterna-
tively, very complex code is needed to properly wrap the decal around arbitrary
meshes, and access to vertex data is required. On a PC this could mean that
system-memory copies of geometry are needed to maintain good performance.
Furthermore, disturbing discontinuities can occur, as in the typical case of shoot-
ing a rocket into a corner and finding that only one of the walls got a decal or
that the decals do not match up across the corner. This article proposes a tech-
nique that overcomes all of these challenges by projecting a decal volume onto
the underlying scene geometry, using the depth buffer.
115
i i
i i
i i
i i
116 II Rendering
What we are really interested in, though, is the local position relative to
the decal volume. The local position is used as a texture coordinate used to
sample a volume texture containing a volumetric decal (see Figure 5.1). Since
the decal is a volumetric texture, it properly wraps around nontrivial geometry
with no discontinuities (see Figure 5.2). To give each decal a unique appearance,
a random rotation can also be baked into the matrix for each decal. Since we
do a matrix transformation we do not need to change the shader code other
than to name the matrix more appropriately as ScreenToLocal, which is then
constructed as follows:
i i
i i
i i
i i
The full fragment shader for this technique is listed below and a sample with
full source code is available in the web materials.
c b u f f e r Constants
{
float4x4 S c r e e n T o L o c a l ;
float2 P i x e l S i z e ;
};
// Compute n o r m a l i z e d s c r e e n p o s i t i o n
f l o a t 2 texCoord = I n . P o s i t i o n . xy ∗ P i x e l S i z e ;
i i
i i
i i
i i
118 II Rendering
// Compute l o c a l p o s i t i o n o f s c e n e geometry
f l o a t depth = DepthTex . Sample ( D e p t h F i l t e r , texCoord ) ;
f l o a t 4 s c r P o s = f l o a t 4 ( texCoord , depth , 1 . 0 f ) ;
f l o a t 4 wPos = mul ( s c r P o s , S c r e e n T o L o c a l ) ;
// Sample d e c a l
f l o a t 3 c o o r d = wPos . xyz / wPos . w ;
r e t u r n DecalTex . Sample ( D e c a l F i l t e r , c o o r d ) ;
}
i i
i i
i i
i i
volume at the time it was added to the scene and cull for discarded pixels that
do not belong to any of those objects.
5.2.3 Optimizations
On platforms where the depth-bounds test is supported, the depth-bounds test
can be used to improve performance. On other platforms, dynamic branching can
be used to emulate this functionality by comparing the sample depth to the depth
bounds. However, given that the shader is relatively short and typically a fairly
large number of fragments survive the test, it is recommended to benchmark to
verify that it actually improves performance. In some cases it may in fact be
faster to not attempt to cull anything.
5.2.4 Variations
In some cases it is desirable to use a two-dimensional texture instead of a volume
decal. Volume textures are difficult to author and consume more memory. Not
all cases translate well from a two-dimensional case to three dimensions. A bullet
hole decal can be swept around to a spherical shape in the three-dimensional
case and can then be used in any orientation, but this is not possible for many
kinds of decals; an obvious example is a decal containing text, such as a logo or
graffiti tag.
An alternate technique is to sample a two-dimensional texture using just the
x, y components of the final coordinates. The z component can be used for fading.
When a volume texture is used, you can get an automatic fade in all directions
by letting the texture alpha fade to zero toward the edges and using a border
color with an alpha of zero. In the 2D case you will have to handle the z direction
yourself.
Two-dimensional decals are not rotation invariant so when placing them in
the scene they must be oriented such that they are projected sensibly over the
underlying geometry. The simplest approach would be to just align the decal
plane with the normal of the geometry at the decal’s center point. Some prob-
lematic cases exist though, such as when wrapping over a corner of a wall. If it is
placed flat against the wall you will get a perpendicular projection on the other
side of the corner with undesirable texture-stretching as a result.
An interesting use of the two-dimensional case is to simulate a blast in a
certain direction. This can be accomplished by using a pyramid or frustum shape
from the point of the blast. When the game hero shoots a monster you place a
frustum from the bullet-impact point on the monster to the wall behind it in the
direction of the bullet and you will get the effect of blood and slime smearing onto
the wall. The projection matrix of this frustum will have to be baked into the
ScreenToLocal matrix to get the proper projection of the texture coordinates.
i i
i i
i i
i i
120 II Rendering
The blast technique can also be varied for a cube decal scenario. This would
better simulate the effect of a grenade blast. In this case a cube or sphere would be
rendered around the site of the blast and a cubemap lookup is performed with the
final coordinates. Fading can be effected using the length of the coordinate vector.
To improve the blast effect you can use the normals of underlying geometry
to eliminate the decal on back-facing geometry. For the best results, a shad-
owmapesque technique can be used to make sure only the surfaces closest to the
front get smeared with the decal. This “blast-shadow map” typically has to be
generated only once at the time of the blast and can then be used for the rest
of the life of the decal. Using the blast-shadow map can ensure splatter happens
only in the blast shadow of monsters and other explodable figures, whereas areas
in the blast-shadow map that contain static geometry only get scorched. This
requires storing a tag in the shadow buffer for pixels belonging to monsters, how-
ever. Creative use of the shadow map information also can be used to vary the
blood-splatter intensity over the distance from the blast to the monster and from
the monster to the smeared wall.
5.3 Conclusions
An alternate approach for decal rendering has been shown that suggests solu-
tions to many problems of traditional decal-rendering techniques. Using vol-
umes instead of flat decal geometry allows for continal decals across nontrivial
geometry. It also eliminates potentially expensive buffer locks or the need for
system-memory buffer copies.
Bibliography
[Thibieroz 04] Nicolas Thibieroz. “Deferred Shading with Multiple Render Targets.”
In ShaderX2 : Shader Programming Tips and Tricks with DirectX 9, edited by
Wolfgang Engel, pp. 251–269. Plano, TX: Wordware Publishing, 2004.
[Engel 09] Wolfgang Engel. “Designing a Renderer for Multiple Lights - The Light Pre-
Pass Renderer.” In ShaderX7 : Advanced Rendering Techniques, edited by Wolfgang
Engel, pp. 655–666. Hingham, MA: Charles River Media, 2009.
[Persson 09] Emil Persson. “Making It Large, Beautiful, Fast and Consistent: Lessons
Learned Developing Just Cause 2.” in GPU Pro: Advanced Rendering Techniques,
pp. 571–596. Natick, MA: A K Peters, 2010.
i i
i i
i i
i i
III
Global Illumination
Effects
The good news: global illumination effects (exceeding ambient occlusion) have
found their way into production! The advances in graphics hardware capabilities
in recent years, combined with smart algorithms and phenomenologically well-
motivated—and also well-understood—approximations of light transport, allow
the rendering of ever-increasing numbers of phenomena in real time. This sec-
tion includes four articles describing rendering techniques of global illumination
effects, and all of them are suited for direct rendering applications in real time.
Reprojection caching techniques, introduced about three years ago, can ex-
ploit temporal coherence in rendering and, by this, reduce computation for costly
pixel shaders by reusing results from previous frames. In “Temporal Coherence to
Improve Screen-Space Ambient Occlusion,” Oliver Mattausch, Daniel Scherzer,
and Michael Wimmer adapt temporal coherence for improving the performance
of screen-space ambient occlusion (SSAO) techniques. Their algorithm reuses am-
bient occlusion (AO) sample information from previous frames if available, and
adaptively generates more AO samples as needed. Spatial filtering is applied only
to regions where the AO computation does not yet converge. This improves the
overall quality as well as performance of SSAO.
In “Level-of-Detail and Streaming Optimized Irradiance Normal Mapping,”
Ralf Habel, Anders Nilsson, and Michael Wimmer describe a clever technique for
irradiance normal mapping, which has been successfully used in various games.
They introduce a modified hemispherical basis (hierarchical, in the spirit of spher-
ical harmonics) to represent low-frequency directional irradiance. The key to this
basis is that it contains the traditional light map as one of its coefficients, and
further basis functions provide additional directional information. This enables
shader level-of-detail (LOD) (in which the light map is the lowest LOD), and
streaming of irradiance textures.
“Real-Time One-Bounce Indirect Illumination and Indirect Shadows Using
Ray Tracing,” by Holger Gruen, describes an easy-to-implement technique to
achieve one-bounce indirect illumination, including shadowing of indirect light, in
real time, which is often neglected for fully dynamic scenes. His method consists
of three phases: rendering of indirect light with reflective shadow maps (RSMs),
creating a three-dimensional grid as acceleration structure for ray-triangle inter-
section using the capabilities of Direct3D 11 hardware, and finally computing
i i
i i
i i
i i
the blocked light using RSMs and ray casting, which is then subtracted from the
result of the first phase.
In their article, “Real-Time Approximation of Light Transport in Translu-
cent Homogenous Media,” Colin Barré-Brisebois and Marc Bouchard describe an
amazingly simple method to render plausible translucency effects for a wide range
of objects made of homogeneous materials. Their technique combines precom-
puted, screen-space thickness of objects with local surface variation into a shader
requiring only very few instructions and running in real time on a PC and console
hardware. The authors also discuss scalability issues and the artist friendliness
of their shading technique.
“Real-Time Diffuse Global Illumination with Temporally Coherent Light Prop-
agation Volumes,” by Anton Kaplanyan, Wolfgang Engel, and Carsten Dachs-
bacher, describes the global-illumination approach used in the upcoming game
Crysis 2. The technique consists of four stages: in the first stage all lit surfaces
of the scene are rendered into RSMs. Then a sparse three-dimensional grid of
radiance distribution is initialized with the generated surfels from the first stage.
In the next step, the authors propagate the light in this grid using an iterative
propagation scheme and, in the last stage, the resulting grid is used to illumi-
nate the scene similarly to the irradiance volumes technique described by Natalya
Tatarchuk in the article “Irradiance Volumes for Games.”
—Carsten Dachsbacher
i i
i i
i i
i i
1
III
Temporal Screen-Space
Ambient Occlusion
Oliver Mattausch, Daniel Scherzer,
and Michael Wimmer
1.1 Introduction
Ambient occlusion (AO) is a shading technique that computes how much of the
hemisphere around a surface point is blocked, and modulates the surface color
accordingly. It is heavily used in production and real-time rendering, because it
produces plausible global-illumination effects with relatively low computational
cost. Recently it became feasible to compute AO in real time, mostly in the form
of screen-space ambient occlusion (SSAO). SSAO techniques use the depth buffer
as a discrete scene approximation, thus have a constant overhead and are simple
to implement.
However, to keep the computation feasible in real time, concessions have to be
made regarding the quality of the SSAO solution, and the SSAO evaluation has
to be restricted to a relatively low number of samples. Therefore, the generated
AO is usually prone to surface noise, which can be reduced in a post-processing
step with a discontinuity filter. Depending on the chosen filter settings, we can
either keep sharp features and accept some noise, or get a smooth but blurry
solution due to filtering over the edges (as can be seen in Figure 1.1). Also, for
dynamically moving objects, the noise patterns will sometimes appear to float on
the surfaces, which is a rather distracting effect. To get a solution that is neither
noisy nor blurry, many more samples have to be used. This is where temporal
coherence comes into play.
123
i i
i i
i i
i i
Figure 1.1. SSAO without temporal coherence (23 FPS) with 32 samples per pixel,
with (a) a weak blur, (b) a strong blur. (c) TSSAO (45 FPS), using 8–32 samples per
pixel (initially 32, 8 in a converged state). (d) Reference solution using 480 samples per
frame (2.5 FPS). All images at 1024 × 768 resolution and using 32-bit precision render
targets. The scene has 7 M vertices and runs at 62 FPS without SSAO.
the number of samples that are computed in a single frame low, while effectively
accumulating hundreds of samples in a short amount of time. Note that ambient
occlusion has many beneficial properties that make it well suited for temporal
coherence: there is no directional dependence on the light source or the viewer,
AO techniques consider only the geometry in a local neighborhood, and only the
SSAO in a pixel neighborhood is affected by a change in the scene configuration.
In this article, we focus specifically on how to use reverse reprojection to im-
prove the quality of SSAO techniques in a deferred shading pipeline. In particular,
we show how to detect and handle changes to the SSAO caused by moving en-
tities, animated characters, and deformable objects. We demonstrate that these
cases, which are notoriously difficult for temporal coherence methods, can be
significantly improved as well. A comparison of our temporal SSAO (TSSAO)
technique with conventional SSAO and a reference solution in a static scene con-
figuration can be seen in Figure 1.1.
Note that this algorithm is complementary to the method described in the
“Fast Soft Shadows With Temporal Coherence” chapter of this book, which also
provides code fragments that describe the reprojection process.
i i
i i
i i
i i
Figure 1.2. This figure compares rendering without (left) and with (right) AO, and
shows that AO allows much better depth perception and feature recognition, without
requiring any additional lighting.
1
Z
ao(p, np ) = V (p, ω)D(|p − ξ|)np · ωdω, (1.1)
π Ω
where ω denotes all directions on the hemisphere and V is the (inverse) binary
visibility function, with V (p, ω) = 1 if the visibility in this direction is blocked
by an obstacle, 0 otherwise. D is a monotonic decreasing function between 1
and 0 of the distance from p to ξ, the intersection point with the nearest surface.
In the simplest case, D is a step function, considering obstacles within a certain
sampling radius only, although a smooth falloff provides better results, (e.g., as
given by an exp(.) function).
Figure 1.2 demonstrates the visual impact of SSAO for the depth perception
of a scene.
i i
i i
i i
i i
In our deferred shading pipeline, we store eye-linear depth values for the
current frame and the previous frame, and use them to reconstruct the world-
space positions p. In our implementation, because we already store the world-
space positions, we have only to transform the current world-space position pf
with the previous view-projection matrix Pf −1 Vf −1 to get tf −1 . From tf −1 we
calculate the correct lookup coordinates texf −1 into the AO buffer by applying
the perspective division and scaling the result to the range [0..1] (i.e., texf −1 =
tf −1 +1
2 .).
i i
i i
i i
i i
jf (p)+k
1 X
Cf (p) = C(p, si ), (1.5)
k
i=jf (p)+1
where jf (p) counts the number of unique samples that have already been used
in this solution. We combine the new solution with the previously computed
solution:
where the weight wf −1 is the number of samples that have already been accu-
mulated in the solution, or a predefined maximum after convergence has been
reached.
i i
i i
i i
i i
The current index position is propagated to the next frame by means of reverse
reprojection as with the SSAO values. In order to prevent the index position from
being interpolated by the hardware and introducing a bias into the sequence, it
is important to always fetch the index value from the nearest pixel center in the
AO-buffer. The pixel center can be found using the formula in Equation (1.8),
given the reprojected coordinates texf −1 :
i i
i i
i i
i i
df
|1 − | < . (1.9)
df −1
Equation (1.9) gives stable results for large scenes with a wide depth range,
which are not oversensitive at the near plane and are sufficiently sensitive when
approaching the far-plane regions. In case of a disocclusion, we always discard
the previous solution by resetting wf −1 to 0 and we compute a completely new
AO solution.
s2 f-1 s2 old
|s 2 f-1
s3 f-1 s3
Reproject into
-p f-1
s4 f-1previous frames
|
s1 f-1 s2 s4
pf-1 1 |s -
2 p| p
Figure 1.3. The distance of p to sample point s2 in the current frame (right) differs
significantly from the distance of pf −1 to s2f −1 in the previous frame (left), so we
assume that a local change of geometry occurred, which affects the shading of P .
i i
i i
i i
i i
example, a scenario wherein a box is lifted from the floor. The SSAO values of
pixels in the contact-shadow area surrounding the box change, even if there is no
disocclusion of the pixel itself.
The size of the neighborhood to be checked is equivalent to the size of the
sampling kernel used for SSAO. Checking the complete neighborhood of a pixel
would be prohibitively expensive, and therefore we use sampling. Actually, it
turns out that we already have a set of samples, namely the ones used for AO
generation. That means that we effectively use our AO sampling kernel for two
purposes: for computing the current contribution Cf (p), and to test for validity.
Our invalidation scheme is visualized in Figure 1.3. The validity of a sample
si for shading a pixel p can be estimated by computing the change in relative
positions of sample and pixel:
The reprojected position sif −1 is computed from the offset vector stored for
si (recall that the first rendering pass stores the offset vectors for all pixels in
the frame buffer for later access by the SSAO-shading pass). Note that, for the
neighborhood test, we use only those samples that lie in front of the tangent
plane of p, since only those samples actually modify the shadow term.
Theoretically we could also check if the angle between surface normal and
vector to the sample point has changed by a significant amount from one frame
to the next, and practical cases are imaginable when the vector length is not
enough. However, this would require more information to be stored (the surface
normal of every pixel in the previous frame), and in all our tests we found it
sufficient to evaluate Equation (1.10).
Note that in order to avoid one costly texture lookup when fetching pf , the
required values for this test and for the AO computation should be stored in a
single render target.
i i
i i
i i
i i
Confidence
1.0
S=5
0.8 S=15
0.6 S=30
0.4 S=50
0.2
∆
0.0 0.5 1.0 1.5 2.0
Figure 1.4. (a) Confidence function depending on the distance difference δ for smooth-
ing factor S = 5, 15, 30, 50. (b) Visualization of the confidence values computed by
our smooth invalidation technique, showing a rotation (left), a translation (middle),
and an animated (walking) character (right). We use a continuous scale from red
(confidence=0) to white (confidence=1).
The parameter S controls the smoothness of the invalidation, and is set to a value
(15 ≤ S ≤ 30) in our current implementation. As can be seen in Figure 1.4, for
different values of S, the confidence is 1 if the relative distance has not changed
(δ(x) = 0), and approaches 0 for large values of δ(si ). The overall confidence of
the previous AO solution is given by
We multiply it with wt to modify the weight of the old solution in Equation (1.6).
Also, in order to prevent flickering artifacts in regions with large changes, we do
not increase the index into the array of samples if the convergence is smaller than
a threshold (e.g., for conv(p) < 0.5) to reuse the same samples.
Figure 1.5 shows the effect of our novel invalidation scheme on a scene with
a translational movement. Checking only disocclusions causes artifacts visible
Figure 1.5. Rotating dragon model using different values for smooth invalidation factor
S. (left) S = 0 (i.e., no invalidation), (middle) S = 100, (right) S = 15. Note that no
invalidation causes a wrong shadow (left), while a too high value causes unwanted noise
in the shadow (middle).
i i
i i
i i
i i
Handling of frame-buffer borders. Samples that fall outside of the frame buffer
carry incorrect information that should not be propagated. Hence we check for
each pixel if one or more of the samples have been outside the frame buffer in the
previous frame. In this case, we do not use smooth invalidation, but discard the
previous values completely since they are undefined. In the same spirit, we do
not use samples that fall outside of the frame buffer to compute our confidence
values.
where x is the individual filter samples in the screen-space support F of the filter
(e.g., a 9 × 9 pixel region), K(p) is the normalization factor, and g is a spatial-
filter kernel (e.g., a Gaussian). As a pixel becomes more converged, we shrink
the screen-space filter support smoothly, using the shrinking factor s:
max(cadaptive − conv(p), 0)
s(p) = ,
cadaptive
i i
i i
i i
i i
Figure 1.6. Rotating dragon, closeups of the marked region are shown. TSSAO without
filter (middle) and with our filter (right). Note that the filter is applied only in the noisy
regions, while the rest stays crisp.
so that when convergence has reached cadaptive , we turn off spatial filtering com-
pletely. We found the setting of cadaptive to be perceptually uncritical (e.g., a
value of 0.2 leads to unnoticeable transitions).
The influence of the adaptive convergence-aware filter on the quality of the
TSSAO solution is shown in Figure 1.6.
Adaptive sampling. Though spatial filtering can reduce noise, it is more effec-
tive to provide additional input samples in undersampled regions. Or, to put
it differently, once the AO has reached sufficient convergence, we can just reuse
the computed solution, thus using fewer samples in regions that are not under-
sampled. We adapt the number k of new AO samples per frame as a function
of convergence. Note that these AO samples are completely unrelated to the
screen-space samples used for spatial filtering in the previous section, where the
kernel size is adapted instead of changing the number of samples.
It is necessary to generate at least a minimum number of samples for the same
reasons that we clamp wf (p) in Equation (1.6) (i.e., to avoid blurring artifacts
introduced by bilinear filtering). Furthermore, a certain number of samples is re-
quired for detecting invalid pixels due to changing neighborhoods (Section 1.4.2).
In order to introduce a minimum amount of branching, we chose a simple two-
i i
i i
i i
i i
stage scheme, with k1 samples if conv(p) < cspatial and k2 samples otherwise
(refer to Table 1.1 for a list of parameters actually used in our implementation).
This requires a variable number of iterations of the AO loop in the shader.
Since disoccluded regions are often spatially coherent (as can be seen in Fig-
ure 1.4(b)), the dynamic branching operations in the shader are quite efficient on
today’s graphics hardware.
1.4.4 Optimizations
These optimizations of the core algorithm allow for faster frame rates and better
image quality due to greater precision.
Local space. In order to avoid precision errors in large scenes, we store our po-
sition values in a local space that is centered at the current view point. These
values can be easily transformed into world-space values by passing the previous
and the current view point as parameters to the shader.
Algorithm of Ritschel et al. The SSAO method of Ritschel et al. uses a 3D sampling
kernel, and a depth test to query the sample visibility, thus implementing the
contribution function in Equation (1.3). This is in contrast to the original Crytek
implementation [Mittring 07], which does not use the incident angle to weight the
sample. In order to get a linear SSAO falloff, we use a sample distribution that is
linear in the sampling-sphere radius. Note that we use a constant falloff function
D(x) = 1, in this case—the falloff is caused only by the distribution of the sample
points. The differences to the ray-traced AO are mainly caused by the screen-
space discretization of the scene.
i i
i i
i i
i i
Algorithm of Fox and Compton. The algorithm of Fox and Compton samples the
depth buffer around the pixel to be shaded and interprets these samples as small
patches, similar to radiosity. While not physically accurate, it often gives a
pleasing visual result because it preserves more small details when using large
kernels for capturing low-frequency AO. On the downside, this method is prone
to reveal the underlying tessellation. As a remedy, we do not count samples at
grazing angles of the hemisphere (i.e., where the cosine is smaller than a given
). We used Equation (1.11) to implement the algorithm:
max(cos(si − p, np ), 0)
C(p, si ) = . (1.11)
max(, |si − p|)
The main difference from other SSAO methods is that each sample is con-
structed on a visible surface, and interpreted as a patch, whereas in Equa-
tion (1.11)), samples are used to evaluate the visibility function. The denom-
inator represents a linear falloff function D(.), where we also guard against zero
sample distance.
The screen-space sampling radius is defined by projecting a user-specified
world-space sampling radius onto the screen, so that samples always cover roughly
similar regions in world space. When the user adjusts the world-space radius, the
intensity of each sample needs to be scaled accordingly in order to maintain a
consistent brightness.
Sample generation. In order to obtain the samples used for SSAO, we first com-
pute a number of random samples ζ in the range [0 . . . 1] using a Halton sequence.
For the method of Fox and Compton, we use 2D (screen-space) samples uniformly
distributed on a disc of user-specified size. These samples are then projected from
screen space into world space by intersecting the corresponding viewing rays with
the depth buffer and computing their world space position. We generate the
screen-space samples si from 2D Halton samples ζx,y using Equation (1.12):
α = 2πζx ,
p
r = ζy , (1.12)
si = (r cos(α), r sin(α)).
α = 2πζx ,
r = ζy , (1.13)
p p p
si = (r cos(α) 1 − ζz , r sin(α) 1 − ζz , r ζz ).
i i
i i
i i
i i
This formula uses the variables ζx and ζz to compute a point on the unit sphere
(i.e., a random direction), and ζy to set the sample distance r. Note that in order
to use variance-reducing importance sampling (rather than uniform sampling),
this formula creates a distribution proportional to the cosine-weighted solid angle.
Furthermore, Szirmay-Kalos et al. [Szirmay-Kalos et al. 10] have shown that a
uniform distribution of samples along the distance r corresponds to a linear falloff
function D(.) (refer to Equation (1.1)) of the occluder influence with respect to
r. Using this importance sampling scheme, we simply have to count the numbers
of samples that pass the depth test during SSAO shading.
Maximum allowed sample distance. If the distance from the pixel center to the
intersection point with the depth buffer is too large, a sample is very likely to
cause wrong occlusion (refer to Figure 1.7(a)). However, introducing a maximum
allowed sample radius and setting it to a too small value can cause samples
where valid occlusion is missed (refer to Figure 1.7(b)). This is because they are
projected to a location outside the allowed sample radius. We set the maximum
(a) If the allowed sample distance is unrestricted, shadows are cast from disconnected objects
(1). If it is set equal to the sampling radius, some valid samples are not counted, resulting
in overbright SSAO (2). Allowing 2 times the radius (3) is a good trade-off and closest to a
ray-traced solution (REF).
(b) 2D illustration of the issue arising when setting the maximum allowed
sample distance (shown in blue, sampling radius shown in red) too small.
While the samples (shown in black) which are projected to the disconnected
surface are correctly rejected (left), this configuration also rejects valid sam-
ples (right).
Figure 1.7. The effect of the maximum allowed sample distance (from the shaded pixel
to the depth buffer intersection).
i i
i i
i i
i i
allowed sample radius to reject those samples where the distance is more than
two times larger than the sampling radius of the SSAO kernel. This trade-off
largely prevents incorrect shadowing of distant and disconnected surfaces caused
by objects in the foreground, while still accounting for correct occlusion in the
vicinity of the current pixel.
Frame buffer borders. A problem inherent in SSAO is the handling of samples that
fall outside the frame buffer borders (requested by fragments near the border).
The simplest solution is to settle for reusing the values at the border by using
clamp-to-edge. To avoid artifacts on the edges of the screen due to the missing
depth information, we can optionally compute a slightly larger image than we
finally display on the screen. It is sufficient to extend about 5–10% on each side
of the screen depending on the size of the SSAO kernel and the near plane.
1.6 Results
We implemented the proposed algorithm in OpenGL using the Cg shading lan-
guage. As test scenes, we used two models of different characteristics (shown
in Figure 1.8): (a) the Sibenik cathedral and (b) the Vienna city model. Both
scenes were populated with several dynamic objects. The walk-through sequences
taken for the performance experiments are shown in the accompanying videos.
Note that most SSAO artifacts caused by image noise are more distracting in an-
imated sequences, hence we point interested readers to these videos which can be
downloaded from [Link] For all of our tests
we used an Intel Core 2 processor at 2.66 GHZ (using 1 core) and an NVIDIA
GeForce 280GTX graphics board. To achieve sufficient accuracy in large-scale
scenes like Vienna, we use 32-bit depth precision. Both the ambient occlusion
buffer and the SSAO texture are 32-bit RGBA render targets.
Generally TSSAO provides finer details and fewer noise artifacts. This can
be seen in Figure 1.1 for a static scene (using the method of Fox and Compton),
Figure 1.8. Used test scenes: (a) Sibenik cathedral (7,013,932 vertices) and (b) Vienna
(21,934,980 vertices) in the streets and (c) from above.
i i
i i
i i
i i
Figure 1.9. Close-up of a distant dragon in Sibenik cathedral at 1600 × 1200 resolution:
SSAO using 32 samples at 12 FPS (left close-up); TSSAO using 8–32 samples per frame
at 26 FPS (right close-up).
where we compare TSSAO to SSAO with a weak and a strong blur filter, which
gives a high or low weight, respectively, to discontinuities. Furthermore, we
compare TSSAO to a reference solution using 480 samples per frame, which was
the highest number of samples our shader could compute in a single frame. Notice
that the TSSAO method is visually very close to the reference solution, to which
it converges after a short time.
Figure 1.9 shows that the method also works for high-resolution images. The
TSSAO algorithm provides good quality even for fine details in the background.
Figure 1.10 shows a capture of a deforming cloak of an animated character. Al-
though deforming objects are difficult to handle with temporal coherence, it can
be seen that TSSAO significantly reduces the surface noise. We used the method
of Fox and Compton for Figure 1.9, and the method of Ritschel et al. for Fig-
ure 1.10.
In terms of visual image-quality, TSSAO performs better than SSAO in all
our tests. It corresponds to at least a 32-sample SSAO solution (since 32 samples
are always used for disocclusions), while the converged state takes up to several
hundred samples into account. Note that a similar quality SSAO solution would
Figure 1.10. Close-up of a deforming cloak: SSAO using 32 samples (middle) and
TSSAO using 8–32 samples (right). Notice that the surface noise (causing severe flick-
ering artifacts when animated) is reduced with TSSAO.
i i
i i
i i
i i
Table 1.2. Average timings for the two walk-through sequences shown in the videos
for 32-bit precision render targets using full and half resolution. We compare standard
SSAO, our method (TSSAO), and deferred shading without SSAO as a baseline. For
SSAO we used 32 samples in all scenes. For TSSAO we used 8 (16)–32 samples in
Vienna (Sibenik cathedral).
Timings. Table 1.2 shows average timings of our walk-throughs, comparing our
method (TSSAO) with SSAO without temporal coherence and the performance-
baseline method, deferred shading without SSAO. TSSAO uses 8 06 16 samples,
respectively, when converged and 32 otherwise for Vienna and for Sibenik cathe-
dral, whereas SSAO always uses 32 samples. In our tests TSSAO was always
faster than SSAO, for full and for half-resolution SSAO computation. Note that,
after convergence has been reached, TSSAO neither applies spatial filtering nor
the random rotations of the sampling-filter kernel (refer to Section 1.4.4).
Figure 1.11 shows the frame time variations for both walk-throughs. Note
that online occlusion culling [Mattausch et al. 08] is enabled for the large-scale
Vienna model, and thus the frame rate for the baseline deferred shading is quite
high for such a complex model. The frame-rate variations for TSSAO stem from
the fact that the method generates adaptively more samples for recently dis-
40 40
30 30
20 20
10 10
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 3500
FRAMES FRAMES
Figure 1.11. Frame times of the Vienna (left) and the Sibenik cathedral walk-through
(right) at resolution 1024 × 768, using 32-bit precision render targets.
i i
i i
i i
i i
occluded regions. The frame times of TSSAO are similar to SSAO for frames
where dynamic objects are large in screen space. For more static parts of the
walk-throughs, TSSAO is significantly faster.
1.8 Conclusions
We have presented a screen-space ambient-occlusion algorithm that utilizes re-
projection and temporal coherence to produce high-quality ambient occlusion for
dynamic scenes. Our algorithm reuses available sample information from previ-
ous frames if available, while adaptively generating more samples and applying
spatial filtering only in the regions where insufficient samples have been accumu-
lated. We have shown an efficient new pixel validity test for shading algorithms
that access only the affected pixel neighborhood. Using our method, such shad-
ing methods can benefit from temporal reprojection also in dynamic scenes with
animated objects.
i i
i i
i i
i i
Bibliography
[Bavoil et al. 08] Louis Bavoil, Miguel Sainz, and Rouslan Dimitrov. “Image-Space
Horizon-Based Ambient Occlusion.” In ACM SIGGRAPH 2008 Talks, SIGGRAPH
’08, pp. 22:1–22:1. New York: ACM, 2008.
[Cook and Torrance 82] Robert L. Cook and Kenneth E. Torrance. “A Reflectance
Model for Computer Graphics.” ACM Trans. Graph. 1 (1982), 7–24.
[Eisemann and Durand 04] Elmar Eisemann and Frédo Durand. “Flash Photography
Enhancement via Intrinsic Relighting.” ACM Trans. Graph. 23 (2004), 673–678.
[Fox and Compton 08] Megan Fox and Stuart Compton. “Ambient Occlusive Crease
Shading.” Game Developer Magazine, 2008.
[Landis 02] Hayden Landis. “Production-Ready Global Illumination.” In Siggraph
Course Notes, Vol. 16. New York: ACM, 2002.
[Mattausch et al. 08] Oliver Mattausch, Jiřı́ Bittner, and Michael Wimmer. “CHC++:
Coherent Hierarchical Culling Revisited.” Computer Graphics Forum (Proceedings
of Eurographics 2008) 27:2 (2008), 221–230.
[Mittring 07] M. Mittring. “Finding Next Gen - CryEngine 2.” In ACM SIGGRAPH
2007 courses, SIGGRAPH ’07, pp. 97–121. New York: ACM, 2007.
[Nehab et al. 07] Diego Nehab, Pedro V. Sander, Jason Lawrence, Natalya Tatarchuk,
and John R. Isidoro. “Accelerating real-time shading with reverse reprojection
caching.” In Proceedings of the Eurographics Symposium on Graphics Hardware
2007, pp. 25–35. Aire-la-Ville, Switzerland: Eurographics Association, 2007.
[Ritschel et al. 09] Tobias Ritschel, Thorsten Grosch, and Hans-Peter Seidel. “Approx-
imating Dynamic Global Illumination in Image Space.” In Proceedings of the 2009
Symposium on Interactive 3D Graphics and Games, pp. 75–82. New York: ACM,
2009.
[Rosado 07] Gilberto Rosado. “Motion Blur as a Post-Processing Effect.” In GPU
Gems 3, pp. 575–576. Reading, MA: Addison Wesley, 2007.
[Scherzer et al. 07] Daniel Scherzer, Stefan Jeschke, and Michael Wimmer. “Pixel-
Correct Shadow Maps with Temporal Reprojection and Shadow Test Confidence.”
In Proceedings of the Eurographics Symposium on Rendering 2007, pp. 45–50. Aire-
la-ville, Switzerland: Eurographics Association, 2007.
[Smedberg and Wright 09] Niklas Smedberg and Daniel Wright. “Rendering Tech-
niques in Gears of War 2.” In Proceedings of the Game Developers Conference
’09, 2009.
[Szirmay-Kalos et al. 10] Laszlo Szirmay-Kalos, Tamas Umenhoffer, Balazs Toth, Las-
zlo Szecsi, and Mateu Sbert. “Volumetric Ambient Occlusion for Real-Time Ren-
dering and Games.” IEEE Computer Graphics and Applications 30 (2010), 70–79.
[Wang and Hickernell 00] Xiaoqun Wang and Fred J. Hickernell. “Randomized Halton
Sequences.” Mathematical and Computer Modelling 32 (2000), 887–899.
i i
i i
i i
i i
i i
i i
i i
i i
2
III
Level-of-Detail and
Streaming Optimized
Irradiance Normal Mapping
Ralf Habel, Anders Nilsson,
and Michael Wimmer
2.1 Introduction
Light mapping and normal mapping are the most successful shading techniques
used in commercial games and applications today, because they require few re-
sources and result in a significant increase in the quality of the rendered image.
While light mapping stores global, low-frequency illumination at sparsely sampled
points in a scene, normal maps provide local, high-frequency shading variation
at a far higher resolution (see Figure 2.1).
The problem with combining the two methods is that light maps store irradi-
ance information for only one normal direction—the geometric surface normal—
and therefore cannot be evaluated using the normals stored in a normal map. To
overcome this problem, the irradiance (i.e., the incoming radiance integrated over
the hemisphere) has to be precalculated for all possible normal map directions
at every sample point. At runtime, this (sparse) directional irradiance signal can
be reconstructed at the (dense) sampling positions of the normal map through
interpolation. The final irradiance is calculated by evaluating the interpolated
directional irradiance using the normal vector from the normal map.
Because such a (hemispherical) directional irradiance signal is low-frequency
in its directionality, it can be well represented by smooth lower-order basis func-
tions. Several bases have been successfully used in games such as spherical har-
monics in Halo 3 [Chen and Liu 08], or the hemispherical Half-Life 2 basis [Mc-
Taggart 04]. It has been shown [Habel and Wimmer 10] that the hemispheri-
cal H-basis provides the overall best representation compared to all other bases
143
i i
i i
i i
i i
Figure 2.1. A scene without albedo maps showing the difference between light mapping
(left) and irradiance normal mapping (right).
and covers all possible surface normals ~n in the upper hemisphere Ω+ , with
L being the incoming radiance. Because we do not want to keep track of the
orientation of the hemispheres at every point, the hemispheres are defined in
i i
i i
i i
i i
tangent space, (i.e., around the interpolated surface normal and (bi)tangent),
which is also the space where the normal maps are defined.
To determine the incoming directional irradiance, we need to calculate the ra-
diance L(~x, ω) in a precomputation step similar to traditional light mapping. Any
method that creates a radiance estimate such as shadow mapping, standard ray
tracing, photon mapping [Jensen 96], final gathering, or path tracing [Kajiya 86]
can be used, and existing renderers or baking software can be applied.
Given the radiance L(~x, ω), calculating E(~x, ~n) (Equation (2.1)) for a point
~x corresponds to filtering L with a diffuse (cut cosine) kernel. Doing this in
Euclidean or spherical coordinates is prohibitively expensive, because we have to
filter a large number of surface points. Instead, we use spherical harmonics as
an intermediate basis in which the filtering can be done much more efficiently.
Spherical harmonics are orthonormal basis functions that can approximate any
spherical function. A comprehensive discussion of spherical harmonics can be
found in [Green 03] and [Sloan 08]. Unfortunately, different definitions exist; to
avoid confusion, we use the definition without the Condon-Shortley phase, as
shown in Appendix A for our calculations.
As shown by [Ramamoorthi and Hanrahan 01], a spherical directional irradi-
ance signal is faithfully represented with three spherical harmonics bands (nine
coefficients per color channel). Therefore, we need only to use spherical harmonics
up to the quadratic band.
First, we rotate the sampled radiance of a surface point into the tangent space
and expand it into spherical harmonics coefficients slm by integrating against the
spherical harmonics basis functions Yml over the upper hemisphere Ω+ :
Z
l
sm = L(~ω )Yml (~
ω )d~
ω.
Ω+
In almost all cases, the coefficients are calculated using Monte Carlo Integration
[Szirmay-Kalos ]:
N
l 2π X
sm ≈ L(ω~i )Yml (ω~i ),
N i=1
where N is the number of hemispherical, equally distributed, radiance samples
L(ω~i ). More advanced methods, such as importance sampling, can be applied,
as long as a radiance estimate represented in spherical harmonics is calculated.
The diffuse convolution reduces to an almost trivial step in this representation.
Following [Ramamoorthi and Hanrahan 01], applying the Funk-Hecke-Theorem,
integrating with a cut cosine kernel corresponds to multiplying the coefficients of
each band with a corresponding factor al :
2 1
a0 = 1 a1 = a2 = ,
3 4
i i
i i
i i
i i
We have built the necessary division by π for the exitant radiance into the diffuse
kernel so we do not have to perform a division at runtime.
By storing nine coefficients (respectively, 27 in the trichromatic case) at each
surface point, in either textures or vertex colors, we can calculate the final result
at runtime by looking up the normal from the normal map and we can calculate
the final irradiance by evaluating Equation (2.2). However, we are not making
the most efficient use of the coefficients since the functions are evaluated only
on the upper hemisphere Ω+ , and not the full sphere. The created directional
irradiance signals can be better represented in a hemispherical basis.
2.3 H-Basis
The H-basis was introduced in [Habel and Wimmer 10] and forms an orthonormal
hemispherical basis. Compared to all other orthonormal hemispherical bases,
such as [Gautron et al. 04] or [Koenderink et al. 96], the H-basis consist of only
polynomial basis functions up to a quadratic degree and therefore shares many
properties with spherical harmonics. Some of the basis functions are actually
the same basis functions as those used in spherical harmonics, but re-normalized
on the hemisphere, which is why the H-basis can be seen as the counterpart of
spherical harmonics on the hemisphere up to the quadratic band.
The basis is explicitly constructed to carry hemispherical directional irradi-
ance signals and can provide a similar accuracy with only six basis functions
compared to nine needed by spherical harmonics, and a higher accuracy than
any other hemispherical basis [Habel and Wimmer 10]. These basis functions
are:
1
H1 = √ ,
2π
r r
3 3
H2 = sin φ sin θ = y,
2π 2π
r r
3 3
H3 = (2 cos θ − 1) = (2z − 1),
2π 2π
r r
3 3
H4 = cos φ sin θ = x,
2π 2π
i i
i i
i i
i i
Figure 2.2. Spherical Harmonics basis functions (left) compared to the H-basis functions
(right). Green are positive and red are negative values.
r r
5 1 15 2 15
H = sin 2φ sin θ = xy,
2 2π 2π
r r
1 15 1 15 2
H6 = cos 2φ sin2 θ = (x − y 2 ).
2 2π 2 2π
Please note that compared to [Habel and Wimmer 10], the negative signs caused
by the Condon-Shortley phase have been removed in the basis functions H 2 and
H 4 for simplicity and for consistency with the spherical harmonics definitions. A
visual comparison of the spherical harmonics basis functions to the H-basis can
be seen in Figure 2.2.
i i
i i
i i
i i
Figure 2.3. The original basis function H 3 (left), and its replacement Hmod
3
(right). In
contrast to H 3 , Hmod
3
is always 0 for ~ng = (0, 0, 1).
are required. But this is still less efficient than standard light mapping, which
requires only one coefficient (i.e., texture). Thus, using the H-basis, we have to
load twice the amount of data in order to arrive at the same result we would
achieve with light mapping. The question arises, can we combine both light
mapping and irradiance normal mapping in one framework?
First, we note that while the H-basis is an orthonormal basis, orthonormality
is not required for representing directional irradiance. Thus we can sacrifice this
property for the sake of another advantage. As we have seen, all but two of the
basis functions always evaluate to 0 when evaluated in the direction of ~ng , but
the optimal case would be to have only one contributing function in this case.
We can achieve this goal by redefining the basis function H 3 as
r r
3 3 3
Hmod = (1 − cos θ) = (1 − z),
2π 2π
which is a function that is still linearly independent of the other basis functions,
and depends only on z (cos θ in spherical coordinates) like H 3 . However, this
function also evaluates to 0 in the direction of the geometric normal ~ng (see
Figure 2.3). Through this modification, the constant basis function H 1 now
represents the traditional light map, while we still maintain the high accuracy of
the original H-basis when using all coefficients. Therefore we have constructed a
set of basis functions that extends the standard light map approach rather than
replacing it.
i i
i i
i i
i i
We arrive at the coefficient vector for the modified H-basis ~h = (h1 , h2 , .., h6 )
by multiplying the serialized spherical harmonics coefficient vector ~s =
(s00 , s1−1 , s10 , .., s22 ) with the matrix THmod . This matrix multiplication also auto-
matically extracts the light map into the coefficient for the constant basis function
h1 .
2.4 Implementation
In the previous sections, we derived all necessary expressions and equations to
precompute the data and evaluate it at runtime. In the following sections, we
will discuss practical issues that occur in an application of irradiance normal
mapping, such as proper generation, representation, and compression of the data.
In general, more sophisticated methods can be applied to compress and distribute
the coefficient maps [Chen and Liu 08, Sloan et al. 03], but this usually requires
the use of proprietary formats that are beyond this discussion.
i i
i i
i i
i i
i i
i i
i i
i i
Since both tangent spaces are aligned, when evaluating the directional irradi-
ance at runtime, we need neither the normals nor the (bi)tangents of any tangent
space but solely the texture coordinate sets to define the texture lookups.
Figure 2.4. A set of coefficient textures. The coefficient h1 is the standard light map.
i i
i i
i i
i i
f o r each s u r f a c e p o i n t do {
//Monte−Carlo−I n t e g r a t i o n i n t a n g e n t s p a c e o v e r
// t h e h e m i s p h e r e t o p r o j e c t i n t o SH [ ] b a s i s f u n c t i o n s
f o r each r a d i a n c e sample L( d i r e c t i o n ) i n N do {
f o r each SHc do {
SHc [ ] += L( d i r e c t i o n ) ∗SH [ ] ( d i r e c t i o n )
}
}
SHc [ ] = ( 2 ∗ PI /N) ∗ SHc [ ]
// D i f f u s e c o n v o l u t i o n
SHc [ 0 ] = SHc [ 0 ]
SHc [ 1 , 2 , 3 ] = 2 . 0 / 3 . 0 ∗ SHc [ 2 , 3 , 4 ]
SHc [ 4 , 5 , 6 , 7 , 8 ] = 1 . 0 / 4 . 0 ∗ SHc [ 4 , 5 , 6 , 7 , 8 ]
// P r o j e c t i o n i n t o m o d i f i e d H−b a s i s
c o l o r modHc [ 6 ] // m o d i f i e d H−b a s i s c o e f f i c i e n t s
// Transform m a t r i x
f o r each c o l o r i n modHc [ ] do {
modHc [ 0 ] = 0 . 7 0 7 1 1 ∗ SHc [ 0 ] + 1 . 2 2 4 7 ∗ SHc [ 2 ] + 1 . 1 8 5 9 ∗ SHc [ 6 ]
modHc [ 1 ] = 0 . 7 0 7 1 1 ∗ SHc [ 1 ] + 0 . 5 9 2 9 3 ∗ SHc [ 5 ]
modHc [ 2 ] = −0.70711∗SHc [ 2 ] − 1 . 3 6 9 3 ∗ SHc [ 6 ]
modHc [ 3 ] = 0 . 7 0 7 1 1 ∗ SHc [ 3 ] + 0 . 5 9 2 9 3 ∗ SHc [ 7 ]
modHc [ 4 ] = 0 . 7 0 7 1 1 ∗ SHc [ 4 ]
modHc [ 5 ] = 0 . 7 0 7 1 1 ∗ SHc [ 8 ]
}
// Convert f i r s t c o e f f i c i e n t t o sRGB
modHc [ 0 ] = pow (modHc [ 0 ] , 1 / 2 . 2 )
w r i t e modHc [ ]
}
Listing 2.1. Pseudo-code for calculating the coefficients for the modified H-basis includ-
ing the texture optimizations.
i i
i i
i i
i i
i i
i i
i i
i i
Figure 2.5. Detail of a game scene without albedo texturing using four coefficients (left)
and for more accuracy six coefficients (right). The differences can be marginal, though
six coefficients show more detail where the normal map normal is perpendicular to the
geometric normal.
consisting of the constant and the linear basis functions, can be used. This
LOD uses the normal map, so the appearance of a material compared with the
lowest LOD may cause popping artifacts. A small distance-based region where
the middle LOD is blended with the lowest can suppress those artifacts, as can
be seen in the demo.
As highest LOD, a full evaluation using all six coefficient maps can be ap-
plied, though the difference between using four or six coefficients can be marginal,
and four coefficients may already deliver a perceptually accurate result (see Fig-
ure 2.5). Depending on the available resources, the highest LOD can therefore
be skipped if the quality of the middle LOD is sufficient. The reason for this
5 6
behavior is that the quadratic basis functions Hmod and Hmod mostly contribute
if the normal map normal is perpendicular to the geometric normal.
All levels of details are cumulative, so if textures are streamed, the next higher
LOD uses the same textures as the lower one. Also, due to the linearity of the
basis functions, the coefficient textures can be used simultaneously even if they
are available at different resolutions, switching to an appropriate LOD as soon as
some mip-level of a coefficient texture is available. The HLSL code for evaluat-
ing the modified H-basis is given in Listing 2.2, including range decompression,
gamma-correct color-space lookups, and levels of detail.
float3 irr =
0 . 3 9 8 9 4 ∗ tex2D ( h1 , lightUV ) // i s sRGB l o o k u p ( x ˆ 2 . 2 )
// s t o p h e r e f o r l o w e s t LOD ( l i g h t m a p )
+ ( 2 ∗ 0 . 7 5 ∗ tex2D ( h2 , lightUV ) − 0 . 7 5 ) ∗ 0 . 6 9 0 9 9 ∗ n . y // not sRGB l o o k u p
+ ( 2 ∗ 0 . 7 5 ∗ tex2D ( h3 , lightUV ) − 0 . 7 5 ) ∗ 0 . 6 9 0 9 9 ∗ ( 1 − n . z )
+ ( 2 ∗ 0 . 7 5 ∗ tex2D ( h4 , lightUV ) − 0 . 7 5 ) ∗ 0 . 6 9 0 9 9 ∗ n . x
i i
i i
i i
i i
// s t o p h e r e f o r middle LOD
+ ( 2 ∗ 0 . 7 5 ∗ tex2D ( h5 , lightUV ) − 0 . 7 5 ) ∗ 1 . 5 4 5 0 9 ∗ n . x∗n . y
+ ( 2 ∗ 0 . 7 5 ∗ tex2D ( h6 , lightUV ) − 0 . 7 5 ) ∗ 0 . 7 7 2 5 5 ∗ ( n . x∗n . x−n . y∗n . y ) ;
// f u l l e v a l u a t i o n
// w r i t e c o l o r t o sRGB frame b u f f e r ( x ˆ ( 1 / 2 . 2 ) )
Listing 2.2. HLSL code for evaluating the modified H-basis, including a modulation
with an albedo map. The different levels of detail are created by stopping the irradiance
calculation at the shown points.
2.5 Results
We have implemented the described approach in the graphics engine OGRE 1.6.5
[OGRE 10]. The accompanying web materials contain the binaries as well as the
full source code of the demo and the Turtle script to bake out coefficient maps
with the described optimizations. All levels of detail and texture formats can
be directly compared and viewed in both low-dynamic as well as high-dynamic
range rendering pipelines with or without albedo textures (see Figure 2.6).
This can act as a reference implementation since any game engine or applica-
tion that supports light mapping and shaders can be easily modified to support
irradiance normal mapping. Besides the precomputation, only several additional
textures need to be exposed to the shader compared to a single texture when
using light mapping. The shader calculations consist only of a few multiply-adds
for range decompression of the textures and to add up the contributions of the
basis functions. Both the data as well as the evaluation are lightweight and sim-
ple, and are therefore also applicable to devices and platforms that have only a
limited set of resources and calculation power.
2.6 Conclusion
We have derived a modification of the H-basis that allows formulating irradiance
normal mapping as an extension of light mapping rather than as a replacement
by containing the light map of the basis function coefficients. We discussed the
efficient calculation and representation of directional irradiance signals in the
modified H-basis using spherical harmonics as an intermediate representation for
efficient filtering. A description of the accompanying implementation was given,
showing the different levels of detail and optimizations for 8-bit textures, such as
optimal color spaces and range compression.
i i
i i
i i
i i
Figure 2.6. A full scene without (top) and with (bottom) albedo mapping.
i i
i i
i i
i i
Bibliography
[Autodesk 10] Autodesk. “Maya. Maya is a registered trademark or trademark of Au-
todesk, Inc. in the USA and other countries.” Available at [Link]
com, 2010.
[Chen and Liu 08] Hao Chen and Xinguo Liu. “Lighting and Material of Halo 3.” In
SIGGRAPH ’08: ACM SIGGRAPH 2008 Classes, pp. 1–22. New York: ACM,
2008.
[Gautron et al. 04] Pascal Gautron, Jaroslav Krivánek, Sumanta N. Pattanaik, and
Kadi Bouatouch. “A Novel Hemispherical Basis for Accurate and Efficient Ren-
dering.” In Rendering Techniques, pp. 321–330. Aire-la-Ville, Switzerland: Euro-
graphics Association, 2004.
[Green 03] Robin Green. “Spherical Harmonic Lighting: The Gritty Details.” Available
at [Link] 2003.
[Habel and Wimmer 10] Ralf Habel and Michael Wimmer. “Efficient Irradiance Normal
Mapping.” In I3D ’10: Proceedings of the 2010 ACM SIGGRAPH Symposium on
Interactive 3D Graphics and Games, pp. 189–195. New York: ACM, 2010.
i i
i i
i i
i i
[Illuminate Labs 10] Illuminate Labs. “Turtle for Maya.” Available at [Link]
[Link], 2010.
[Jensen 96] Henrik Wann Jensen. “Global Illumination using Photon Maps.” In Pro-
ceedings of the Eurographics Workshop on Rendering Techniques ’96, pp. 21–30.
London: Springer-Verlag, 1996.
[Kajiya 86] James T. Kajiya. “The Rendering Equation.” In SIGGRAPH ’86: Pro-
ceedings of the 13th Annual Conference on Computer Graphics and Interactive
Techniques, pp. 143–150. New York: ACM, 1986.
[Koenderink et al. 96] Jan J. Koenderink, Andrea J. van Doorn, and Marigo Stavridi.
“Bidirectional Reflection Distribution Function Expressed in Terms of Surface Scat-
tering Modes.” In ECCV ’96: Proceedings of the 4th European Conference on
Computer Vision-Volume II, pp. 28–39. London, UK: Springer-Verlag, 1996.
[McTaggart 04] G. McTaggart. “Half-Life 2/Valve Source Shading.” Technical report,
Valve Corporation, 2004.
[OGRE 10] OGRE. “OGRE Graphics Engine.” Available at [Link]
2010.
[Ramamoorthi and Hanrahan 01] Ravi Ramamoorthi and Pat Hanrahan. “An Efficient
Representation for Irradiance Environment Maps.” In SIGGRAPH ’01: Proceed-
ings of the 28th Annual Conference on Computer Graphics and Interactive Tech-
niques, pp. 497–500. New York: ACM, 2001.
[Sloan et al. 03] Peter-Pike Sloan, Jesse Hall, John Hart, and John Snyder. “Clustered
Principal Components for Precomputed Radiance Transfer.” ACM Trans. Graph.
22:3 (2003), 382–391.
[Sloan 08] Pete-Pike Sloan. “Stupid Spherical Harmonics (SH) Tricks.” Available at
[Link] 2008.
[Szirmay-Kalos ] Laszlo Szirmay-Kalos. Monte Carlo Methods in Global Illumination
- Photo-Realistic Rendering with Randomization. Saarbrücken, Germany: VDM
Verlag Dr. Mueller e.K.
i i
i i
i i
i i
3
III
3.1 Overview
This chapter presents an easily implemented technique for real-time, one-bounce
indirect illumination with support for indirect shadows. Determining if dynamic
scene elements occlude some indirect light and thus cast indirect shadows is a
hard problem to solve. It amounts to being able to answer many point-to-point
or region-to-region visibility queries in real time. The method described in this
chapter separates the computation of the full one-bounce indirect illumination
solution into three phases. The first phase is based on reflective shadow maps
(RSM) [Dachsbacher and Stamminger 05] and is fully Direct3D 9 compliant. It
generates the one-bounce indirect lighting from a kernel of RSM texels without
considering blockers of indirect light. The second phase requires Direct3D 11–
capable hardware and dynamically creates a three-dimensional grid that contains
lists of triangles of the geometry that should act as blockers of indirect light. The
third phase traverses the 3D grid built in phase 2, tracing rays to calculate an
approximation of the indirect light from RSM texels that are blocked by geometry.
Finally, the result of the third phase is subtracted from the result of the first phase
to produce the full indirect illumination approximation.
3.2 Introduction
Real-time indirect illumination techniques for fully dynamic scenes are an active
research topic. There are a number of publications (e.g., [Ritschel et al. 09a,
Wyman and Nichols 09, Kapalanyan 09, Dachsbacher and Stamminger 06, Dachs-
bacher and Stamminger 05]) that describe methods for indirect one-bounce
159
i i
i i
i i
i i
illumination for fully dynamic scenes, but they do not account for indirect shad-
ows. Only a handful of methods for indirect illumination have been described
to date that also include support for indirect shadows in the context of fully
dynamic scenes and interactive frame rates (e.g., [Ritschel et al. 08, Ritschel
et al. 09a,Ritschel et al. 09b,Kapalanyan and Dachsbacher 10,Yang et al. 09,Thi-
bieroz and Gruen 10]).
Direct3D 11-capable GPUs allow the concurrent construction of linked lists
using scattering writes and atomic operations (see [Yang et al. 09]). This ca-
pability is used as the basic building block for the solution to real-time indirect
shadowing described in this chapter. Linked lists open the door for a new class
of real-time algorithms to compute indirect shadows for fully dynamic scenes
using ray-triangle intersections. The basic idea behind these techniques is to
dynamically build data structures on the GPU that contain lists of triangles
that represent low level-of-detail (LOD) versions of potential blockers of indi-
rect light. Most game engines already rely on having low LOD versions of
game objects for rendering or simulation purposes. These low LOD objects
can readily be used as the approximate blockers of indirect light, as long as
the LOD is good enough to capture the full topology of objects for proper self-
shadowing.
The data structures containing lists of triangles are traversed using ray trac-
ing to detect if some amount of the indirect light is blocked. Although this
approach could probably be used to implement ray tracing of dynamic scenes
in general, the following discussion considers only the application of linked lists
in the context of the computation of indirect shadows and for low LOD-blocker
geometry.
[Thibieroz and Gruen 10] discuss some of the implementation details of
a proof-of-concept application for the indirect shadowing technique presented
in [Yang et al. 09]. However, the scene used to test the indirect illumination
solver did not contain any dynamic objects. Tests with more complicated dy-
namic scenes and rapidly changing lighting conditions revealed flickering artifacts
that are not acceptable for high-quality interactive applications. The algorithms
presented below address these issues and are able to deliver real-time frame rates
for more complicated dynamic scenes that include moving objects and changing
lighting conditions.
As described in the overview, the techniques explained below separate the
process of computing indirect illumination into three phases. The reason for
specifically separating the computation of indirect light and blocked indirect light
is that it makes it easier to generate the blocked indirect light at a different fidelity
or even at a different resolution than the indirect light. Furthermore, game
developers will find it easier to add just the indirect light part of the technique
if they can’t rely on Direct3D 11-capable hardware for indirect shadowing. The
three phases for computing the full one-bounce illumination approximation are
explained in detail in the following sections.
i i
i i
i i
i i
3. Real-Time One-Bounce Indirect Illumination and Shadows using Ray Tracing 161
i i
i i
i i
i i
similar in spirit to [Segovia et al. 09] but does not actually split the G-buffer into
sub-buffers for better memory coherence.
The shader in Listing 3.1 demonstrates how this can be implemented for a
pattern that only uses one out of 6 × 6 VPLs.
// t h i s f u n c t i o n e v a l u a t e s t h e w e i g h t i n g f a c t o r
// o f a v p l
f l o a t evaluateVPLWeightingFac ( RSM data d , // data f o r VPL
f l o a t 3 f3CPos , // pos o f p i x
f l o a t 3 f4CN , // normal o f p i x
)
{
// compute i n d i r e c t l i g h t c o n t r i b u t i o n w e i g h t
f l o a t 3 f3D = d . f 3 P o s . xyz − f3CPos . xyz ;
f l o a t fLen = l e n g t h ( f3D ) ;
f l o a t fInvLen = rcp ( fLen ) ;
f l o a t fDot1 = dot ( f3CN , f3D ) ;
f l o a t fDot2 = dot ( d . f3N , −f3D ) ;
// t h i s f u n c t i o n computes t h e i n d i r e c t l i g h t from a k e r n e l o f
// VPLs i n s i d e t h e RSM.A r e p e t i t i v e s c r e e n s p a c e p a t t e r n i s used
// t o do i n t e r l e a v e d s h a d i n g t o r e d u c e t h e number
// o f s a m p l e s t o l o o k a t
f l o a t 3 computeIndirectLight (
f l o a t 2 tc , // RSM c o r d s o f g−bug p i x
i n t 2 i 2 O f f , // k e r n e l s t a r t o f f s e t
i i
i i
i i
i i
3. Real-Time One-Bounce Indirect Illumination and Shadows using Ray Tracing 163
// l o o p o v e r VPL k e r n e l
f o r ( f l o a t row = −LFS ; row <= LFS ; row += 6 . 0 f )
{
f o r ( f l o a t c o l = −LFS ; c o l <= LFS ; c o l += 6 . 0 f )
{
// unpack RSM g−b u f f e r data f o r VPL
RSM data d = LoadRSMData ( tc , i 2 O f f , row , c o l ) ;
// a c c u m u l a t e w e i g h t e d i n d i r e c t l i g h t
f 3 I L += d . f 3 C o l ∗
evaluateVPLWeightingFac ( d , f3CPos , f3CN ) ∗
d . PixelArea ;
}
}
return f 3 I L ;
}
// i n d i r e c t l i g h t i s computed f o r a h a l f width / h a l f h e i g h t
// image
f l o a t 4 P S R e n d e r I n d i r e c t L i g h t ( PS SIMPLE INPUT I ) : SV TARGET
{
// compute s c r e e n pos f o r RT t h a t i s 2∗w, 2∗h
i n t 3 t c = i n t 3 ( i n t 2 ( I . vPos . xy ) << 1 , 0 ) ;
// s t a r t o f f s e t o f t h e VPL k e r n e l r e p e a t s e v e r y 6 x6 p i x e l s
i n t 2 i 2 O f f = ( i n t 2 ( I . vPos . xy ) % ( 0 x5 ) . xx ) ;
// l o a d g b u f f e r data a t t h e c u r r e n t p i x e l
GBuf data d = LoadGBufData ( t c ) ;
// compute i n d i r e c t l i g h t
f l o a t 3 f3IL = computeIndirectLight ( rtc , i2Off ,
d . f3CPos , d . f3CN ) ;
return f l o a t 4 ( f 3 I L , 0 . 0 f ) ;
}
Note that the shader assumes that the indirect light is computed at a resolu-
tion that is half as high and half as wide as the screen resolution.
As the dithered result from Listing 3.1 is not smooth, a bilateral blurring step
i i
i i
i i
i i
i i
i i
i i
i i
3. Real-Time One-Bounce Indirect Illumination and Shadows using Ray Tracing 165
(see e.g., [Tomasi and Manduchi 98]) is performed and then the image is sampled
up to the full-screen resolution using bilateral upsampling [Sloan et al. 09]. Both
bilateral filter operations use the differences in normal and light space-depth
between a central G-buffer pixel and samples in the filter footprint.
Figures 3.3 and 3.4 show screenshots of a demo implementing the algorithm
described above. The demo uses a 512 × 512 RSM and a dithered 81×81 kernel
of VPLs. The frame rate of the demo is usually above 250 frames per seconds
on an AMD HD5970 at 1280×800, which shows that the technique works fast
enough to be used in interactive applications and computer games.
i i
i i
i i
i i
//
// Add t r i a n g l e s d e f i n e d by an i n d e x and a v e r t e x b u f f e r i n t o
// t h e l i n k e d l i s t o f each 3D g r i d c e l l t h e y touch
//
// Note : This i s a s i m p l i f i e d 3D r a s t e r i z a t i o n l o o p a s i t
// t o u c h e s a l l g r i d c e l l s t h a t a r e t o u c h e d by t h e bounding box
// o f t h e t r i a n g l e and adds t h e t r i a n g l e t o t h e l i s t o f a l l
// t h e s e g r i d c e l l s
//
[ numthreads ( GROUPSIZE, 1 , 1 ) ]
void CS AddTrisToGrid ( u i n t 3 Gid : SV GroupID ,
u i n t 3 GTid : SV GroupThreadID ,
u i n t GI : SV GroupIndex )
{
// compute o f f s e t i n t o i n d e x b u f f e r from t h e t h r e a d and
// group i d s
u i n t u O f f s e t = GROUPSIZE ∗ Gid . y + GTid . x ;
uint3 indices ;
LinkedTriangle t ;
uint3 start , stop ;
// o n l y p r o c e s s v a l i d i n d i c e s
i f ( u O f f s e t < g IndexCount ) {
// add s t a r t i n d e x t o t h e o f f s e t
u O f f s e t += g S t a r t I n d e x L o c a t i o n ;
// f e t c h t h r e e i n d i c e s f o r t r i a n g l e
i i
i i
i i
i i
3. Real-Time One-Bounce Indirect Illumination and Shadows using Ray Tracing 167
// add b a s e v e r t e x l o c a t i o n
i n d i c e s += g B a s e V e r t e x L o c a t i o n . xxx ;
// compute o f f s e t f o r v e r t i c e s i n t o t h e v e r t e x b u f f e r
u i n t 3 v o f f s e t = i n d i c e s ∗ g V e r t e x S t r i d e . xxx +
g V e r t e x S t a r t . xxx ;
// l o a d v e r t e x data o f t r i a n g l e −−p r e p a r e t r i a n g l e
f l o a t 3 v0 = g b u f V e r t i c e s . Load ( v o f f s e t . x ) . xyz ;
f l o a t 3 v1 = g b u f V e r t i c e s . Load ( v o f f s e t . y ) . xyz ;
f l o a t 3 v2 = g b u f V e r t i c e s . Load ( v o f f s e t . z ) . xyz ;
// now c a l l e . g . , s k i n n i n g code f o r t h e v e r t i c e s
// i f t h e v e r t i c e s b e l o n g t o a s k i n n e d o b j e c t
t . v0 = v0 ;
t . edge1 = v1 − t . v0 ;
t . edge2 = v2 − t . v0 ;
// i t e r a t e o v e r c e l l s
f o r ( u i n t z i = s t a r t . z ; z i <= s t o p . z ; ++z i ) {
f o r ( u i n t y i = s t a r t . y ; y i <= s t o p . y ; ++y i ) {
f o r ( u i n t x i = s t a r t . x ; x i <= s t o p . x ; ++x i ) {
// a l l o c new o f f s e t
uint newOffset = LinkedTriGridBuffer .
IncrementCounter ( ) ;
uint oldOffset ;
// update g r i d o f f s e t b u f f e r
StartOffsetBuffer .
InterlockedExchange ( 4 ∗ ( xi + yi ∗
CELLS XYZ + z i ∗
CELLS XYZ ∗ CELLS XYZ ) ,
newOffset , o l d O f f s e t ) ;
// s t o r e o l d o f f s e t
l t . prev = o l d O f f s e t ;
// add t r i a n g l e t o t h e g r i d
LinkedTriGridBuffer [ newOffset ] = t ;
} } } } }
i i
i i
i i
i i
//
// This f u n c t i o n w a l k s t h e 3D g r i d t o c h e c k f o r i n t e r s e c t i o n s o f
// t r i a n g l e s and t h e g i v e n edge
//
f l o a t t r a c e R a y L i n k e d T r i s ( f l o a t 3 f3OrgP , f l o a t 3 f3D )
{
f l o a t f I n t e r s e c t i o n = ( 0 . 0 f ) , f I , f Le n ;
f l o a t 3 f 3 I n c , f3P ;
// s e t u p t h e march a l o n g t h e r a y t r o u g h t h e g r i d c e l l s
setupRay ( fLen , f3P , f 3 I n c ) ;
// do t h e march
for ( f I = 0 . 0 f ;
f I <= f L en ;
f I += 1 . 0 f , f3P += f 3 I n c )
{
// c h e c k f o r i n t e r s e c t i o n w a l k s t h r o u g h t h e l i s t
// o f t r i s i n t h e c u r r e n t g r i d c e l l and computes
// r a y t r i a n g l e s i n t e r s e c t i o n s
i f ( c h e c k f o r i n t e r s e c t i o n ( i n t 3 ( f3P ) , f3P ,
f3OrgP , f3D ) != 0 . 0 f ) {
fIntersection = 1.0 f ;
break ;
}
}
return f I n t e r s e c t i o n ;
}
f l o a t 3 c o m p u t e B l o c k e d I n d i r e c t L i g h t ( f l o a t 2 tc , f l o a t 2 f c ,
i n t 2 i 2 O f f , f l o a t 3 f3CPos ,
f l o a t 3 f3CN )
i i
i i
i i
i i
3. Real-Time One-Bounce Indirect Illumination and Shadows using Ray Tracing 169
{
f l o a t 3 f 3 I L = ( 0 . 0 f ) . xxx ;
// l o o p o v e r VPL k e r n e l
f o r ( f l o a t row = −SFS ; row <= SFS ; row += 6 . 0 f ) {
f o r ( f l o a t c o l = −SFS ; c o l <= SFS ; c o l += 6 . 0 f ) {
// unpack RSM g−b u f f e r data f o r VPL
RSM data d = LoadRSMData ( adr ) ;
// compute w e i g h t i n g f a c t o r f o r VPL
f l o a t f = evaluateVPLWeightingFac ( d , f3CPos , f3CN ) ;
i f ( f > 0.0 f ) {
f 3 I L += t r a c e R a y L i n k e d T r i s ( f3CPos . xyz , f3D ) ∗
∗ d . f3Col ∗ f ;
}
}
}
// a m p l i f y t h e accumulated b l o c k e d i n d i r e c t l i g h t a b i t
// t o make i n d i r e c t shadows more prominent
return 1 6 . 0 f ∗ f 3 I L ;
}
// r e n d e r s t h e accumulated c o l o r o f b l o c k e d l i g h t u s i n g a 3D
// g r i d o f t r i a n g l e l i s t s t o d e t e c t o c c l u d e d VPLs
float4
P S R e n d e r B l o c k e d I n d i r e c t L i g h t L i n k e d T r i s ( PS SIMPLE INPUT I ) :
SV TARGET
{
i n t 3 t c = i n t 3 ( i n t 2 ( I . vPos . xy ) << 1 , 0 ) ;
i n t 2 i 2 O f f = ( i n t 2 ( I . vPos . xy ) % ( 0 x5 ) . xx ) ;
Gbuf data d = LoadGBufData ( t c ) ;
// i s any i n d i r e c t l i g h t ( phase 1 ) r e a c h i n g t h i s p i x e l
[ branch ] i f ( dot ( f 3 I L , f 3 I L ) > 0 . 0 f )
f3IS = computeBlockedIndirectLight ( rtc , i2Off ,
d . f3CPos , d . f3CN ) ;
return f l o a t 4 ( f 3 I S , 0 . 0 f ) ;
}
i i
i i
i i
i i
Figure 3.6. Demo scene with indirect illumination and indirect shadows.
Again, the dithered, blocked indirect light is blurred and upsampled using
a bilateral filter. After that, the blocked indirect light is subtracted from the
indirect light, and the result is clamped to make sure that indirect illumination
doesn’t become negative. This generates the full indirect illumination approxi-
mation with indirect shadowing. Finally, indirect illumination is combined with
direct illumination and shadowing to produce the final image as shown in Fig-
ure 3.6.
The performance for rendering the full one-bounce indirect illumination, in-
cluding indirect shadows with tracing nine rays per pixel is at 70–110 fps for a
32×32×32 grid and a resolution of 1280×800 on an AMD HD5970. The number
of blocker triangles that get inserted into the 3D grid in every frame is in the
order of 6000.
i i
i i
i i
i i
3. Real-Time One-Bounce Indirect Illumination and Shadows using Ray Tracing 171
Bibliography
[Dachsbacher and Stamminger 05] Carsten Dachsbacher and Marc Stamminger. “Re-
flective Shadow Maps. ” In Proceedings of the 2005 Symposium on Interactive 3D
Graphics and Games, I3D ’05, pp.203–231. New York, ACM, 2005.
[Dachsbacher and Stamminger 06] Carsten Dachsbacher and Marc Stamminger.
”Splatting Indirect Illumination.” In Proceedings of the 2006 Symposium on In-
teractive 3D Graphics and Games, I3D ’06, pp. 93–100. New York, ACM, 2006.
[Eisemann and Décoret 06] Elmar Eisemann and Xavier Décoret. “Fast Scene Voxeliza-
tion and Applications.” In Proceedings of the 2006 Symposium on Interactive 3D
Graphics and Games, I3D ’06, pp. 71–78. New York, ACM, 2006.
[Kapalanyan 09] Anton Kapalanyan. “Light Propagation Volumes in CryEngine 3.“
In Advances in Real-Time Rendering in 3D Graphics and Games Course -
SIGGRAPH 2009. Available online [Link]
presentations/light-propagation-volumes-in-cryengine-3, 2009.
[Kapalanyan and Dachsbacher 10] Anton Kaplanyan and Carsten Dachsbacher. ”Cas-
caded Light Propagation Volumes for Real-Time Indirect Illumination,” In Pro-
ceedings of the 2010 ACM SIGGRAPH Symposium on Interactive 3D Graphics
and Games, I3D ’10, pp. 99–107. New York: ACM, 2010.
[Möller and Trumbore 97] Thomas Möller and Ben Trumbore. “Fast, Minimum Storage
Ray/Triangle Intersection,” journal of graphics tool 2:1 (1997), 21–28.
[Ritschel et al. 08] T. Ritschel, T. Grosch, T. M. H. Kim, H.-P. Seidel, C. Dachsbacher,
and J. Kautz. ”Imperfect Shadow Maps for Efficient Computation of Indirect Illu-
mination.” ACM Trans. Graph. 27:5 (2008), 129:1–129:8. .
[Ritschel et al. 09a] T. Ritschel, T. Grosch, and H.-P. Seidel. ”Approximating Dynamic
Global Illumination in Image Space.” In I3D ’09: Proceedings of the 2009 Sympo-
sium on Interactive 3D Graphics and Games, I3D ’09, pp. 75–82. New York, ACM,
2009.
i i
i i
i i
i i
[Ritschel et al. 09b] T. Ritschel, T. Engelhardt, T. Grosch, H.-P. Seidel, J. Kautz, and
C. Dachsbacher. “Micro-Rendering for Scalable, Parallel Final Gathering.“ ACM
Transactions on Graphics (Proc. SIGGRAPH Asia 2009) 28:5 (2009), 132:1–132:8.
[Sloan et al. 09] P.-P Sloan, N. Govindaraju, D. Nowrouzezahrai and J. Snyder. “Image-
Based Proxy Accumulation for Real-Time Soft Global Illumination.” In Proceedings
of the 15th Pacific Conference on Computer Graphics, pp. 97–105. Washington,
DC: IEEE Computer Society, 2009.
[Segovia et al. 09] B. Segovia, J.C. Iehl R. Mitanchey and B. Péroche. ”Non-Interleaved
Deferred Shading of Interleaved Sample Patterns.” in Proceedings of the 21st ACM
SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pp. 53–60.
Aire-la-Ville, Switzerland, Eurographics Asociation, 2009.
[Tomasi and Manduchi 98] C. Tomasi and R. Manduchi. “Bilateral Filtering for Gray
and Color Images.” In Proceedings of the Sixth International Conference on Com-
puter Vision, pp. 839–846. Washington, DC: IEEE Computer Society, 1998.
[Thibieroz and Gruen 10] N. Thibieroz and H. Gruen. “’OIT and Indirect Illumination
Using DX11 Linked Lists.“ In Proceedings of Game Developers Conference, 2010.
[Wald et al. 09] I. Wald. T. Ize, A. Kensler, A. Knoll, and S¿ G¿ Parker. ”Ray Tracing
Animated Scenes Using Coherent Grid Traversal.” ACM Transanction on Graphics
(Proc. SIGGRAPH ASIA) 25:3 (2006), 485–493.
[Wyman and Nichols 09] C. Wyman and G. Nichols. “Multiresolution Splatting for In-
direct Illumination.” In I3D ’09: Proceedings of the 2009 Symposium on Interactive
3D Graphics and Games, pp. 83–90. News York: ACM, 2009.
[Yang et al. 09] J. Yang, J. Hensley, H. Gruen, and N. Thibieroz. “Dynamic Construc-
tion of Concurrent Linked-Lists for Real-Time Rendering.” Computer Graphics
Forum (Eurographics Symposium on Rendering 2010)29:4 (2010), 1297–1304.
i i
i i
i i
i i
4
III
Real-Time Approximation of
Light Transport in Translucent
Homogenous Media
Colin Barré-Brisebois and Marc Bouchard
4.1 Introduction
When reproducing visual elements found in nature, it is crucial to have a mathe-
matical theory that models real-world light transport. In real-time graphics, the
interaction of light and matter is often reduced to local reflection described by
bidirectional reflectance distribution functions (BRDFs), for example, describing
reflectance at the surface of opaque objects [Kurt 09]. In nature, however, many
173
i i
i i
i i
i i
objects are (partly) translucent: light transport also happens within the surface
(as shown in Figure 4.1).
To simulate light transport inside objects in real time, developers rely on
various complex shading models and algorithms, (e.g., to replicate the intricate
subsurface scattering found in human skin [d’Eon 09,Hable 09].) Conversely, this
article presents a fast real-time approximation of light transport in translucent
homogeneous media, which can be easily implemented on programmable graphics
hardware (PC, as well as video game consoles). In addition, the technique scales
well on fixed and semi-fixed pipeline systems, it is artist friendly, and it provides
results that fool the eyes of most users of real-time graphics products. This
technique’s simplicity also permits fast iterations, a key criteria for achieving
visual success and quality in the competitive video game industry. We discuss
the developmental challenge, its origins, and how it was resolved through an
initial and basic implementation. We then present several scalable variations, or
improvements to the original technique, all of which enhance the final result (see
Figure 4.2).
i i
i i
i i
i i
the shape was influenced by the varying thickness of that same shape. In mathe-
matical terms, this means that the amount of light that penetrates the surface of
the shape (but also the amount of light exiting at the opposite hemisphere to the
BRDF) can be defined with a bidirectional transmittance distribution function
(BTDF). Our current method attempts to phenomenologically replicate inner-
surface diffusion, in which light traveling inside the object is scattered based on
material properties. This phenomena can be described using a bidirectional sur-
face scattering reflectance distribution function (BSSRDF) [Jensen 01]. With this
technique, we wanted to approximate the BSSRDF using minimal available GPU
resources. After careful analysis, we discovered that using distance-attenuated
regular diffuse lighting, combined with the distance-attenuated dot product of
the view vector and an inverted light vector, would simulate basic light transport
inside an object (see Figure 4.3).
Through this research, executed in prototype levels not included with the
final product, we discovered that the technique worked well for numerous simple
shapes. However, the technique did not take thickness into account and we had
to return to the drawing board, seeking a solution that was more effective and
complete.
i i
i i
i i
i i
i i
i i
i i
i i
i i
i i
i i
i i
// fLightAttenuation == L i g h t a t t e n u a t i o n from d i r e c t l i g h t i n g
// cBRDF == Computed c o l o r f o r BRDF l i g h t i n g
// cDiffuse == Light D i f f u s e Color
// cTranslucent == Light Translucency Color
// Compute t h e BSSRDF
f l o a t fLTDot = pow( saturate ( dot ( vCamera , v L i g h t ) ) , fLTPower ) ;
f l o a t fLT = f L i g h t A t t e n u a t i o n ∗ tex2D ( texInvAO , i n p u t . vUV ) ;
fLT += fLTDot ∗ f L i g h t A t t e n u a t i o n ∗ f L T S c a l e ;
// Compute t h e f i n a l c o l o r f o r t h e BSSRDF ( t r a n s l u c e n c y )
f l o a t 3 cBSSRDF = lerp ( c D i f f u s e , c T r a n s l u c e n t , fLT ) ∗ fLT ;
// The f i n a l r e s u l t
return f l o a t 4 (cBRDF + cBSSRDF , 1 . 0 f ) ;
i i
i i
i i
i i
surrounding the objects. Unfortunately, a still image does not do this variation
justice, therefore we recommend viewing the demo on the website..
4.4 Performance
The performance numbers for all techniques illustrated in this paper are given
in Table 4.1. Timings are given in frames-per-second (FPS) and the complex-
i i
i i
i i
i i
ity is presented in terms of added arithmetic logic unit (ALU) and texture unit
(TEX) shader instructions. These benchmarks were established using an NVIDIA
GeForce GTX260, at a resolution of 1280 × 720, in a scene comprised of several
instances of our Hebe statue, with approximately 200,000 triangles.
Overall, these numbers illustrate that our technique approximates light trans-
port inside translucent homogenous media at a very reasonable cost. This method
also provides significant benefits in cases when developers want a quick impres-
sion of subsurface scattering. In fact, the technique requires only 17 additional
instructions to achieve a significant and convincing effect. Finally, if the option is
financially viable, adding screen-space thickness to the computation will further
improve the final result. The cost of this computation can be managed through
the use of a quarter-sized buffer, with filtering in order to prevent artifacts at
the boundaries. Realistically, one could rely on a lightweight separable-Gaussian
i i
i i
i i
i i
blur, and even use a bilateral filter when up-sampling. A blurred result would
definitely complement our diffuse computation.
4.5 Discussion
In the following section we discuss various elements and potential concerns that
were not forgotten but were put aside during the development of this technique
and its variations. These concerns will now be addressed.
i i
i i
i i
i i
differ from one project to the next, it is pertinent to provide general hints re-
garding the adaptation of the technique within a deferred context. These hints
are an adequate starting point, leaving clear room for improvement. An example
implementation is provided in the web materials.
The implementation is dependent on available space on the G-buffer. In
cases where a single channel is available, the local thickness can be stored as a
grayscale value: this is done in the same way that it is stored for specular maps.
Subsequently, the light and view-dependent part of the BSSRDF computation
can be processed at the same time as the deferred lighting pass. The final result
can then be combined with the scene at the same time that the diffuse lighting
is computed. In cases where there is no space left for storing the local thickness,
this technique will have to be treated in an additional pass and the translucent
objects will have to be rendered once again. Fortunately, the z-buffer will already
be full and this minimizes the number of affected pixels.
4.6 Conclusion
This article illustrates an artist-friendly, fast and scalable real-time approximation
of light transport in translucent homogenous media. Our technique allows devel-
opers to improve their games’ visuals by simulating translucency with reasonable
and scalable impact on the runtime. Providing such convincing results through
i i
i i
i i
i i
4.7 Demo
The web materials accompanying this book contain an implementation of our
technique and its variations in the form of an AMD RenderMonkey sample. HLSL
code is also provided in a text file and a short video demonstration of the technique
is included.
4.8 Acknowledgments
We would like to thank the following individuals at Electronic Arts for reviewing this
paper: Sergei Savchenko, Johan Andersson and Dominik Bauset. We are also grate-
ful for the support of EA’s Wessam Bahnassi, Frédéric O’Reilly, David “Mojette” Gi-
raud, Gabriel Lassonde, Stéphane Lévesque, Christina Coffin and John White. Fur-
ther, we would like to thank Sandie Jensen from Concordia University for her genuine
and meticulous style editing. We would also like to thank our section editor, Carsten
Dachsbacher, for his thorough reviews and inspiring savoir-faire in search of techni-
cal excellence through simplicity. Finally, Marc Bouchard would like to thank Ling
Wen Kong and Elliott Bouchard, and Colin Barré-Brisebois would like to thank Simon
Barré-Brisebois, Gene-viève Barré-Brisebois and Robert Brisebois.
Bibliography
[Hable 09] J. Hable, G. Borshakov, and J. Heil. “Fast Skin Shading.” In ShaderX7 , pp.
161–173. Hingham, MA: Charles River Media, 2009.
[d’Eon 09] Eugene d’Eon and David Luebke. “Advanced Techniques for Realistic Real-
Time Skin Rendering.” In GPU Gems 3, edited by Hubert Nguyen, pp.293–347.
Reading, MA: Addison-Wesley, 2008.
i i
i i
i i
i i
[Hejl 09] Jim Hejl. “Fast Skin Shading.” In ShaderX7 : Advanced Rendering Techniques,
edited by Wolfgang Engel, pp. 161–173. Hingham, MA: Charles River Media, 2009.
[Jensen 01] Henrik Wann Jensen, Stephen R. Marschner, Marc Levoy, and Pat Hanra-
han. “A Practical Model for Subsurface Light Transport.” In Proceedings of the 28th
Annual Conference on Computer Graphics and Interactive Techniques, pp. 511-518.
New York: ACM, 2001.
[Ki 09] Hyunwoo Ki. “Real-Time Subsurface Scattering Using Shadow Maps.” In
ShaderX7 : Advanced Rendering Techniques, edited by Wolfgang Engel, pp 467–478.
Hingham, MA: Charles River Media, 2009.
[Kurt 09] Murat Kurt and Dave Edwards. “A Survey of BRDF Models for Computer
Graphics,” ACM SIGGRAPH Computer Graphics 43:2 (2009), 4:1–4:7.
[Oat 08] Christopher Oat and Thorsten Christopher. “Computing Per-Pixel Object
Thickness in a Single Render Pass,” In ShaderX6 : Advanced Rendering Techniques,
edited by Wolfgang Engel, pp. 57–62. Hingham, MA: Charles River Media, 2008.
[Sousa 08] Tiago Sousa. “Vegetation Procedural Animation and Shading in Crysis.”
In GPU Gems 3,edited by Hubert Nguyen, pp. 373–385. Reading, MA: Addison-
Wesley, 2008.
i i
i i
i i
i i
5
III
5.1 Introduction
The elusive goal of real-time global illumination in games has been pursued for
more than a decade. The most often applied solution to this problem is to use
precomputed data in lightmaps (e.g., Unreal Engine 3) or precomputed radiance
Figure 5.1. Example of indirect lighting with Light Propagation Volumes in the up-
coming blockbuster Crysis 2.
185
i i
i i
i i
i i
transfer (e.g., Halo 3 ). Both techniques increase the complexity and decrease the
efficiency of a game production pipeline and require an expensive infrastructure
(e.g., setting up a cloud for precomputation and incorporating the result into a
build process).
In this chapter we describe the light propagation volumes, a scalable real-time
technique that does not require a preprocess and storing additional data. The
basic idea is to use a lattice storing the light and the geometry in a scene. The
directional distribution of light is represented using low-order spherical harmon-
ics. The surfaces of the scene are sampled using reflective shadow maps and
this information is then used to initialize the lattice for both light propagation
and blocking. A data-parallel light propagation scheme allows us to quickly, and
plausibly, approximate low-frequency direct and indirect lighting including fuzzy
occlusion for indirect light. Our technique is capable of approximating indirect
illumination on a vast majority of existing GPUs and is battle-tested in the pro-
duction process of an AAA game. We also describe recent improvements to the
technique such as improved temporal and spatial coherence. These improvements
enabled us to achieve a time budget of 1 millisecond per frame on average on both
Microsoft Xbox 360 and Sony PlayStation 3 game consoles.1
5.2 Overview
The light propagation volume technique consists of four stages:
• At the first stage we render all directly lit surfaces of the scene into reflective
shadow maps [Dachsbacher and Stamminger 05] (see Figure 5.2).
Figure 5.2. Reflective Shadow Maps store not only depth, but also information about
surfaces’ normals and reflected flux.
1 Crysis 2, Halo 3, and CryENGINE 3 are trademarked. PlayStation 3, Microsoft Xbox 360,
i i
i i
i i
i i
5. Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes 187
Figure 5.3. The basic steps of our method: surfaces causing one-bounce indirect illu-
mination are sampled using RSMs, next this information is used to initialize the light
propagation volumes where light is propagated, and finally this information is used to
light the surfaces in the scene.
Our light propagation volumes (LPV) technique works completely on the GPU
and has very modest GPU requirements.
i i
i i
i i
i i
Figure 5.4. A reflective shadow map captures the directly lit surfaces of the scene with
a regular sampling from the light’s origin.
Temporally stable rasterization. Temporal flickering can occur when an RSM frus-
tum moves, and obviously becomes visible when the resolution of the RSM is
rather low and thus scene surfaces are sampled at a coarser level. A simple
solution to this problem is to move the frustum of the RSM with world-space
positions snapped to the size of one texel in world space. This leads to con-
sistently sampled points during rasterization and largely removes the sampling
problems [Dimitrov 07]. If this is not possible, e.g. for perspective projections,
then higher resolution RSMs are required for consistently sampling surfaces. Ka-
planyan et al. [Kaplanyan 09] proposed downsampling RSMs to reduce the num-
ber of surfels for the injection stage. However, additional experiments showed
that the downsampling if sometimes even slower than injecting the same number
of surfaces directly into the LPV.
i i
i i
i i
i i
5. Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes 189
Figure 5.5. The reflected intensity distribution of a surfel obtained from an RSM is
approximated using SH and snapped to the center of the closest LPV cell.
i i
i i
i i
i i
Figure 5.6. Illustration of the approximation error with 4 SH coefficients and a coarse
lattice. Note that both the analytical result (green) and propagated result (red) are
represented as a final convolved irradiance.
These coefficients are scaled according to the corresponding world-space size and
reflected flux of the surfel, yielding four coefficients per color channel. In order
to simplify the following description, we show only one set of SH coefficients.
Offsetting surfels. As the LPV is a coarse grid, we have to take care when injecting
surfels, as their exact position inside a cell is no longer available after injection. If
a surfel’s normal points away from the cell’s center, its contribution should not be
added to this cell, but rather to the next cell, in order to avoid self-illumination
(see Figure 5.7). For this, we virtually move each VPL by half the cell-size in
the direction of its normal before determining the cell. Note that this shifting
of the surfels still does not guarantee completely avoiding self-illumination and
light bleeding, but largely removes artifacts from the injection stage.
Figure 5.7. Example of a VPL injection causing self-illumination of the thin geometry.
i i
i i
i i
i i
5. Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes 191
5.4.1 Propagation
The propagation stage consists of several iterations, where every iteration repre-
sents one step of light propagation. The propagation stage has some similarity to
the SH discrete ordinate method [Chandrasekhar 50,Evans 98]. These techniques
are typically employed for light-propagation simulation in scattering media. Es-
sentially we use the same process; however, we use a different cell-to-cell prop-
agation scheme. The application of this method to light propagation through
vacuum instead of scattering media suffers from the fact that propagation direc-
tions are blurred out. Fortunately, our results show that this is an acceptable
artifact in many application scenarios.
• The input for the first iteration step is the initial LPV obtained from the
injection stage. Each cell stores the intensity as an SH-vector and we prop-
agate the energy to the six neighbors along the axial directions.
• All subsequent propagation steps take the LPV from the previous iteration
as input and propagate as in the first iteration.
The main difference from SHDOM methods is the propagation scheme. In-
stead of transferring energy from a source cell to its 26 neighbor cells in a regular
grid, we propagate to its 6 neighbors only, thus reducing the memory footprint.
To preserve as much directional information as possible, we compute the transfer
to the faces of these neighbor cells and reproject the intensities to the cells’ cen-
ter (see Figure 5.8). This mimics, but is of course not identical to, the use of 30
unique propagation directions. Please see [Kaplanyan and Dachsbacher 10] for
the details. There are two ways to implement this propagation process: scatter-
ing and gathering light. The gathering scheme is more efficient in this case due
to its cache-friendliness.
Figure 5.8. Propagation from one source cell (center) to its neighbor cells. Note that
we compute the light propagation according to the intensity distribution I(ω) to all
yellow-tagged faces of the destination cells (blueish).
i i
i i
i i
i i
Figure 5.9. Left: Light propagation from the source cell (gray) to the bottom face of
the destination cell (blueish). Right: during propagation we can account for occlusion
by looking up the blocking potential from the geometry volume.
When propagating the light from a source cell to one face of the destination
cell, we compute the incoming flux onto the face using the solid angle ∆ω of the
face and the central direction ωc of the propagation cone (see Figure 5.9). The
flux reaching the face is then computed as
∆ω
Φf = I(ωc ),
4π
where I(ωc ) is the intensity of the source cell towards the center of face obtained
by evaluating the SH approximation. Here we assume that the intensity does not
vary over the solid angle of the face.
Reprojection. The flux incident on a face is then reprojected back into the in-
tensity distribution of the destination cell. The reprojection is accomplished by
creating a new virtual surfel at the destination cell’s center, pointing toward the
face and emitting exactly as much flux as the face received from the propaga-
tion (Φf ):
Φf
Z
Φl = Φf hnl , ωidω = .
Ω π
Similar to the light injection stage, the corresponding clamped cosine lobe is
scaled by Φl and accumulated into SH coefficients of the destination cell. In
other words, we compute the incoming flux for each face of the destination cell
and transform it back into an intensity distribution.
Blocking for indirect shadows. Indirect shadows, i.e., the blocking of indirect light
due to scene geometry, can also be incorporated into the LPVs. In order to add
indirect shadows, we construct a volumetric representation of the scene’s surfaces
(see Section 5.4.2). This so-called geometry volume (GV) is a grid of the same
resolution as the LPV and stores the blocking potential (also represented as SH)
i i
i i
i i
i i
5. Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes 193
for every grid cell. The GV is displaced by half the grid size with respect to
the LPV. That is, the cell centers of the GV reside on the corners of the LPV
cells. Whenever we propagate from a source to a destination cell, we obtain the
bilinearly interpolated SH coefficients—at the center of the face through which
we propagate—from the GV, and evaluate the blocking potential for the propa-
gation direction to attenuate the intensity. Note that this occlusion should not be
considered for the very first propagation step after injection, in order to prevent
immediate self-shadowing.
Iterations. The sum of all intermediate results is the final light distribution in the
scene. Thus we accumulate the results of every propagation in the LPV into a
separate 3D grid. The number of required iterations depends on the resolution
of the volume. We recommend using two times the longest dimension of the grid
(when not using a cascaded approach). For example, if the volume has dimensions
of 32 × 32 × 8, then the light can travel the whole volume in 64 iterations in the
worst case (which is a diagonal of the volume). However, when using a cascaded
approach it is typically sufficient to use a significantly lower number of iterations.
Illustrative example of light propagation process in the Cornell Box–like scene. The
light propagation is shown in Figure 5.10. The top-left image shows the coarse
LPV initialized from the RSM in the injection stage. The noticeable band of
reflected blue and red colors has a width of one cell of LPV. Note that after four
iterations the indirect light is propagated and touches the small white cube. After
eight iterations the indirect light has reached to the opposite wall.
i i
i i
i i
i i
Figure 5.11. Light from a small area light source (smaller than the cell size). The
analytical result shows the ground truth solution (left). The light propagation illustrates
the ray effect and an unwanted propagation behind the light source (right).
Limitations. The iterative propagation has three main limitations. First, the
coarse resolution of the grid does not capture fine details and might even cause
light bleeding. Second, due to the SH representation of the intensity distribution
and the propagation process itself, the light strongly diffuses and a strictly di-
rected light propagation is not possible. Consequently, there is also no reasonable
chance of handling glossy surfaces during propagation. Lastly, the propagation
together with the reprojection introduces spatial and directional discretization;
this is called the ray effect and is common to all lattice-based methods (see Fig-
ure 5.11). Note that some of these artifacts can be suppressed to an acceptable
level using cascaded grids to provide finer grids closer to the camera and smart
filtering when looking up the LPV.
i i
i i
i i
i i
5. Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes 195
Reusing G-buffers from RSMs and cameras. Aiming at fully dynamic scenes with-
out precomputation requires the creation of the GV—and thus the surface sampling—
on the fly. First of all, we can reuse the sampling of the scene’s surfaces that is
stored in the depth and normal buffers of the camera view (when using a de-
ferred renderer), and in the RSMs that have been created for the light sources.
RSMs are typically created for numerous light sources and thus already represent
a dense sampling of large portions of the scene. It is, at any time, possible to
gather more information about the scene geometry by using depth-peeling for the
RSMs or the camera view.
The injection (i.e., determining the blocking potential of a GV’s cell) has
to be done using separate GVs for every RSM or G-buffer in order to make
sure that the same occluder (surfel) is not injected multiple times from different
inputs. Afterwards, we combine all intermediate GVs into a single final GV. We
experimented using the maximum operator for SH coefficients. Although this is
not correct, it yields plausible result when using 2 SH-bands only and clamping
the evaluation to zero, and thus yields more complete information about light-
blocking surfaces.
i i
i i
i i
i i
Figure 5.12. The interpolation of the intensity stored in the grid induced by the orange
surface between two cells. Note the wrong result of linear interpolation behind the
emitting object itself. Also note that the gradient is opposing in this case.
LPVs and light prepass rendering. In light prepass architecture [Engel 09] of the
CryENGINE 3, the LPV lighting is directly rendered into the diffuse-light-
accumulation buffer on top of multiple ambient passes. This allows us to use
optimizations, such as stencil prepass and depth-bound tests, for this pass as
well. Moreover, it is also possible to compose complex layered lighting.
c(x) − c(x + n)
∇c(x) = = c(x) − c(x + n).
kn||
Whenever the derivative is large, and c and ∇c are deviating, we damp c before
computing the lighting.
This additional filtering during the final rendering phase yields sharper edges
between lit and shadowed areas.
i i
i i
i i
i i
5. Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes 197
Figure 5.13. Cascaded approach. Cascades are nested and attached to the camera; note
that cascades are slightly shifted toward the view direction.
Cascaded LPVs. In spirit of the cascaded shadow map (CSM) approach [Engel 05,
Dimitrov 07], we use multiple nested grids, or cascades, for light propagation that
move with the camera. For every cascade, we not only store an LPV, but we also
create a separate RSM for every light source, where the RSM resolution is chosen
proportional to the grid cell sizes as described above. However, unlike CSMs, the
indirect lighting can also be propagated from behind to surfaces in front of the
camera (see Figure 5.14). In practice we use a 20/80 ratio for cascade shifting, i.e.,
the cascades are not centered around the camera but shifted in view directions
such that 20% of their extent is located behind the camera. Usually it is sufficient
to use three nested grids with a respective double size.
Figure 5.14. Indirect light propagated from objects of different sizes in the respective
resolution.
i i
i i
i i
i i
Light propagation with cascades. So far we detailed light propagation within a sin-
gle grid, but handling multiple cascades simultaneously imposes new questions:
How do we propagate across different cascades? How do we combine the contri-
butions of different cascades for the final rendering? In this section we propose
two options depending on whether indirect shadows are a required feature or not.
If no indirect shadows are required, we can handle the cascades independent
from each other, and the multi-resolution approach is straightforward to imple-
ment. With indirect shadows, we have to correctly deal with light propagating
across the edges of cascades and, of course, blocking.
Cascaded LPVs without indirect shadows. Assuming that light propagates without
blocking, we can completely decouple the cascades and compute a LPV solution
with the following steps:
• Every RSM for a each cascade should contain unique objects causing in-
direct light. Objects are normally rendered into the RSM for the cascade
for which they have been selected; in RSMs for other cascades they are
rendered with black albedo in order to prevent indirect light contribution,
but correctly rendering direct shadows.
In this case, we determine the respective cascade for every object by estimating
its contribution to the indirect illumination, which in turn heavily depends on its
surface area and the distance to the camera. To this end, we account for large
and distant objects in the coarser cascades while injecting smaller, close objects
into the finer grids. Note that this means that distant, small objects might not
be considered during the computation (if they lie outside the finer cascades).
However, the indirect illumination stemming from such objects typically has a
significant contribution only within a certain (small) proximity of the object.
Cascades with indirect shadows. When accounting for light blocking we cannot
decouple the propagation process of the cascades. In this case every object is
injected into the finest grid at its respective location. This also means that
those parts of coarser grids that are overlapped by finer grids are empty and not
used during the propagation. Although we apply the propagation steps to each
i i
i i
i i
i i
5. Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes 199
cascade separately, we have to make sure that light leaving one grid can further
propagate in the next grid. This is simple when the grid cell sizes of two cascades
differ by a power-of-two factor. Note that intensity has to be redistributed when
propagating light from a single cell of a coarse grid to multiple cells of a fine grid,
and accumulated when propagating light from multiple fine grid cells to a coarse
grid cell (as intensity already accounts for the surface area).
Stable lighting through snapping. The LPV cascades are oriented along the world
space axes and do not rotate, only translate, with the camera. It is important
to snap the cascade positions in world-space to multiples of the cell sizes. This
ensures that the distribution of surfels in the injection stage is stable and inde-
pendent of the translation of the LPVs.
5.5 Optimizations
5.5.1 General Optimizations
The scene illumination stage with a final LPV is usually a very expensive pass
(see the timing table in Section 5.6.1). It is important to note that the hardware
capability to render into a volume texture tremendously improves performance
at this stage. This simplifies the shader workload as the emulation of a trilinear
filtering in the pixel shader is not necessary. Instead, the hardware unit is utilized
and cache coherency is highly improved due to the optimized memory layout of
a swizzled 3D texture.
Unfortunately, not every platform supports rendering into a 3D texture. How-
ever, it is possible to inject the VPLs directly into the volume texture on consoles.
To do so, the 3D texture should be treated as a horizontally unwrapped 2D render
target; note that this is not possible with the Microsoft DirectX 9.0 API.
Using an 8 bit/channel texture format for the LPV has proven to be sufficient
for diffuse indirect lighting stored in two bands of SH. Note that this detail is
very important because it speeds up the final rendering pass significantly due to
decreased bandwidth and texture-cache misses.
i i
i i
i i
i i
5.6 Results
In this section we provide several example screenshots showing diffuse global
illumination in Crysis 2 (see Figure 5.15). In Crysis 2 we use one to three cascades
depending on the graphics settings, platform, and level specifics.
i i
i i
i i
i i
5. Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes 201
Figure 5.15. In Crysis 2 we use from one to three cascades depending on graphics
settings, platform, and level specifics.
• Per-object indirect color. This parameter affects only the color and the
intensity of indirectly bounced lighting from a specific object. It is mostly
used to amplify or attenuate the contribution of some particular objects
into indirect lighting.
These parameters are frequently used by artists to tweak some particular places
and moments in the game. They proved to be particularly important for cut
scenes and some in-game movies.
i i
i i
i i
i i
5.6.2 Timings
Detailed timings for a single cascade for the Crytek Sponza2 scene are provided
in Table 5.1. Note that the timings for multiple cascades can be estimated by
multiplying the timings for a single cascade by the number of cascades when using
the cascaded approach, as the work is spread across several RSMs.
For all screenshots in this chapter we used the same settings: the size of the
LPV grid is 32 × 32 × 32, the propagation uses 12 iterations and 1 1 cascade,
and the rendering was at 1280 × 720 resolution (no MSAA). The cost of the
final illumination pass obviously depends on the target buffer resolution. The
RSM size is 2562 for the timings with the NVIDIA GTX285 and 1282 for both
consoles. Of course the cost of the RSM rendering also depends on the scene
complexity.
Table 5.1. Detailed timings for a single cascade for the Crytek Sponza scene. All
measurements are in milliseconds for the individual stages.
5.7 Conclusion
In this chapter we described a highly parallel production-ready and battle-tested
(diffuse) global illumination method for real-time applications without any pre-
computation. To our knowledge, it is the first technique that employs a light
propagation for diffuse global illumination. It features a very consistent and scal-
able performance, which is crucial for real-time applications such as games. We
also described how to achieve indirect illumination in large-scale scenes using a
multi-resolution light propagation.
We demonstrated our method in various scenes in combination with compre-
hensive real-time rendering techniques. In the future we would like to reduce the
limitations of our method by investigating other grid structures and other adap-
tive schemes, where the compute shaders of DirectX 11 will probably of great
help.
2 The Crytek Sponza scene is the original Sponza Atrium scene improved for global illumi-
nation experiments. This scene is granted to the rendering community and can be downloaded
from this link: [Link]
i i
i i
i i
i i
5. Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes 203
5.8 Acknowledgments
Thanks to Sarah Tariq and Miguel Sainz for implementing this technique as a sample
of the NVIDIA SDK3 . And of course the whole Crytek R&D team as well as all others
who helped and discussed the real-time graphics with us!
Bibliography
[Chandrasekhar 50] S. Chandrasekhar. Radiative Transfer. New York: Dover, 1950.
[Dachsbacher and Stamminger 05] C. Dachsbacher and M. Stamminger. “Reflective
Shadow Maps.” In Proc. of the Symposium on Interactive 3D Graphics and Games,
pp. 203–213. Washington, DC: IEEE Computer Society, 2005.
[Dimitrov 07] R. Dimitrov. “Cascaded Shadow Maps.” Technical report, NVIDIA
Corporation, 2007.
[Engel 05] W. Engel. “Cascaded Shadow Maps.” In Shader X5 , pp. 129–205. Hingham,
MA: Charles River Media, 2005.
[Engel 09] W. Engel. “Light Prepass.” In Advances in Real-Time Rendering in 3D
Graphics and Games Course—SIGGRAPH 2009. New York: ACM Press, 2009.
[Evans 98] K. F. Evans. “The Spherical Harmonic Discrete Ordinate Method for Three-
Dimensional Atmospheric Radiative Transfer.” In Journal of Atmospheric Sciences,
pp. 429–446, 1998.
[Greger et al. 97] G. Greger, P. Shirley, P. Hubbard, and D. Greenberg. “The Irradiance
Volume.” IEEE Computer Graphics & Applications 18 (1997), 32–43.
[Kaplanyan and Dachsbacher 10] Anton Kaplanyan and Carsten Dachsbacher. “Cas-
caded Light Propagation Volumes for Real-Time Indirect Illumination.” In Pro-
ceedings of the 2010 ACM SIGGRAPH Symposium on Interactive 3D Graphics and
Games, pp. 99–107, 2010.
[Kaplanyan 09] A. Kaplanyan. “Light Propagation Volumes in CryEngine 3.” In Ad-
vances in Real-Time Rendering in 3D Graphics and Games Course—SIGGRAPH
2009. New York: ACM Press, 2009.
[Sillion 95] F. Sillion. “A Unified Hierarchical Algorithm for Global Illumination with
Scattering Volumes and Object Clusters.” IEEE Trans. on Visualization and Com-
puter Graphics 1:3 (1995), 240–254.
[Sloan 08] P.-P. Sloan. “Stupid Spherical Harmonics (SH) Tricks.” In GDC’08, 2008.
[Tatarchuk 04] N. Tatarchuk. “Irradiance Volumes for Games.” Technical report, ATI
Research, Inc., 2004.
i i
i i
i i
i i
i i
i i
i i
i i
IV
Shadows
In Part IV we will cover various algorithms that are used to generate shadow data.
Shadows are the dark companions of lights and although both can exist on their
own, they shouldn’t exist without each other in games. Achieving good visual
results in rendering shadows is still considered one of the particularly difficult
tasks of graphics programmers. One of the trends in this shadow section is the
description of implementations that achieve perceptually correct looking shadows;
in real-time graphics terminology, called soft shadows. Three articles build on
Randy Fernando’s work, “Percentage-Closer Soft Shadows,” that is now—five
years after it was published—still an efficient way to implement soft shadows on
the current generation of hardware.
The first article in this section, “Variance Shadow Maps Light-Bleeding Re-
duction Tricks,” by Wojciech Sterna, covers techniques to reduce light bleeding.
There is also an example application that shows the technique.
Pavlo Turchyn covers fast soft shadows with adaptive shadow maps—as used
in Age of Conan—in his article “Fast Soft Shadows via Adaptive Shadow Maps.”
The article describes the extension of percentage-closer filtering to adaptive shadow
maps that was implemented in the game. Turchyn proposes a multiresolution
filtering method in which three additional, smaller shadow maps with sizes of
1024 × 1024, 512 × 512 and 256 × 256 are created from a 2048 × 2048 shadow
map. The key observation is that the result of the PCF kernel over a 3 × 3 area
of a 1024 × 1024 shadow map is a reasonably accurate approximation for filtering
over a 6 × 6 area of a 2048 × 2048 shadow map. Similarly, a 3 × 3 filter kernel of
a 256 × 256 shadow map approximates a 24 × 24 area of a 2048 × 2048 shadow
map.
The article “Adaptive Volumetric Shadow Maps” by Marco Salvi et al. de-
scribes a new approach for real-time shadows that supports high-quality shad-
owing from dynamic volumetric media such as hair, smoke, and fog. Adaptive
volumetric shadow maps (AVSM) encode the fraction of visible light from the
light source over the interval [0, 1] as a function of depth at each texel. This
transmittance function and the depth value are then stored for each texel and
sorted front-to-back. This is called the AVSM representation. This AVSM rep-
resentation is generated by first rendering all visible transparent fragments in
a linked list (see the article “Order-Independent Transparency Using Per-Pixel
i i
i i
i i
i i
206 IV Shadows
Linked Lists in DirectX 11” for the description of per-pixel linked lists). In a
subsequent pass, those linked lists are compressed into the AVSM representation,
consisting of the transmittance value and the depth value.
Another article that describes the fast generation of soft shadows is “Fast Soft
Shadows With Temporal Coherence” by Daniel Scherzer et al. The light source
is sampled over multiple frames instead of a single frame, creating only a single
shadow map with each frame. The individual shadow test results are then stored
in a screen-space shadow buffer. This buffer is recreated in each frame using
the shadow buffer from the previous frame as input. This previous frame holds
only shadowing information for pixels that were visible in the previous frame.
Pixels that become newly visible in this frame due to camera or object movement
have no shadowing information stored in this buffer. For these pixels the article
describes a spatial-filtering method to estimate the soft shadow results. In other
words the main idea of the algorithm described in the article is to formulate light-
source area sampling in an iterative manner, evaluating only a single shadow map
per frame.
The last article in the section, “MipMapped Screen Space Soft Shadows,”
by Alberto Aguado et al. uses similar ideas as the other two soft-shadow arti-
cles. Soft shadows are generated with the help of mipmaps to represent multi-
frequency shadows for screen-space filtering. The mipmap has two channels; the
first channel stores the shadow-intensity values and the second channel stores
screen-space penumbra widths. Shadow values are obtained by filtering while
penumbrae widths are propagated by flood filling. After the mipmap is gener-
ated, the penumbrae values are used as indices to the mipmap levels. Thus, we
transform the problem of shadow generation into the problem of selecting levels
in a mipmap. This approach is extended by including layered shadow maps to
improve shadows with multiple occlusions.
—Wolfgang Engel
i i
i i
i i
i i
1
IV
1.1 Introduction
Variance Shadow Maps (VSMs) were first introduced in [Donnelly and Lau-
ritzen 06] as an alternative to bilinear percentage closer filtering (PCF) to speed
up rendering of smoothed shadows. The algorithm is relatively inexpensive, easy
to implement, and very effective in rendering shadows with large penumbra re-
gions. However, VSM has one major drawback—apparent light-bleeding—which
occurs when two or more shadow casters cover each other in light-space. This ar-
ticle will show techniques that help to reduce the light-bleeding artifacts in VSM.
207
i i
i i
i i
i i
208 IV Shadows
σ2
P (O ≥ R) ≤ pmax (R) ≡ , where µ < R (1.1)
σ 2 + (R − µ)2
M1 = E(O)
M2 = E(O2 )
µ = M1 = E(O)
σ2 = M2 − M1 2 = E(O2 ) − E(O)2
In fact, the first moment is what is actually stored in the first channel of the
shadow map, and the second moment in the second channel of the shadow map.
That’s why the shadow map can be additionally prefiltered before its use—the
moments are defined by the expectation operator which is linear and can thus be
linearly filtered.
A sample implementation of standard VSM is shown in Listing 1.1.
return s a t u r a t e ( p max ) ;
}
i i
i i
i i
i i
Note that the given Chebyshev’s inequality (its one-tailed version) is undefined
for cases in which µ ≥ R. Of course, in such a case a point is fully lit, so the
function returns 1.0; otherwise, pmax is returned.
1.3 Light-Bleeding
Light-bleeding (see Figure 1.1) occurs when two or more shadow casters cover
each other in light-space, causing light (these are actually soft edges of shadows
of the objects closest to the light) to bleed onto the shadows of further (from the
light) objects. Figure 1.2 shows this in detail.
As can be seen from the picture in Figure 1.2, object C is lit over a filter
region. The problem is that when estimating the shadow contribution for pixels
of object C over this region, the variance and mean that are used are actually
based on the samples from object A (red line) and visible samples from object B
(green line) (the shadow map consists of pixels colored by red and green lines),
whereas they should be based on samples from object B only (green and blue
lines). Moreover, the greater the ratio of distances ∆x
∆y (see Figure 1.2), the more
apparent the light-bleeding is on object C.
The VSM is not capable of storing more information (about occluded pixels
of object B for instance). This is not even desirable since we want to keep the
i i
i i
i i
i i
210 IV Shadows
algorithm simple and don’t want to raise its memory requirements. Fortunately,
the following sections in this chapter will present a few very simple tricks that
can greatly reduce the problem.
i i
i i
i i
i i
Figure 1.3. VSM and VSM with pmax cut off by 0.15.
i i
i i
i i
i i
212 IV Shadows
Figure 1.5. ESM with the following constants: 10, 50, 90, 200.
ESMs work similarly to VSMs. A very nice feature of ESM is that it requires
only a single-channel shadow map, which stores the first moment. The algorithm
also uses statistical methods to estimate the shadowing term and it does so by
using the so-called Markov inequality (as opposed to VSM which uses Chebyshev’s
inequality):
E(O)
P (O ≥ R) ≤ (1.2)
R
Using Markov inequality as given in Equation (1.2) doesn’t provide a good shadow
approximation; shadows suffer from a sort of global light-bleeding. However,
[Salvi 08] shows that it can be transformed to the following representation:
E(ekO )
P (ekO ≥ ekR ) ≤ (1.3)
ekR
Constant k determines how good the approximation is—the greater the value, the
better the approximation. Unfortunately, large values cause precision loss, and
shadow boundaries become sharper, so a compromise must be found. Figure 1.5
shows a comparison of ESM with different values of constant k.
ESM is simpler, faster, and has a tweakable light-bleeding parameter that
makes it more practical than VSM in many cases. However, a very promising
idea is the combination of these two techniques—EVSM. Instead of storing depth
and a square of depth in the shadow map, we store an exponential of depth
and a square of exponential of depth. The exponential function has the effect of
decreasing the ratio ∆x
∆y and thus reduces the VSM-like light-bleeding.
i i
i i
i i
i i
Figure 1.6. VSM and EVSM with pmax cut off by 0.05 and k = 50.
EVSM suffers from light-bleeding only in cases when both VSM and ESM
fail, which rarely happens. A very important feature of EVSM is that the light-
bleeding can be controlled by two factors: VSM with the tail cut off and ESM
with k constant. Careful adjustment of these two will lead to very pleasing and
accurate soft shadows.
Figure 1.6 shows a side-by-side comparison of standard VSM and EVSM.
1.6 Conclusion
Variance shadow mapping has already proven to be a great way of generating
soft shadows. The algorithm is easy to implement, fast, and utilizes hardware
features of modern GPUs. Despite its advantages, VSM also introduces some
i i
i i
i i
i i
214 IV Shadows
problems. The worst one is light-bleeding, which was the subject of discussion in
this chapter.
Bibliography
[Donnelly and Lauritzen 06] William Donnelly and Andrew Lauritzen. “Variance
Shadow Maps.”, 2006. Available online ([Link]
[Lauritzen 07] Andrew Lauritzen. “Summed-Area Variance Shadow Maps.” In GPU
Gems 3, Chapter II.8. Reading, MA: Addison-Wesley, 2007.
[Lauritzen 08] Andrew Lauritzen. Master’s thesis, University of Waterloo, 2008.
[Salvi 08] Marci Salvi. “Rendering Filtered Shadows with Exponential Shadow Maps.”
In ShaderX6 , Chapter IV.3. Hingham, MA: Charles River Media, 2008.
i i
i i
i i
i i
2
IV
The tables Weight and Offset hold weights and the texture coordinates offsets,
respectively. The choice of Weight and Offset defines performance and quality
of rendering. The computationally fastest way is to use constant tables. In more
elaborate schemes, the tables are constructed based on coordinates uv.
215
i i
i i
i i
i i
216 IV Shadows
The number of samples n is chosen depending on the size of the area over
which the filtering is performed. Summing over the area of m × m shadow map
texels would require at least n = m2 samples if we want to account for all texels.
However, such a quadratic complexity of the filter makes PCF impractical for
filtering over large areas. For instance, if the size of a shadow map texel’s pro-
jection onto a scene’s geometry is 0.1 m, and the desired penumbra size is about
1.5 meters, then we need to apply PCF over (1.5/0.1) × (1.5/0.1) = 15 × 15 area,
which in turn gives n = 225.
It is possible to use PCF with large kernels in time-sensitive applications
by decreasing the number of samples, so that n m2 , and distributing the
samples pseudorandomly over the summation area. Such an approach produces
penumbra with noise artifacts as shown in Figure 2.2. A screen-space low-pass
filtering can be used for suppressing the noise, but such a post-filtering removes all
high-frequency features within penumbra indiscriminately. Moreover, preventing
shadows from bleeding into undesired areas in screen space may require the use
of relatively expensive filters,( e.g., a bilateral filter).
We propose a multiresolution filtering (MRF) method that attempts to al-
leviate the PCF problem described above. The idea is as follows: When we
create the standard shadow map for the scene, we also create versions of it with
progressively lower resolutions. For example, if the shadow map’s resolution is
2048 × 2048, we create three additional shadow maps: 1024 × 1024, 512 × 512,
and 256 × 256. The key observation is that the result of PCF over a 3 × 3 area of
a 1024 × 1024 shadow map is a reasonably accurate approximation for filtering
over 6×6 area of 2048×2048 map. Similarly, PCF over a 3×3 area of a 256×256
shadow map approximates a 6 × 6 filter for a 512 × 512 map, a 12 × 12 filter for
a 1024 × 1024 map, and a 24 × 24 filter for a 2048 × 2048 map. Thus, in order to
i i
i i
i i
i i
Figure 2.2. Filtering comparison. From top to bottom: bilinear PCF; 24 × 24 PCF
filter with 32 samples, frame rendering time on Radeon 4870 is 3.1 ms; 24 × 24 PCF
filter with 64 samples, 6.7 ms; 24 × 24 MRF filter (3 × 3 PCF), 3.1 ms.
approximate PCF with large kernels, we apply PCF with a small kernel size to a
shadow map with reduced resolution.
Approximating arbitrary kernel size. Let us number shadow maps starting with
zero index assigned to the shadow map with the highest resolution
sampler2D shadowMaps[4] = { shadowMap2048x2048,
shadowMap1024x1024, shadowMap512x512, shadowMap256x256 };
i i
i i
i i
i i
218 IV Shadows
samples of a 3 × 3 PCF. The index of the shadow map, which can be used to get
an adequate approximation, is computed as
Pros and cons. MRF enables creating large penumbrae with only a few depth tex-
ture samples. Since it is based on a small number of samples, it allows the com-
putiation of shadow intensities using relatively complex filter kernels, which can
produce continuous values without the noise inherent to plain PCF kernels with a
low number of samples. As a result, MRF does not require any type of postfilter-
ing. Compared to prefiltering methods, (e.g. [Donnelly and Lauritzen 06], [Annen
et al. 07], [Annen et al. 08]), MRF does not introduce approximation-specific ar-
tifacts, such as light leaking, ringing, or precision deficiencies. Moreover, since
MRF is based on a regular PCF, it is possible to utilize existing hardware features,
for example hardware bilinear PCF or double-speed depth-only rendering.
Creating tiles hierarchy. We start by projecting the view frustum onto the near
plane of the light’s frustum. The projected frustum is clipped against a grid
defined on this plane. This grid is view independent and static (does not change
from frame to frame). Each grid cell intersecting with the frustum is called a
tile. If a tile is closer to the projected frustum’s top than a certain threshold
distance, we subdivide it into four equal cells (in a quadtree-like fashion). Each
resulting cell is tested against the frustum, and the cells intersecting with it are
called child tiles. We continue the subdivision process to obtain a hierarchy of
tiles as shown in Figure 2.3. Note that unlike for example, the cascaded shadow
i i
i i
i i
i i
maps algorithm, here we do not subdivide the frustum itself; instead we use the
frustum to determine which grid cells it intersects.
Scene rendering. Since all shadow maps are equal in size, the shadow map sam-
pling function requires knowledge only of the shadow map’s offset within the atlas.
These offsets are stored in a small dynamic index texture (we use a 128 × 128
texture). One can think of index texture as a standard shadow map that covers
the entire view range, but instead of depth values, each texel contains the off-
sets, which are used to locate the actual tile shadow map in the atlas. Shadow
map sampling code is given in Listing 2.1. As one can see, the only difference is
index-texture indirection.
Pros and cons. ASM enables rendering of highly detailed shadows. Similar shadow
mapping quality can be achieved only with standard shadow mapping when using
a shadow map of very high resolution (for example, the resolution of an equiva-
lent shadow map in Age of Conan is 16384 × 16384). Unlike projection-modifying
approaches, such as [Stamminger and Drettakis 02] or [Martin and Tan 04], ASM
does not suffer from temporal aliasing (since the tile’s shadow map projection
i i
i i
i i
i i
220 IV Shadows
f l o a t standardShadowMapSampling ( f l o a t 4 s a m p l e P o s i t i o n )
{
f l o a t 4 shadowMapCoords =
mul ( s a m p l e P o s i t i o n , s h a d o w P r o j e c t i o n M a t r i x ) ;
return PCF( shadowMapTexture , shadowMapCoords ) ;
}
f l o a t shadowMapSamplingASM ( f l o a t 4 s a m p l e P o s i t i o n )
{
f l o a t 4 indexCoords =
mul ( s a m p l e P o s i t i o n , s h a d o w P r o j e c t i o n M a t r i x ) ;
f l o a t 3 o f f s e t = tex2D ( i n d e x T e x t u r e , i n d e x C o o r d s . xy ) ;
f l o a t 2 C = f l o a t 2 ( tileShadowMapSize / a t l a s S i z e , 1 ) ;
f l o a t 3 shadowMapCoords = i n d e x C o o r d s ∗C . xxy + o f f s e t ;
return PCF( a t l a s T e x t u r e , shadowMapCoords ) ;
}
matrices are view independent) and offers an intuitive control over the shadow
map’s texel distribution.
However, ASM imposes certain restrictions on a scene’s granularity. Even
though such situations do not occur frequently, in some cases we might need to
render a number of shadow maps (we note that in Age of Conan we typically
render one tile per several frames). As an extreme example, consider a scene that
consists of just one huge object; the cost for rendering N shadow maps will be
N times the cost of rendering such a scene. On the other hand, imagine a scene
that consists of objects so small that they always overlap with just one tile; in
this case the cost for rendering N tile shadow maps will be less or equal to the
cost of whole scene. Therefore, a preferred scene should consist of a large number
of lightweight, spatially compact objects rather than a few big and expensive-to-
render geometry chunks.
Provided that granularity of the scene is reasonably fine and there is a certain
frame-to-frame coherency, ASM significantly reduces shadow map rendering costs
compared to standard shadow mapping. In this regard, one can view ASM as a
method for distributing the cost of rendering a single shadow map over multiple
frames.
A fundamental shortcoming of ASM is its inability to handle animated objects
because such objects require updating the shadow map with every frame, while
ASM relies on a shadow maps cache that holds data created over many frames.
Similarly, light’s position or direction cannot be changed on a per-frame basis
because such a change invalidates the cache. In Age of Conan we use a separate
i i
i i
i i
i i
1024 × 1024 shadow map for rendering dynamic objects. Such a shadow map
normally contains only a few objects, so it is inexpensive to render. Moreover,
one can apply prefiltering techniques, (e.g., variance shadow maps), which may
otherwise be problematic from the viewpoint of performance or quality. MRF
naturally applies to the hierarchy of shadow maps produced with ASM.
Occluders fusion. The most outstanding defect of PCF-based soft shadows is in-
correct occluder fusion. The larger the penumbra size is, the more the artifacts
stand out (see e.g., Figure 2.5(a)). The main source of the problem, illustrated in
Figure 2.4, is the inability of a single shadow map to capture information needed
i i
i i
i i
i i
222 IV Shadows
to create shadows from area light. Each texel of a shadow map contains visibil-
ity information for a single light direction only, though light propagates along a
range of directions.
This problem can be reduced relatively easily within the ASM framework.
For a small set of tiles, which are closer to the viewer than a certain threshold
distance, we render two shadow maps instead of one. First we create a regular
shadow map and its corresponding DEM. Let dmax be the shadow map’s range.
Then, we create a layer shadow map, which is identical to the regular one, except
that it is constructed using the fragments with depths within the range [d +
dc , dmax ] (fragments with depth outside this range are discarded), where d is the
corresponding minimum depth value from the DEM constructed for the regular
shadow map, and dc is a constant chosen by user. The penumbra over the scene’s
objects located within the range [0; d + dc ] will be created using a regular shadow
map only, thus occluder fusion will not be correct. However, one can use a layer
shadow map to correct the penumbra on the objects located beyond d + dc , as
shown in Figure 2.5(b).
Shadow map layering significantly improved image quality in Age of Conan,
removing a vast majority of occluder fusion artifacts. While theoretically one may
utilize more than one layer, in Age of Conan one layer appeared to be sufficient.
Adding more layers did not lead to any noticeable improvements.
i i
i i
i i
i i
(a) PCSS (from NVIDIA Direct3D SDK 10 (b) ASM + MRF + layering.
Code Samples.)
Figure 2.5. Occluders fusion: PCSS filters out penumbra details, ASM allows keeping
them.
2.4 Results
We implemented our soft shadows algorithm in Funcom’s MMORPG Age of Co-
nan. Figure 2.6 shows in-game benchmark of two shadow mapping methods
tested on Intel Core i7 2.66 MHz and AMD Radeon 4850.
Originally, the shadowing method in the released version of Age of Conan
was standard shadow mapping, and cascaded shadow maps were added a year
later with a patch. As shown in Figure 2.6, standard shadow mapping resulted
in approximately 30% frame rate drop. The cascaded shadow map performance
i i
i i
i i
i i
224 IV Shadows
(not shown here) was worse. Implementing ASM-based soft shadows provided not
only a substantial increase in image quality, but also a significant performance
boost. We use ASM (with MRF and layering) to create shadows from static
objects, and a separate 1024 × 1024 shadow map for dynamic objects, which is
filtered with a fixed-size 3 × 3 PCF kernel.
Bibliography
[Annen et al. 07] Thomas Annen, Tom Mertens, P. Bekaert, Hans-Peter Seidel, and
Jan Kautz. “Convolution Shadow Maps.” In European Symposium on Rendering,
pp. 51–60. Aire-la-Ville, Switzerland: Eurographics Association, 2007.
[Annen et al. 08] Thomas Annen, Tom Mertens, Hans-Peter Seidel, Eddy Flerackers,
and Jan Kautz. “Exponential Shadow Maps.” In GI ’08: Proceedings of Graphics
Interface 2008, pp. 155–161. Toronto, Canada: Canadian Information Processing
Society, 2008.
[Donnelly and Lauritzen 06] William Donnelly and Andrew Lauritzen. “Variance
Shadow Maps.” In I3D ’06: Proceedings of the 2006 Symposium on Interactive
3D Graphics and Games, pp. 161–165. New York: ACM, 2006.
[Fernando et al. 01] Randima Fernando, Sebastian Fernandez, Kavita Bala, and Don-
ald P. Greenberg. “Adaptive Shadow Maps.” In SIGGRAPH ’01: Proceedings of
the 28th Annual Cconference on Computer Graphics and Interactive Techniques,
pp. 387–390. New York: ACM, 2001.
[Fernando 05] Randima Fernando. “Percentage-Closer Soft Shadows.” In SIGGRAPH
’05: ACM SIGGRAPH 2005 Sketches, p. 35. New York: ACM, 2005.
[Isidoro and Sander 06] John R. Isidoro and Pedro V. Sander. “Edge Masking and
Per-Texel Depth Extent Propagation For Computation Culling During Shadow
Mapping.” In ShaderX5 : Advanced Rendering Techniques. Hingham:MA: Charles
River Media, 2006.
[Martin and Tan 04] Tobias Martin and Tiow-Seng Tan. “Anti-Aliasing and Continu-
ity with Trapezoidal Shadow Maps.” In Proceedings of Eurographics Symposium
on Rendering, pp. 153–160. Aire-la-Ville, Switzerland: Eurographics Association,
2004.
[Stamminger and Drettakis 02] Marc Stamminger and George Drettakis. “Perspective
Shadow Maps.” In SIGGRAPH ’02: Proceedings of the 29th Annual Conference
on Computer Graphics and Interactive Techniques, pp. 557–562. New York: ACM,
2002.
i i
i i
i i
i i
3
IV
Adaptive Volumetric
Shadow Maps
Marco Salvi, Kiril Vidimče, Andrew Lauritzen,
Aaron Lefohn, and Matt Pharr
This chapter describes adaptive volumetric shadow maps (AVSM), a new ap-
proach for real-time shadows that supports high-quality shadowing from dynamic
volumetric media such as hair and smoke. AVSMs compute approximate volu-
metric shadows for real-time applications such as games, for which predictable
performance and a fixed, small memory footprint are required (and for which
approximate solutions are acceptable).
We first introduced AVSM in a paper at the 2010 Eurographics Symposium on
Rendering [Salvi et al. 10]; this chapter reviews the main ideas in the paper and
details how to efficiently implement AVSMs on DX11-class graphics hardware.
AVSMs are practical on today’s high-end GPUs; for example, rendering Figure 3.4
requires 8.6 ms with opacity shadow maps (OSMs) and 12.1 ms with AVSMs—an
incremental cost of 3.5 ms to both build the AVSM data structure and to use it
for final rendering.
225
i i
i i
i i
i i
226 IV Shadows
Figure 3.1. This image shows self-shadowing smoke and hair, both seamlessly rendered
into the same adaptive volumetric shadow map. (Hair model courtesy of Cem Yuksel).
media have limited their use in real-time applications. Existing solutions for real-
time volumetric shadowing exhibit slicing artifacts due to nonadaptive sampling,
cover only a limited depth range, or are limited to one type of media (e.g., only
hair, only smoke, etc.).
Adaptive shadow representations such as deep shadow maps have been used
widely in offline rendering [Lokovic and Veach 00, Xie et al. 07]. Deep shadow
maps store an adaptive, lossy-compressed representation of the visibility function
for each light-space texel, though it is not clear how they can be implemented
efficiently enough for real-time performance on today’s graphics hardware, due
to their high costs in term of storage and memory bandwidth.
Many volumetric shadowing techniques have been developed for interactive
rendering. See our paper [Salvi et al. 10] for a detailed discussion of previous
approaches; here,in this chapter, we will highlight the most widely known alter-
natives. A number of approaches discretize space into regularly spaced slices,
for example opacity shadow maps [Kim and Neumann 01]. These methods typ-
ically suffer from aliasing, with variations specialized to handle small particles
that can display view-dependent shadow popping artifacts even with static vol-
umes [Green 08]. Deep opacity maps improve upon opacity shadow maps specif-
ically for hair rendering by warping the sampling positions in the first depth
i i
i i
i i
i i
layer [Yuksel and Keyser 08]. Occupancy maps also target hair rendering and
use regular sampling, but capture many more depth layers than opacity- or deep-
opacity- shadow maps by using only one bit per layer. However, they are limited
to volumes composed of occluders with identical opacity [Sintorn and Assar-
son 09]. Mertens et al. describe a fixed-memory shadow algorithm for hair that
adaptively places samples based on a k-means clustering estimate of the transmit-
tance function, assuming density is uniformly distributed within a small number
of clusters [Mertens et al. 04]. Recently, Jansen and Bavoil introduced Fourier
opacity mapping, which addresses the problem of banding artifacts, but where
the detail in shadows is limited by the depth range of volume samples along a
ray and may exhibit ringing artifacts [Jansen and Bavoil 10]. Finally, Enderton
et al. [Enderton et al. 10] have introduced a technique for handling all types of
transparent occluders in a fixed amount of storage for both shadow and primary
visibility, generating a stochastically sampled visibility function, though their
approach requires a large number of samples for good results.
AVSM generates an adaptively sampled representation of the volumetric trans-
mittance in a shadow-map-like data structure, where each texel stores a com-
pact approximation of the transmittance curve along the corresponding light
ray. AVSM can capture and combine transmittance data from arbitrary dynamic
occluders, including combining soft media like smoke and well-localized denser
media such as hair. It is thus both a versatile and a robust approach, suitable
for handling volumetric shadows in a variety of situations in practice. The main
innovation introduced by AVSM is a new, streaming lossy compression algorithm
that is capable of building a constant-storage, variable-error representation of
visibility for later use in shadow lookups.
i i
i i
i i
i i
228 IV Shadows
1 1
A
Transmittance
Transmittance
B
0 0
Depth Depth
nodes per texel. More nodes allow for a better approximation of transmittance
and higher quality shadows, but at the expense of increased storage and compu-
tational costs. We have found that, in practice, 8–12 nodes (a cost of 64–96 bytes
per texel in the AVSM when full precision is used) give excellent results.
In HLSL code, our AVSM nodes are implemented with a simple structure:
#define AVSM_NODE_COUNT 8
#define AVSM_RT_COUNT (AVSM_NODE_COUNT / 4)
struct AVSMData
{
float4 depth[AVSM_RT_COUNT];
float4 trans[AVSM_RT_COUNT];
};
i i
i i
i i
i i
requires support for read-modify-write framebuffer operations in the pixel shader. DirectX11
adds the ability to perform unordered read-modify-write operations on certain buffer types in
the pixel shader; however, for AVSM’s transmittance-curve-simplification algorithm we need
to ensure that each pixel’s framebuffer memory is modified by only one fragment at a time
(per-pixel lock). Because current DX11 HLSL compilers forbid per-pixel locks, we implement
AVSM with variable-memory version that uses the current DX11 rendering pipeline to first
capture all transparent fragments seen from the light and then compressing them into the
AVSM representation.
i i
i i
i i
i i
230 IV Shadows
RWStructuredBuffer<ListTexNode> gListTexSegmentNodesUAV ;
RWTexture2D<u i n t > gListTexFirstNodeAddrUAV ;
// A l l o c a t e a g e n e r i c node
b o o l AllocNode ( out u i n t newNodeAddress )
{
// a l l o c a new node
newNodeAddress = gListTexNodesUAV . I n c r e m e n t C o u n t e r ( ) ;
// r u n n i n g out o f memory?
return newNodeAddress < MAX BUFFER NODES;
}
newNode . n e x t = oldNodeAddress ;
gListTexNodesUAV [ newNodeAddress ] = newNode ;
}
Listing 3.1. Node allocation and insertion code of a generic list (DirectX11/Shader
Model 5).
This process can also be used with transparent objects with finite extent
and uniform density in depth—not just billboards. Each object’s fragment can
store in the list start and end points along the corresponding ray from the light
source to define a segment, along with exit transmittance (entry transmittance
is implicitly assumed to be set to 1). For example, given billboards representing
spherical particles, we insert a segment representing the ray’s traversal through
the particle; for hair we insert a short segment where the light enters and exits
the hair strand; for opaque blockers, we insert a short, dense segment that takes
the transmittance to zero at the exit point.
i i
i i
i i
i i
so not only saves a considerable amount of storage and bandwidth, but makes
lookups very efficient. In general, the number of transparent fragments at a pixel
will be much larger than the number of AVSM nodes (e.g., Figure 3.6 shows
a transmittance curve with 238 nodes and its 12 nodes counterpart compressed
with AVSM), therefore, we use a streaming compression algorithm that in a single
(post-list-creation) rendering pass approximates the original curve.
Each node of our piecewise transmittance curve maps to an ordered sequence
of pairs (di , ti ) that encode node position (depth) along a light ray and its associ-
ated transmittance. AVSMs store the transmittance curve as an array of depth-
transmittance pairs (di , ti ) using two single-precision floating-point values2 . An
important ramification of our decision to use a fixed, small number of nodes is
that the entire compressed transmittance curve can fit in on-chip memory during
compression. As with classic shadow maps we clear depth to the far plane value,
while transmittance is set to 1 in order to represent empty space.
We insert each new occluding segment by viewing it as a compositing oper-
ation between two transmittance curves, respectively representing the incoming
blocker and the current transmittance curve. Given two light blockers, A and
B, located along the same light ray, we write the density function fAB (x) as
a sum of their density functions fA (x) and fB (x). By simply applying Equa-
tion eqrefeq:transmittance we can compute their total transmittance:
Rz
ttot (z) = e− 0
fAB (x) dx
Rz Rz
= e− 0
fA (x) dx −
e 0
fB (x) dx
= tA (z)tB (z). (3.2)
interest in many fields and has been studied in cartography, computer graphics, and elsewhere;
see our EGSR paper for more references on this topic.
i i
i i
i i
i i
232 IV Shadows
The removal of an internal ith node affects only the area of the two trapezoids
that share it. Since the rest of the curve is unaffected we compute the variation
of its integral 4ti with a simple, geometrically derived formula:
4ti = |(di+1 − di−1 )(ti+1 − ti ) − (di+1 − di )(ti+1 − ti−1 )|.
In practice, due to the lossy compression, the order in which segments are
inserted can affect the results. In particular, when generating the per-pixel linked
lists in the previous pass, the parallel execution of pixel shaders inserts segments
into the linked lists in an order that may vary per-frame even if the scene and
view are static. Inconsistent ordering can result in visible temporal artifacts,
although they are mostly imperceptible in practice when using eight or more
AVSM nodes or when the volumetric media is moving quickly (e.g., billowing
smoke). In those rare cases when a consistent ordering cannot be preserved and
the number of nodes is not sufficient to hide these artifacts, it is also possible to
sort the captured segments by depth via an insertion sort before inserting them.
We discuss the cost of this sort in Section 3.3.3.
i i
i i
i i
i i
the two nodes that bound the shadow receiver of depth d, we then interpolate
the bounding nodes’ transmittance (tl , tr ) to intercept the shadow receiver.
In order to locate the two nodes that bound the receiver depth (i.e., a seg-
ment), we use a fast two-level search; since our representation stores a fixed
number of nodes, memory accesses tend to be coherent and local, unlike with
variable-length linked-list traversals necessary with techniques like deep shadow
maps [Lokovic and Veach 00]. In fact, the lookups can be implemented entirely
with compile-time (static) branching and array indexing, allowing the compiler
to keep the entire transmittance curve in registers. Listing 3.3 shows an imple-
mentation of our AVSM segment-finding algorithm specialized for an eight node
visibility curve, which is also used for both segment insertion and sampling/fil-
tering5 .
As we do at segment-insertion time, we again assume space between two nodes
to exhibit uniform density, which implies that transmittance varies exponentially
between each depth interval (see Equation (3.1)), although we have found linear
interpolation to be a faster and visually acceptable alternative:
tr − tl
T (d) = tl + (d − dl ) ·
dr − dl
This simple procedure is the basis for point filtering. Bilinear filtering is straight-
forward; the transmittance T (d) is evaluated over four neighboring texels and
linearly weighted.
struct AVSMSegment
{
int index ;
f l o a t depthA ;
f l o a t depthB ;
f l o a t transA ;
f l o a t t ra n sB ;
};
5 Please see the accompanying demo source code for a generalized implementation that sup-
i i
i i
i i
i i
234 IV Shadows
// We s t a r t by i d e n t i f y i n g t h e r e n d e r t a r g e t t h a t . .
// . . c o n t a i n s t h e nodes we a r e l o o k i n g f o r . .
i f ( r e c e i v e r D e p t h > data . depth [ 0 ] [ 3 ] ) {
depth = data . depth [ 1 ] ;
trans = data . t r a n s [ 1 ] ;
leftDepth = data . depth [ 0 ] [ 3 ] ;
leftTrans = data . t r a n s [ 0 ] [ 3 ] ;
rightDepth = data . depth [ 1 ] [ 3 ] ;
rightTrans = data . t r a n s [ 1 ] [ 3 ] ;
Output . i n d e x = 4 ;
} else {
depth = data . depth [ 0 ] ;
trans = data . t r a n s [ 0 ] ;
leftDepth = data . depth [ 0 ] [ 0 ] ;
leftTrans = data . t r a n s [ 0 ] [ 0 ] ;
rightDepth = data . depth [ 1 ] [ 0 ] ;
rightTrans = data . t r a n s [ 1 ] [ 0 ] ;
Output . i n d e x = 0 ;
}
// . . we then l o o k f o r t h e e x a c t nodes t h a t wrap . .
// . . around t h e shadow r e c e i v e r .
i f ( r e c e i v e r D e p t h <= depth [ 0 ] ) {
Output . depthA = l e f t D e p t h ;
Output . depthB = depth [ 0 ] ;
Output . transA = l e f t T r a n s ;
Output . t r a n s B = t r a n s [ 0 ] ;
} e l s e i f ( r e c e i v e r D e p t h <= depth [ 1 ] ) {
....
....
} else {
Output . i n d e x += 4 ;
Output . depthA = depth [ 3 ] ;
Output . depthB = r i g h t D e p t h ;
Output . transA = t r a n s [ 3 ] ;
Output . t r a n s B = r i g h t T r a n s ;
}
return Output ;
}
3.3 Comparisons
We have compared AVSM to a ground-truth result, deep shadow maps (DSM),
Fourier opacity maps (FOM), and opacity shadow maps (OSM). All techniques
were implemented using the DirectX11 rendering and compute APIs.
All results are gathered on an Intel Core i7 quad-core CPU running at 3.33
GHz running Windows 7 (64-bit) and an ATI Radeon 5870 GPU.
i i
i i
i i
i i
f l o a t tempDepth , tempTrans ;
// I n s e r t l a s t segment node
[ f l a t t e n ] i f ( i == postMoveSegmentEndIdx ) {
tempDepth = segmentDepth [ 1 ] ;
tempTrans = newNodesTransOffset [ 1 ] ;
// I n s e r t f i r s t segment node
} e l s e i f ( i == postMoveSegmentStartIdx ) {
tempDepth = segmentDepth [ 0 ] ;
tempTrans = newNodesTransOffset [ 0 ] ;
// Update a l l nodes i n between t h e new two nodes
} e l s e i f ( ( i > postMoveSegmentStartIdx ) &&
( i < postMoveSegmentEndIdx ) ) {
tempDepth = depth [ i − 1 ] ;
tempTrans = t r a n s [ i − 1 ] ;
// Update a l l nodes l o c a t e d b eh in d t h e new two nodes
} e l s e i f ( ( i > 1 ) && ( i > postMoveSegmentEndIdx ) ) {
tempDepth = depth [ i − 2 ] ;
tempTrans = t r a n s [ i − 2 ] ;
// Update a l l nodes l o c a t e d i n f r o n t t h e new two nodes
} else {
tempDepth = depth [ i ] ;
tempTrans = t r a n s [ i ] ;
}
// L i n e a r l y i n t e r p o l a t e s t r a n s m i t t a n c e a l o n g t h e i n c o m i n g . .
// . . segment and c o m p o s i t e i t with t h e c u r r e n t c u r v e
tempTrans ∗= I n t e r p ( segmentDepth [ 0 ] , segmentDepth [ 1 ] ,
FIRST NODE TRANS VALUE,
segmentTransmittance , tempDepth ) ;
Listing 3.2. Segment insertion code for AVSMs. Note that there is no dynamic branch-
ing nor dynamic indexing in this implementation, which makes it possible for interme-
diate values to be stored in registers and for efficient GPU execution.
i i
i i
i i
i i
236 IV Shadows
Figure 3.3. A comparison of smoke added to a scene from a recent game title with AVSM
with 12 nodes (left) and deep shadow maps (right). Rendering the complete frame takes
approximately 32 ms, with AVSM generation and lookups consuming approximately 11
ms of that time. AVSM is 1–2 orders of magnitude faster than a GPU implementation
of deep shadow maps and the uncompressed algorithm, yet produce a nearly identical
result. (Thanks to Valve Corporation for the game scene.)
Figure 3.4. Comparison of AVSM, Fourier opacity maps, and opacity shadow maps to
the ground-truth uncompressed result in a scene with three separate smoke columns
casting shadows on each other: AVSM with eight nodes (top left), ground-truth uncom-
pressed (top right), Fourier opacity maps with 16 expansion terms (bottom left), and
opacity shadow maps with 32 slices (bottom right). Note how closely AVSM matches
the ground-truth image. While the artifacts of the other methods do not appear prob-
lematic in these still images, the artifacts are more apparent when animated. Note that
the differerent images have been enhanced by 4x to make the comparison more clear.
i i
i i
i i
i i
Figure 3.5. This scene compares (from left to right) AVSM (12 nodes), uncompressed,
opacity shadow maps (32 slices), and Fourier opacity maps (16 expansion terms). Note
that AVSM-12 and uncompressed are nearly identical and the other methods show
substantial artifacts. In particular FOM suffers from severe over-darkening/ringing
problems generated by high-frequency light blockers like hair and by less-than-optimal
depth bounds. Also note that these images use only bilinear shadow filtering. Using a
higher-quality filtering kernel substantially improves the shadow quality.
i i
i i
i i
i i
238 IV Shadows
1
Uncompressed (238 Nodes)
0.9
Fourier Opacity Maps (16 terms)
0.8
Opacity Shadow Maps (32 slices)
0.7
Adaptive Volumetric Shadow Maps (12 Nodes)
0.6 Deep Shadow Maps
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300
Figure 3.6. Transmittance curves computed for a scene with a mix of smoke and hair
for AVSM (12 nodes) and the ground-truth uncompressed data (238 nodes). The hairs
generate sharp reductions in transmittance, whereas the smoke generates gradually,
decreasing transmittance. AVSM matches the ground-truth data much more closely
than the other real-time methods.
transitions in the smokey regions. Note that the 12-node AVSM matches the
ground-truth data much more closely than the opacity or Fourier shadow map
(both of which use more memory than AVSM to represent shadow data) and
is similar to the deep shadow map but uses less memory and is 1–2 orders of
magnitude faster.
i i
i i
i i
i i
Table 3.1. Performance and memory results for 2562 resolution, adaptive volumetric
shadow maps (AVSM) with 4, 8 and 16 nodes, opacity shadow maps (OSM) with 32
slices, Fourier opacity maps (FOM) with 16 expansion terms, deep shadow maps (DSM),
and the ground-truth uncompressed data for the scene shown in Figure 3.4. The AVSM
compression algorithm takes 0.5–1.6 ms to build our representation of the transmittance
curve even when there are hundreds of occluders per light ray. The total memory
required for AVSM and DSM implementations on current graphics hardware is the size
of the buffer used to capture the occluding segments plus the size of the compressed
shadow map (shown in parentheses).
the segments (via insertion sort) before compression takes 3 ms6 . As discussed
earlier, the errors arising from not sorting are often imperceptible so sorting can
usually be skipped—reducing the AVSM render-time to be nearly identical to
that of opacity and Fourier opacity maps.
There are two key sources to AVSM performance. First is the use of a stream-
ing compression algorithm that permits direct construction of a compressed trans-
mittance representation without first building the full uncompressed transmit-
tance curve. The second is the use of a fixed, small number of nodes such that
the entire representation can fit into on-chip memory. While it may be possible
to create a faster deep shadow map implementation than ours, sampling deep
shadow maps’ variable-length linked lists is costly on today’s GPUs, and it may
result in low SIMD efficiency. In addition, during deep shadow map compression,
it is especially challenging to keep the working set entirely in on-chip memory.
Table 3.1 also shows the memory usage for AVSM, deep shadow maps, and the
uncompressed approach for the smoke scene shown in Figure 3.4. Note that the
memory usage for the variable-memory algorithms shows the amount of memory
allocated, not the amount actually used per frame by the dynamically generated
linked lists.
don’t currently have an optimized list sorting implementation but we expect it is possible to do
significantly better than our current method.
i i
i i
i i
i i
240 IV Shadows
3.5 Acknowledgments
We thank Jason Mitchell and Wade Schin from Valve Software for the Left-for-Dead-2
scene and their valuable feedback; and Natasha Tatarchuk and Hao Chen from Bungie
and Johan Andersson from DICE for feedback on early versions of the algorithm.
Thanks to the the entire Advanced Rendering Technology team, Nico Galoppo and
Doug McNabb at Intel for their contributions and support. We also thank others at In-
tel: Jeffery Williams and Artem Brizitsky for help with art assets; and Craig Kolb, Jay
Connelly, Elliot Garbus, Pete Baker, and Mike Burrows for supporting the research.
i i
i i
i i
i i
Bibliography
[Enderton et al. 10] Eric Enderton, Erik Sintorn, Peter Shirley, and David Luebke.
“Stochastic Transparency.” In I3D ’10: Proceedings of the 2010 Symposium on
Interactive 3D Graphics and Games, pp. 157–164. New York: ACM, 2010.
[Green 08] Simon Green. “Volumetric Particle Shadows.” [Link]
[Link]/compute/cuda/sdk/website/C/src/smokeParticles/doc/
[Link], 2008.
[Jansen and Bavoil 10] Jon Jansen and Louis Bavoil. “Fourier Opacity Mapping.” In
I3D ’10: Proceedings of the 2010 Symposium on Interactive 3D Graphics and
Games, pp. 165–172. New York: ACM, 2010.
[Kim and Neumann 01] Tae-Yong Kim and Ulrich Neumann. “Opacity Shadow Maps.”
In Rendering Techniques 2001: 12th Eurographics Workshop on Rendering,
pp. 177–182. Aire-la-Ville, Switzerland: Eurographics Assocaition, 2001.
[Lokovic and Veach 00] Tom Lokovic and Eric Veach. “Deep Shadow Maps.” In
Proceedings of ACM SIGGRAPH 2000, Computer Graphics Proceedings, ACS,
pp. 385–392. New York: ACM, 2000.
[Mertens et al. 04] Tom Mertens, Jan Kautz, Philippe Bekaert, and F. van Reeth.
“A Self-Shadowing Algorityhm for Dynamic Hair using Clustered Densities.” In
Rendering Techniques 2004: Eurographics Symposium on Rendering. Aire-la-Ville,
Switzerland: Eurographics, 2004.
[Salvi et al. 10] Marco Salvi, Kiril Vidimče, Andrew Lauritzen, and Aaron Lefohn.
“Adaptive Volumetric Shadow Maps.” In Eurographics Symposium on Rendering,
pp. 1289–1296. Aire-la-Ville, Switzerland: Eurographics Association, 2010.
[Sintorn and Assarson 09] Erik Sintorn and Ulf Assarson. “Hair Self Shadowing and
Transparency Depth Ordering Using Occupancy Maps.” In I3D ’09: Proceedings
of the 2009 Symposium on Interactive 3D Graphics and Games, pp. 67–74. New
York: ACM, 2009.
[Williams 78] Lance Williams. “Casting Curved Shadows on Curved Surfaces.” Com-
puter Graphics (Proceedings of SIGGRAPH 78) 12:3 (1978), 270–274.
[Xie et al. 07] Feng Xie, Eric Tabellion, and Andrew Pearce. “Soft Shadows by Ray
Tracing Multilayer Transparent Shadow Maps.” In Rendering Techniques 2007:
18th Eurographics Workshop on Rendering, pp. 265–276. Aire-la-Ville, Switzerland:
Eurographics Association, 2007.
[Yang et al. 10] Jason Yang, Justin Hensley, Holger Grün, and Nicolas Thibieroz.
“Real-Time Concurrent Linked List Construction on the GPU.” In Rendering
Techniques 2010: Eurographics Symposium on Rendering, pp. 51–60. Aire-la-Ville,
Switzerland: Eurographics Association, 2010.
[Yuksel and Keyser 08] Cem Yuksel and John Keyser. “Deep Opacity Maps.” Computer
Graphics Forum 27:2 (2008), 675–680.
i i
i i
i i
i i
i i
i i
i i
i i
4
IV
4.1 Introduction
In computer graphics applications, soft shadows are usually generated using either
a single shadow map together with some clever filtering method (which is fast,
but inaccurate), or by calculating physically correct soft shadows with light-source
area sampling [Heckbert and Herf 97]. Many shadow maps from random positions
on the light source are created (which is slow) and the average of the resulting
shadow tests is taken (see Figure 4.1).
Figure 4.1. Light sampling with one, two, three and 256 shadow maps (left to right).
243
i i
i i
i i
i i
244 IV Shadows
We present a soft shadow algorithm that combines the benefits of these two
approaches by employing temporal coherence: the light source is sampled over
multiple frames instead of a single frame, creating only a single shadow map with
each frame. The individual shadow test results are stored in a screen-space (of the
camera) shadow buffer (see Figure 4.2). Focusing each shadow map on creation
can be done because only fragments in the screen space of the camera remain
stored in the shadow buffer. This buffer is recreated with each frame using
the shadow buffer from the previous frame Bprev as input (ping-pong style).
The input Bprev holds shadowing information only for pixels that were visible
in the previous frame. Pixels that become newly visible in the current frame
due to camera (or object) movement (so-called disocclusions) have no shadowing
information stored in this buffer. For these pixels we use spatial filtering to
estimate the soft shadow results.
Our approach is faster as typical single sample soft shadow approaches like
PCSS, but provides physically accurate results and does not suffer from typical
single-sample artifacts. It also works on moving objects by marking them in the
shadow map and falling back to a standard single-sample approach in these areas.
4.2 Algorithm
The main idea of our algorithm is to formulate light-source area sampling in an
iterative manner, evaluating only a single shadow map per frame. We start by
looking at the math for light-source area sampling: given n shadow maps, we
can calculate the soft-shadow result for a given pixel p by averaging over the
hard-shadow results si calculated for each shadow map. This is given by
n
1X
ψn (p) = si (p). (4.1)
n i=1
i i
i i
i i
i i
(b) Check if the pixel was visible in the last frame and therefore has as-
sociated shadowing information stored in the shadow buffer (see Sec-
tion 4.2.1):
Yes: Combine information from the shadow buffer with SM (see Sec-
tion 4.2.2).
Yes No
Disoccluded?
Use soft
shadow for
scene Store updated
values in Bcur
lighting α-channel:
Average blocker depth
(for neighborhood filter)
i i
i i
i i
i i
246 IV Shadows
// shadow map s a m p l i n g c o o r d i n a t e s
const f l o a t 2 smCoord = t e x S p a c e ( i n p u t . L i g h t P o s ) ;
// L i n e a r depth o f c u r r e n t p i x e l i n l i g h t s p a c e
const f l o a t f r a g D e p t h = i n p u t . L i g h t P o s . z ;
// sample depth i n shadow map
const f l o a t Depth = g e t S M t e x e l ( smCoord ) ;
// s t o r e hard shadow t e s t r e s u l t a s i n i t i a l sum
f l o a t ShadowSum = shadowTest ( Depth , f r a g D e p t h ) ;
To detect pixels that were not visible in the previous frame we first check if
pprev is inside Bprev in the x- and y-direction and then we check the z (i.e., the
depth) difference between pprev and the corresponding entry in Bprev at position
pprev . If this difference exceeds a certain threshold, we conclude that this pixel
was not visible in the previous frame (see Listing 4.2 and 4.3).
i i
i i
i i
i i
Figure 4.3. Back-projection of a single pixel (left). If we do this for every pixel we
virtually transform the previous frame into the current, except for the pixels that were
not visible in the previous frame (shown in red (right)).
bool o u t s i d e T e x t u r e ( f l o a t 2 Tex ) {
return any ( b o o l 2 ( Tex . x < 0 . 0 f , Tex . y < 0 . 0 f ) )
| | any ( b o o l 2 ( Tex . x > 1 . 0 f , Tex . y > 1 . 0 f ) ) ;
}
Listing 4.2. Helper function for checking the validity of texture coordinates.
The obtained position will normally not be at an exact pixel center in Bprev
except in the special case that no movement has occurred. Consequently, texture
filtering should be applied during the lookup in the shadow buffer Bprev . In
practice, the bilinear filtering offered by graphics hardware shows good results.
// p r e v i o u s shadow b u f f e r s a m p l i n g c o o r d i n a t e s :
const f l o a t 2 shadowBuffTexC = t e x S p a c e ( i n p u t . B u f f e r P o s ) ;
// c h e c k i f t h e p i x e l i s i n s i d e t h e p r e v i o u s shadow b u f f e r :
i f ( ! o u t s i d e T e x t u r e ( shadowBuffTexC ) ) {
// i n s i d e o f p r e v i o u s data −> we can t r y t o re−u s e i n f o r m a t i o n !
f l o a t 4 oldData = g e t S h a d o w B u f f e r T e x e l ( shadowBuffTexC ) ;
const f l o a t oldDepth = oldData . x ;
// c h e c k i f d e p t h s a r e a l i k e , s o we can re −u s e i n f o r m a t i o n
i f ( abs(1− i n p u t . B u f f e r P o s . z / oldDepth ) < EPSILON DEPTH) {
// o l d data a v a i l a b l e −> u s e i t , s e e n e x t s e c t i o n
...
}
}
Listing 4.3. Test if the data for the current pixel was available in the previous shadow
buffer.
i i
i i
i i
i i
248 IV Shadows
Listing 4.4. Combination of a hard shadow and the data from the shadow buffer.
f l o a t n e i g h b o r h o o d F i l t e r ( const f l o a t 2 uv ,
const f l o a t 2 f i l t e r R a d i u s U V ,
const f l o a t c u r r e n t D e p t h ) {
f l o a t sampleSum = 0 , numSamples = 0 ;
f o r ( i n t i = 0 ; i < NUM POISSON SAMPLES ; ++i ) {
const f l o a t 2 o f f s e t = p o i s s o n D i s k [ i ] ∗ f i l t e r R a d i u s U V ;
const f l o a t 3 data = g e t S h a d o w B u f f e r T e x e l ( uv + o f f s e t ) . xyz ;
i i
i i
i i
i i
Listing 4.5. Soft shadow estimation by filtering the shadow buffer neighborhood.
If these neighboring pixels have not been recently disoccluded, they are very
likely to provide a good approximation of the correct soft-shadow value and will
help to avoid discontinuities between the shadowed pixels.
The filter radius is calculated using the same penumbra estimation as in the
PCSS algorithm [Fernando 05]. The idea is to approximate all occluders in a
search area around the pixel by one planer occluder at depth zavg . Using the
intercept theorem and the relations between pixel depth zreceiver and light source
size wlight an estimation of the penumbra width wpenumbra (see Figure 4.4) is
given by
(zreceiver − zavg )
wpenumbra = wlight .
zavg
Figure 4.4. The sampling radius of the neighborhood filter depends on the scene depth
and an estimated penumbra size (left). The penumbra width can be approximated by
using the intercept theorem (right).
i i
i i
i i
i i
250 IV Shadows
The calculation of the average occluder depth is done by searching for potential
blockers in the shadow map, and is therefore a computationally costly step—but
in contrast to PCSS, we have to do this step only in the case of a disocclusion.
Otherwise, we store it in the shadow buffer for further use in consecutive frames
(see Section 4.2.4).
In practice, it has been found useful to assign a weight larger than 1 to this
approximation (for one hard shadow map evaluation), to avoid jittering artifacts
in the first few frames after a disocclusion has occurred. Therefore, we use the
number of Poisson samples from the neighborhood filter as weight.
In order to avoid visible discontinuities when switching from the estimate gener-
ated after a disocclusion and the correct result obtained from the shadow buffer
Bprev , the two shadow values are blended. This blended shadow is only used to
improve the visual quality in the first few frames and is not stored in the shadow
buffer. Note that we do not have to estimate the average blocker depth for the
neighborhood filter again, as it has been evaluated and stored in the shadow
buffer directly after the disocclusion! Additionally, this average blocker depth is
refined every frame by adding the additional depth value from the current shadow
map SM (see Listing 4.6).
i i
i i
i i
i i
// c a l c u l a t e s t a n d a r d e r r o r with b i n o m i a l v a r i a n c e e s t i m a t o r
const f l o a t e r r o r =
n == 1 . 0 ? 1 . 0 : s q r t ( softShadow ∗(1− softShadow ) / ( n − 1 ) ) ;
// i f we have r e c e n t l y d i s o c c l u d e d s a m p l e s o r a l a r g e e r r o r ,
// s u p p o r t t h e shadow i n f o r m a t i o n with an a p p r o x i m a t i o n
i f ( e r r o r >= e r r m i n && avgBlockerDepth > 0 ) {
// penumbra e s t i m a t i o n l i k e i n PCSS , but with t h e a v e r a g e
// o c c l u d e r depth from t h e h i s t o r y b u f f e r
const f l o a t penumbraEstimation = v L i g h t D i m e n s i o n s [ 0 ] ∗
( ( f r a g D e p t h − avgBlockerDepth ) / avgBlockerDepth ) ;
// do s p a t i a l f i l t e r i n g i n t h e shadow b u f f e r ( s c r e e n s p a c e ) :
const f l o a t d e p t h F a c t o r = ( n e a r P l a n e D i s t / i n p u t . Depth ) ;
const f l o a t shadowEstimate = n e i g h b o r h o o d F i l t e r (
shadowBuffTexC , v A s p e c t R a t i o ∗ d e p t h F a c t o r ∗ penumbraEstimation ,
i n p u t . Depth ) ;
// i f shadow e s t i m a t e v a l i d c a l c u l a t e new s o f t shadow
i f ( shadowEstimate > 0 . 0 f ) {
i f ( inDisoccludedRegion ) {
// d i s o c c l u d e d sample : o n l y e s t i m a t e d shadow
// d e f i n e w e i g h t f o r e s t i m a t e
const f l o a t e s t i m a t e W e i g h t = NUM POISSON SAMPLES ;
ShadowSum = shadowEstimate ∗ e s t i m a t e W e i g h t ;
n = estimateWeight ;
softShadow = shadowEstimate ;
} else {
// b l e n d e s t i m a t e d shadow with accumulated shadow
// u s i n g t h e e r r o r a s b l e n d i n g w e i g h t
const f l o a t w e i g h t = ( err max−e r r o r ) / ( err max−e r r m i n ) ;
softShadow =
shadowEstimate ∗ (1− w e i g h t ) + softShadow ∗ w e i g h t ;
}
}
}
i i
i i
i i
i i
252 IV Shadows
After having the soft-shadow result evaluated for each displayed pixel, the
final steps are to
• use the calculated result to modify the scene illumination, and output the
shadowed scene on the corresponding render target, and to
• store the current depth, the number of successful shadow tests sum, the
number of samples n, and the average blocker depth in the new shadow
buffer render target Bcur .
Listing 4.8. Store values in shadow buffer and output rendered image.
i i
i i
i i
i i
Figure 4.6. The “age” (i.e., the number of reusable shadow tests) of the fragments in
our walkthrough scene.
i i
i i
i i
i i
254 IV Shadows
Figure 4.7. Overlapping occluders (our method, PCSS 16/16) and bands in big penum-
brae (our method, PCSS 16/16) are known problem cases for single sample approaches
left to right:.
Our algorithm tends to have a slower frame rate in cases of numerous disocclu-
sions, because it has to perform the additional blocker search for the penumbra
estimation. Due to its higher complexity (more ifs), our shader can be slower
than PCSS in such circumstances. As soon as the shadow buffer is exploited and
its values can be reused, our approach can unfold its strength and deliver higher
frame rates, while PCSS still has to do the shadow map lookups. As can be seen
in Figure 4.6, the number of fragments for which buffer data can be reused is
usually high enough to obtain frame rates exceedings those that can be obtained
with PCSS.
In static scenes, the soft shadows generated with our method are physically
accurate and of a significantly better quality than is produced by PCSS, which
suffers from typical single-sample artifacts (see Figure 4.7). For moving objects,
the shadow buffer can hardly be exploited, and we therefore provide a fallback
solution in which spatial neighborhood filtering is applied. Though humans can
hardly perceive the physical incorrectness in such cases, there is room for im-
provement, since some flickering artifacts may remain when dynamic shadows
overlap with static shadows that have large penumbrae.
Bibliography
[Fernando 05] Randima Fernando. “Percentage-Closer Soft Shadows.” In ACM SIG-
GRAPH Sketches, p. 35. New York: ACM, 2005.
[Heckbert and Herf 97] Paul S. Heckbert and Michael Herf. “Simulating Soft Shadows
with Graphics Hardware.” Technical Report CMU-CS-97-104, CS Dept., Carnegie
Mellon University, 1997.
[Scherzer et al. 07] Daniel Scherzer, Stefan Jeschke, and Michael Wimmer. “Pixel-
Correct Shadow Maps with Temporal Reprojection and Shadow Test Confidence.”
In Proceedings Eurographics Symposium on Rendering, pp. 45–50. Aire-la-Ville,
Switzerland: Eurographics Association, 2007.
i i
i i
i i
i i
[Scherzer et al. 09] Daniel Scherzer, Michael Schwärzler, Oliver Mattausch, and Michael
Wimmer. “Real-Time Soft Shadows Using Temporal Coherence.” In Proceedings of
the 5th International Symposium on Advances in Visual Computing: Part II, Lec-
ture Notes in Computer Science (LNCS), pp. 13–24. Berlin, Heidelberg: Springer-
Verlag, 2009.
[Sen et al. 03] Pradeep Sen, Mike Cammarano, and Pat Hanrahan. “Shadow Silhouette
Maps.” ACM Transactions on Graphics (Proceedings of SIGGRAPH) 22:3 (2003),
521–526.
[Wimmer et al. 04] Michael Wimmer, Daniel Scherzer, and Werner Purgathofer. “Light
Space Perspective Shadow Maps.” In Proceedings of Eurographics Symposium on
Rendering 2004. Aire-la-Ville, Switzerland: Eurographics Association, 2004.
i i
i i
i i
i i
i i
i i
i i
i i
5
IV
Mipmapped Screen-Space
Soft Shadows
Alberto Aguado and Eugenia Montiel
This chapter presents a technique for generating soft shadows based on shadow
maps and screen space image filtering. The main idea is to use a mipmap to rep-
resent multifrequency shadows in screen space. The mipmap has two channels:
the first channel stores the shadow intensity values and the second channel stores
screen-space penumbrae widths. Shadow values are obtained by filtering while
penumbrae widths are propagated by flood filling. After the mipmap is gener-
ated, the penumbrae values are used as indices to the mipmap levels. Thus, we
transform the problem of shadow generation into the problem of selecting levels
in a mipmap. This approach is extended by including layered shadow maps to
improve shadows with multiple occlusions.
As with the standard shadow-map technique, the computations in the tech-
nique presented in this chapter are almost independent of the complexity of the
scene. The use of the shadow’s levels of detail in screen space and flood filling
make this approach computationally attractive for real-time applications. The
overhead computation compared to the standard shadow map is about 0.3 ms
per shadow map on a GeForce 8800GTX.
257
i i
i i
i i
i i
258 IV Shadows
into the camera frame, so its distance to the camera can be compared to the
shadow-map value. This comparison determines if a point is occluded or not;
thus points are either fully shadowed or fully illuminated. The binary nature of
the comparison produces hard shadows, reducing the realism of the scene. As
such, previous works have extended the shadow-map technique to produce soft
shadows.
The technique presented in this chapter filters the result of the shadow map
test. This approach was introduced in the percentage closer filtering (PCF)
[Reeves et al. 87] technique. PCF determines the shadow value of a pixel by
projecting its area into the shadow map. The shadow intensity is defined by the
number of values in the shadow map that are lower than the value at the center
of the projected area. Percentage closer soft shadows (PCSS) [Fernando 05] ex-
tended the PCF technique to include shadows of different widths by replacing the
pixel’s area with a sampling region whose area depends on the distance between
the occluder and the receiver.
The PCSS technique is fast and it provides perceptually accurate soft shadows,
so it has become one of the most often used methods in real-time applications.
However, it has two main issues. First, since it requires sampling a region per
pixel, it can require an excessive number of computations for large penumbrae.
The number of computations can be reduced by using stochastic sampling, but
it requires careful selection of the sampling region in order to avoid artifacts.
Second, to determine the region’s size, it is necessary to estimate an area in
which to search for the blockers. In general, it is difficult to set an optimal
size, since large regions lead to many computations and small regions reduce the
shadows far away from the umbra.
Figure 5.1. The mipmap in image space is built by using the result of the shadow map
test and the distances between the occluder and the receiver.
i i
i i
i i
i i
In [Gambau et al. 10], instead of filtering by sampling the shadow map, soft
shadows are obtained by filtering the result of the shadow-map comparison in
screen space. The technique in this chapter follows this approach and it intro-
duces a mipmap to represent multiple-frequency shadow details per pixel. As
such, the problem of filtering is solved by selecting a mipmap level for each pixel.
Filtering via mipmapping has been used in previous shadow-map techniques such
as convolution shadow maps [Annen et al. 08] and variance shadow maps [Don-
nelly and Lauritzen 06]. In the technique presented in this chapter, mipmapping
is used to filter screen-space soft shadows. In addition to storing multi-frequency
shadows, the mipmap is exploited to propagate occlusion information obtained
from the occluder and the shadowed surface. Occlusion information is used to
select the shadow frequency as an index to a level in the mipmap. The efficiency
of the mipmap filtering and the screen-space computations make it possible to
create penumbrae covering large areas, expending little computational overhead,
and performing no stratified sampling.
d2 (p) − d1 (p)
w(p) = L.
d1 (p)
Here, the value of L represents the size of the light and it is used to control
the penumbrae; by increasing L, shadows in the scene become softer. Once the
penumbra width is computed, the shadow can be filtered by considering a region
i i
i i
i i
i i
260 IV Shadows
k(p) whose size is proportional to it. That is, PCSS uses a proportional constant
to map the penumbra width to the shadow map region k(p).
The technique presented in this chapter is based on the PCSS penumbra
estimation, but the values are computed only for the points that pass the shadow-
map test. This is because, for these points, it is not necessary to compute averages
to obtain the occluders’ distances; they can be obtained by fetching the value for
the point p0 in the shadow map. That is, d1 is simply defined by the shadow map
entry. Another difference from PCSS is that the penumbra estimation is not used
to define a filter of the values in the shadow map, but it is used to define a filter
in image space.
// Shadow map v a l u e
f l o a t shadow map val = tex2D ( shadowmap texture0 , l i g h t p o s ) ;
// Shadow map t e s t
output . c o l o r . r = shadow map val < l i g h t p o s . z − s h a d o w b i a s ;
i i
i i
i i
i i
// Penumbra width
output . c o l o r . b = 0xFFFFFFFF ;
i f ( output . c o l o r . r == 1 . 0 f ) {
float distance factor =
( l i g h t p o s . z / shadow map val ) − 1 . 0 f ;
output . c o l o r . b = d i s t a n c e f a c t o r ∗ L ∗ f ∗ A / pos . z ;
}
// Tag t h e r e g i o n f o r r e g i o n f i l l i n g
output . c o l o r . g = output . c o l o r . r
}
Listing 5.1. Performing the shadow test and computing the penumbrae widths in two
different texture channels.
Figure 5.2 shows examples of the penumbrae values. The image on the left
shows an example scene. The image in the middle shows the penumbrae widths.
The white pixels are points without penumbrae widths. That is, they are points
without occlusions. The pixels’ intensities represent penumbrae widths and they
show the dependency between the occluder and the receiver positions. Occluders
that are far away from the receiver have lighter values than occluders that are
closer. Lighter values indicate large smooth shadows while dark values indicate
that shadows should have well-delineated borders. The image on the right shows
the result of the shadow-map test computed in Listing 1. The pixels in this image
have just two intensities that define hard shadows.
Notice that the code in Listing 5.1 computes penumbrae estimations only for
pixels in the hard shadow region. However, in order to compute soft shadows, it
is necessary to obtain estimates of the points that will define the umbra of the
shadow. The penumbrae widths for these pixels can be obtained by searching
for the closer occluder. Here, the search is performed by a flood-filling technique
implemented in a mipmap. In this approach, each level of the mipmap is manually
Figure 5.2. Example scene (left). Penumbra widths, light values indicate large penum-
brae (middle). Hard shadows (right).
i i
i i
i i
i i
262 IV Shadows
created by rendering a single quad. The pixel shader sets as render target the
mipmap level we are computing, and it takes as resource texture the previous
mipmap level. The advantage of this approach is that the reduction in resolution
at each level causes an increase in the search region. Thus, large regions can be
searched with a small number of computations.
The flood-filling implementation requires distinguishing between pixels that
have been filled and pixels that need to be filled. This can be efficiently im-
plemented by using a single bit in a texture. To simplify the presentation, the
implementations in Listing 5.1 and Listing 5.2 use an extra channel on the tex-
ture. In Listing 5.1 the pixels that define the filled region are tagged by setting
the green channel to one. This channel is used, when the mipmap level is created,
as shown in Listing 5.2 to distinguish pixels that should be used during the flood
fill. In Listing 5.2, the value of a pixel is obtained by averaging the values that
form the fill region using a 5 × 5 window in the source texture. The implemen-
tation averages distances so pixels so that are close to several occluders do not
produce sharp shadow discontinuities.
// Number o f p o i n t s i n t h e r e g i o n
f l o a t num = 0 . 0 f ;
// Flood f i l l u s i n g a window 5 x5
f o r ( i n t i = 0 ; i <5; i ++) {
f o r ( i n t j = 0 ; j <5; j ++) {
f l o a t 4 t = f l o a t 4 ( uv . x − t a r g e t s h i f t . x
∗ ( −2.5 f + i ) ,
uv . y − t a r g e t s h i f t . y
∗ ( −2.5 f + j ) ,
0 , previous level );
// Read i n p u t l e v e l
f l o a t 4 v a l = t ex 2Dl od ( samMipMap , t ) ;
// Flood f i l l a v e r a g i n g r e g i o n p i x e l s o n l y
i f ( v a l . g == 1 . 0 f ) {
sum += v a l . b ;
num++;
}
}
}
// Output f l o o d f i l l value
i f ( num>0.0 f ) {
i i
i i
i i
i i
// P i x e l s h o u l d be f l o o d
output . c o l o r . b = sum / num ;
output . c o l o r . g = 1 . 0 f ;
}
else {
// P i x e l i s not i n t h e f l o o d r e g i o n
output . c o l o r . b = 0xFFFFFFFF ;
output . c o l o r . g = 0 . 0 f ;
}
}
Listing 5.2. The mipmap level generation. Flood filling for occlusion values.
Figure 5.3. Mipmap levels obtained by flood filling. The flood fill propagates informa-
tion from hard shadows to outer regions.
// S t o r e s r e s u l t o f f i l t e r
output . c o l o r . b = 0 . 0 f ;
i i
i i
i i
i i
264 IV Shadows
// E v a l u a t e u s i n g a window 5 x5
f o r ( i n t i = 0 ; i <5; i ++) {
f o r ( i n t j = 0 ; j <5; j ++) {
f l o a t 4 t = f l o a t 4 ( uv . x − t a r g e t s h i f t . x
∗ ( −2.5 f + i ) ,
uv . y − t a r g e t s h i f t . y ∗ ( −2.5 f + j ) ,
0 , previous level );
// Read i n p u t l e v e l
f l o a t 4 v a l = t ex 2Dl od ( samMipMap , t ) ;
// Flood f i l l a v e r a g i n g r e g i o n p i x e l s o n l y
i f ( v a l . g == 1 . 0 f ) {
sum += v a l . b ;
num++;
}
// Gaussian f i l t e r
outpu t . c o l o r . b += v a l . r ∗ k e r n e l [ i ] [ j ] ;
}
}
// Output f l o o d f i l l value
i f ( num>0.0 f ) {
output . c o l o r . b = sum / num ;
output . c o l o r . g = 1.0 f ;
}
else {
output . c o l o r . b = 0xFFFFFFFF ;
output . c o l o r . g = 0.0 f ;
}
}
Listing 5.3. Mipmap level generation including the filter of the shadow map test values.
The red channel stores the shadow intensity, the blue channel stores the penumbra width
and the green channel is used to tag the filled region.
i i
i i
i i
i i
Figure 5.4. Mipmap levels obtained by filtering the result of the shadow map test (top).
Shadows defined at each mipmap level (bottom).
i = log2 (q (p)) .
Notice that this equation does not give integer levels; we should not be limited by
the values in the mipmap levels, and we can generate shadows for intermediate
values. In the implementation, we use bilinear interpolation to compute the
i i
i i
i i
i i
266 IV Shadows
// Fetch mipmap l e v e l s
f l o a t 4 val [ 8 ] ;
f o r ( i n t l e v e l =0; l e v e l < 8 ; l e v e l++ ) {
v a l [ l e v e l ]= te x2 Dl od ( g samMipMap , f l o a t 4 ( uv , 0 , l e v e l ) ;
}
// Find q ( p )
float q = 0;
f o r ( i n t l e v e l = 0 ; l e v e l < MAX LEVELS {\&}{\&} q == 0 ;
l e v e l++ ) {
i f ( v a l [ l e v e l ] . y != 0 ) q = v a l [ l e v e l ] . z ;
}
i f ( q >0.0 f )
{
// S e l e c t e d l e v e l
i f ( q<1) q = 1 ;
float l = log2 (q ) ;
i f ( l > MAX LEVELS) l = MAX LEVELS − 0 . 1 f ;
// I n t e r p o l a t e l e v e l s
i n t down = f l o o r ( l ) ;
i n t up = down + 1 ;
f l o a t i n t e r p = l −down ;
// Shadow i n t e n s i t y
shadow = ( 1 . 0 f − l e r p ( v a l [ down ] . x , v a l [ up ] . x , i n t e r p ) ) ;
}
output . c o l o r = shadow ;
}
i i
i i
i i
i i
Figure 5.5. Mipmapped shadows implemented using the main scene rendering. Forward
rendering (left). Deferred rendering (right).
how shadows can be added to the scene in forward and deferred rendering. In
forward rendering, as shown in Figure 5.5 (left), hard shadows are computed
during the main rendering of the scene, and they are stored in the bottom level
of the mipmap. Thus, the scene buffer does not contain any shadows. Afterward
the mipmap is constructed and subsequently the scene is shadowed by a shadow
blending post-processing. The post-processing computes the shadow intensities
and it combines the scene buffer and the shadow intensity by rendering a single
quad. In the deferred rendering illustrated in Figure 5.5 (right), hard shadows
are stored in a G-buffer and the shadow map can be used during the lighting
pass.
Figure 5.6 shows some examples of soft shadows generated by using the
mipmap technique. The first two images were obtained by changing the light’s
area, so shadows become smoother. The third image shows a close-up view to
highlight contact shadows. As with any other shadow-map techniques, accurate
shadows at contact point and self-shadows require an appropriate bias in the
shadow map test. The first image in the bottom row shows how shadows change
depending on the distance between the occluder and the receiver. Shadows are
blurred and smooth for distant points whilst they are well delineated close to
contact points. The final two images show the result on textured surfaces.
i i
i i
i i
i i
268 IV Shadows
i i
i i
i i
i i
Layered shadow maps define an array of shadow maps that store the distances
to the closest point for slices parallel to the light. In our implementation, each
shadow map is obtained by rendering the scene multiple times, changing the near
clip of the camera to cover the regions illustrated in Figure 5.7 (right). That is,
the first shadow map covers a small region far from the light and the next shadow
maps cover regions that increase in size approaching the camera.
In order to use layered shadow maps, the shadow-map test in Listing 5.1
should be changed to search for occluders in an array of textures. This is imple-
mented in Listing 5.5. Here, the test is performed on each of the layer textures
and the first layer that passes the shadow-map test is used to compute the oc-
cluder distance. It is important to mention that if multiple occluders are close to
each other, then they will be located in the same shadow map. However, if they
are close, then the distance error is low and the shadows are similar. That is,
layers will not guarantee the correct distance computation, but they will mitigate
problems caused by multiple occluders at far distances.
Figure 5.8 shows two examples that compare renderings with and without
layered shadow maps. The images in the top row were obtained with a single
shadow map while the images on the bottom row have eight layers. In the example
shown in the images on the left, there are two multiple occlusions caused by
the brick blocks and containers. Since the distance between the bricks and the
container on the left is small relative to their distance to the light, shadows caused
by both objects merge without causing artifacts. However, the large distance
between the bricks and the container at the right causes a light shadow under
the container. As shown in the image on the bottom row, these problems are
reduced by using layered shadow maps.
i i
i i
i i
i i
270 IV Shadows
f l o a t 4 l i g h t p o s = mul ( pos4 , l i g h t m a t r i x )
// For each l a y e r
float2 shadow map val [ 8 ]
shadow map val [ 0 ] = GetShadowMap ( 0 , l i g h t p o s ) ; // :
shadow map val [ 7 ] = GetShadowMap ( 7 , l i g h t p o s ) ;
Listing 5.5. Performing the shadow test and computing the penumbrae widths for
layered shadow maps and transparency.
The example in the top-right image in Figure 5.8 shows light bleeding caused
by a semitransparent object. Shadows for semitransparent objects can be created
by changing the intensity of the shadows according to the transparency value
of the albedo texture of the occluder. This modification can be implemented
during the shadow-map test, so it does not add any significant computational
overhead; it requires changing the shadow-map generation and the shadow-map
test. The shadow-map creation should be modified so that the shadow map keeps
the distance to the closest object and the alpha value of the albedo texture. The
alpha value can then be used as shown in Listing 5.5 to determine the intensity
of the shadow.
The computation of transparency, using values in the shadow map, is compu-
tationally attractive; however it can produce light bleeding for multiple occluders.
This is illustrated in Figure 5.8 (top right). Here, the bin is causing a multiple
occlusion with the bus stop glass. Thus, the shadow cast by the bin uses the
transparency of glass and produces a very weak shadow. This is because a single
shadow map stores only the alpha of the closest object to the light. As shown
in Figure 5.8 (bottom right), layered shadow maps can alleviate this problem.
However, if the occluders are moved close to each other, the layer strategy may
fail to store multiple values and objects can produce incorrect shadows.
i i
i i
i i
i i
Figure 5.8. Examples of layered shadows. Light bleeding caused by incorrect computa-
tion of the distance between the occluder and the receiver (top). Layered shadow maps
can reduce occluder problems (bottom).
5.7 Discussion
Compared with standard shadow maps, the technique presented in this chapter
uses an extra texture to store the mipmap and one texture for each layer for the
layered version. In terms of processing, it adds a computational overhead caused
by: (i) the computations of the penumbra width during the shadow-map test
(Listing 5.1); (ii) the creation of the mipmap (Listing 5.3); and (iii) the mipmap
lookup during shadow blending (Listing 5.4).
The rendering times of the technique are shown in Table 5.1. The columns in
the table show the frames per second when: rendering without shadows, render-
ing using standard shadow maps, rendering using the mipmap, and when using
layers. The results were obtained for a test scene with 13K faces and by using
a GeForce 8800GTX with a 720 × 480 display resolution. The implementation
used a 512 × 512 shadow map and six mipmap levels. The frame time increases
about 0.3 ms when the mipmap is used to generate soft shadows. This increase
is mainly because of the time spent during the generation of the mipmap. In
i i
i i
i i
i i
272 IV Shadows
the layered version, the increase in rendering time is mainly due the multiple
rendering required to create the shadow map for each layer. The time shown in
Table 5.1 was obtained by considering six layers.
The computational load is adequate for real-time applications and the results
show compelling smooth shadows. However, multiple occlusions can produce light
bleeding. This is more evident as the light’s area increases, since shadows with
significantly different intensities can be created. This problem can be mitigated
by saturating the intensity of the shadows or by using layered shadow maps. Nev-
ertheless, when dealing with complex scenes and large-area lights, there still may
be variations of intensities on multiple occlusion zones. As such, the technique
could benefit from more elaborate layer placements or peeling layer strategies.
Finally, it is important to mention that this technique relies on shadow maps,
so it inherits those computational advantages, but it is also prone to inherent
problems such as z-fighting for incorrect depth bias.
Bibliography
[Annen et al. 08] Thomas Annen, Zhao Dong, Tom Mertens, Philippe Bakaert, Hans-
Peter Seidel and Jan Kautz. “Real-Time All-Frequency Shadows in Dynamic
Scenes.” ACM Transactions on Graphics (Proc SIGGRAPH) 27:3 (2008), 34:1–
34:8
[Bracewell 00] Ronald Bracewell. The Fourier Transform and its Applications. Singa-
pore: McGraw-Hill International Editions, 2000.
[Donnelly and Lauritzen 06] William Donnelly and Andrew Lauritzen. “Variance
Shadow Maps.” In Symposium on Interactive 3D Graphics and Games, pp. 161-165.
New York: ACM, 2006.
[Gambau et al. 10] Jesus Gambau, Miguel Chover and Mateu Sbert. “Screen Space Soft
Shadows.” In GPU Pro Advanced Rendering Techniques, edited by Wolfgang Engel,
pp. 477–491. Natick, MA: A K Peters, 2010.
i i
i i
i i
i i
[Reeves et al. 87] William Reeves, David Salesin and Robert Cook. “Rendering An-
tialised Shadows with Depth Maps.” Computer Graphics (Proc. SIGGRAPH) 21:4
(1987), 283-291.
[Williams 78] Lance Williams. “Casting Curved Shadows on Curved Surfaces.” Com-
puter Graphics (Proc. SIGGRAPH) 12:3 (1978), 270–274.
i i
i i
i i
i i
i i
i i
i i
i i
V
Handheld
Devices
This part covers the latest development in programming GPUs of devices that are
portable, such as mobile phones, personal organizers, and portable game consoles.
The latest generation of GPUs for handheld devices comes with a feature set that
is comparable to PC and console GPUs.
The first article, “A Shader-Based eBook Renderer,” by Andrea Bizzotto illus-
trates a vertex-shader-based implementation of the page-peeling effect of a typical
eBook renderer. It covers high-quality procedural antialiasing of the page edges,
as well as some tricks that achieve a polished look. Two pages can be combined
side-by-side to simulate a real book, and additional techniques are introduced
to illustrate how to satisfy additional constraints and meet power-consumption
requirements.
The second article of this part, “Post-Processing Effects on Mobile Devices,”
by Marco Weber and Peter Quayle describes a general approach to implement
post-processing on handheld devices by showing how to implement a bloom effect
with efficient convolution.
Joe Davis and Ken Catterall show in “Shader-Based Water Effects,” how to
render high-quality water effects at a low computational cost. Although there are
many examples of water effects using shaders that are readily available, they are
designed mainly for high-performance graphics chips on desktop platforms. This
article shows how to tailor a technique discussed by Kurt Pelzer (in ShaderX2 ,
“Advanced Water Effects,” 2004) to mobile platforms.
—Kristof Beets
i i
i i
i i
i i
i i
i i
i i
i i
1
V
Readers of eBooks are becoming increasingly popular. Touch screens and pro-
grammable GPUs, such as the POWERVR SGX Series from Imagination Tech-
nologies, can be combined to implement user-friendly navigation and page flipping
functionality. This chapter illustrates a vertex-shader-based implementation of
the page-peeling effect, and details some techniques that allow high-quality proce-
dural antialiasing of the page edges, as well as some tricks that achieve a polished
look. Two pages can be combined side-by-side to simulate a real book, and addi-
tional techniques are introduced to illustrate how to satisfy additional constraints
and meet power consumption requirements.
1.1 Overview
The chapter is organized as follows: Section 1.2 introduces the mathematical
model which is the basis of a page-peeling simulation, and shows how to use
the vertex shader to render the effect with a tessellated grid on OpenGL ES
2.0 hardware. Section 1.3 discusses the additional constraints that need to be
considered when rendering two pages side-by-side, and Section 1.4 illustrates some
techniques that improve the visual look and deal with antialiasing. An approach
that doesn’t require a tessellated grid is illustrated in Section 1.5 to show how
the technique can be adapted to work on OpenGL ES 1.1 hardware with minimal
vertex overhead. Section 1.6 mentions some practical considerations regarding
performance and power consumption. Finally, Sections 1.7, 1.8, and 1.9 discuss
some aspects that have not been considered or that can be improved.
Throughout the article, points will be represented with a capital bold letter,
vectors with a small bold letter, and scalars in italic. Page, quad, and plane will
be used interchangeably to describe the same entity.
277
i i
i i
i i
i i
i i
i i
i i
i i
Figure 1.2. Page-folding effect: positions of the input vertices on the right of the bending
axis are modified according to the model.
Figure 1.3. Internal representation: a tessellated grid is used as the input. When the
page is not folded, two triangles can be used to render it.
i i
i i
i i
i i
C = −sign(d) (1.1)
Then, the axis where the page starts to fold is determined (dotted line in
Figure 1.2). This can be represented geometrically by a point1 and a direction,
which can be calculated as in Equations (1.2) and (1.3).
B=C+F ·d (1.2)
t = (−dy , dx ) (1.3)
The calculated values B and t are then passed to the vertex shader as uniforms,
together with the direction d and radius R. All the remaining steps of the
algorithm are performed in the vertex shader.
Since t and v both lie on the same plane z = 0, the scalar value w = sz , which
satisfies |w| = |sz | = |s|, gives all the required information. In fact, the sign of
w represents the direction of the resulting vector s and tells which semiplane the
vertex belongs to, and its absolute value is the distance from the axis.
It can be noticed in Figure 1.2 that the relation w = −d · v also holds since
the dot product calculates the projection of the vector v into d, which can be on
either side of the bending axis. To summarize, both the cross and dot product
can be used to get the required information as Equation (1.5) shows:
i i
i i
i i
i i
Figure 1.4. Model for bending effect: For vertices with w > 0 the position needs to to
be recalculated by wrapping it along the curve path.
E = B + (v · t) t (1.6)
Figure 1.4 shows that the x-, y- and z-components of the input vertex are up-
dated differently. All the remaining calculations are based on the known distance
w and the direction −d, labeled u for convenience. The angle α = w/R is used,
as well, to update the final position.
If α < π, then the final position of the vertex lies on the semicircle and can
be calculated as in Equation (1.7):
i i
i i
i i
i i
g l P o s i t i o n .w = 1 . 0 ;
TexCoord = inTexCoord ;
}
Listing 1.1. OpenGL ES 2.0 vertex shader for basic page-peeling effect.
i i
i i
i i
i i
recommended that the relative tessellation of the input grid matches the screen
aspect ratio, so that the vertices will be spaced equally in the two dimensions,
once stretched to the screen.
F dx − λdy = x
(1.11)
F dy + λdx = 0,
!
d2y d2x + d2y
dy F
x = F dx − −F dy = F dx + =F = . (1.12)
dx dx dx dx
If x > a, then F exceeds the maximum value given the direction d, as shown in
Figure 1.5(a). If a different angle is chosen, the page can fold in a direction that
satisfies the constraint, while preserving the same folding value F .
Let θ be the angle corresponding to the direction d. Values of |θ| close to π/2
cause x to approach the limit a quickly, since the page folds almost vertically,3
therefore when x > a it is appropriate to use a new value θ0 , where |θ0 | < |θ|.
Once the new angle is calculated, all other dependent variables can be updated
and passed to the vertex shader.
2 Here the corner is assumed to have coordinates (0, 0) and applying the two vectors F d and
i i
i i
i i
i i
Figure 1.5. Angle correction: to prevent page tearing (left), the direction is modified
in order to meet the constraint x ≤ a (right).
x−a
θc = −sign(dx dy )F . (1.13)
x
The term −sign(dx dy ) takes into account the fact that θ can be positive or neg-
ative, ensuring that |θ0 | < |θ|.4
The absolute value of θc needs to be proportional to F and the difference
(x − a), normalized by a factor 1/x to ensure small corrections for large x values
(this is an heuristic that works well in practice).
4 The term d in Equation (1.13) is necessary to handle correctly the right page, where the
x
angle correction needs to be inverted. The full example can be found in the code.
i i
i i
i i
i i
If F is large enough, it means that the user is dragging a finger across the
whole screen, and the page should fold completely. In order to do this it is
necessary to modify the current model so that once θ0 reaches 0 (condition by
which the page is parallel and almost completely covers the one underneath), the
radius decreases to 0 as well to complete the peel effect. Listing 1.2 illustrates
the final update stage.
Once the angle and radius are updated, all the required uniforms are calcu-
lated and passed to the vertex shader.
Figure 1.6. Antialiasing edges: simple shader (left), internal gradient border (center),
internal and external border (right). The intensity values and widths of the borders can
be tweaked to obtain the desired look.
i i
i i
i i
i i
In addition to the physical borders of the page, the area where the page
bends includes an edge that can be perceived visually and could be improved by
some form of shading. A gradient-based technique can be used here too, and
the gradient can be calculated as a simple function of the angle α, available in
the vertex shader. Since the folding direction is arbitrary, in this case it’s not
possible to split the original mesh in two, and this results in a more expensive
fragment shader being applied to the whole page. The next section introduces
an alternative rendering technique that, as a side effect, addresses this issue.
Figure 1.7. Direction-aligned strip: by using tessellation on only the curled part of the
page, the overall number of vertices can be sensibly reduced. Additionally, the grid can
be simplified to a triangle strip.
i i
i i
i i
i i
axis and create the two trapezoids. The triangle strip and the trapezoids can be
generated procedurally on the CPU every time the input changes and can then
be passed to OpenGL for rendering.
If an internal border is used as described in the previous section, the number of
meshes and intersections to be determined further increases. Many subcases can
be envisioned depending on how the bending axis partitions the page, and some
extensive coding is necessary to cover all of them. If needed, the whole geometry
can be preprocessed on the CPU and the page can be submitted already folded
to the graphics renderer, allowing this technique to run on OpenGL ES 1.1-class
hardware. For the purposes of this article only the general idea is presented and
the implementation is left to the reader (though code for the tessellation-grid
approach is included).
Note, separating the input quad into individual meshes further reduces the
overall number of computations performed on the GPU, since a simple lookup
shader can be used for most of the area (flat trapezoids), while the application
of fancier shaders with gradients can be limited to the bent part of the page.
i i
i i
i i
i i
represented by two different textures, and the page can be rendered twice, with
front- and back-face cull. Figure 1.1 shows the final effect rendered on screen.
The available code features a simple eBook application that allows the user to
browse through a predefined set of pages and takes into account some additional
practical conditions not considered in this article.
1.9 Conclusion
Graphics hardware capable of vertex processing can exploit the problem of folding
a plane in any arbitrary direction by means of a highly tessellated grid. A sample
application for iPad (available in the source code release) has been developed,
and approximately 17,000 faces have been used per-page without any noticeable
degrade in visual quality or performance at 1024 × 768 resolution.
The cost per vertex is relatively low, given the optimizations introduced in
this paper, and the fragment shaders are quite simple, generally consisting of a
texture lookup and a sum or mix with a gradient color, making low-end OpenGL
ES 2.0 devices a good target for this kind of application.
i i
i i
i i
i i
1.10 Acknowledgments
The author would like to thank Kristof Beets, Sergio Tota, and the whole BizDev and
DevTech teams for their work, contribution and feedback.
Bibliography
[Sutherland and Hodgman 74] Ivan E. Sutherland and Gary W. Hodgman. “Reentrant
Polygon Clipping.” Communications of the ACM 17:1 (1974), 32–42.
i i
i i
i i
i i
i i
i i
i i
i i
2
V
Post-Processing Effects on
Mobile Devices
Marco Weber and Peter Quayle
2.1 Overview
The general idea behind post-processing is to take an image as input and generate
an image as output (see Figure 2.3). You are not limited to only using the
provided input image data, since you can use any available data source (such as
291
i i
i i
i i
i i
Figure 2.1. Radial twist, sepia color transformation, depth of field, depth of field and
sepia color transformation combined (clockwise from top left).
Figure 2.2. Post-processing effect on the Samsung Galaxy S. The background of the
user interface is blurred and darkened to draw attention to items in the foreground.
i i
i i
i i
i i
depth and stencil buffers), as additional input for the post-processing step. The
only requirement is that the end result has to be an image.
One direct advantage of this approach is that, due to the identical basic for-
mat of input and output, it is possible to chain post-processing techniques. As
illustrated in Figure 2.4, the output of the first post-processing step is reused as
input for the second step. This can be repeated with as many post-processing
i i
i i
i i
i i
2. Render a full screen pass using a custom pixel shader, with the texture from
the previous stage as input, to apply the effect.
The first step is the most straightforward, because it simply requires setting
a different render target. Instead of rendering to the back buffer, you can create
an FBO of the same dimensions as your frame buffer. In the case that the frame
buffer dimensions are not a power of two (e.g., 128×128, 256×256 etc.), you must
check that the graphics hardware supports non-power-of-two (NPOT) textures.
If there is no support for NPOT textures, you could allocate a power-of-two FBO
that approximates the dimensions of the frame buffer. For some effects it may
be possible to use an FBO far smaller than the frame buffer, as discussed in
Section 2.4.
In step two, the texture acquired during step one can be used as input for the
post-processing. In order to apply the effect, a full-screen quad is drawn using a
post-processing pixel shader to apply the effect to each pixel of the final image.
All of the post-processing is executed within the pixel shader. For example, in
order to apply an image convolution filter, neighboring texels have to be sampled
and modulated to calculate the resulting pixel. Figure 2.5 illustrates the kernel,
which can be seen as a window sliding over each line of the image and evaluating
each pixel at its center by fetching neighboring pixels and combining them.
i i
i i
i i
i i
i i
i i
i i
i i
• The second step renders to the frame buffer object with ID 2, using the
previous frame buffer object as input.
• The whole procedure is repeated for steps three and four, but instead of
using frame buffer object 2 again for the last step, the back buffer is used
since the final result will be displayed on the screen.
i i
i i
i i
i i
i i
i i
i i
i i
2.4 Implementation
The bloom algorithm presented in the previous section describes the general
approach one might implement when processing resources are vast. Two full
screen passes for intensity filtering and final blending and several passes for the
blur filter in the most naive implementation are very demanding even for the
fastest graphics cards.
Due to the nature of mobile platforms, adjustments to the original algorithm
have to be made in order to get it running, even when the hardware is equipped
with a highly efficient POWERVR SGX core.
i i
i i
i i
i i
The end result has to look convincing and must run at interactive frame rates.
Thus, most of the steps illustrated in Section 2.3 have to be modified in order to
meet the resource constraints.
2.4.1 Resolution
One of the realities of mobile graphics hardware is a need for low power and long
battery life, which demand lower clock frequencies. Although the POWERVR
SGX cores implement a very efficient tile-based deferred rendering approach, it
is still essential to optimize aggressively when implementing full screen post-
processing effects.
In our implementation the resolution of the frame buffer object for the blurred
texture was set to 128 × 128, which has shown to be sufficient for VGA (640 ×
480) displays. Depending on the target device’s screen and the content being
rendered, even 64 × 64 may be adequate; the trade-off between visual quality and
performance should be inspected by regularly testing the target device. It should
be kept in mind that using half the resolution (e.g., 64 × 64 instead of 128 × 128)
means a 75% reduction in the number of pixels being processed.
Since the original image data is not being reused because of the reduced reso-
lution, the objects using the bloom effect have to be redrawn. This circumstance
can be exploited for another optimization. As we are drawing only the objects
which are affected by the bloom, it is possible to calculate a bounding box en-
closing these objects that in turn will be reused in the following processing steps
as a kind of scissor mechanism.
When initially drawing the objects to the frame buffer object, one could take
the optimization even further and completely omit texture mapping (see Fig-
ure 2.11). This would mean that the vertex shader would calculate only the
Figure 2.11. Difference between nontextured (left) and textured (right) bloom.
i i
i i
i i
i i
Figure 2.12. Transforming the brightness scalar by doing a texture lookup in an inten-
sity map.
vertex transformation and the lighting equation, which would reduce the amount
of data being processed in the fragment shader even further, at the expense of
some detail. Each different case should be evaluated to judge whether the per-
formance gain is worth the visual sacrifice.
When omitting texture mapping, the scalar output of the lighting equation
represents the input data for the blur stage, but if we simply used scalar output
for the following steps, the resulting image would be too bright, even in the darker
regions of the input image, which is why the intensity filter has to be applied.
Applying the intensity filter can be achieved by doing a texture lookup into a
1D texture, representing a transformation of the luminance (see Figure 2.12).
This texture can be generated procedurally by specifying a mapping function, or
manually, whereby the amount of bloom can be stylized to meet artistic needs.
The lookup to achieve the intensity filtering can potentially be merged into the
lighting equation. Other parts of the lighting equation could also be computed
via the lookup table, for example, by premultiplying the values in it by a constant
color.
2.4.2 Convolution
Once we’ve rendered our intensity-filtered objects to the frame buffer object, the
resulting image can be used as input for the blur-filtering steps. This section
explains the blur-filtering methods which are depicted in Figure 2.13.
Image convolution is a common operation and can be executed very efficiently
on the GPU. The naive approach is to calculate the texture-coordinate offsets
(e.g., 1/width and 1/height of texture image) and sample the surrounding tex-
els. The next step is to combine these samples by applying either linear filters
(Gaussian, median, etc.) or morphologic operations (dilation, erosion, etc.).
i i
i i
i i
i i
Figure 2.13. Blurring the input image in a low-resolution render target with a separated
blur filter kernel.
In this case we will apply a Gaussian blur to smooth the image. Depending
on the size of the filter kernel, we have to read a certain amount of texture values,
multiply each of them by a weight, sum the results, and divide by a normalization
factor. In the case of a 3 × 3 kernel this results in nine texture lookups, nine
multiplications, eight additions, and one divide operation, which is a total of 27
operations to filter a single texture element. The normalization can be included
in the weights, reducing the total operation count to 26.
Fortunately, the Gaussian blur is a separable filter, which means that the filter
kernel can be expressed as the outer product of two vectors:
1 2 1
2 4 2 = (1 2 1) ⊗ (1 2 1).
1 2 1
Making use of the associativity,
t · (v · h) = (t · v) · h,
where t represents the texel, v the column, and h the row vector, we can first
apply the vertical filter and, in a second pass, the horizontal filter, or vice versa.
This results in three texture lookups, three multiplications, and two additions per
pass, giving a total of 16 operations when applying both passes. This reduction
in the number of operations is even more dramatic when increasing the kernel
size (e.g., 5 × 5, 7 × 7, etc.)(see Table 2.1):
Table 2.1.
i i
i i
i i
i i
Figure 2.14. Reducing the number of texture lookups by using hardware texture filter-
ing.
In most cases, separating the filter into horizontal and vertical passes results
in a large performance increase. However, the naive single-pass version may be
faster in situations in which bandwidth is severely limited. It is always worth
benchmarking to ensure the best solution for a given platform or scenario.
The number of texture lookups can be decreased again by exploiting hardware
texture filtering. The trick is to replace the texture lookup for the outer weights
with one which is between the outer texels, as shown in Figure 2.14.
The way this works is as follows: when summing the contribution, s, of the
outer texels, t0 and t1 , in the unoptimized version, we use
s = t0 w0 + t0 w1 . (2.1)
When we sample between the outer texels with linear filtering enabled we
have
s = t0 (1 − u) + t1 u, (2.2)
where u is the normalized position of the sample point in relation to the two
texels. So by adjusting u we can blend between the two texel values. We want
to blend the texels with a value for u such that the ratio of (1 − u) to u is the
same as the ratio of w0 to w1 . We can calculate u using the texel weights
u = w1 / (w0 + w1 ) . (2.3)
Although this appears to contain more operations than Equation (2.1), the
cost of the term in the first set of brackets is negligible because linear texture
filtering is effectively a free operation. In the case of the 5 × 5 filter kernel, the
number of texture lookups can be reduced from ten to six, yielding the identical
number of computation necessary as for the 3 × 3 kernel.
i i
i i
i i
i i
It is important that the texture coordinates are calculated in the vertex shader
and passed to the pixel shader as varyings, rather than being calculated in the
pixel shader. This will prevent dependent texture reads. Although these are
supported, they incur a potentially substantial performance hit. Avoiding de-
pendent texture reads means that the texture-sampling hardware can fetch the
texels sooner and hide the latency of accessing memory.
2.4.3 Blending
The last step is to blend the blurred image over the original image to produce
the final result, as shown in Figure 2.15.
Therefore, the blending modes have to be configured and blending enabled so
that the blurred image is copied on top of the original one. Alternatively, you
could set up an alpha-value-based modulation scheme to control the amount of
bloom in the final image.
In order to increase performance and minimize power consumption, which is
crucial in mobile platforms, it is best that redundant drawing be avoided as much
as possible. The single most important optimization in this stage is to minimize
the blended area as far as possible. Blending is a fill-rate intensive operation,
especially when being done over the whole screen. When the bloom effect is
applied only to a subset of the visible objects, it is possible to optimize the final
blending stage:
• In the initialization stage, calculate a bounding volume for the objects which
are affected.
• During runtime, transform the bounding volume into clip space and cal-
culate a 2D bounding volume, which encompasses the projected bounding
volume. Add a small margin to the bounding box for the glow.
i i
i i
i i
i i
Figure 2.16. Bounding box derived rectangle (red) used for final blending.
2.5 Conclusion
We have presented a brief introduction to post-processing, followed by a detailed
case study of a well-known post-processing effect. We have illustrated optimiza-
tion techniques that make it possible to use the effect while achieving interactive
i i
i i
i i
i i
Bibliography
[Imagination Technologies 10] PowerVR SGX OpenGL ES 2.0 Application De-
veloper Recommendations. 2010. [Link]
[Link].
i i
i i
i i
i i
i i
i i
i i
i i
3
V
3.1 Introduction
Generating efficient and detailed water effects can add a great deal of realism
to 3D graphics applications. In this chapter, we highlight techniques that can
be used in software running on POWERVR SGX-enabled platforms to render
high-quality water effects at a relatively low computational cost.
Such effects can be achieved in a variety of ways, but we will focus on the use
of vertex and fragment shaders in OpenGL ES 2.0 to alter the appearance of a
plane to simulate a water effect.
Although there are many examples of water effects using shaders that are
readily available, they are designed mainly for high-performance graphics chips
on desktop platforms. The following sections of this chapter describe how the gen-
eral concepts presented in desktop implementations, in particular the technique
discussed by K. Pelzer [Pelzer 04], can be tailored to run at an interactive frame
rate on even low-cost POWERVR SGX platforms, including the optimizations
that were made to achieve the required performance.
We refer to an example application OGLES2Water that is part of the freely
available POWERVR OpenGL ES 2.0 SDK, which is included in the example
code with this article. Up-to-date SDKs are available from the Imagination Tech-
nologies website.1 Specific performance numbers cited refer to tests run on an
OMAP3530 BeagleBoard2 platform at VGA resolution.
3.2 Techniques
3.2.1 Geometry
In the demonstration, the plane that the water effect is applied to is a horizontal
plane in world space, extending to the boundaries of the view frustum—this is
1 [Link]
2 [Link]
307
i i
i i
i i
i i
constructed from only four or five points, as a high level of tessellation is not
required. The PVRTools library from the POWERVR SDK contains a function,
PVRTMiscCalculateInfinitePlane(), which obtains these points for a given set
of view parameters. Because the plane is horizontal, certain calculations can be
simplified by assuming the normal of the plane will always lie along the positive
y-axis.
A skybox is used to encapsulate the scene. This is textured with a PVRTC-
compressed 4 bits per-pixel format cubemap using bilinear filtering with near-
est mipmapping to provide a good balance between performance and quality
(for more information, see S. Fenney’s white paper on texture compression [Fe-
neny 03]).
i i
i i
i i
i i
// M i r r o r t h e view m a t r i x about t h e p l a n e .
PVRTMat4 mMirrorCam (PVRTMat4 : : I d e n t i t y ( ) ) ;
mMirrorCam . p t r ( ) [ 1 ] = −m vPlaneWater . x ;
mMirrorCam . p t r ( ) [ 5 ] = −m vPlaneWater . y ;
mMirrorCam . p t r ( ) [ 9 ] = −m vPlaneWater . z ;
mMirrorCam . p t r ( ) [ 1 3 ] = −(2.0 f ∗ m vPlaneWater . w ) ;
As the diagram in Figure 3.2 shows, mirroring the camera is not enough by
itself, because it results in the inclusion of objects below the water’s surface,
which spoils the reflection. This issue can be avoided by utilizing a user-defined
clip plane along the surface of the water to remove all objects below the water
from the render (See Section 3.3.1 for information on how this can be achieved
in OpenGL ES 2.0). Using this inverted camera, the entire reflected scene can
be rendered. Figure 3.3 shows the clipped reflection scene rendered to texture in
the demo.
i i
i i
i i
i i
In the main render pass, where the water plane is drawn, the reflected scene is
sampled using screen-space coordinates, and then is distorted using an animated
bump map normal.
The gl FragCoord variable can be used to retrieve the current fragment’s
coordinate within the viewport, which is then normalized as follows:
i i
i i
i i
i i
i i
i i
i i
i i
Figure 3.5. Water effect using a permutated reflection texture and alpha blending.
intersecting the plane since these are the only objects that will be needed
during the render. If this still proves to be too expensive, the pass can be
reduced to just drawing the key objects in the scene, such as a skybox and
terrain.
2. Favor FBO use over reading from the frame buffer. Rather than using
a copy function such as glReadPixels(), a frame buffer object with a
texture bound to it should be used to store the output of the render pass
in a texture [PowerVR 10]. This avoids the need to copy data from one
memory location (the frame buffer) to another (texture memory), which
would cost valuable cycles and bandwidth within the system. Even more
important, it avoids direct access of the frame buffer that can result in a
loss of parallelism, as the CPU would often be stalled, waiting for the GPU
to render.
i i
i i
i i
i i
4. Avoid using discard to perform clipping. Although using the discard key-
word works for clipping techniques, its use decreases the efficiency of early
order-independent depth rejection performance advantages that the POW-
ERVR architecture offers (See Section 3.3.1 for more information).
i i
i i
i i
i i
Figure 3.7. Water effect using only the refraction texture (without fogging).
i i
i i
i i
i i
mixing of color, based on the current viewing angle. Figure 3.8 shows the full
effect of the water, where reflection and refraction textures are mixed using the
Fresnel term.
Figure 3.9. Water effect using only the refraction texture (with fogging).
i i
i i
i i
i i
Fresnel term. The Fresnel term is used to determine how much light is reflected
at the boundaries of two semitransparent materials (the rest of which is absorbed
through refraction into the second material). The strongest reflection occurs
when the angle of incidence of the light ray is large, and, conversely, reflection
decreases as the angle of incidence decreases (Figures 3.10 and 3.11). The Fresnel
term provides a ratio of transmitted-to-reflected light for a given incident light
ray.
In practice, this is used to determine the correct mix between the reflected
and refracted textures for any point on the water’s surface from the current view
position. This is the Fresnel principle in reverse, and the ratio can be obtained
using an approximation derived from the same equations.
i i
i i
i i
i i
The approximation of the Fresnel term used in the demo is determined using
the following formulae, where n1 and n2 are the indices of refraction for each
material [Pelzer 04]:
(n1 − n2 )2
R(0) = ,
(n1 + n2 )2
R(α) = (1 − R(0))(1 − cos α)5 + R(0).
To save computation time, the result of the equation above is calculated outside
of the application, using the values in Table 3.1.
Using these numbers, the constant terms in the formula can be precalculated
(see Table 3.2).
n1 (Air) 1.000293
n2 (Water at room temperature) 1.333333
Fresnel approximation
R(0) 0.02037
1–R(0) 0.97963
i i
i i
i i
i i
The shader code in Listing 3.2 shows this principle in practice; first calculate
the angle between the water normal (obtained from the bump map technique) and
then the water-to-eye vector. These results are then used to calculate the Fresnel
term, which is in turn used to mix the samples of the reflected and refracted
scenes.
The normalization cube map used in the code is an optimization discussed
later in Section 3.3.1. On some hardware this may achieve faster results than
using the built-in normalize() functionality. The water normal here is assumed
to be already normalized, though this may not always be the case.
Using the Fresnel calculation instead of a constant mix on the development
hardware reduces the performance by 22%, but gives a much more realistic out-
put.
mediump f l o a t fEyeToNormalAngle =
clamp ( dot ( vWaterToEyeCube , vAccumulatedNormal ) , 0 . 0 , 1 . 0 ) ;
// Use t h e a p p r o x i m a t i o n s :
// R(0) −1 $ \ sim $= 0 . 9 8
// R( 0 ) $ \ sim $= 0 . 0 2
mediump f l o a t f A i r W a t e r F r e s n e l = 1 . 0 − fEyeToNormalAngle ;
f A i r W a t e r F r e s n e l = pow ( f A i r W a t e r F r e s n e l , 5 . 0 ) ;
fAirWaterFresnel = (0.98 ∗ fAirWaterFresnel ) + 0 . 0 2 ;
// Blend r e f l e c t i o n and r e f r a c t i o n
lowp f l o a t fTemp = f A i r W a t e r F r e s n e l ;
g l F r a g C o l o r = mix ( v R e f r a c t i o n C o l o u r , v R e f l e c t i o n C o l o u r , fTemp ) ;
3.3 Optimizations
3.3.1 User Defined Clip Planes in OpenGL ES 2.0
Although the programmability of the OpenGL ES 2.0 pipeline provides the flexi-
bility to implement this water effect, there is a drawback in that the API does not
have user-defined clip plane support, which is required to produce good quality
reflections and refractions. Many OpenGL ES 2.0 text books suggest performing
i i
i i
i i
i i
Figure 3.12. View frustum with n and f planes from original projection.
a render pass that uses a discard in the fragment shader so that fragments beyond
the user-defined clip plane will be ignored. Although this method works and will
produce the required output, using the discard keyword is highly inadvisable
because it means the hardware is unable to perform early depth testing, and in-
stead is forced to perform the full fragment shader pass. This cancels out specific
performance advantages offered by some GPUs, such as those using early Z mech-
anisms or tile-based deferred rendering (TBDR) which include the POWERVR
SGX architecture [PowerVR 10].
To solve this problem, a projection matrix modifying technique can be used
[Lengyel 04]. The projection matrix (M ) is used to convert all of the objects in
the scene from view-space coordinates into normalized device coordinates (NDC),
and part of this process is to clip objects that do not fall between the near, far, left,
right, top, and bottom planes of the view frustum (Figure 3.12). By considering
the function of the projection matrix in this way, it becomes apparent that there
is already a built-in mechanism for clipping. Clipping along a user-defined plane
can be achieved by altering the contents of the projection matrix, but this does
introduce a number of problems (which will be discussed later).
The first stage of this technique requires the user-defined clip plane (P~ ) to
be converted into view space. This can be done by multiplying the row vector
representing the plane’s coefficients (expressed in world space) by the inverse of
the view matrix:
~ = P~ × M −1 = [Cx
C Cy Cz Cw ].
view
i i
i i
i i
i i
For this technique to work, the clipping plane must be facing away from the
camera, which requires the Cw component to be negative. This does restrict the
flexibility of the clipping method, but does not pose a problem for the clipping
required for the water effect.
Altering the clipping planes requires operations on the rows of the projection
matrix, which can be defined as
~1
R
~
R
M = ~2 .
R3
~4
R
The near clipping plane (~n) is defined from the projection matrix M as the third
row plus the fourth row, so these are the values that need to be altered:.
~3 + R
~n = R ~ 4.
For perspective correction to work, the fourth row must keep the values (0, 0, −1, 0).
For this reason, the third row has to be
~ 3 = [Cx
R Cy Cz + 1 Cw ].
On the other hand, the far plane (f~) is calculated using the projection matrix by
subtracting the third row from the fourth
f~ = R
~4 − R
~ 3.
i i
i i
i i
i i
Unfortunately, changing the near plane from its default direction along the pos-
itive z-axis results in a skewed far plane that no longer remains parallel with
the near plane (Figure 3.13). This is due to the way in which the far plane is
calculated in the above formula.
Although this problem cannot be corrected completely, the effect can be min-
imized by scaling the clip plane before the third row is set, which causes the
orientation of the far clipping plane to change. Ideally, this scaling should result
in an optimized far plane that produces the smallest possible view frustum that
can still encapsulate the conventional view frustum (Figure 3.14).
To do this, the point (Q)~ that lies furthest opposite the near plane within
NDC must be calculated, using the following equation:
~ = M −1 sgn(Cx ) sgn(Cy ) 1 1 .
Q
The result of this calculation can then be used to determine the scaling factor (a)
that should be applied to the camera-space clip plane before it is used to alter
the projection matrix:
~4 · Q
2R ~
a= .
~ ·Q
C ~
The camera-space plane can now be scaled before it is used to alter the projection
matrix, using the following calculation:
~ = aC.
C ~
Although this technique may seem more difficult to understand than the dis-
card method of clipping, it is significantly faster because it allows the graphics
hardware to perform clipping at almost no additional cost.
i i
i i
i i
i i
Scale water distortion. Without scaling the amount of distortion that is applied
to each fragment, water in the distance can ultimately sample the reflection and
i i
i i
i i
i i
refraction textures at too large an offset, which gives water in the distance an un-
realistic degree of distortion. Additionally, the bigger offset for distant fragments
results in a higher amount of texture-read cache misses.
By scaling the amount of distortion that is applied to a given fragment, the
visual quality of the effect can be improved and the number of stall cycles caused
by texture cache misses can be reduced. This is done in the demo by dividing the
wave’s distortion value by the distance between the camera and the fragment’s
position (so fragments further from the camera are distorted less). The extra cycle
cost has a minimal impact on performance (less than 1% on the test hardware)
because, even though the texture-read stalls are reduced, they still account for
the main bottleneck.
Render the water effect to a texture. Because of the heavy use of the fragment
shader to produce the effect, the demo tends to be fragment limited on most
hardware. To reduce this bottleneck, the water effect can be rendered to a texture
at a lower resolution and then applied to the water plane during the final render
pass. This technique benefits the speed of the demonstration by reducing the
number of fragments that are rendered using the water effect. This can be further
reduced (especially on a TBDR) by rendering objects that will obscure areas of
the water in the final render pass, such as the demo’s terrain. Although the
introduction of numerous objects to the render can improve the speed of the
water effect, the inaccuracies caused by mapping the texture to the final water
plane can result in artifacts around the edges of models that were used during
the low-resolution pass. Such artifacts are generally not that noticeable, provided
that the shaders used for the additional objects in the low-resolution pass are the
same as those used in the final render (i.e., rendering geometry without lighting
during the low-resolution pass will cause highlights around dark edges of models
in the final pass, so this should be avoided). One of the best ways to steer
clear of the problems caused by the scaling is to avoid drawing objects that are
very detailed around their edges that overlap the water because this reduces the
likelihood of artifacts occurring. In the demo, the boat is omitted from the water’s
render pass because it is too detailed to be rendered without causing artifacts
and does not afford as great a benefit as the terrain when covering areas of the
water.
When rendering to a texture at a 256 × 256 resolution and performing the
final render pass to a 640 × 480 screen, the reduction in quality is only slightly
noticeable, but on the test hardware the performance level is increased by ∼18%.
Removing artifacts at the water’s edge. One of the biggest problems with shader
effects that perturb texture coordinates is the lack of control over the end texel
that is chosen. Due to the clipping that is implemented in the reflection and
refraction render passes, it is very easy for artifacts to appear along the edges
of objects intersecting the water plane. The occurrence of artifacts occurs when
i i
i i
i i
i i
the sampled texel is taken from behind the object intersecting the water, which
results in the texture sample being either the clear color, or geometry that ought
to be obscured, resulting in visible seams near the water’s edge. The edge artifact
can be seen in Figure 3.8. To compensate, the clip-plane location is set slightly
above the water surface (a small offset along the positive y-axis). In the case of
the refraction render pass, such an offset will cause some of the geometry above
the water to be included in the rendered image, which helps to hide the visible
seams by sampling from this above-the-water geometry.
Although another inaccuracy is introduced because of the deliberately im-
perfect clipping, it is barely noticeable, and the effect of the original artifact is
effectively removed for very little additional computation. The same benefit ap-
plies to the reflected scene, although in this case the offset direction is reversed,
and clipping occurs slightly below the water. Figure 3.15 shows the scene with
the artifact fix in place.
Another way to compensate for the artifacts, and improve the aesthetics of
the effect, is to use fins or particle effects along the edges of objects intersecting
the water to give the appearance of a wake where the water is colliding with the
objects. The drawback of these techniques is that they both require the program
to know where in the scene objects are intersecting the water, which can be very
expensive if the water height is changing or objects in the water are moving
dynamically.
i i
i i
i i
i i
3.4 Conclusion
We have presented a technique that allows a water effect to be augmented onto
a simple plane, using several render passes, some simple distortion, and texture
mixing in the fragment shader. Additionally, we have presented optimal tech-
niques for user-defined clip planes, normalization, removing artifacts caused by
texture-coordinate perturbation, and have also highlighted the benefits of utiliz-
ing low-resolution render passes to reduce the fragment shader workload. The
result is a high-performance example with extremely compelling visual results.
Though this example targets the current low-cost OpenGL ES 2.0 capable de-
vices, it can be correspondingly scaled to take advantage of higher resolution
displays and increased GPU power.
Bibliography
[PowerVR 10] Imagination Technologies. “POWERVR SGX OpenGL ES 2.0 Applica-
tion Development Recommendations.” [Link]/.../POWERVR%20SGX.
OpenGL%20ES%202.0%20Application%20Development%20Recommendations,
2010.
[Feneny 03] Simon Fenney, “Texture Compression using Low-Frequency Signal Modu-
lation.” In Proceedings Graphics Hardware, pp. 84–91. New York: ACM, 2003.
[Lengyel 04] Eric Lengyel. “Modifying the Projection Matrix to Perform Oblique Near-
plane Clipping.” Terathon Software 3D Graphics Library, 2004. Available at http:
//[Link]/code/[Link].
[Pelzer 04] Kurt Pelzer. “Advanced Water Effects.” In ShaderX2 . Plano, TX: Wordware
Publishing, Inc, 2004.
i i
i i
i i
i i
i i
i i
i i
i i
VI
3D Engine
Design
—Wessam Bahnassi
i i
i i
i i
i i
i i
i i
i i
i i
1
VI
1.1 Introduction
With the complexity and interactivity of game worlds on the rise, the need for
efficient dynamic visibility is becoming increasingly important.
This article covers two complementary approaches to visibility determination
that have shipped in recent AAA titles across Xbox 360, PS3, and PC: Splinter
Cell Conviction and Battlefield: Bad Company 1 & 2.
These solutions should be of broad interest, since they are capable of handling
completely dynamic environments consisting of a large number of objects, with
low overhead, straightforward implementations, and only a modest impact on
asset authoring.
Before we describe our approaches in detail, it is important to understand
what motivated their development, through the lens of existing techniques that
are more commonly employed in games.
329
i i
i i
i i
i i
i i
i i
i i
i i
mentioned titles. We will now outline the problems encountered with occlusion
queries.
1.3.1 Batching
First, though OQs can be batched in the sense that more than one can be issued
at a time [Soininen 08]—thereby avoiding lock-step CPU-GPU synchronization—
one cannot batch several bounds into a single draw call with individual query
counters. This is a pity, since CPU overhead alone can limit the number of tests
to several hundred per frame on current-generation consoles, which may be fine
if OQs are used to supplement another visibility approach [Hastings 07], but is
less than ideal otherwise.
1.3.2 Latency
To overcome latency, and as a general means of scaling OQs up to large envi-
ronments, a hierarchy can be employed [Bittner et al. 09]. By grouping, via a
bounding volume hierarchy (BVH) or octree for instance, tests can be performed
progressively, based on parent results, with sets of objects typically rejected ear-
lier.
However, this dependency chain generally implies more CPU-GPU synchro-
nization within a frame since, at the time of this writing, only the CPU can issue
queries.2 Hiding latency perfectly in this instance can be tricky and may require
overlapping query and real rendering work, which implies redundant state changes
in addition to a more complicated renderer design.
1.3.3 Popping
By compromising on correctness, one can opt instead to defer checking the results
of OQs until the next frame—so called latent queries [Soininen 08]—which prac-
tically eliminates synchronization penalties, while avoiding the potential added
burden of interleaved rendering. Unfortunately, the major downside of this strat-
egy is that it typically leads to objects “popping” due to incorrect visibility clas-
sification [Soininen 08]. Figure 1.1 shows two cases where this can occur. First,
the camera tracks back to reveal object A in Frame 1, but A was classified as
outside of the frustum in Frame 0. Second, object B moves out from behind an
occluder in Frame 1 but was previously occluded in Frame 0.
Such artifacts can be reduced by extruding object-bounding volumes,3 simi-
larly padding the view frustum, or even eroding occluders. However, these fixes
come with their own processing overhead, which can make eliminating all sources
of artifacts practically impossible.
2 Predicated
rendering is one indirect and limited alternative on Xbox 360.
3Amore accurate extrusion should take into account rotational as well as spatial velocity,
as with continuous collision detection [Redon et al. 02].
i i
i i
i i
i i
Figure 1.1. Camera or object movement can lead to popping with latent queries.
Sudden changes such as camera cuts are also problematic with latent queries,
potentially leading to either a visibility or processing spike [Hastings 07] to avoid
rampant popping. As such, it may be preferable to simply skip rendering for a
frame and only process visibility updates behind the scenes.
i i
i i
i i
i i
i i
i i
i i
i i
Render occluder depth. As with OQs, we first render the depth of a subset of the
scene, this time to a render target texture, which will later be used for visibility
testing, but in a slightly different way than before.
For Conviction, these occluders were typically artist authored4 for perfor-
mance reasons, although any object could be optionally flagged as an occluder
by an artist.
Create a depth hierarchy. The resulting depth buffer is then used to create a
depth hierarchy or z-pyramid, as in [Greene et al. 93]. This step is analogous
to generating a mipmap chain for a texture, but instead of successive, weighted
down-sampling from each level to the next, we take the maximum depth of sets
of four texels to form each new texel, as in Figure 1.2.
This step also takes place on the GPU, as a series of quad passes, reading
from one level and writing to the next. To simplify the process, we restrict
the visibility resolution to a power of two, in order to avoid the additional logic
of [Shopf et al. 08]. Figure 1.3 shows an example HZB generated in this way.
In practice, we render at 512 × 256,5 since this seems to strike a good balance
between accuracy and speed. This could theoretically result in false occlusion for
objects of 2 × 2 pixels or less at native resolution, but since we contribution-cull
small objects anyway, this has not proven to be a problem for us.
4 These are often a simplified version of the union of several adjoining, structural meshes.
5 This is approximately a quarter of the resolution of our main camera in single-player mode.
i i
i i
i i
i i
Figure 1.3. The resulting depth hierarchy. Note that the sky in the distance increasingly
dominates at coarser levels.
Test object bounds. We pack object bounds (world-space AABBs) into a dynamic
point-list vertex buffer and issue the tests as a single draw call. For each point,
we determine, in the vertex shader, the screen-space extents of the object by
transforming and projecting the bounds (see Figure 1.4). From this, we calculate
the finest mip level of the hierarchy that covers these extents with a fixed number
of texels or fewer and also the minimum, projected depth of the object (see
Listing 1.1).
// C o n t a i n s t h e d i m e n s i o n s o f t h e v i e w p o r t .
// I n t h i s c a s e x = 5 1 2 , y = 256
f l o a t 2 cViewport ;
b o o l v i s i b l e = ! FrustumCull ( i n p u t . c e n t e r , i n p u t . e x t e n t s ) ;
// Transform / p r o j e c t AABB t o s c r e e n −s p a c e
f l o a t min z ;
f l o a t 4 sbox ;
GetScreenBounds ( i n p u t . c e n t e r , i n p u t . e x t e n t s , min z , sbox ) ;
// C a l c u l a t e HZB l e v e l
f l o a t 4 s b o x v p = sbox ∗ cViewport . xyxy ;
f l o a t 2 s i z e = s b o x v p . zw − s b o x v p . xy ;
f l o a t l e v e l = c e i l ( l o g 2 (max( s i z e . x , s i z e . y ) ) ) ;
i i
i i
i i
i i
return output ;
}
Figure 1.4. The object’s world-space AABB (blue), screen extents (green) and overlap-
ping HZB texels (orange).
This depth, plus the UVs (sbox is the screen-space AABB) and mip level for
HZB lookup are then passed to the pixel shader. Here we test for visibility by
comparing the depth against the overlapping HZB texels and write out 1 or 0 as
appropriate (see Listing 1.2).
sampler2D sHZB : r e g i s t e r ( s 0 ) ;
f l o a t 4 samples ;
s a m p l e s . x = t ex 2D lod ( sHZB , f l o a t 4 ( sbox . xy , 0 , l e v e l ) ) . x ;
s a m p l e s . y = t ex 2D lod ( sHZB , f l o a t 4 ( sbox . zy , 0 , l e v e l ) ) . x ;
i i
i i
i i
i i
f l o a t max z = max4 ( s a m p l e s ) ;
return v i s i b l e ;
}
Process the results. Finally, the results are read back to the CPU via MemExport
on Xbox 360. On PC, under DX9, we instead emulate DX10 stream-out by
rendering with a point size of one to an off-screen render-target, followed by a
copy to system memory via GetRenderTargetData.
1.5.2 Tradeoffs
By using a fixed number of lookups instead of rasterization, the performance of
the visibility tests is highly predictable for a given number of objects. That said,
this bounded performance comes at the cost of reduced accuracy for objects that
are large on screen.
On the other hand, this approach can be viewed as probabilistic: large objects
are, on average, more likely to be visible anyway, so performing more work (in
the form of rasterization with OQs) is counter-productive. Instead, with HZB
testing, accuracy is distributed proportionally. This proved to be a particularly
good fit for us, given that we wanted a lot of relatively small clutter objects, for
which instancing was not appropriate for various reasons.
We also benefited from the high granularity afforded by a query per object,
whereas wholly OQ-based methods require some degree of aggregation in order
to be efficient, leading to reduced accuracy and more variable performance. This
became clear in our own analysis when we switched to HZB visibility from OQs.
We started off with a 2 × 2 depth-test configuration, and even that out-performed
i i
i i
i i
i i
1.5.3 Performance
Table 1.1 represents typical numbers seen in PIX on Xbox 360, for a single camera
with around 22000 objects, all of which are processed in each each frame.
1.5.4 Extensions
Once you have a system like this in place, it becomes easy to piggy-back related
work that could otherwise take up significant CPU time compared with the GPU,
which barely breaks a sweat. Contribution fading/culling, texture streaming and
LOD selection, for instance, can all be determined based on each object’s screen
extents,6 with results returned in additional bits.
On Xbox 360, we can also bin objects into multiple tiles ourselves, thereby
avoiding the added complexity and restrictions that come with using the predi-
cated tiling API, not to mention the extra latency and memory overhead when
double-buffering the command buffer.
Finally, there is no reason to limit visibility processing to meshes. We also
test and cull lights, particle systems, ambient occlusion volumes [Hill 10], and
dynamic decals.
i i
i i
i i
i i
Figure 1.5 shows this in action for a parallel light source. Here, caster C is
fully behind an occluder,8 so it can be culled away since it will not contribute to
the shadow map.
In the second pass, we transform these shafts into camera space and test their
visibility from the player’s point of view via the existing player camera HZB—
again just like regular objects. Here, since the shafts of A and B have been
clamped to the occluder underneath, they are not visible either.
f l o a t l e v e l n e w = max( l e v e l − 1 , 0 ) ;
f l o a t 2 s c a l e = pow ( 2 , −l e v e l n e w ) ;
f l o a t 2 a = f l o o r ( s b o x v p . xy ∗ s c a l e ) ;
f l o a t 2 b = c e i l ( s b o x v p . zw∗ s c a l e ) ;
f l o a t 2 dims = b − a ;
7 But we could adapt this type of testing to cull more. See Section 1.7.
8 Occluders used for shadow culling always cast shadows.
i i
i i
i i
i i
1.5.6 Summary
To reiterate, this entire process takes place as a series of GPU passes; the CPU is
involved only in dispatching the draw calls and processing the results at the end.
In retrospect, a CPU solution could have also worked well as an alternative,
but we found the small amount of extra GPU processing to be well within our bud-
get. Additionally, we were able to leverage fixed-function rasterization hardware,
stream processing, and a mature HLSL compiler, all with literally man-years of
optimization effort behind them. In contrast to the simple shaders listed earlier,
a hand-optimized VMX software rasterizer would have taken significantly longer
to develop and would have been harder to extend.
If you already have a PVS or portal visibility system, there can still be sig-
nificant benefits to performing HZB processing as an additional step. In the first
place, either system can act as an initial high-level cull, thus reducing the num-
ber of HZB queries. In the case of portals, the “narrow-phase” subfrusta testing
could also be shifted to the GPU. Indeed, from our own experience, moving basic
frustum testing to the GPU alone was a significant performance improvement
over VMX tests on the CPU. Finally, in the case of BSP-based PVS, the faces
could be preconverted to a number of large-scale occluders for direct rendering.
i i
i i
i i
i i
Terrain triangle setup. This is effectively the same as the previous stage, except
that it generates and adds conservative triangles for the terrain9 to the array.
Occluder render. This is the stage that actually rasterizes the triangles. Each
SPU job has its own z-buffer (256 × 114) and grabs 16 triangles at a time from
the triangle array generated previously.
When the jobs are finished getting triangles from the triangle array, they will
each try to lock a shared mutex. The first one will simply DMA its z-buffer to
main memory, unlock the mutex, and exit so that the next job can start running.
As the mutex gets unlocked, the next job will now merge its own buffer with
the one in main memory and send back the result, and so on. (Note: There are
9 As the terrain can deform, these must be regenerated.
i i
i i
i i
i i
several ways to improve on this and make it faster. We could, for example, DMA
directly from each SPU.)
Frustum cull. This stage performs frustum versus sphere/bounding box (BB)
checks on all meshes in the world—typically between 10,000 and 15,000—and
builds an array for the next stage. The implementation traverses a tree of spheres
(prebuilt by our pipeline) and at each leaf we do bounding-box testing if the
sphere is not fully inside.
Occlusion cull. Finally, this is where visibility testing against the z-buffer hap-
pens. We first project the bounding box of the mesh to screen-space and calculate
its 2D area. If this is smaller than a certain value—determined on a per-mesh
basis—it will be immediately discarded (i.e, contribution culled).
Then, for the actual test against the z-buffer, we take the minimum distance
from the camera to the bounding box and compare it against the z-buffer over
the whole screen-space rectangle. This falls somewhere between the approach
of [Woodard 07]—which actually rasterizes occluders—and that of Conviction in
terms of accuracy.
Performance The timings reflect best-case parallelism over five SPUs and were
measured in a typical scene (see Table 1.2). In practice, workloads between SPU
jobs will vary slightly and may be intermixed with other jobs, so the overall time
for visibility processing will be higher in practice.
In this case we rasterized around 6000 occluder triangles (we normally observe
3000 to 5000), and performed around 3000 occlusion tests after frustum and
extent culling.
i i
i i
i i
i i
1.7.2 Optimizations
One trivial optimization for the GPU solution would be to add a pre-pass, testing
a coarse subdivision of the scene (e.g., regular grid) to perform an earlier, high-
level cull—just like in Battlefield, but using the occlusion system too. We chose
not to do this since performance was already within our budget, but it would cer-
tainly allow the approach to scale up to larger environments (e.g., “open world”).
Additionally, a less accurate object-level pre-pass (for instance, four HZB sam-
ples using the bounding sphere, as with [Shopf et al. 08]) could lead to a speed
up wherever there is a reasonable amount of occlusion (which by necessity is a
common case). Equally, a finer-grained final pass (e.g., 8 × 8 HZB samples) could
improve culling of larger occluders.
In a similar vein, another easy win for the SPU version would be using a
hierarchical z-buffer either for early rejection or as a replacement for a complete
loop over the screen bounds. As earlier numbers showed, however, the main
hotspot performance-wise is occluder rasterization. In that instance we might
i i
i i
i i
i i
Figure 1.6. A single screen-space z-value for occluders can lead to conservative accep-
tance in some cases.
gain again, this time from hierarchical rasterization as in [Abrash 09], although
at the cost of increased implementation complexity. Frustum culling could also
be sped up by switching to a different data structure (e.g., grid) to improve load
balancing on SPUs as well as memory access patterns.
Although the accuracy-performance trade-off from the HZB was almost always
beneficial for Conviction, we did encounter a couple of instances where we could
have profited from better culling of large, structural geometry. We believe that
the biggest factor here was the lack of varying z over the occluder (see Figure 1.6)
when testing against the HZB, not the number of tests (beyond 4×4) or the base
resolution.
On Xbox 360, we investigated hardware-rasterizing occluder bounds as a
proof-of-concept for overcoming this, but we ran out of time and there were
some performance pitfalls with MemExport. We hope to pick up where we left
off in the future.
Conviction’s shadow-caster culling proved to be a significant optimization
for cascaded shadow maps. One potential avenue of future development would
be to try to adapt the idea of frustum subdivision coupled with caster-receiver
intersection testing, as presented in [Diamand 10], with similarities to [Lloyd
et al. 04]. [Eisemann and Décoret 06] and [Décoret 05] also build on the latter.
We would also like to extend culling to local shadow lights. As we already
cache casters per shadow light (the cache is updated on object or light movement),
we could directly evaluate shadow visibility for this subset of the scene. This
would avoid the higher fixed overhead of processing all objects in the map as we
do for the main view or shadow cascades, which is important since we can have
up to eight active shadow lights per camera. These updates could happen either
every frame or whenever the list changes.
i i
i i
i i
i i
1.8 Conclusion
Whatever the future, experimenting with solutions like these is a good invest-
ment; in our experience, we gained significantly from employing these fast yet
straightforward visibility systems, both in development and production terms.
The GPU implementation in particular is trivial to add (demonstrated by the
fact that our initial version was developed and integrated in a matter of days)
and comes with a very reasonable overhead.
i i
i i
i i
i i
1.9 Acknowledgments
We would like to thank Don Williamson, Steven Tovey, Nick Darnell, Christian Desau-
tels, and Brian Karis, for their insightful feedback and correspondence, as well as the
authors of all cited papers and presentations, for considerable inspiration.
Bibliography
[Abrash 96] Michael Abrash. “Inside Quake: Visible-Surface Determination.” Dr.
Dobb’s Sourcebook Jan/Feb (1996), 41–45.
[Abrash 09] Michael Abrash. “Rasterization on Larrabee.” In Game Deveoper’s Con-
ference, 2009.
[Akenine-Möller et al. 08] Tomas Akenine-Möller, Eric Haines, and Naty Hoffman.
Real-Time Rendering, Third edition. Natick, MA: A K Peters, 2008.
[Andersson 10] Johan Andersson. “Parallel Futures of a Game Engine v2.0.” In STHLM
Game Developer Forum, 2010.
[Bittner et al. 09] Jiřı́ Bittner, Oliver Mattausch, and Michael Wimmer. “Game Engine
Friendly Occlusion Culling.” In ShaderX7 , pp. 637–653. Hingham, MA: Charles
River Media, 2009.
[Blinn 96] Jim Blinn. “Calculating Screen Coverage.” IEEE CG&A 16:3 (1996), 84–88.
[Décoret 05] Xavier Décoret. “N-Buffers for Efficient Depth Map Query.” Computer
Graphics Forum (Eurographics) 24:3 (2005), 8 pp.
[Diamand 10] Ben Diamand. “Shadows in God of War III.” In Game Developer’s
Conference, 2010.
[Eisemann and Décoret 06] Elmar Eisemann and Xavier Décoret. “Fast Scene Voxeliza-
tion and Applications.” In Proceedings of the 2006 Symposium on Interactive 3D
Graphics and Games, I3D ’06, pp. 71–78. New York: ACM, 2006.
[Greene et al. 93] Ned Greene, Michael Kass, and Gavin Miller. “Hierarchical Z-buffer
Visibility.” In Proceedings of the 20th Annual Conference on Computer Graphics
and Interactive Techniques, SIGGRAPH ’93, pp. 231–238. New York: ACM, 1993.
[Hastings 07] Al Hastings. “Occlusion Systems.” [Link]
research dev/articles/2007/1500779, 2007.
[Hill 10] Stephen Hill. “Rendering with Conviction.” In Game Developer’s Conference,
2010.
[Lloyd et al. 04] Brandon Lloyd, Jeremy Wendt, Naga Govindaraju, and Dinesh
Manocha. “CC Shadow Volumes.” In ACM SIGGRAPH 2004 Sketches, SIG-
GRAPH ’04, p. 146. New York: ACM, 2004.
[Redon et al. 02] Stephane Redon, Abderrahmane Kheddar, and Sabine Coquillart.
“Fast Continuous Collision Detection between Rigid Bodies.” Computer Graph-
ics Forum 21:3 (2002), 279–288.
i i
i i
i i
i i
[Shopf et al. 08] Jeremy Shopf, Joshua Barczak, Christopher Oat, and Natalya
Tatarchuk. In ACM SIGGRAPH 2008 Classes, SIGGRAPH ’08, pp. 52–101. New
York: ACM, 2008.
[Soininen 08] Teppo Soininen. “Visibility Optimization for Games.” Gamefest, 2008.
Microsoft Download Center, Available at [Link]
/[Link]?FamilyId=B9B33C7D-5CFE-4893-A877-5F0880322AA0&displaylang
=en, 2008.
[Woodard 07] Bruce Woodard. “SPU Occlusion Culling.” In SCEA PS3 Graphics
Seminar, 2007.
i i
i i
i i
i i
i i
i i
i i
i i
2
VI
2.1 Introduction
Algorithmic optimization and level of detail are very pervasive topics in real-time
rendering. With each rendering problem comes the question of the acceptable
amount of approximation error and the quality vs. performance trade-off of in-
creasing or decreasing approximation error. Programmable hardware pipelines
play one of the largest roles in how we optimize rendering algorithms because they
dictate where we can add algorithmic modification via programmable shaders.
In this article we analyze one particular aspect of modern programmable
hardware—the pixel derivative instructions and pixel quad rasterization—and
we identify a new level at which optimizations can be performed. Our work
demonstrates how values calculated in one pixel can be passed to neighboring
pixels in the frame buffer allowing us to amortize the cost of expensive shading
operations. By amortizing costs in this manner we can reduce texture fetches
and/or arithmetic operations by factors of two to sixteen times. Examples in this
article include 4×4 percentage closer filtering (PCF) using only one texture fetch,
and 2 × 2 bilateral upsampling using only one or two texture fetches. Our ap-
proach works using a technique we call pixel quad amortization (PQA). Although
our approach already works on a large set of existing hardware, we propose some
standards and extensions for future hardware pipelines, or software pipelines, to
make it ubiquitous and more efficient.
349
i i
i i
i i
i i
i i
i i
i i
i i
a b
Hybrid Derivatives - Each Pixel In The Quad:
c d
i i
i i
i i
i i
frame buffer; they are essentially double width pixels. Second, we found only
two approaches to computing derivatives within these quads, as illustrated in
Figure 2.1.
Interestingly, Shader Model 5 in Direct3D 11 has both coarse and fine ver-
sions of the derivative instructions, likely exposing the trade-offs of these two
approaches to the developer. Half-resolution derivatives always return the same
value within a quad, allowing for optimized texture sampling in some cases, while
hybrid derivatives have the potential to provide slightly more accurate results.
It is important to note at this point that although derivative instructions
were created to assist with texture mapping, they are not reserved for computing
derivatives of texture coordinates. You can use the derivative instructions to
calculate the derivative of any value in a shader. One obvious question that
arises is, what happens when a triangle does not cover all the pixels in a given
quad, or if some pixels in a quad are rejected by the depth test? Another question
is, how does graphics hardware synchronize all the seemingly independent shader
programs such that derivatives can be calculated anywhere? The answer is that
in the real shader processing core the “loop” over all the pixels in a triangle
is unrolled into blocks of at least four pixels. So all quad pixels are always
calculated in lockstep and in parallel, likely even sharing the same set of real
hardware registers. The shader program will execute for all the pixels in a quad
even if only one pixel is actually needed. In the event that a quad pixel falls
outside of a triangle, the values passed down from the vertices are extrapolated
using the triangle’s homogenous coordinates.
i i
i i
i i
i i
a + ddx(a) = a + (b − a) = b.
b − ddx(b) = b − (b − a) = a.
So to generically pass a value v horizontally within a quad and get the hori-
zontal neighbor h, we compute
h = v − signx ∗ ddx(v),
where signx denotes the sign of x in the quadrant of the current pixel within a
quad. Although we can not access the pixel diagonally across from the current
pixel directly, we can determine the horizontal neighbor followed by the vertical
neighbor of that value. An example that computes all three neighbors is as
follows:
// Gather f o u r f l o a t 4 s
void QuadGather2x2 ( f l o a t 4 v a l u e ,
out f l o a t 4 horz ,
out f l o a t 4 v e r t ,
out f l o a t 4 d i a g )
{
h o r z = v a l u e + ddx ( v a l u e ) ∗ QuadVector . z ; // H o r i z o n t a l
v e r t = v a l u e + ddy ( v a l u e ) ∗ QuadVector . w ; // V e r t i c a l
d i a g = v e r t + ddx ( v e r t ) ∗ QuadVector . z ; // D i a g o n a l
}
If we need to gather only one or two values instead of a full float4 vector,
we can optimize this calculation down to as little as two MAD instructions and
two derivative instructions:
i i
i i
i i
i i
// Gather f o u r f l o a t s i n t o one f l o a t 4
f l o a t QuadGather2x2 ( f l o a t v a l u e )
{
float4 r = value ;
r . y = r . x + ddx ( r . x ) ∗ QuadVector . z ; // H o r i z o n t a l
r . zw = r . xy + ddy ( r . xy ) ∗ QuadVector . w ; // V e r t i c a l /
Diagonal
return r ;
}
In both of these examples we used the variable QuadVector. Figure 2.2 illus-
trates the value of QuadVector for each pixel in a quad. Most of the optimiza-
tions we perform in this chapter rely on this vector and one other variable called
QuadSelect. QuadVector is used to divide two-dimensional symmetric problems
into four parts, while QuadSelect is used to choose between two values based on
the current pixel’s quadrant.
The following code demonstrates one way to calculate QuadVector and
QuadSelect from a pixel’s screen coordinates. The negated/flipped values are
also useful and are stored in z/w components.
void InitQuad ( f l o a t 2 s c r e e n C o o r d )
{
// This assumes s c r e e n C o o r d c o n t a i n s an i n t e g e r p i x e l
coordinate
ScreenCoord = s c r e e n C o o r d ;
QuadVector = f r a c ( s c r e e n C o o r d . xy ∗ 0 . 5 ) . xyxy ;
QuadVector = QuadVector ∗ f l o a t 4 (4 ,4 , −4 , −4) + f l o a t 4 ( −1 , −1 ,1 ,1)
;
Q u a d S e l e c t = s a t u r a t e ( QuadVector ) ;
}
i i
i i
i i
i i
QuadVector QuadSelect
Figure 2.2. To initialize PQA we calculate two simple values for each pixel. QuadVector
contains the x/y sign of the pixel within it’s quad and is used to perform symmetric
operations while QuadSelect is used to choose between values based on the pixel’s
location in the quad.
provides a list of hardware that supports hybrid derivatives at the time this ar-
ticle was written. It is also possible that a hardware vendor could change the
way the derivative instructions work, breaking this functionality. Although this
seems very unlikely, it is easy enough to write a detection routine to test which
type of derivatives are used.
The second problem that becomes immediately apparent is that there is no in-
terpolation between quads as there would be from a pre-rendered half-resolution
buffer. Thus, if we output the same value for an entire quad, it will resemble
unfiltered point sampling from a half-resolution frame buffer. This may be ac-
ceptable in certain situations, but if we want higher quality results, we still need
to compute unique values for each pixel. Our ability to produce pleasing results
really depends on the specific problem.
The third problem is that quad-level calculations work effectively only in the
current triangle’s domain. For example, we can use pixel quad amortization
to accelerate PCF shadow-map sampling in forward rendering, but not nearly
as easily in deferred rendering. This is because in the deferred case the quads
being rendered are not in object space; thus, a pixel quad may straddle a depth
discontinuity, creating a large gap in shadow space. In forward rendering, the
entire quad will project into a contiguous location in shadow space, which is
what we rely on to amortize costs effectively.
Although there are a number of drawbacks to PQA, we found we could solve
these issues for several common graphics problems and still achieve large per-
formance gains. In the following sections we will discuss how to optimize PCF,
bilateral upsampling, and basic convolution and blurring with PQA.
i i
i i
i i
i i
// Gather quad h o r i z o n t a l / v e r t i c a l / d i a g o n a l s a m p l e s
f l o a t 2 AO D, AO D H , AO D V , AO D D ;
AO D . x = tex2D ( lowResDepthSampler , c o o r d ) . x ;
AO D . y = tex2D ( lowResAOSampler , coord ) . x ;
QuadGather2x2 ( AO D, AO D H , AO D V , AO D D ) ;
i i
i i
i i
i i
Quad
Texel
The bilateral upsample can then be performed as usual for each pixel, with
the caveat that tent weights will need to flip to compensate for the samples being
flipped in each pixel. A similar approach can be taken for a 4X upsample, or for
bilateral blurring operations at any resolution. One extra thing to note is that
the low-resolution buffer is shifted half a pixel (see Figure 2.3).
// P o p u l a t e m e s s a g e s f o r n e i g h b o r s
float4 m = 0;
m. rgba+= tex2D ( imageSampler , c o o r d ) . x ;
m. rb += tex2D ( imageSampler , c o o r d+QuadVector ∗
f l o a t 2 ( TEXEL SIZE . x , 0 ) ) . x ;
m. r g += tex2D ( imageSampler , c o o r d+QuadVector ∗
f l o a t 2 ( TEXEL SIZE . y , 0 ) ) . x ;
m. r += tex2D ( imageSampler , c o o r d+QuadVector ∗
f l o a t 2 ( TEXEL SIZE . xy ) ) . x ;
i i
i i
i i
i i
Figure 2.4. Illustration of a 5 × 5 blur using PQA. The blur kernel footprint of four
pixels in a quad (left). Samples taken by each pixel in the quad (middle). Uniquely
weighted messages from the red pixel to other pixels in the quad (right).
// Gather m e s s a g e s
float4 h , v , d ;
QuadGather2x2 ( m, h , v , d ) ;
// Weight r e s u l t s f o r 3 x3 b l u r
f l o a t 4 r e s u l t = dot ( f l o a t 4 ( 4 , 2 , 2 , 1 ) / 9 . 0 ,
f l o a t 4 (m. x , h . g , v . b , d . w) ) ;
i i
i i
i i
i i
f o r ( i n t i = 0 ; i < N; i++ )
f o r ( i n t j = 0 ; j < N; j++ )
{
shad += ShadowSample (Map, Coord ,
SM TEXEL ∗ f l o a t 2 ( i −(N/ 2 . 0 − 0 . 5 ) , j −(N/ 2 . 0 − 0 . 5 )
) );
}
shad /= (N∗N) ;
Many graphics cards support native bilinear PCF filtering, and this section
assumes we have at least bilinear PCF support. Some more recent graphics cards
support fetching four depth values at once, allowing the user to arbitrarily filter
them in the shader. Since utilizing bilinear PCF is more difficult in our case, but
is supported on a much wider set of hardware, we will focus on using bilinear
PCF. Extensions to Gather instructions can further improve results.
Since we cannot access the result of each pixel when using bilinear PCF, we
start by applying an approach from [Sigg and Hadwiger 05, Gruen 10], which
uses bilinear samples to build efficient larger filters. This involves using sample
offsets such that each bilinear sample fetches four uniquely weighted samples. In
the most simple case, where we want equal weights, this simply means placing a
bilinear PCF sample in the middle of the four texels we want:
// F r a c t i o n o f a p i x e l
f l o a t 2 a = f r a c ( Coord . xy ∗ SM SIZE − 0 . 5 ) ;
// N e g a t i v e / P o s i t i v e o f f s e t s t o compute e q u a l w e i g h t s
f l o a t 4 O f f s e t = a . xyxy ∗ −(SM TEXEL) +
f l o a t 4 ( − 0 . 5 , − 0 . 5 , 1 . 5 , 1 . 5 ) ∗SM TEXEL ;
float4 taps ;
t a p s . x = ShadowSample ( Map, Coord , O f f s e t . xw ) ;
t a p s . y = ShadowSample ( Map, Coord , O f f s e t . zw ) ;
t a p s . z = ShadowSample ( Map, Coord , O f f s e t . xy ) ;
t a p s . w = ShadowSample ( Map, Coord , O f f s e t . zy ) ;
This approach can apply to arbitrary separable filters as we will see later,
but for now we will keep things simple. To apply PQA, we replace the offset
calculation with one that uses the quadrant vector, and then take one sample at
each pixel, followed by a quad average:
i i
i i
i i
i i
Figure 2.5. Half-resolution 4 × 4 PCF using quad LOD. The colored pixels correspond
to the projection of one pixel quad into shadow space. Each pixel performs only one
texture fetch, followed by a pixel quad average. The close-up illustrates half-resolution
point sampling artifacts.
// Average c o o r d i n a t e f o r quad
Coord . xy = QuadAve2x2 ( Coord . xy ) ;
// F r a c t i o n o f a p i x e l
f l o a t 2 a = f r a c ( Coord . xy ∗ SM SIZE − 0 . 5 ) ;
// N e g a t i v e o r p o s i t i v e o f f s e t t o compute e q u a l w e i g h t s
f l o a t 2 O f f s e t = (−a + 0 . 5 + QuadVector . xy ) ∗ SM TEXEL ;
f l o a t tap = ShadowSample ( Map, Coord , O f f s e t ) ;
f l o a t shadow = QuadAve2x2 ( tap ) ;
We first compute the average texture coordinate for the quad. We then use
the quadrant vector to select only the offset we need. Last, we take one sample
in each pixel and then average the results. This is illustrated in Figure 2.5. Note
that the offset calculation was also reduced from a float4 to float2 calculation.
At this point we are doing a lot of extra work to save only three samples,
but once we extend this to larger kernels it starts to become quite effective. For
example, if we use four samples per pixel we can now achieve 8 × 8 PCF (64 total
texels) with only four bilinear samples, for a 16X improvement over the naive
approach. The layout of these samples is illustrated in Figure 2.6 (right).
//Low and h i g h o f f s e t s f o r t h i s p i x e l
f l o a t 4 lOhO = (−a . xyxy + QuadVector . xyxy + 0 . 5 +
f l o a t 4 ( −2 , −2 ,2 ,2) ) ∗ SM TEXEL ;
float4 t ;
t .x = ShadowSample ( Map, Coord , lOhO . xy );
t .y = ShadowSample ( Map, Coord , lOhO . xw );
t.z = ShadowSample ( Map, Coord , lOhO . zy );
t .w = ShadowSample ( Map, Coord , lOhO . zw );
f l o a t shadow = P i x e l A v e 2 x 2 ( dot ( t , 0 . 2 5 ) ) ;
i i
i i
i i
i i
3 4 4 3 1 1 2 2
Quad
3 4 4 3 3 3 4 4 Texel
1 2 2 1 3 3 4 4
Figure 2.6. Different sample placements color coded by quad pixel. Simply mirroring
samples is easier but results in samples that are very far apart, which can degrade
cache performance (left). Local mirroring results in samples that are closer together
but offsets can be more difficult to calculate symmetrically (right).
Although we can now sample very large kernels, we are outputting the same
value for each pixel in the quad, resulting in quad-sized point sampling artifacts.
Noncontinuous PCF is also quite undesirable, so it is important to add at least
first-order continuity to our filter. We will now tackle both of these issues.
Higher-order filtering is more complicated since shadow texels are not located
at fixed distances from the sampling location, thus weights need to be calculated
dynamically. The most recent approach [Gruen 10] to achieving higher-order
PCF filtering involves solving a small linear system for each sample to find the
correct weights and offsets. The linear system is based on all the bilinear samples
that would have touched the same texels.
We note that this can be largely simplified by using the work from [Sigg and
Hadwiger 05]. Instead of replicating the weights produced by several bilinear
samples and a grid of weights, we determine the weight for each texel using
an analytic filter kernel. Because the kernel is separable, we can compute the
sample offsets and weights separately for each axis. This is demonstrated using
a full-sampled Gaussian kernel below.
{
#d e f i n e SIGMA (SM TEXEL∗ 2 )
#d e f i n e ONE OVER TWO SIGMA SQ ( 1 . 0 / ( 2 . 0 ∗SIGMA∗SIGMA) )
#d e f i n e GAUSSIAN( v ) ( exp ( −(v∗v ) ∗ONE OVER TWO SIGMA SQ) )
f l o a t 4 l i n s t e p ( f l o a t 4 min , f l o a t 4 max , f l o a t 4 v )
{
return s a t u r a t e ( ( v−min ) / (max−min ) ) ;
}
i i
i i
i i
i i
f l o a t 4 F i l t e r W e i g h t ( f l o a t 4 o f f s e t , const i n t f i l t e r T y p e )
{
switch ( f i l t e r T y p e )
{
case 0 :
return L i n e a r S t e p F i l t e r W e i g h t ( o f f s e t , t e x e l W i d t h ) ;
case 1 :
return S m o o t h S t e p F i l t e r W e i g h t ( o f f s e t , t e x e l W i d t h ) ;
case 2 :
return G a u s s i a n F i l t e r W e i g h t ( o f f s e t , t e x e l W i d t h ) ;
}
}
//Low and h i g h p i x e l c e n t e r o f f s e t s ( l o c a l m i r r o r e d )
f l o a t 4 o f f s e t s 0 = (−a . xyxy + QuadVector . xyxy + f l o a t 4
( −2 , −2 ,2 ,2) ) ∗SM TEXEL ;
f l o a t 4 o f f s e t s 1 = o f f s e t s 0 + SM TEXEL ;
// F i l t e r w e i g h t s and o f f s e t s
f l o a t 4 g0 = F i l t e r W e i g h t ( o f f s e t s 0 , f i l t e r T y p e ) ;
f l o a t 4 g1 = F i l t e r W e i g h t ( o f f s e t s 1 , f i l t e r T y p e ) ;
f l o a t 4 g01 = g0 = g1 ;
f l o a t 4 b i l i n e a r O f f s e t s = o f f s e t s 0 + ( g1 / g01 ) ∗SM TEXEL ;
i i
i i
i i
i i
sh ad ow we ig ht . xy = QuadAve ( s h a d o w w e i g h t . xy ) ;
// N o r m a l i z e our sample w e i g h t
f l o a t shadow = s ha do w w e i g h t . x/ s h a d o w w e i g h t . y ;
return shadow ;
}
}
We have shown a few Gaussian filters for both simplicity and readability; in
practice, we prefer to use linear, quadratic, or cubic B-spline kernels. Note that
we do not need to calculate weights for each texel, but rather for each row and
column of texels. Bilinear offsets can then similarly be computed separately and
weights simplified to the product between the sum of X and Y weights. The
same approach can be applied for piecewise polynomial filters such as B-Splines,
or using arbitrary filters with the offsets and weights stored in lookup textures
as in [Sigg and Hadwiger 05].
At this point we now have very smooth shadows but still have the same value
for all pixels in a quad. To smooth the point-sampled look, it would be optimal to
bound all quad texels in shadow space and create a uniquely weighted kernel for
each pixel, but without Gather() capability that would involve performing four
Figure 2.7. Gradient estimation for 8 × 8 bilinear PCF (four samples). These images
are magnified to illustrate how even very naive gradient estimation can hide most quad
artifacts. If using Shader Model 4 or 5, Gather samples can be used to avoid these
artifacts altogether.
i i
i i
i i
i i
times the samples. We found that a good compromise when using bilinear PCF
is to compute a simple gradient approximation along with the shadow value.
Rather than using every texel to compute the gradient, we simply reuse the
bilinear-filtered samples as if they had come from a lower-resolution shadow map.
Although this is somewhat of a hack, thankfully it actually works quite well (see
Figure 2.7). The weights for the derivative calculation will depend on the kernel
itself (see Figure 2.8). The following code calculates a 4 × 4 Prewitt gradient
which works well for low-order B-spline filters:
// G r a d i e n t e s t i m a t i o n u s i n g P r e w i t t 4 x4 g r a d i e n t o p e r a t o r
float4 s dxdy ;
s d x d y . xy = dot ( ta ps , w e i g h t s ) ;
// P r e w i t t ( x )
s d x d y . z = dot ( t ap s , f l o a t 4 ( 3 , 3 , 1 , 1 ) ∗ QuadVector . x ) ;
// P r e w i t t ( y
s d x d y . w = dot ( t aps , f l o a t 4 ( 3 , 1 , 3 , 1 ) ∗ QuadVector . y ) ; )
s d x d y = QuadAve ( s d x d y ) ∗ 4 ;
f l o a t shadow = s d x d y . x ;
Figure 2.8. All of the images in Part II, Chapter 1 also make use of PQA for PCF
shadows. This image uses 8 × 8 bilinear PCF filtering (four samples) and half the
original ALU operations. Part of the shadow penumbra is used to mimic scattering
fall-off, thus an inexpensive wide PCF kernel is crucial. Mesh and textures courtesy of
XYZRGB.
i i
i i
i i
i i
The last issue that needs to be mentioned is handling anisotropy and mini-
fication. Our gradient estimate works well on close-ups and will handle minifi-
cation up to the size of the kernel used. However, under extreme minification
the distance between the quad pixels in shadow space increases, and the linear
gradient estimate breaks down. There are a number of ways we can deal with
this. Firstly, if using a technique like cascading shadow maps (CSMs) we are
unlikely to experience extreme minification since shadow resolution should be
distributed somewhat equally in screen space. In other cases, one option is to
generate mipmaps of the shadow map, allowing us to increase the kernel size to
fit the footprint of the quad in shadow space. Alternatively, we can also forgo
generating mipmaps and just sparsely sample a larger footprint in the shadow
map. We have found that both of these solutions work adequately. Again, having
Gather() support opens up several more options.
2.10 Discussion
We have demonstrated a new approach for optimizing shaders, by amortizing
costly operations across pixel quads, that is natively supported by a large set of
existing hardware. Our approach has the advantage of not requiring additional
passes over the scene unlike other frame buffer LOD approaches. It also poten-
tially allows for sharing redundant calculations and temporary registers between
pixels, while still performing the final calculation at full resolution. We have
also demonstrated how gradients can be used to generate smooth results within
a quad while still supporting bilinear texture fetches. The primary drawback of
our approach remains the lack of interpolation between neighboring quads that
would be provided with something like bilateral upsampling. Interestingly, how-
ever, our technique can help in either case, since our technique can also accelerate
the bilateral upsampling operation itself.
Should PQA become a popular technique, hardware or software pipelines
could make it much more efficient by exposing the registers of neighboring pixels
directly in the pixel shader. Native API support for sharing registers between
pixels would greatly simplify writing amortized shaders. The current cost of
sharing results via derivative instructions makes it prohibitive in some cases.
We have found that our approach can also be applied to other rendering
problems, such as shadow-contact hardening, ambient occlusion, and global illu-
mination. Although we can not verify this at the time of writing, it also appears
that all future hardware that supports Direct3D 11’s fine derivatives will sup-
port PQA.
i i
i i
i i
i i
Bibliography
[Gruen 10] Holger Gruen. “Fast Conventional Shadow Filtering.” In GPU Pro: Ad-
vanced Rendering Techniques, pp. 415–445. Natick, MA: A K Peters, 2010.
[Montrym et al. 97] J.S. Montrym, D.R. Baum, D.L. Dignam, and C.J. Migdal. “In-
finiteReality: A Real-Time Graphics System.” In Proceedings of the 24th Annual
Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’97,
pp. 293–302. New York: ACM Press/Addison-Wesley Publishing Co., 1997.
[Nehab et al. 07] Diego Nehab, Pedro V. Sander, Jason Lawrence, Natalya Tatarchuk,
and John R. Isidoro. “Accelerating Real-Time Shading with Reverse Reprojec-
tion Caching.” In ACM Siggraph/Eurographics Symposium on Graphics Hardware,
pp. 25–35. Aire-la-Ville, Switzerland: Eurographics Association, 2007.
[Olano et al. 03] Marc Olano, Bob Kuehne, and Maryann Simmons. “Automatic shader
level of detail.” In ACM Siggraph/Eurographics Conference on Graphics Hardware,
pp. 7–14. Aire-la-Ville, Switzerland: Eurographics Association, 2003.
[Ren et al. 06] Zhong Ren, Rui Wang, John Snyder, Kun Zhou, Xinguo Liu, Bo Sun,
Peter-Pike Sloan, Hujun Bao, Qunsheng Peng, and Baining Guo. “Real-Time Soft
Shadows in Dynamic Scenes using Spherical Harmonic Exponentiation.” ACM
Siggraph Transactions on Graphics 25:3 (2006), 977–986.
[Sigg and Hadwiger 05] Christian Sigg and Markus Hadwiger. “Fast Third-Order Fil-
tering.” In GPU Gems 2, Chapter 20. Reading, MA: Addison-Wesley Professional,
2005.
[Tomasi and Manduchi 98] C. Tomasi and R. Manduchi. “Bilateral Filtering for Gray
and Color Images.” In Proceedings of the Sixth International Conference on Com-
puter Vision, ICCV ’98, pp. 839–. Washington, DC: IEEE Computer Society, 1998.
i i
i i
i i
i i
[Yang et al. 08] Lei Yang, Pedro V. Sander, and Jason Lawrence. “Geometry-Aware
Framebuffer Level of Detail.” 27:4 (2008), 1183–188.
[Zhu et al. 05] T. Zhu, R. Wang, and D. Luebke. “A GPU Accelerated Render Cache.”
In Pacific Graphics, 2005.
i i
i i
i i
i i
i i
i i
i i
i i
3
VI
369
i i
i i
i i
i i
Figure 3.1. Rendering pipeline for crowd visualization. Dashed arrows correspond to
data transferred from main memory to GPU memory only once at initialization.
• View-frustum culling and LOD Assignment. In this stage we use the charac-
ters’ positions to identify those that will be culled. Additionally, we assign
a proper LOD identifier to the characters’ positions inside the view frustum
according to their distance to the camera.
• LOD sorting. The output of the view-frustum culling and LOD assignment
stage is a mixture of positions with different LODs. In the LOD sorting
stage we sort each position, according to its LOD identifier, into appropriate
buffers so that all the characters’ positions in any one buffer have the same
level of detail.
• Animation and draw instancing. In this stage we will use each sorted buffer
to draw the appropriate LOD character mesh, using instancing. Instancing
allows us to translate the characters across the virtual environment and add
visual and geometrical variety to the individuals that form the crowd.
i i
i i
i i
i i
given agent i
s t a t e=a g e n t [ i ] . s ; x=a g e n t [ i ] . x ; z=a g e n t [ i ] . z ;
l a b e l=world [ x , z ] ;
a g e n t [ i ] . s=fsm [ s t a t e , l a b e l ] ;
a g e n t [ i ] . x += fsm [ s t a t e , l a b e l ] . d e l t a x ;
a g e n t [ i ] . z += fsm [ s t a t e , l a b e l ] . d e l t a z ;
that these methods can simulate the behavior of tens of thousand characters efficiently, is that
approaches using the GPU eliminate the overhead of transferring the new characters’ positions
between the CPU and and the GPU on every frame.
i i
i i
i i
i i
method called radar VFC [Puig Placeres 05]. Radar VFC is based on the camera’s
referential points. The method tests the objects for being in the view range or
not, thus there is no need to calculate the six view-frustum plane equations.
On the other hand, objects tested against the view frustum are usually sim-
plified using points or bounding volumes such as bounding boxes (oriented or
axis-aligned) or spheres. In our case, we use points (the characters’ positions)
together with radar VFC to perform only three tests to determine the characters’
visibility. In addition, to avoid the culling of characters that are partially inside
the view frustum, we increase the view frustum size by ∆ units2 (Figure 3.2).
As mentioned earlier, radar VFC is based on camera referential points. In
other words, the camera has a referential based on the three unit vectors x̂, ŷ,
and ẑ as shown in Figure 3.3, where c is the position of the camera, n is the
center of the near plane, and f is the center of the far plane.
The idea behind radar VFC is that once we have the character’s position p to
be tested against the view frustum, we find the coordinates of p in the referential
and then use this information to find out if the point is inside or outside the view
frustum.
The first step is to find the camera’s referential. Let d be the camera’s view
direction, û the camera’s up vector, then unit vectors x̂, ŷ, and ẑ that form the
referential are calculated using Equations 3.1, 3.2, and 3.3.
d d
ẑ = =q (3.1)
kdk d2x + d2y + d2z
ẑ ⊗ û
x̂ = (3.2)
kẑ ⊗ ûk
x̂ ⊗ ẑ
ŷ = (3.3)
kx̂ ⊗ ẑk
Once we have calculated the referential, the next step is to compute the vector v
that goes from the camera center c to the agent’s position p using Equation 3.4:
v = p − c. (3.4)
Next, the vector v is projected onto the camera referential, i.e., onto the x̂, ŷ,
and ẑ unit vectors.
Radar VFC first tests vector v against ẑ; v is outside the view frustum if
its projection projẑ v ∈
/ ( nearPlane, farPlane). Notice that the projection of a
vector a into a unit vector b̂ is given by the dot product of both vectors, i.e.,
projb̂ a = a · b̂.
If projẑ v ∈ [nearPlane, farPlane], then vector v is tested against ŷ; v will be
outside the view frustum if its projection projŷ v ∈
/ (−(h/2+∆), h/2+∆) interval,
2 The value of ∆ is obtained by visually adjusting the view frustum.
i i
i i
i i
i i
where h is the height of the view frustum at position v and ∆ is the value used to
increase the view-frustum size as shown in Figure 3.2. The height h is calculated
using Equation 3.5, where fov is the field-of-view angle:
fov
h = projẑ v × 2 × tan : fov ∈ [0, 2π] (3.5)
2
If projŷ v ∈ (−(h/2 + ∆), h/2 + ∆), then vector v is tested against x̂ (i.e., v
is outside the view frustum if its projection projx̂ v ∈
/ (−(w/2 + ∆), w/2 + ∆)
interval) where w is the width of the view frustum, given in Equation 3.6 and
ratio is the aspect ratio value of the view frustum:
w = h × ratio (3.6)
VFC and LOD assignment stages are performed using a geometry shader.
This shader receives as input the agent texture that was updated in the behavior
stage (Section 3.2), and it will emit the positions (x, z) which are inside the view
Figure 3.3. Camera’s referential based on the three unit vectors x, y, and z.
i i
i i
i i
i i
frustum and a LODid . The resultant triplets (x, y, LODid ) are stored in a vertex
buffer object using the OpenGL transform feedback feature. Listing 3.2 shows
the code that performs radar VFC in GLSL.
[ v e r t e x program ]
void main ( void )
{
gl TexCoord [ 0 ] = g l M u l t i t e x C o o r d 0 ;
gl Position = gl Vertex ;
}
[ geometry program ]
#define INSIDE t r u e
#define OUTSIDE f a l s e
u n i f o r m sampler2DRect p o s i t i o n ;
u n i f o r m f l o a t n e a r P l a n e , f a r P l a n e , tang , r a t i o , d e l t a ;
u n i f o r m v e c 3 camPos , X, Y, Z ;
// F i r s t t e s t : t e s t a g a i n s t Z u n i t v e c t o r
i f ( pcz > f a r P l a n e | | pcz < n e a r P l a n e )
return OUTSIDE ;
// c a l c u l a t i n g t h e p r o j e c t i o n o f v i n t o Y u n i t v e c t o r
f l o a t pcy = dot ( v ,Y ) ;
f l o a t h = pcz ∗ tang ;
h = h + delta ;
// Second t e s t : t e s t a g a i n s t Y u n i t v e c t o r
i f ( pcy > h | | pcy < −h )
return OUTSIDE ;
// c a l c u l a t i n g t h e p r o j e c t i o n o f v i n t o X u n i t v e c t o r
f l o a t pcx = dot ( v ,X ) ;
float w = h ∗ ratio ;
w = w + delta ;
// Third t e s t : t e s t a g a i n s t X u n i t v e c t o r
i f ( pcx > w | | pcx < − w )
return OUTSIDE ;
return INSIDE ;
}
i i
i i
i i
i i
3 It has been shown in [Millán et al. 06] that 2D representations, such as impostors, make it
possible to render tens of thousands of similar animated characters, but 2D-representation ap-
proaches need manual tuning and generate a huge amount of data if several animation sequences
are present and/or geometrical variety is considered.
i i
i i
i i
i i
...
i f p r o j Z v <= r a n g e 0 then
LODid = 0
e l s e i f p r o j Z v > r a n g e 0 & p r o j Z v <= r a n g e 1 then
LODid = 1
e l s e i f p r o j Z v > r a n g e 1 & p r o j Z v <= r a n g e 2 then
LODid = 2
...
where n is the number of LOD meshes per character, τi ∈ (0, 1), is a threshold
that isotropically or anisotropically divides the view range visually calibrated to
reduce popping effects and U is the unit step function given by:
1 if t ≥ t0 ,
U (t − t0 ) =
0 if t < t0 .
Notice that if n = 3 (three LOD meshes per character), then LODid can
receive three values, 0 when the characters are near the camera (full detail), 1
when the characters are at medium distances from the camera (medium detail)
and 2 when the characters are at distances far from the camera (low detail).
Listing 3.4 shows the changes made in Listing 3.2 to add LODid calculation.
[ geometry s h a d e r ]
...
b o o l p o i n t I n F r u s t u m ( v e c 3 p o i n t , out f l o a t l o d )
{
...
// c a l c u l a t i n g t h e p r o j e c t i o n o f v i n t o Z u n i t v e c t o r
f l o a t pcz = dot ( v , Z ) ;
...
// For 3 LOD meshes :
l o d = s t e p ( f a r P l a n e ∗ tao0 , pcz ) +
s t e p ( f a r P l a n e ∗ tao1 , pcz ) +
s t e p ( f a r P l a n e ∗ tao2 , pcz ) ;
return INSIDE ;
}
i i
i i
i i
i i
v e c 4 pos = t e x t u r e 2 D R e c t ( p o s i t i o n , g l T e x C o o r d I n [ 0 ] [ 0 ] . s t ) ;
i f ( p o i n t I n F r u s t u m ( pos . xyz , l o d ) )
{
g l P o s i t i o n = pos ;
g l P o s i t i o n .w = lod ;
EmitVertex ( ) ;
EndPrimitive ( ) ;
}
}
Figure 3.4. (a) Output VBO from VFC and LOD assignment stage. (b) Output of
LOD sorting stage.
i i
i i
i i
i i
For each transform feedback pass, a geometry shader will emit only the ver-
tices of the same LODid . This is shown in Listing 3.5. Notice that the uniform
variable lod is updated each pass. In our case it will be set to 0 for a full-resolution
mesh, 1 for a medium-resolution mesh, and 2 for a low-resolution mesh.
[ geometry s h a d e r ]
u n i f o r m f l o a t l o d ; // t h i s v a r i a b l e i s updated each p a s s
void main ( )
{
v e c 4 pos = g l P o s i t i o n I n [ 0 ] ;
i f ( l o d == pos . w )
{
g l P o s i t i o n = pos ;
EmitVertex ( ) ;
EndPrimitive ( ) ;
}
}
Figure 3.5 shows the output of this stage and the VFC and LOD assignment
stage. The characters’ positions are rendered as points. We have assigned a
specific color for each VBOLOD . In this case, red was assigned to VBOLOD0 ,
green to VBOLOD1 , and blue to VBOLOD2 .
Figure 3.5. Output of LOD sorting stage, 4096 characters rendered as points. LOD0 is
shown in red, LOD1 in green, and LOD2 in blue. Main camera view (left). Auxiliary
camera view; notice that only positions inside the view frustum are visible (right).
i i
i i
i i
i i
3.6 Results
We designed two tests to verify the performance of our pipeline. These tests were
performed on Windows Vista using an NVIDIA 9800GX2 card with SLI-disabled
and a viewport size of 900 × 900 pixels.
The goal of the first test is to determine the execution time of the behav-
ior, VFC and LOD assignments, and LOD sorting stages.4 The goal of the
second test is to determine the execution time of the complete pipeline. The
first test consisted of incrementing the number of characters from 1K to 1M,
each character with three LODs. Timing information was obtained using timer
queries (GL EXT timer query) which provides a mechanism used to determine
the amount of time (in nanoseconds) it takes to fully complete a set of OpenGL
commands without stalling the rendering pipeline.
Results of this test are shown in the graph in Figure 3.6 (timing values are
in milliseconds). In addition, Figure 3.5 shows a rendering snapshot for 4096
characters rendered as points. Notice that the elapsed time for VFC and LOD
assignments and LOD sorting stages remains almost constant. When performing
transform feedback, we do not need any subsequent pipeline stages, thus rasteri-
zation is disabled.
The second test consists of rendering a crowd of different characters. Each
character has three LODs, the character’s LOD0 mesh is made of 2500 vertices,
4 We do not provide the execution time of the animation and draw instanced stage, since
i i
i i
i i
i i
Figure 3.6. Test 1 results. Notice that timing results are in milliseconds.
the LOD1 mesh 1000 and LOD2 300. The goal of this test is to determine
the execution time of all the stages of our pipeline using two different camera
perspectives. In Perspective A (Figure 3.7), almost all characters are visible,
while in Perspective B (Figure 3.8) almost all characters are culled.
Figure 3.7. Perspective A (8192 characters). Most of the characters are visible.
i i
i i
i i
i i
Figure 3.8. Perspective B (8192 characters). Most of the characters are culled.
i i
i i
i i
i i
These results are shown in Table 3.9 for Perspective A, and in Table 3.10 for
Perspective B. The first column of both tables shows the number of rendered
characters, the second one shows the time, in milliseconds, measured for each
case. Columns three to five show how many characters per level of detail are ren-
dered, and column six shows the total number of vertices, in millions, transformed
by our animation shader. Finally, the last two columns show the percentage of
characters that are visible or culled.
i i
i i
i i
i i
3.8 Acknowledgments
We wish to thank NVIDIA for its kind donation of the GPU used in the experiments.
Bibliography
[Bahnassi 06] Wessam Bahnassi. “AniTextures.” In ShaderX4 : Advanced Rendering
Techniques. Hingham, MA: Charles River Media, 2006.
[Dudash 07] B. Dudash. Skinned Instancing. NVIDIA Technical Report, 2007.
[Millán et al. 06] Erik Millán, Benjamı́n Hernández, and Isaac Rudomı́n. “Large
Crowds of Autonomous Animated Characters using Fragment Shaders and Level
of Detail.” In ShaderX5 : Advanced Rendering Techniques. Higham, MA: Charles
River Media, 2006.
[Park and Han 09] Hunki Park and Junghyun Han. “Fast Rendering of Large Crowds
Using GPU.” In ICEC ’08: Proceedings of the 7th International Conference on
Entertainment Computing, pp. 197–202. Berlin, Heidelberg: Springer-Verlag, 2009.
[Puig Placeres 05] Frank Puig Placeres. “Improved Frustum Culling.” In Game Pro-
gramming Gems V. Hingham, MA: Charles River Media, Inc., 2005.
[Rudomı́n et al. 05] Isaac Rudomı́n, Erik Millán, and Benjamı́n Hernández. “Fragment
shaders for agent animation using Finite State Machines.” Simulation Modelling
Practice and Theory 13:8 (2005), 741–751.
i i
i i
i i
i i
i i
i i
i i
i i
VII
GPGPU
i i
i i
i i
i i
—Sebastien St-Laurent
i i
i i
i i
i i
1
VII
Figure 1.1. Fleur-de-lis seed image (left) and its resulting distance field (right).
387
i i
i i
i i
i i
1.1 Vocabulary
In the context of distance fields, the definition of distance (also known as the
metric) need not be “distance” in the physical sense that we’re all accustomed to.
• Chessboard metric. Rather than summing the horizontal and vertical dis-
tances, take their maximum. This is the minimum number of moves a king
needs when traveling between two points on a chessboard. Much like the
Manhattan metric, chessboard distance tends to be easier to compute than
true Euclidean distance.
i i
i i
i i
i i
Figure 1.2. Seed image (left), Manhattan distance (middle), Euclidean distance (right).
See Figure 1.2 for an example of a seed image and a comparison of Manhattan
and Euclidean metrics.
It also helps to classify the generation algorithms that are amenable to the
GPU:
i i
i i
i i
i i
The fragment shader (for OpenGL 3.0 and above) used in each rendering pass
is shown in Listing 1.1. This shader assumes that the render target and source
texture have formats that are single-component, 8-bit unsigned integers.
out u i n t F r a g C o l o r ;
u n i f o r m usampler2D Sampler ;
void main ( )
{
i v e c 2 c o o r d = i v e c 2 ( g l F r a g C o o r d . xy ) ;
u i n t c o l o r = t e x e l F e t c h ( Sampler , coord , 0 ) . r ;
i f ( c o l o r != 255 u )
discard ;
i i
i i
i i
i i
i f ( c o l o r != 255 u ) {
FragColor = c o l o r ;
return ;
}
b o o l done = f a l s e ;
int pass = 0 ;
while ( ! done ) {
// Copy t h e e n t i r e s o u r c e image t o t h e t a r g e t
glUseProgram ( B l i t P r o g r a m ) ;
glDrawArrays (GL TRIANGLE FAN, 0 , 4 ) ;
// I f a l l p i x e l s were d i s c a r d e d , we ’ r e done
GLuint count = 0 ;
i i
i i
i i
i i
Note that Listing 1.2 also checks against the MaxPassCount constant for loop
termination. This protects against an infinite loop in case an error occurs in the
fragment shader or occlusion query.
Tex A, d=0 Tex B, d=1, β=1 Tex A, d=2, β=3 Tex B, d=3, β=5 Tex A, d=4, β=7
F F F F F F F F F F F F F F F F F F F F F F F F F
F F F F F F F F F F F F F F F F F F F F F F F F F
F F 0 F F F 1 0 1 F 4 1 0 1 4 4 1 0 1 4 4 1 0 1 4
F F 0 F F F 1 0 1 F 4 1 0 1 4 4 1 0 1 4 4 1 0 1 4
F F F F F F F F F F F F F F F F F F F F F F F F F
F F F 0 F F F 1 0 1 F 4 1 0 1 9 4 1 0 1 9 4 1 0 1
F F F 0 F F F 1 0 1 F 4 1 0 1 9 4 1 0 1 9 4 1 0 1
Figure 1.3. Horizontal erosion.
Tex A, d=0 Tex B, d=1, β=1 Tex A, d=2, β=3 Tex B, d=3, β=5
F F F F F F F F F F 8 5 4 5 8 8 5 4 5 8
F F F F F 5 2 1 2 5 5 2 1 2 5 5 2 1 2 5
4 1 0 1 4 4 1 0 1 4 4 1 0 1 4 4 1 0 1 4
4 1 0 1 4 4 1 0 1 4 4 1 0 1 4 4 1 0 1 4
F F F F F 5 2 1 1 2 5 2 1 1 2 5 2 1 1 2
9 4 1 0 1 9 4 1 0 1 8 4 1 0 1 8 4 1 0 1
i i
i i
i i
i i
You might be wondering why this is erosion rather than grassfire, according
to our terminology. In Figure 1.4, notice that a 9 changes into an 8 during the
second pass. Since a nonbackground pixel changes its value, this is an erosion-
based method.
The fragment shader for the horizontal stage of processing is shown next.
out u i n t F r a g C o l o r ;
u n i f o r m usampler2D Sampler ;
u n i f o r m u i n t Beta ;
u n i f o r m u i n t MaxDistance ;
void main ( )
{
i v e c 2 c o o r d = i v e c 2 ( g l F r a g C o o r d . xy ) ;
i f (A == B | | B > MaxDistance )
discard ;
FragColor = B;
}
Background pixels are initially set to “infinity,” (i.e., the largest possible value
allowed by the bit depth). Since the shader discards pixels greater than the
application-defined MaxDistance constant, it effectively clamps the distance val-
ues. We’ll discuss the implications of clamping later in this chapter.
To create the shader for the vertical pass, simply replace the two East-West
offsets (+1,0) and (-1,0) in Listing 1.3 with North-South offsets (0,+1) and (0,-1).
To give the erosion shaders some context, the C code is shown next.
GLuint program = H o r i z o n t a l P r o g r a m ;
f o r ( i n t d = 1 ; d < MaxPassCount ; d++) {
// Copy t h e e n t i r e s o u r c e image t o t h e d e s t i n a t i o n s u r f a c e
glUseProgram ( B l i t P r o g r a m ) ;
i i
i i
i i
i i
Applying odd-numbered offsets at each pass might seem less intuitive than
the Manhattan technique, but the mathematical explanation is simple. Recall
that the image contains squared distance, d2 . At every iteration, the algorithm
fills in new distance values by adding an offset value β to the previous distance.
Expressing this in the form of an equation, we have
d2 = (d − 1)2 + β.
Solving for β is simple algebra:
β = 2 ∗ d − 1;
therefore, β iterates through the odd integers.
i i
i i
i i
i i
1. Find the one-dimensional distance field of each row. This can be performed
efficiently in two passes as follows:
• First, crawl rightward and increment a counter along the way, resetting
the counter every time you cross a contour line. Write the counter’s
value into the destination image along the way. After the entire row
is processed, the seed image can be discarded.
• Next, crawl leftward, similarly incrementing a counter along the way
and writing the values into the destination image. When encountering
an existing value in the destination image that’s less than the current
counter, reset the counter to that value.
// Rightward Pass
d = 0;
f o r ( x = 0 ; x < Width ; x++) {
d = s e e d [ x ] ? 0 : d+1;
destination [ x ] = d;
}
// L e f t w a r d Pass
d = 0;
f o r ( x = Width −1; x >= 0 ; x−−) {
d = min ( d+1 , d e s t i n a t i o n [ x ] ) ;
destination [ x ] = d;
}
2. In each vertical column, find the minimum squared distance of each pixel,
using only the values computed in Step 1 as input. A brute-force way of
doing this would be as follows:
f o r ( y1 = 0 ; y1 < h e i g h t ; y1++) {
minDist = INFINITY ;
f o r ( y2 = 0 ; y2 < h e i g h t ; y2++) {
d = d e s t i n a t i o n [ y2 ] ;
d = ( y1 − y2 ) ∗ ( y1 − y2 ) + d∗d ;
minDist = min ( minDist , d ) ;
}
i i
i i
i i
i i
d e s t i n a t i o n [ y1 ] = minDist ;
}
Note the expensive multiplications in the vertical pass. They can be optimized
in several ways:
• The (y1 − y2)2 operation can be replaced with a lookup table because
|y1 − y2| is a member of a relatively small set of integers.
In practice, we found that these multiplications were not very damaging since
GPUs tend to be extremely fast at multiplication.
For us, the most fruitful optimization to the vertical pass was splitting it
into downward and upward passes. Saito and Toriwaki describe this in detail,
showing how it limits the range of the inner loop to a small region of interest.
Figure 1.5. Seed image, rightward, leftward, downward, upward (top to bottom).
i i
i i
i i
i i
// C r e a t e t h e OpenCL c o n t e x t
c l G e t D e v i c e I D s ( p l a t f o r m I d , CL DEVICE TYPE GPU , 1 ,
&d e v i c e I d , 0 ) ;
c o n t e x t = c l C r e a t e C o n t e x t ( 0 , 1 , &d e v i c e I d , 0 , 0 , 0 ) ;
// C r e a t e memory o b j e c t s
i n B u f f e r = c l C r e a t e B u f f e r ( context ,
CL MEM READ ONLY | CL MEM COPY HOST PTR,
IMAGE WIDTH ∗ IMAGE HEIGHT, inputImage , 0 ) ;
o u t B u f f e r = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ WRITE,
IMAGE WIDTH ∗ IMAGE HEIGHT ∗ 2 , 0 , 0 ) ;
// Load and c o m p i l e t h e k e r n e l s o u r c e
program = cl Cre at ePr ogr am W i t h S o u r c e ( c o n t e x t , 1 ,
&s o u r c e , 0 , 0 ) ;
c l B u i l d P r o g r a m ( program , 0 , 0 , " -cl - fast - relaxed - math " , 0 , 0 ) ;
// S e t up t h e k e r n e l o b j e c t f o r t h e h o r i z o n t a l p a s s
h o r i z K e r n e l = c l C r e a t e K e r n e l ( program , " horizontal " , 0 ) ;
c l S e t K e r n e l A r g ( h o r i z K e r n e l , 0 , s i z e o f ( cl mem ) , &i n B u f f e r ) ;
c l S e t K e r n e l A r g ( h o r i z K e r n e l , 1 , s i z e o f ( cl mem ) , &o u t B u f f e r ) ;
i i
i i
i i
i i
// S e t up t h e k e r n e l o b j e c t f o r t h e v e r t i c a l p a s s
v e r t K e r n e l = c l C r e a t e K e r n e l ( program , " vertical " , 0 ) ;
c l S e t K e r n e l A r g ( v e r t K e r n e l , 0 , s i z e o f ( cl mem ) , &o u t B u f f e r ) ;
Listing 1.5 uses OpenCL memory buffers rather than OpenCL image objects.
This makes the kernel code a bit easier to follow for someone coming from a CPU
background. Since we’re not leveraging the texture filtering capabilities of GPUs
anyway, this is probably fine in practice.
Also note that we’re using a seed image that consists of 8-bit unsigned integers,
but our target image is 16 bits. Since we’re generating squared distance, using
only 8 bits would result in very poor precision. If desired, a final pass could be
tacked on that takes the square root of each pixel and generates an 8-bit image
from that.
u s h o r t n e x t D i s t a n c e = min ( 2 5 4 u , d i s t a n c e ) + 1u
Even though the target is 16 bit, we clamp it to 255 during the horizontal
scan because it gets squared in a later step. Note that distance clamping results
in an interesting property:
If distances are clamped to a maximum value of x, then any two seed pixels
further apart than x have no effect on each other in the final distance field.
We’ll leverage this property later. For some applications, it’s perfectly fine
to clamp distances to a very small value. This can dramatically speed up the
generation algorithm, as we’ll see later.
i i
i i
i i
i i
Listing 1.6 is the complete listing of the horizontal kernel used in our naive
implementation. For simplicity, this kernel operates on a fixed-width image. In
practice, you’ll want to pass in the width as an argument to the kernel.
// Rightward p a s s
d = 0;
f o r ( i n t x = 0 ; x < IMAGE WIDTH; x++) {
u s h o r t n e x t = min ( 2 5 4 u , d ) + 1u ;
d = s o u r c e [ x ] ? 0u : n e x t ;
target [ x ] = d;
}
// L e f t w a r d p a s s
d = 0;
f o r ( i n t x = IMAGE WIDTH − 1 ; x >= 0 ; x−−) {
u s h o r t n e x t = min ( 2 5 4 u , d ) + 1u ;
d = min ( next , t a r g e t [ x ] ) ;
target [ x ] = d;
}
// Square t h e d i s t a n c e s
f o r ( i n t x = 0 ; x < IMAGE WIDTH; x++) {
target [ x ] = target [ x ] ∗ target [ x ] ;
}
}
i i
i i
i i
i i
... ...
Figure 1.6. OpenCL work item for the horizontal pass (blue), with two overlapping
neighbors (yellow).
To reduce the need for generous amounts of local storage, we can break up
each row into multiple sections, thus shrinking the size of each OpenCL work
item. Unfortunately, operating on a narrow section of the image can produce
incorrect results, since contour lines outside the section are ignored.
This is where we leverage the fact that far-apart pixels have no effect on each
other when clamping the distance field. The middle part of each work item will
produce correct results since it’s far away from neighboring work items. We’ll use
the term margin to label the incorrect regions of each work item. By overlapping
the work items and skipping writes for the values in the margin, the incorrect
regions of each work item are effectively ignored (see Figures 1.6 and 1.7). Note
that tighter distance clamping allows for smaller margin size, resulting in better
parallelization.
Figure 1.7. OpenCL topology for the horizontal and vertical kernels.
We now need to set up the work groups for the horizontal pass using a two-
dimensional arrangement (see Listing 1.7); changed lines are highlighted. The
SPAN constant refers to the length of each work item, not including the throw-
away margins.
i i
i i
i i
i i
Listing 1.8 is the listing for the new kernel code. Instead of looping between
0 and WIDTH, we now perform processing between Left and Right, which are
determined from the value returned by get global id(0). You’ll also notice the
InnerLeft and InnerRight constants; these mark the portion of the work item
that actually gets written out.
// Rightward p a s s
d = 0;
f o r ( i n t x = L e f t ; x < Right ; x++) {
u s h o r t n e x t = min ( 2 5 4 u , d ) + 1u ;
d = s o u r c e [ x ] ? 0u : n e x t ;
scratch [ x − Left ] = d ;
}
// L e f t w a r d p a s s
d = 0;
f o r ( i n t x = Right − 1 ; x >= L e f t ; x−−) {
u s h o r t n e x t = min ( 2 5 4 u , d ) + 1u ;
d = min ( next , ( u s h o r t ) s c r a t c h [ x − L e f t ] ) ;
scratch [ x − Left ] = d ;
i i
i i
i i
i i
The only remaining piece is the kernel for the vertical pass. Recall the code
snippet we presented earlier that described an O(n2 ) algorithm to find the min-
imum distances in a column. By splitting the algorithm into downward and
upward passes, Saito and Toriwaki show that the search area of the inner loop
can be narrowed to a small region of interest, thus greatly improving the best-
case efficiency. See this book’s companion source code for the full listing of the
vertical kernel.
Due to the high variability from one type of GPU to the next, we recommend
that readers experiment to find the optimal OpenCL kernel code and topology
for their particular hardware.
Readers may also want to experiment with the image’s data type (floats versus
integers). We chose integers for this article because squared distance in a grid
is, intuitively speaking, never fractional. However, keep in mind that GPUs are
floating-point monsters! Floats and half-floats may provide better results with
certain architectures. It suffices to say that the implementation presented in this
article is by no means the best approach in every circumstance.
i i
i i
i i
i i
former generates distances for object pixels in addition to background pixels. Ob-
ject pixels have negative distance while background pixels have positive distance.
It’s easy to extend any technique to account for signed distance by simply
inverting the seed image and applying the algorithm a second time. We found it
convenient to use an unsigned, integer-based texture format, and added a second
color channel to the image for the negative values. In Figure 1.8, we depict a
signed distance field where the red channel contains positive distance and the
green channel contains negative distance.
In the case of the horizontal-vertical erosion technique presented earlier, we
can modify the fragment shader to operate on two color channels simultaneously,
thus avoiding a second set of passes through the image for negative distance.
Listing 1.9 shows the new fragment shader.
out uvec2 F r a g C o l o r ;
u n i f o r m usampler2D Sampler ;
u n i f o r m u i n t Beta ;
void main ( )
{
i v e c 2 c o o r d = i v e c 2 ( g l F r a g C o o r d . xy ) ;
i f (A == B)
discard ;
FragColor = B;
}
For most applications it’s desirable to make a final transformation that nor-
malizes the two-channel distance into a simple grayscale image. The final square-
root transformation can also be performed at this time. The following fragment
shader makes this final pass:
v a r y i n g v e c 2 TexCoord ;
u n i f o r m sampler2D Sampler ;
uniform float S c a l e ;
void main ( )
{
v e c 2 D = s q r t ( t e x t u r e 2 D ( Sampler , TexCoord ) . r g ) ;
i i
i i
i i
i i
f l o a t L = 0 . 5 + S c a l e ∗ (D. r − D. g ) ;
g l F r a g C o l o r = v e c 4 (L ) ;
}
If a distance field is normalized in this way, a value of 0.5 indicates the center
of a contour in the seed image.
1.6.1 Antialiasing
Both of the source textures in Figure 1.9 are only 128 × 32; it’s obvious that
rendering with the aid of a distance field can provide much better results.
Because the gradient vector at a certain pixel in the distance field gives the
direction of maximum change, it can be used as the basis for antialiasing. The
fwidth function in GLSL provides a convenient way to obtain the rate of change
of the input value at the current pixel. In our case, large values returned from
fwidth indicate a far-off camera, while small values indicate large magnification.
Recall that a lookup value of 0.5 represents the location of the contour line.
We compute the best alpha value for smoothing by testing how far the current
pixel is from the contour line. See Listing 1.10 for our antialiasing shader.
Figure 1.9. Bilinear filtering (left). Magnification using a distance field (right).
i i
i i
i i
i i
i n v e c 2 TexCoord ;
out v e c 4 F r a g C o l o r ;
u n i f o r m sampler2D Sampler ;
void main ( )
{
f l o a t D = t e x t u r e ( Sampler , TexCoord ) . x ;
f l o a t width = f w i d t h (D ) ;
f l o a t A = 1 . 0 − s m o o th s t e p ( 0 . 5 − width , 0 . 5 + width , D ) ;
FragColor = vec4 ( 0 , 0 , 0 , A) ;
}
1.6.2 Outlining
Creating an outline effect such as the one depicted in Figure 1.10 is quite simple
when using a signed distance field for input. Note that there are two color
transitions that we now wish to antialias: the transition from the fill color to
the outline color, and the transition from the outline color to the background
color. The following fragment shader shows how to achieve this; the Thickness
uniform is the desired width of the outline.
i n v e c 2 TexCoord ;
out v e c 4 F r a g C o l o r ;
u n i f o r m sampler2D Sampler ;
uniform float Thickness ;
void main ( )
{
f l o a t D = t e x t u r e ( Sampler , TexCoord ) . x ;
f l o a t W = f w i d t h (D ) ;
f l o a t T0 = 0 . 5 − T h i c k n e s s ;
f l o a t T1 = 0 . 5 + T h i c k n e s s ;
i i
i i
i i
i i
i f (D < T0 ) {
f l o a t A = 1 . 0 − s m o o t h s t e p ( T0−W, T0 , D ) ;
F r a g C o l o r = v e c 4 (A, A, A, 1 ) ;
} e l s e i f (D < T1 ) {
FragColor = vec4 ( 0 , 0 , 0 , 1 ) ;
} else {
f l o a t A = 1 . 0 − s m o o t h s t e p ( T1 , T1+W, D ) ;
FragColor = vec4 ( 0 , 0 , 0 , A) ;
}
}
i n v e c 2 TexCoord ;
out v e c 4 F r a g C o l o r ;
u n i f o r m sampler2D Sampler ;
u n i f o r m f l o a t Animation ;
void main ( )
{
f l o a t D = t e x t u r e ( Sampler , TexCoord ) . x ;
f l o a t W = f w i d t h (D ) ;
f l o a t H = 2 . 0 ∗ f l o a t (D − 0 . 5 ) ;
f l o a t A = s m o o t h s t e p ( 0 . 5 − W, 0 . 5 + W, D ) ;
hue = f r a c t (H + Animation ) ;
F r a g C o l o r = v e c 4 (A ∗ HsvToRgb (H, 1 . 0 , 1 . 0 ) , 1 . 0 ) ;
}
i i
i i
i i
i i
Bibliography
[Cao et al. 10] Thanh-Tung Cao, Ke Tang, Anis Mohamed, and Tiow-Seng Tan. “Par-
allel Banding Algorithm to Compute Exact Distance Transform with the GPU.”
In I3D ’10: Proceedings of the 2010 ACM SIGGRAPH Symposium on Interactive
3D Graphics and Games, pp. 83–90. New York: ACM, 2010.
[Danielsson 80] P. E. Danielsson. “Euclidean Distance Mapping.” Computer Graphics
and Image Processing 14:3 (1980), 227–248.
[Fabbri et al. 08] Ricardo Fabbri, Luciano da F. Costa, Julio C. Torelli, and Odemir M.
Bruno. “2D Euclidean Distance Transform Algorithms: A Comparative Survey.”
ACM Computing Surveys 40:1 (2008), 2:1–2:44.
[Green 07] Chris Green. “Improved Alpha-Tested Magnification for Vector Textures
and Special Effects.” In SIGGRAPH ’07: ACM SIGGRAPH 2007 Courses, pp. 9–
18. New York: ACM, 2007.
[Lotufo and Zampirolli 01] Roberto de Alencar Lotufo and Francisco A. Zampirolli.
“Fast Multidimensional Parallel Euclidean Distance Transform Based on Math-
ematical Morphology.” In SIBGRAPI ’01: Proceedings of the 14th Brazilian Sym-
posium on Computer Graphics and Image Processing, pp. 100–105. Washington,
DC: IEEE Computer Society, 2001.
[Qin et al. 06] Zheng Qin, Michael D. McCool, and Craig S. Kaplan. “Real-Time
Texturemapped Vector Glyphs.” In Symposium on Interactive 3D Graphics and
Games, pp. 125–132. New York: ACM Press, 2006.
[Rong and Tan 06] Guodong Rong and Tiow-Seng Tan. “Jump Flooding: An Efficient
and Effective Communication on GPU.” pp. 185–192. Hingham, MA: Charles River
Media, 2006.
[Saito and Toriwaki 94] Toyofumi Saito and Jun-Ichiro Toriwaki. “New algorithms for
Euclidean Distance Transformation of an n-Dimensional Digitized Picture with
Applications.” Pattern Recognition 27:11 (1994), 1551–1565.
i i
i i
i i
i i
i i
i i
i i
i i
2
VII
Order-Independent Transparency
using Per-Pixel Linked Lists
Nicolas Thibieroz
2.1 Introduction
Order-independent transparency (OIT) has been an active area of research in
real-time computer graphics for a number of years. The main area of research
has focused on how to effect fast and efficient back-to-front sorting and render-
ing of translucent fragments to ensure visual correctness when order-dependent
blending modes are employed. The complexity of the task is such that many real-
time applications have chosen to forfeit this step altogether in favor of simpler and
faster alternative methods such as sorting per object or per triangle, or simply
falling back to order-independent blending modes (e.g., additive blending) that
don’t require any sorting [Thibieroz 08]. Different OIT techniques have previously
been described (e.g., [Everitt 01], [Myers 07]) and although those techniques suc-
ceed in achieving correct ordering of translucent fragments, they usually come
with performance, complexity, or compatibility shortcomings that make their use
difficult for real-time gaming scenarios.
This chapter presents an OIT implementation relying on the new features
of the DirectX 11 API from Microsoft.1 An overview of the algorithm will be
presented first, after which each phase of the method will be explained in detail.
Sorting, blending, multisampling, anti-aliasing support, and optimizations will
be treated in separate sections before concluding with remarks concerning the
attractiveness and future of the technique.
409
i i
i i
i i
i i
buffers for later rendering. Our method uses linked lists [Yang 10] to store a list
of translucent fragments for each pixel. Therefore, every screen coordinate in the
render viewport will contain an entry to a unique per-pixel linked list containing
all translucent fragments at that particular location.
Prior to rendering translucent geometry, all opaque and transparent (alpha-
tested) models are rendered onto the render target viewport as desired. Then the
OIT algorithm can be invoked to render corrected-ordered translucent fragments.
The algorithm relies on a two-step process.
1. The first step is the creation of per-pixel linked lists whereby the translucent
contribution to the scene is entirely captured into a pair of buffers containing
the head pointers and linked lists nodes for all translucent pixels.
2. The second step is the traversal of per-pixel linked lists to fetch, sort, blend
and finally render all pixels in the correct order onto the destination render
viewport.
i i
i i
i i
i i
buffer. The programmer is given control of the counter via the following two
Shader Model 5.0 methods:
u i n t <B u f f e r >. I n c r e m e n t C o u n t e r ( ) ;
u i n t <B u f f e r >. DecrementCounter ( ) ;
Hardware counter support is used to keep track of the offset at which to store
the next linked list node.
While hardware counter support is not strictly required for the algorithm to
work, it enables considerable performance improvement compared to manually
keeping track of a counter via a single-element buffer UAV.
i i
i i
i i
i i
linked list. The linked list nodes are the individual elements of the linked list
that contain fragment data as well as the address of the next node.
u i n t u L i n e a r A d d r e s s I n B y t e s = 4 ∗ ( S c r e e n P o s . y∗RENDERWIDTH +
ScreenPos . x ) ;
i i
i i
i i
i i
i i
i i
i i
i i
i i
i i
i i
i i
Figure 2.1. Head pointer and nodes buffers contents after rendering five translucent
pixels (3 triangles) for a 6 × 6 render target viewport.
// P i x e l s h a d e r i n p u t s t r u c t u r e
struct PS INPUT
{
f l o a t 3 vNormal : NORMAL; // P i x e l normal
f l o a t 2 vTex : TEXCOORD; // Texture c o o r d i n a t e s
f l o a t 4 vPos : SV POSITION ; // S c r e e n c o o r d i n a t e s
};
// Node data s t r u c t u r e
struct NodeData STRUCT
{
u i n t uColor ; // Fragment c o l o r packed a s RGBA
u i n t uDepth ; // Fragment depth
u i n t uNext ; // Address o f n e x t l i n k e d l i s t node
};
// UAV d e c l a r a t i o n s
RWByteAddressBuffer H e a d P o i n t e r B u f f e r : r e g i s t e r ( u1 ) ;
RWStructuredBuffer<NodeData STRUCT> N o d e s B u f f e r : r e g i s t e r ( u2 ) ;
// P i x e l s h a d e r f o r w r i t i n g per−p i x e l l i n k e d l i s t s
[ earlydepthstencil ]
f l o a t PS StoreFragments ( PS INPUT i n p u t ) : SV Target
{
NodeData STRUCT Node ;
i i
i i
i i
i i
// C a l c u l a t e f r a g m e n t c o l o r from p i x e l i n p u t data
Node . uColor = P a c k F l o a t 4 I n t o U i n t ( ComputeFragmentColor ( i n p u t ) ) ;
// S t o r e p i x e l depth i n packed f o r m a t
Node . uDepth = PackDepthIntoUint ( i n p u t . vPos . z ) ;
// R e t r i e v e c u r r e n t p i x e l count and i n c r e a s e c o u n t e r
u i n t uPixelCount = N o d e s B u f f e r . I n c r e m e n t C o u n t e r ( ) ;
// Convert p i x e l 2D c o o r d i n a t e s t o b y t e l i n e a r a d d r e s s
u i n t 2 v S c r e e n P o s = u i n t ( i n p u t . vPos . xy ) ;
u i n t u L i n e a r A d d r e s s I n B y t e s = 4 ∗ ( v S c r e e n P o s . y∗RENDERWIDTH +
vScreenPos . x ) ;
// Exchange o f f s e t s i n Head P o i n t e r b u f f e r
// Node . uNext w i l l r e c e i v e t h e p r e v i o u s head p o i n t e r
HeadPointerBuffer . InterlockedExchange (
u L i n e a r A d d r e s s I n B y t e s , uPixelCount , Node . uNext ) ;
// No RT bound s o t h i s w i l l have no e f f e c t
return f l o a t 4 ( 0 , 0 , 0 , 0 ) ;
}
i i
i i
i i
i i
been rendered. This copy is needed to start the manual blending operation in
the pixel shader.
The current render target viewport (onto which opaque and alpha-tested ge-
ometry has previously been rendered) is set as output, and the depth buffer that
was used to render previous geometry is bound with Z-Writes disabled (the use of
the depth buffer for the traversal step is explained in the Optimizations section).
i i
i i
i i
i i
i i
i i
i i
i i
underblended result via actual hardware blending). However, this method im-
poses restrictions on the variety of per-fragment blend modes that can be used,
and did not noticeably affect performance.
It is quite straightforward to modify the algorithm so that per-fragment blend
modes are specified instead of adopting a blend mode common to all fragments.
This modification allows translucent geometry of different types (particles, win-
dows, smoke etc.) to be stored and processed together. In this case a bit field
containing the blend mode id of each fragment is stored in the node structure
(along with pixel color and depth) in the per-pixel linked list creation step. Only
a few bits are required (this depends on how many different blend modes are
specified— typically this shouldn’t be more than a handful) and therefore the bit
field could be appended to the existing color or depth member of the node struc-
ture by modifying the packing function accordingly. When the per-pixel linked
lists are parsed for rendering, the blending part of the algorithm is modified so
that a different code path (ideally based on pure arithmetic instructions to avoid
the need for actual code divergence) is executed based on the fragment’s blend
mode id.
// P i x e l s h a d e r i n p u t s t r u c t u r e f o r f u l l s c r e e n quad r e n d e r i n g
struct PS SIMPLE INPUT
{
f l o a t 2 vTex : TEXCOORD; // Texture c o o r d i n a t e s
f l o a t 4 vPos : SV POSITION ; // S c r e e n c o o r d i n a t e s
};
// Fragment s o r t i n g a r r a y
#define MAX SORTED FRAGMENTS 18
s t a t i c u i n t 2 Sorted Fragments [MAX SORTED FRAGMENTS+ 1 ] ;
// SRV d e c l a r a t i o n s
B u f f e r <u i n t > HeadPointerBufferSRV : r e g i s t e r ( t 0 ) ;
S t r u c t u r e d B u f f e r <NodeData STRUCT> NodesBufferSRV : r e g i s t e r ( t 1 ) ;
Texture2D BackgroundTexture : r e g i s t e r ( t 3 ) ;
// P i x e l s h a d e r f o r p a r s i n g per−p i x e l l i n k e d l i s t s
f l o a t 4 PS RenderFragments ( PS SIMPLE INPUT i n p u t ) : SV Target
{
// Convert p i x e l 2D c o o r d i n a t e s t o l i n e a r a d d r e s s
u i n t 2 v S c r e e n P o s = u i n t ( i n p u t . vPos . xy ) ;
u i n t u L i n e a r A d d r e s s = v S c r e e n P o s . y∗RENDERWIDTH + v S c r e e n P o s . x ;
i i
i i
i i
i i
// Fetch o f f s e t o f f i r s t f r a g m e n t f o r c u r r e n t p i x e l
u i n t u O f f s e t = HeadPointerBufferSRV [ u L i n e a r A d d r e s s ] ;
// R e t r i e v e f r a g m e n t a t c u r r e n t o f f s e t
NodeData STRUCT Node = NodesBufferSRV [ u O f f s e t ] ;
// S o r t f r a g m e n t s f r o n t t o back u s i n g i n s e r t i o n s o r t i n g
i n t j = nNumFragments ;
[ l o o p ] while ( ( j >0) &&
( Sor t e d F r a g m e n t s [ max( j −1 , 0 ) ] . y >
Sorte d F r a g m e n t s [ j ] . y ) )
{
// Swap r e q u i r e d
i n t j m i n u s o n e = max( j −1 , 0 ) ;
u i n t 2 Tmp = S o r t e d F r a g m e n t s [ j ] ;
Sorte d F r a g m e n t s [ j ] = S o r t e d F r a g m e n t s [ j m i n u s o n e ] ;
Sorte d F r a g m e n t s [ j m i n u s o n e ] = Tmp;
j −−;
}
// I n c r e a s e number o f f r a g m e n t i f under t h e l i m i t
nNumFragments = min ( nNumFragments+1 ,
MAX SORTED FRAGMENTS) ;
// R e t r i e v e n e x t o f f s e t
u O f f s e t = Element . uNext ;
}
// R e t r i e v e c u r r e n t c o l o r from background c o l o r
f l o a t 4 vCurrentColor =
BackgroundTexture . Load ( i n t 3 ( i n p u t . vPos . xy , 0 ) ) ;
// Render s o r t e d f r a g m e n t s u s i n g SRCALPHA−INVSRCALPHA
// b l e n d i n g
f o r ( i n t k=nNumFragments −1; k>=0; k−−)
{
f l o a t 4 vColor = UnpackUintIntoFloat4
( Sort edF r a g m e n t s [ k ] . x ) ;
v C u r r e n t C o l o r . xyz = l e r p ( v C u r r e n t C o l o r . xyz , v C o l o r . xyz ,
vColor .w) ;
}
i i
i i
i i
i i
// Return manually−b l e n d e d c o l o r
return v C u r r e n t C o l o r ;
}
// P i x e l s h a d e r i n p u t s t r u c t u r e
struct PS INPUT
{
f l o a t 3 vNormal : NORMAL; // P i x e l normal
f l o a t 2 vTex : TEXCOORD; // Texture c o o r d i n a t e s
f l o a t 4 vPos : SV POSITION ; // Screen coordinates
u i n t uCoverage : SV COVERAGE; // Pixel coverage
};
Only a few bits are required for sample coverage; we therefore pack it onto
the depth member of the node structure using a 24:8 bit arrangement (24 bits
for depth, 8 bits for sample coverage). This avoids the need for extra stor-
age and leaves enough precision for encoding depth. The node structure thus
becomes:
i i
i i
i i
i i
Figure 2.2. Sample coverage example on a single pixel. The blue triangle covers the
third sample in a standard MSAA 4x arrangement. The input coverage to the pixel
shader will therefore be equal to 0x04 (0100 in binary).
The pixel shader to create per-pixel linked lists is modified so that depth and
coverage are now packed together and stored in the node structure:
i i
i i
i i
i i
When parsing per-pixel linked lists for rendering, the sample coverage in the
current node is compared with the index of the sample being shaded: if the
index is included in the sample coverage, then this pixel node contributes to the
current sample and is therefore copied to the temporary array for sorting and
later blending.
One further modification to the blending portion of the code is that the back-
ground texture representing the scene data prior to any translucent contribution
is multisampled, thus the declaration and the fetch instruction are modified ac-
cordingly.
After the fullscreen quad is rendered, the multisampled render target will
contain the sample-accurate translucent contribution to the scene.
// P i x e l s h a d e r i n p u t s t r u c t u r e f o r f u l l s c r e e n quad r e n d e r i n g
// with MSAA e n a b l e d
struct PS SIMPLE INPUT
{
f l o a t 2 vTex : TEXCOORD; // Texture c o o r d i n a t e s
f l o a t 4 vPos : SV POSITION ; // S c r e e n c o o r d i n a t e s
uint uSample : SV_SAMPLEINDEX; // Sample i n d e x
};
// Fragment s o r t i n g a r r a y
#define MAX SORTED FRAGMENTS 18
s t a t i c u i n t 2 Sorted Fragments [MAX SORTED FRAGMENTS+ 1 ] ;
// SRV d e c l a r a t i o n s
B u f f e r <u i n t > HeadPointerBufferSRV : r e g i s t e r ( t 0 ) ;
S t r u c t u r e d B u f f e r <NodeData STRUCT> NodesBufferSRV : r e g i s t e r ( t 1 ) ;
Texture2DMS <float4, NUM_SAMPLES BackgroundTexture : register(t3);
// Convert p i x e l 2D c o o r d i n a t e s t o l i n e a r a d d r e s s
u i n t 2 v S c r e e n P o s = u i n t ( i n p u t . vPos . xy ) ;
u i n t u L i n e a r A d d r e s s = v S c r e e n P o s . y∗RENDERWIDTH
+ vScreenPos . x ;
// Fetch o f f s e t o f f i r s t f r a g m e n t f o r c u r r e n t p i x e l
u i n t u O f f s e t = HeadPointerBufferSRV [ u L i n e a r A d d r e s s ] ;
i i
i i
i i
i i
// R e t r i e v e f r a g m e n t a t c u r r e n t o f f s e t
NodeData STRUCT Node = NodesBufferSRV [ u O f f s e t ] ;
// Only i n c l u d e f r a g m e n t i n s o r t e d l i s t i f c o v e r a g e mask
// i n c l u d e s t h e sample c u r r e n t l y b e i n g r e n d e r e d }
uintuCoverage =
UnpackCoverageIntoUint([Link]);
if ( uCoverage & (1<<[Link]) )}
{
// Copy f r a g m e n t c o l o r and depth i n t o s o r t i n g a r r a y
Sorte d F r a g m e n t s [ nNumFragments ] =
u i n t 2 ( Node . uColor , Node . uDepth ) ;
// S o r t f r a g m e n t s f r o n t t o back u s i n g
// i n s e r t i o n s o r t i n g
i n t j = nNumFragments ;
[ l o o p ] while ( ( j >0) &&
( S o r t e d F r a g m e n t s [ max( j −1 , 0 ) ] . y >
SortedFragments [ j ] . y ) )
{
// Swap r e q u i r e d
i n t j m i n u s o n e = max( j −1 , 0 ) ;
u i n t 2 Tmp = S o r t e d F r a g m e n t s [ j ] ;
SortedFragments [ j ] = SortedFragments [ jminusone ] ;
S o r t e d F r a g m e n t s [ j m i n u s o n e ] = Tmp;
j −−;
}
// I n c r e a s e number o f f r a g m e n t i f under t h e l i m i t
nNumFragments=min ( nNumFragments+1 ,
MAX SORTED FRAGMENTS) ;
}
// R e t r i e v e n e x t o f f s e t
u O f f s e t = Element . uNext ;
}
// Render s o r t e d f r a g m e n t s u s i n g SRCALPHA−INVSRCALPHA
// b l e n d i n g
f o r ( i n t k=nNumFragments −1; k>=0; k−−)
{
f l o a t 4 vColor =
UnpackUintIntoFloat4 ( SortedFragments [ k ] . x ) ;
i i
i i
i i
i i
// Return manually−b l e n d e d c o l o r
return v C u r r e n t C o l o r ;
}
Listing 2.3. Pixel Shader for parsing per-pixel linked lists when MSAA is enabled
2.8 Optimizations
2.8.1 Node Structure Packing
As previously mentioned in this chapter, the size of the node structure has a direct
impact on the amount of memory declared for the nodes buffer. Incidentally,
the smaller the size of the node structure, the better the performance, since
fewer memory accesses will be performed. It therefore pays to aggressively pack
data inside the node structure, even if it adds to the cost of packing/unpacking
instructions in the shaders used.
The default node structure presented in previous paragraphs is three uint in
size whereby one uint is used for packed RGBA color, one uint is used for depth
and coverage, and the last uint is used for the next pointer. Some circumstances
may allow further reduction of the structure for a performance/memory benefit;
for instance, color and depth could be packed into a single uint (e.g., by encoding
RGB color as 565 and depth as a 16-bit value (such a reduction in depth precision
may need some scaling and biasing to avoid precision issues)). The “next” pointer
could be encoded with 24 bits, leaving 8 bits for a combination of sample coverage
and/or blend id. Such a scheme would reduce the node structure size to two uint
(8 bytes), which is a desirable goal if the scene circumstances allow it.
i i
i i
i i
i i
When per-pixel linked lists are parsed for rendering the stencil buffer is set
up to pass if the stencil value is above 0. Early stencil rejection ensures that only
pixel locations that have been touched by translucent fragments will be processed,
saving on performance in the process.
i i
i i
i i
i i
vCurrentColorSample [ uSample ] =
BackgroundTexture . Load ( i n t 3 ( i n p u t . vPos . xy , 0 ) , uSample ) ;
}
// Render f r a g m e n t s u s i n g SRCALPHA−INVSRCALPHA b l e n d i n g
f o r ( i n t k=nNumFragments −1; k>=0; k−−)
{
// R e t r i e v e f r a g m e n t c o l o r
f l o a t 4 vFragmentColor=
UnpackUintIntoFloat4 ( SortedFragments [ k ] . x ) ;
// R e t r i e v e sample c o v e r a g e
u i n t uCoverage =
UnpackCoverageIntoUint ( S o r t e d F r a g m e n t s [ k ] . y ) ;
// R e s o l v e s a m p l e s i n t o a s i n g l e c o l o r
f l o a t 4 vCurrentColor = f l o a t 4 ( 0 , 0 , 0 , 1 ) ;
[ u n r o l l ] f o r ( u i n t uSample =0; uSample<NUM SAMPLES; uSample++)
{
v C u r r e n t C o l o r . xyz += vCu rre ntC olo rSa mpl e [ uSample ] ;
}
v C u r r e n t C o l o r . xyz ∗= ( 1 . 0 /NUM SAMPLES ) ;
// Return manually−b l e n d e d c o l o r
return v C u r r e n t C o l o r ;
2.9 Tiling
2.9.1 Tiling as a Memory Optimization
Tiling is a pure memory optimization that considerably reduces the amount of
video memory required for the nodes buffer (and to a lesser extent the head
i i
i i
i i
i i
Figure 2.3. Opaque contents of render target prior to any OIT contribution.
pointer buffer). Without tiling, the memory occupied by both buffers can rapidly
become huge when fullscreen render target resolutions are used. As an example
a standard HD resolution of 1280 × 720 with an estimated average translucent
overdraw of eight would occupy a total of 1280×720×8×sizeof(node structure size)
bytes for the nodes buffer only, which equates to more than 168 megabytes with
a standard node structure containing 3 units (color, depth and next pointer).
Instead of allocating buffers for the full-size render target, a single, smaller
rectangular region (the “tile”) is used. This tile represents the extent of the
Figure 2.4. Translucent contribution to the scene is added to the render target via
regular tiling. Each rectangular area stores fragments in the tile-sized head pointer
and nodes buffers and then parses those buffers to add correctly ordered translucency
information to the same rectangle area. In this example the tile size is 1/15 of the
render target size, and a total of 15 rectangles are processed.
i i
i i
i i
i i
area being processed for OIT in a given pass. Since the tile is typically smaller
than the render size, this means multiple passes are needed to calculate the
full-size translucent contributions to the scene. The creation of per-pixel linked
lists is therefore performed on a per-tile basis, after which the traversal phase
fetches nodes from the tile-sized head pointer and nodes buffers to finally output
the resulting color values onto the rectangular region corresponding to the tile
being processed in the destination render target. As an example, a standard HD
resolution of 1280 × 720 would take 15 passes with a tile size of 256 × 240 for the
screen to be fully covered. Figure 2.3 shows a scene with translucent contributions
yet to be factored in. Figure 2.4 shows the same scene with translucency rendered
on top using a set of tile-sized rectangular regions covering the whole render area.
i i
i i
i i
i i
Figure 2.5. Translucent contribution to the scene is added to the render target via op-
timized tiling. Only tiles enclosing the bounding geometry of the translucent characters
are processed. The bounding boxes of translucent geometry are transformed to screen
space and the combined extents of the resulting coordinates define the minimum rect-
angle area that will be processed. This minimum rectangle area is covered by as many
tile-sized rectangular regions as required (six in this example). Each of those regions
performs fragment storing and rendering using a single pair of tile-sized head pointers
and nodes buffers.
the minimum and maximum dimensions (in X and Y ) of the combined set will
define the rectangle area of translucent contributions. The minimum rectangle
optimization typically allows a reduction in the number of tiles to process when
parsing and rendering fragments from linked lists. In order to render a minimum
number of tiles, it is desirable to ensure that the bounding geometry used is as
tight as possible; for example, axis-aligned bounding boxes are likely to be less
effective than arbitrary-aligned bounding boxes or a set of bounding volumes with
a close fit to the meshes involved.
Because this optimization covers only a portion of the screen, the previous
contents of the render target will need to be copied to the destination render
target, at least for those regions that do not include translucent contribution.
This copy can be a full-size copy performed before the OIT step, or stencil-based
marking can be used to transfer only the rectangle regions that did not contain
any translucency.
Figure 2.5 illustrates the optimized covering of tiles to cover only the 2D
extents of translucent contributions to the scene.
2.10 Conclusion
The OIT algorithm presented in this chapter allows significant performance sav-
ings compared to other existing techniques. The technique is also robust, allowing
i i
i i
i i
i i
2.11 Acknowledgments
I would like to thank Holger Gruen and Jakub Klarowicz for coming up with the original
concept of creating per-pixel linked lists in a DirectX 11-class GPU environment.
Bibliography
[Bavoil 08] Louis Bavoil and Kevin Myers. “Order-Independent Transparency with Dual
Depth Peeling.” White paper available online at [Link]
com/SDK/10/opengl/src/dual depth peeling/doc/[Link], 2008.
[Myers 07] Kevin Myers and Louis Bavoil. “Stencil Routed A-Buffer.” SIGGRAPH ’07:
ACM SIGGRAPH 2007 Sketches, Article 21. New York: ACM, 2007.
[Thibieroz 10] Nicolas Thibieroz and Holger Gruen. “OIT and Indirect Illumina-
tion using DX11 Linked Lists,” GDC 2010 Presentation from the Advanced
D3D Day Tutorial. Available online at [Link]
presentations/Pages/[Link], 2010
[Yang 10] Jason Yang, Justin Hensley, Holger Gruen, and Nicolas Thibieroz. “Real-
Time Concurrent Linked List Construction on the GPU. Computer Graphics Fo-
rum, Eurographics Symposium on Rendering 2010 29:4 (2010), 1297–1304.
i i
i i
i i
i i
i i
i i
i i
i i
3
VII
3.1 Introduction
In this chapter, we present a simple and efficient algorithm for the simulation
of fluid flow directly on the GPU using a single pixel shader. By temporarily
relaxing the incompressibility condition, we are able to solve the full Navier-
Stokes equations over the domain in a single pass.
433
i i
i i
i i
i i
this algorithm on the GPU is at least 100 times faster than on the CPU.1 In this
chapter, we show how to couple the two equations of the classical Navier-Stokes
equations into a single-phase process; a detailed explanation of the algorithm
along with example code follows.
∇ · u = 0. (3.3)
1 The simulation runs on the CPU at 8.5 fps with 4 threads on an Intel Core 2 Quad at 2.66
GHz simulating only the velocity field over a 256 × 256 grid. Keeping the same grid size, the
simulation of both velocity and density fields runs at more than 2500 fps on a Geforce 9800
GT using 32-bit floating point render targets. Note that 16-bit floating point is sufficient to
represent velocities.
i i
i i
i i
i i
Now one of the main features of the formulation of the Navier-Stokes equations
illustrated here, is the possibility, when working with regular grid domains, to
use classical finite differences schemes.
Now before jumping directly to code, we first need to discretize the formulation.
i i
i i
i i
i i
1. Solve the mass conservation equation for density by computing the differ-
ential operators with central finite differences and integrating the solution
with the forward euler method.
i i
i i
i i
i i
Indeed, there exist other finite difference schemes. For instance, one could use
upwinding for the transport term or literally semi-Lagrangian advection. Unfor-
tunately, the latter results in much numerical dissipation; an issue covered in
Section 3.3.2.
φn+1 n n
i,j,k (xi,j,k ) = φ (xi,j,k − ∆tui,j,k ). (3.9)
The idea is to solve the transport equation from a Lagrangian viewpoint where
the spatial discretization element holding quantities (e.g., a particle) moves along
the flow of the fluid, and answer the following question: where was this element
at the previous time step if it has been transported by a field u and ends up
at the current texel’s center at the present time? Finally, we use the sampled
quantity to set it as the new value of the current texel.
Now when solving the transport equation for velocity, Equation (3.8) becomes
∂u
∂t = −u · ∇u and is solved with
un+1 n n
i,j,k (xi,j,k ) = u (xi,j,k − ∆tui,j,k ).
i i
i i
i i
i i
This method is not only straightforward to implement on the GPU with linear
samplers, but is also unconditionally stable. Unfortunately, quantities advected
in this fashion suffer from dramatic numerical diffusion and higher-order schemes
exist to avoid this issue, such as McCormack schemes discussed in [Selle et al. 08].
These schemes are especially useful when advecting visual densities as mentioned
in Section 3.5.
Density invariance and diffusion forces. After solving the transport term, the rest
of the momentum conservation equation is solved with central finite differences.
Here is a quick reminder of the part of Equation (3.1) which is not yet solved:
∂u ∇P µ
=− + g + ∇2 u.
∂t ρ ρ
As mentioned earlier, the gradient of the density-invariant field ∇P is equivalent
to the density gradient ∇ρ, i.e. ∇P ' K∇ρ. Since we already computed the den-
sity gradient ∇ρni,j,k when solving the mass conservation equation with Equation
(3.3.1), we need only to scale by K in order to compute the “pressure”” gradient
∇P . As for the diffusion term, a Laplacian ∇2 must be computed. This operator
is now expressed using a second-order central finite difference scheme:
where
n n n n n n
fi+1,j,k − 2fi,j,k + fi−1,j,k fi,j+1,k − 2fi,j,k + fi,j−1,k
L(f ) = +
(∆x)2 (∆y)2
n n n
fi,j,k+1 − 2fi,j,k + fi,j,k−1
+ 2
,
(∆z)
un+1 n n 2 n
i,j,k = ui,j,k + ∆t(−S∇ρi,j,k + g + ν∇ ui,j,k ).
1
Since the density ρ value should not vary much from ρ0 , we can interpret the ρ
2
scale as a constant held by ν := µ
ρ0 and S := K (∆x)
∆tρ0 . One can see how we also
2
scale by (∆x)
∆t which seems to give better results (we found (∆x)2 while testing
over a 2D simulation) and a sound justification has still to be found and will be
the subject of future work.
Up to now we solved the equations without considering boundary conditions
(obstacles) or numerical stability. These two topics will be covered in the next
two sections.
i i
i i
i i
i i
i i
i i
i i
i i
trivial when assuming that the walls of an obstacle are always coincident with the
face of a computational cell (i.e., obstacles would completely fill computational
∂f
cells). With this assumption, the derivative ∂n is either given by ± ∂u ∂v
∂x , ± ∂y , or
± ∂w
∂z for u, v, and w, respectively. As an example, one can see how the fluid
cell Fi−1,j is adjusted according to the derivative of the boundary cell Bi,j in
Figure 3.2.
The true difficulty is the actual tracking of obstacles, specifically when working
with dynamic 3D scenes in which objects must first be voxelized in order to be
treated by the algorithm. See [Crane et al. 07] for a possible voxelization method.
3.4 Code
Short and simple code for the 2D solver is presented in this section. In two
dimensions, the x- and y-components hold velocities and the z-component holds
the density. Setting ∆x = ∆y = 1 greatly simplifies the code. A 3D demo is also
available on accompanying web materials (K ' 0.2, ∆t = 0.15).
///< C e n t r a l F i n i t e D i f f e r e n c e s S c a l e .
f l o a t 2 CScale = 1 . 0 f / 2 . 0 f ;
f l o a t S=K/ dt ;
// du/dx , du/dy
f l o a t 3 UdX = f l o a t 3 ( FieldMat [ 0 ] − FieldMat [ 1 ] ) ∗ C Sc al e ;
f l o a t 3 UdY = f l o a t 3 ( FieldMat [ 2 ] − FieldMat [ 3 ] ) ∗ C Sc al e ;
i i
i i
i i
i i
///<
///< Solve f o r density .
///<
FC . z −= dt ∗ dot ( f l o a t 3 (DdX, Udiv ) ,FC . xyz ) ;
///< Related to s t a b i l i t y
FC . z = clamp (FC . z , 0 . 5 f , 3 . 0 f ) ;
///<
///< S o l v e f o r v e l o c i t y .
///<
f l o a t 2 PdX = S∗DdX;
f l o a t 2 L a p l a c i a n = mul ( ( f l o a t 4 ) 1 , ( f l o a t 4 x 2 ) FieldMat ) −4.0 f ∗FC . xy ;
f l o a t 2 V i s c o s i t y F o r c e = v∗ L a p l a c i a n ;
///< Semi−L a g r a n g i a n a d v e c t i o n .
f l o a t 2 Was = UV − dt ∗FC . xy ∗ Step ;
FC . xy = tex2D ( F i e l d L i n e a r S a m p l e r , Was ) . xy ;
FC . xy += dt ∗ ( V i s c o s i t y F o r c e − PdX + E x t e r n a l F o r c e s ) ;
///< Boundary c o n d i t i o n s .
f o r ( i n t i =0; i <4; ++i )
{
i f ( IsBoundary (UV+Step ∗ D i r e c t i o n s [ i ] ) )
{
f l o a t 2 SetToZero = (1− abs ( D i r e c t i o n s [ i ] ) ) ;
FC . xy ∗= SetToZero ;
}
}
return FC ;
3.5 Visualization
One of the disadvantages of the Eulerian formulation is the lack of geometric
information about the fluid. So far, we have captured its motion with the velocity
field, but we still don’t know its shape. Nevertheless, there are many ways to
visualize a fluid. In this section we briefly discuss two simple techniques. The first
consists of advecting particles under the computed velocity field and the second,
of advecting a scalar density field.
3.5.1 Particles
Using particles in a one-way interaction with the fluid is by far the most simple
and efficient technique for visualizing a fluid. Since the velocity field is computed
i i
i i
i i
i i
on the GPU, the whole system can run independently with very few interactions
with the CPU. Once the particles are initialized, we sample only the velocity field
in order to update their positions as illustrated in the following code:
∂φ
= u · ∇φ + k∇2 φ, (3.13)
∂t
∂φ
= u · ∇φ + k∇2 φ − c, (3.14)
∂t
where φ is a scalar density, k a diffusion coefficient and c a reaction constant for
fire.
Numerical schemes to solve this equation are abundant, and the one which
maps best to the GPU is the semi-Lagrangian method (see Equation (3.9)) but
unfortunately, the solution loses much detail as dissipation occurs from this nu-
merical scheme—which in turn enables the omission of the diffusion term from
the equation. To address this problem and achieve more compelling visual re-
sults, we strongly suggest using the three-pass MacCormack method described
in [Selle et al. 08]. This scheme has second-order precision both in space and time,
therefore keeping the density from losing its small scale features and numerically
dissipating its quantity as drastically as with the first-order semi-Lagrangian
method. To add more detail to the simulation, one could also amplify the vortic-
ity of the flow (the velocity field) with vorticity confinement, a method discussed
in the context of visual smoke simulation in [Fedkiw et al. 01].
3.6 Conclusion
Many algorithms could be generated from this one-phase coupling of both equa-
tions through a density-invariant field. We hope the one illustrated here serves
as a good basis for developers seeking to make use of interactive fluids in their
applications.
i i
i i
i i
i i
Figure 3.3. Smoke density over a 512 × 512 density and 256 × 256 fluid simulation grid.
Figure 3.4. Fire density over a 512 × 512 density and 256 × 256 fluid simulation grid.
i i
i i
i i
i i
Bibliography
[Colin et al. 06] F. Colin, R. Egli, and F.Y. Lin. “Computing a Null Divergence Velocity
Field using Smoothed Particle Hydrodynamics.” Journal of Computational Physics
217:2 (2006), 680–692.
[Crane et al. 07] K. Crane, I. Llamas, and S. Tariq. “Real-Time Simulation and Ren-
dering of 3D Fluids.” In GPU Gems 3, pp. 633–675. Reading, MA: Addison-Wesley,
2007.
[Desbrun and Cani 96] M. Desbrun and M.P. Cani. “Smoothed Particles: A New
Paradigm for Animating Highly Deformable Bodies.” In Proceedings of EG Work-
shop on Animation and Simulation, pp. 61–76. Berlin-Heidelberg: Springer-Verlag,
1996.
[Fedkiw et al. 01] R. Fedkiw, J. Stam, and H. W. Jensen. “Visual Simulation of Smoke.”
Proceedings of the 28th Annual Conference on Computer Graphics and Interactive
Techniques, pp. 15–22.
[Li et al. 03] W. Li, X. Wei, and A. Kaufman. “Implementing Lattice Boltzmann Com-
putation on Graphics Hardware.” The Visual Computer 19:7–8 (2003), 444–456.
[Muller et al. 03] M. Muller, D. Charypar, and M. Gross. “Particle-Based Fluid Simu-
lation for Interactive Applications.” pp. 154–159. New York: ACM, 2003.
[Premože et al. 03] S. Premože, T. Tasdizen, J. Bigler, A. Lefohn, and R.T. Whitaker.
“Particle-Based Simulation of Fluids.” Computer Graphics Forum (Proceedings of
Eurographics) 22:3 (2003), 401–410.
[Selle et al. 08] A. Selle, R. Fedkiw, K. Byungmoon, L. Yingjie, and R. Jarek. “An
Unconditionally Stable MacCormack Method.” Journal of Scientific Computing
35:2–3 (2008), 350–371.
[Stam 99] J. Stam. “Stable Fluids.” In Proceedings of the 26th Annual Conference on
Computer Graphics and Interactive Techniques, pp. 121–128. New York: ACM,
1999.
[Xu et al. 09] R. Xu, P. Stansby, and D. Laurence. “Accuracy and Stability in Incom-
pressible SPH (ISPH) Based on the Projection Method and a New Approach.”
Journal of Computational Physics 228:18 (2009), 6703–6725.
i i
i i
i i
i i
4
VII
4.1 Introduction
Many techniques in computer graphics are based on mathematical models to
realistically simulate physical phenomena, such as fluid dynamics [Stam 99], or
to deform and merge complex object meshes together [Yu et al. 04].
Many of these mathematical models involve the solution of Poisson partial
differential equations, or more general elliptic equations, making the availabil-
ity of efficient Poisson solvers crucial, particularly for real-time simulation. For
example, the simulation of incompressible fluid flows often relies on projection-
correction techniques where the pressure fields are solutions of a Poisson equation.
Solving this equation is important, not only to obtain realistic flow dynamics, but
also for the stability of the simulation. In fact, many efforts have been dedicated
to the development of fast and stable fluid solvers [Stam 99]; the solution of
the Poisson pressure equation constitutes the most time-consuming part of these
algorithms.
This chapter presents an implementation of various iterative methods for the
resolution of Poisson equations on heterogeneous parallel computers. Currently,
most fast Poisson solvers implement the simple Jacobi method. After reviewing
Jacobi iterative methods and variants in Section 4.3, we introduce more advanced
iterative techniques based on multiscale iterations on a set of embedded grids,
namely the so-called multigrid methods in Section 4.4. For all of these methods,
we provide some theoretical background and discuss their efficiency and com-
plexity with regard to their implementation. We particularly detail the multi-
grid method which involves several operators whose implementation is critical
to efficiency. In Section 4.5, we provide a tutorial for the OpenCL implementa-
tion of the various algorithms, which are subsequently tested and compared in
445
i i
i i
i i
i i
Section 4.6. We end the chapter with a discussion of the efficiency of the methods
in Section 4.7. Specifically, we show that although more complex to implement,
the multigrid method allows for a significant reduction of both the number of
iterations and the length of computation time compared with the simpler fixed-
grid iterative methods. Hasty developers can skip directly to the implementation
in (Section 4.5) and refer later to the theoretical background in Section 4.2.
∇ · (∇u) = ∇2 u = −f on Ω ⊂ Rd , (4.1)
The elliptic equation (Equation (4.1)) calls for boundary conditions which can be
of Neumann type:
∂u
= g(x), x ∈ ∂ΩN , (4.3)
∂n
where ∂u/∂n := ∇u · n is the normal derivative at the boundary ∂ΩN with n
pointing outside of the domain, or of Dirichlet type:
where ∂ΩN and ∂ΩD are distinct portions of ∂Ω such that ∂Ω = ∂ΩN ∪ ∂ΩD .
This problem is represented schematically in Figure 4.1. In this chapter, we focus
on Neumann-type boundary conditions. Other types of boundary conditions are
discussed in Section 4.7.
i i
i i
i i
i i
Figure 4.1. Poisson problem (Equation 4.1) on a 2D domain Ω with a boundary ∂Ω.
The vector n is normal to the boundary and points outside of the domain.
Note that for well-posed problems, when δΩD = ∅, the data f needs to satisfy
the compatibility condition
Z Z
f (x)dx = g(x)dx.
Ω δΩ
i i
i i
i i
i i
Figure 4.2. Cell values are computed by averaging the 1D function u with a FV method
(left). The reconstruction of uh by linear interpolation over the cell averages (right).
and we denote by uhi the computed average of u over the cell with index i:
1
Z
uhi ≈ u dV. (4.6)
|Ci | Ci
The superscript h refers here to the discretized nature of the solution, h being
related to the size of the cells.
Different types of meshes can be considered; we restrict ourselves to structured
Cartesian grids made of cells with equal edge size h in all directions (the cell
volume is hd ). For a sufficiently refined grid (i.e., small enough h), uhi can be
identified with the value of u at the center xi of cell Ci . In addition, such
structured grids greatly simplify the reconstruction of the smooth approximate
uh from the averages arising from the multilinear interpolation between the cell
centers. This reconstruction is schematically illustrated in Figure 4.2 (right).
Once the computational domain has been discretized into finite volumes, the
objective is to derive a system of equations relating the values uhi for i ∈ T .
This is achieved by making use of Stokes’ theorem [Spivak 71], which consists in
replacing the integral of the Poisson equation over a cell by the integral of the
normal flux (∂u/∂n) over the cell’s boundaries. Specifically,
Z Z Z
∇ · (∇u) dx = ∇u · n dx = − f dx ' −f (xi ) |Ci | . (4.7)
Ci ∂Ci Ci
i i
i i
i i
i i
∂u ∂u
− = hfih , (4.8)
∂x xi −h/2 ∂x xi +h/2
where xi ± h/2 are the locations of the cell interfaces and we have denoted
fih = f (xi ). The FV system is finally obtained by substituting
the fluxes ∇u·n by
their reconstructions from the set of averaged values uhi ; i ∈ T . Again, different
reconstruction strategies can be used, and we adopt the second-order reconstruc-
tion, where the normal flux is based on the difference between the averages at
the two cells Ci and Cj having in common an interface ∂Cij :
∂u uhi − uhj
≈ . (4.9)
∂n ∂Cij |xi − xj |
The discrete equation for cell Ci,j,k involves the (unknown) averages over the
cell and its six neighbors having a face in common (see Figure 4.3).
Writing this equation for all cells of the mesh, eventually using modified re-
constructions of the flux for the cells neighboring ∂Ω (see discussion below), one
ends with a system of N = Nx × Ny × Nz equations for the cell averages uhi,j,k ,
1 ≤ i ≤ Nx , 1 ≤ j ≤ Ny and 1 ≤ k ≤ Nz . This system can be rewritten in a
matrix form as
Au = f , (4.12)
where A ∈ RN ×N is a sparse matrix, u ∈ RN is the vector containing the cell
averages and f gathers the corresponding right-hand sides of Equation (4.11).
i i
i i
i i
i i
Figure 4.3. Enlarged view of the 3D Laplace stencil, the left-hand-side in Equation
(4.11).
The sparsity of A arises from the fact that the fluxes are reconstructed from the
immediate neighboring cells, such that each row of A has only seven nonzero
entries as seen from Equation (4.11). As a result, the memory allocation for A is
less than 7N (due to the treatment of the boundary conditions). An example of
matrix A is shown in Figure 4.4 for Nx = Ny = Nz = 3, before reduction, due to
boundary conditions.
6 -1 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
-1 6 -1 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 6 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-1 0 0 6 -1 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 -1 0 -1 6 -1 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 -1 0 -1 6 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 -1 0 0 6 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 -1 0 -1 6 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 -1 0 -1 6 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0
-1 0 0 0 0 0 0 0 0 6 -1 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 0
0 -1 0 0 0 0 0 0 0 -1 6 -1 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0
0 0 -1 0 0 0 0 0 0 0 -1 6 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0
0 0 0 -1 0 0 0 0 0 -1 0 0 6 -1 0 -1 0 0 0 0 0 -1 0 0 0 0 0
0 0 0 0 -1 0 0 0 0 0 -1 0 -1 6 -1 0 -1 0 0 0 0 0 -1 0 0 0 0
0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 6 0 0 -1 0 0 0 0 0 -1 0 0 0
0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 6 -1 0 0 0 0 0 0 0 -1 0 0
0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 6 -1 0 0 0 0 0 0 0 -1 0
0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 6 0 0 0 0 0 0 0 0 -1
0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 6 -1 0 -1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 6 -1 0 -1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 6 0 0 -1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 6 -1 0 -1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 6 -1 0 -1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 6 0 0 -1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 6 -1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 6 -1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 6
i i
i i
i i
i i
∂u
= 0, for x ∈ ∂Ω. (4.13)
∂n
This condition states that the flux (or normal derivative) of u is zero everywhere
on ∂Ω. In the context of potential flows, where u is the flow potential, this
corresponds to no-through flow BC. For instance, in the classical projection-
correction methods for solving the incompressible Navier-Stokes equations (see
[Chorin 68]), such potential flow is used to enforce the divergence-free constraint
on the velocity field.
In practice, ghost-cell techniques are commonly used to implement the ho-
mogeneous Neumann BC. It consists of creating a virtual layer of cells along the
boundary, with values that mirror the inside domain. One of the interesting fea-
tures of this ghost-cell approach is that it immediately extends to other types of
BC (nonhomogeneous, Dirichlet, Fourier, periodic domains, etc.) making them
very attractive in terms of general code implementation. Indeed, after defining
the ghost-cells values (eventually updated at each iteration), the same stencil can
be used for all the inner cells of the computational domain. In the case of the
homogeneous Neumann BC, a ghost cell is taken equal to the inner domain cell
sharing a face with it. Therefore, the flux between the two cells is zero (see Equa-
tion (4.9)). Other types of BC follow a similar procedure whereby the ghost-cell
values are defined from their respective (inner domain) neighboring cell value (see
discussion in [Patankar 80]).
i i
i i
i i
i i
i i
i i
i i
i i
Figure 4.5. Decomposition of the matrix A into its lower-triangular part L, diagonal
D, and upper-triangular part U , used in the construction of the preconditioners.
In the case of the 3D Laplace matrix (Figure 4.4), we observe that the diagonal
is equal to 6 for the inner domain, therefore we can write D = 6I. Using this
simplified form D−1 = 1/6 I, the Jacobi iteration (4.20) becomes
1 1
vα+1 = (I − A)vα + f . (4.21)
6 6
This method is also called the method of simultaneous displacements. As
observed in the visual representation of the Jacobi iteration in Figure 4.6, all
Figure 4.6. Visual representation of the Jacobi iteration. Note that the central cell of
the stencil on vα is provided only as a visual cue and does not intervene in this equation
because the diagonal of M is nullified (I − 1/6D = 0I).
i i
i i
i i
i i
Figure 4.7. Result of several Jacobi iterations on a 1D error function. The high spa-
tial frequency components, sharp edges of e0 , are efficiently smoothed out after a few
iterations while the low frequency components remain almost unchanged.
i i
i i
i i
i i
Figure 4.8. Red-black labeling of the grid cells for the Gauss-Seidel method. We observe
that all neighbors of a red cell are painted in black and vice-versa, in order to prevent
concurrent read and write access by two threads on the same cell.
Figure 4.9. Visual representation of the second half of the Gauss-Seidel iteration. Please
note that the black neighbors of the red cell have already been computed during the
first half iteration on black cells.
i i
i i
i i
i i
The Gauss-Seidel iteration is then split into two steps where the black cells
are updated first because they do not have any face in common, and the red cells
are updated using the newly computed values of the black cells, as illustrated in
Figure 4.9.
Section 4.5.3 covers the implementation of the Gauss-Seidel method using
red-black ordering with OpenCL.
i i
i i
i i
i i
smoothing iteration,
residual computation,
fine to coarse projection,
coarse to fine interpolation,
approximation correction.
i i
i i
i i
i i
Knowing that we almost suppressed all the high frequencies from rhα , we ob-
serve that the remaining smooth function can be represented on a coarser grid
i i
i i
i i
i i
(where the edge size becomes 2h) without losing important information. The
projection operator transfers rhα from a fine -h-spaced- grid to a coarser—2h-
spaced—grid producing a rougher function fα2h . This projection actually leads to
a new Poisson problem defined on a coarser grid, where all remaining frequencies
2h
are a bit higher. The approximated solution to this problem vα is not a pro-
h
jection of vα , but the correction term required to reduce higher error frequencies
h
on vα :
A2h vα2h
= fα2h . (4.26)
i i
i i
i i
i i
Figure 4.12. Trilinear interpolation process approximating the FV average, used in the
OpenCL projection kernel.
i i
i i
i i
i i
the same way as the traditional OpenGL/GLSL shader and texture setup. Sec-
ond, compute kernels written in the OpenCL C language, are programs executed
in parallel on the devices in the same way as traditional vertex or pixel shaders
but with greater flexibility to address broader problems.
The OpenCL API specifies two kinds of memory objects: buffers and images.
Buffers are contiguous arrays of memory indexed by 1D coordinates and com-
posed of any available type (int, float, half, double, int2, int3, int4,
float2, ...). These can either be allocated in global, local or constant memory.
Global memory can be shared between the host and the devices by enqueuing
read or write commands to exchange these data; it is abundant but it has a high
latency and must be used with caution, while local memory is faster but has a
very limited size. Coalesced memory accesses can reduce the latency, but when
memory read and write patterns are random, images can be used to mitigate
this latency. Images share many similarities with textures: they support an au-
tomatic caching mechanism, their access can be filtered through a sampler with
multilinear interpolation, and out of bounds access behavior can be configured.
We make use of images whenever possible because of two reasons. First, our
memory access patterns are mostly random. Second, the sampler filtering can
greatly reduce the computational costs of certain operations such as multilinear
interpolation in the projection operator, or automatic clamping of image coordi-
nates to handle ghost cells with a homogeneous boundary condition.
Finally, parallelization is achieved by enqueueing the execution of compute
kernels over work-groups or ranges of threads organized in 1, 2, or 3D. Each
work-group is composed of work-items, or threads indexed by a unique 1, 2, or
3D identifier inside the global work-group range.
Initializing the compute devices is straightforward using the C++ host bind-
ings. We first initialize the platform to access the underlying compute devices
and select either CPU devices, GPU devices, or both types. Then we create a
context and a command queue in order to execute compute kernels and enqueue
memory transfers.
// f e t c h a l l GPU d e v i c e s on t h e f i r s t OpenCL p l a t f o r m
s t d : : v e c t o r <c l : : Platform > p l a t f o r m s ;
s t d : : v e c t o r <c l : : Device> d e v i c e s ;
c l : : P l a t f o r m : : g e t (& p l a t f o r m s ) ;
p l a t f o r m s . a t ( 0 ) . g e t D e v i c e s (CL DEVICE TYPE GPU , &d e v i c e s ) ;
Finally, compute kernels are loaded and compiled from OpenCL source:
// l o a d an OpenCL s o u r c e f i l e i n t o a s t d : : s t r i n g
s t d : : i f s t r e a m s r c f i l e ( " kernels . cl " ) ;
i i
i i
i i
i i
s t d : : s t r i n g s r c ( s t d : : i s t r e a m b u f i t e r a t o r <char>( s r c f i l e ) ,
s t d : : i s t r e a m b u f i t e r a t o r <char > ( 0 ) ) ;
// c o m p i l e t h e d e v i c e program and l o a d t h e ” z e r o ” k e r n e l
c l : : Program program ( c o n t e x t , c l : : Program : : S o u r c e s ( 1 ,
std : : make pair ( s r c . c s t r ( ) , s r c . s i z e ( ) ) ) ) ;
program . b u i l d ( d e v i c e s , " - Werror " ) ;
c l : : K e r n e l k z e r o = c l : : K e r n e l ( program , " zero " ) ;
This zero compute kernel is later used to clear global memory buffers on the
device. With OpenCL’s C language, it is written as follow:
// s z c o n t a i n s t h e b u f f e r s t r i d e a l o n g each 3D a x i s
const f l o a t 4 s z = ( f l o a t 4 ) ( 1 , g e t g l o b a l s i z e ( 0 ) ,
g e t g l o b a l s i z e ( 0 ) ∗ g e t g l o b a l s i z e ( 1 ) , 0 ) ; // s t r i d e s
// vh i s a g l o b a l b u f f e r used t o w r i t e t h e z e r o ou tp ut
vh [ ( i n t ) dot ( id , s z ) ] = 0 ; // vh [ i d . x+i d . y∗ s z . y+i d . z ∗ s z . z ]
}
This kernel must be launched with a global work-size equal to the number
of elements in the input buffer vh so that each element is written by exactly
one work-item (or thread). The id and sz variables are initialized with the 3D
work-item identifier and work-size stride so that their dot product directly gives
the corresponding memory location in the 1D buffer (i.e., id.x + id.y*sz.y +
id.z*sz.z).
In order to run this compute kernel, we need to allocate a memory buffer on
the device and create the global (computational domain) and local (concurrent
work-items) work-size ranges:
// i n i t i a l i z e a read−o n l y b u f f e r f o r 64ˆ3 4− b y t e s f l o a t s ,
// t h i s b u f f e r can not be update by t h e ho st , o n l y r e a d from
c l : : B u f f e r b u f f e r ( c o n t e x t , CL MEM READ ONLY, 64∗64∗64 ∗ 4 ) ;
// enqueue t h e k e r n e l and w a i t f o r i t t o f i n i s h i t s t a s k
queue . enqueueNDRangeKernel ( z e r o , c l : : NullRange , gndr , l n d r ) ;
queue . e n q u e u e B a r r i e r ( ) ;
i i
i i
i i
i i
f l o a t data [ 6 4 ∗ 64 ∗ 6 4 ] ;
// one−component f l o a t image f o r m a t
c l : : ImageFormat fmt ( CL R , CL FLOAT ) ;
c l : : Image3D img ( c o n t e x t , CL MEM READ ONLY, fmt , 6 4 , 6 4 , 6 4 ) ;
typedef struct {
c l : : s i z e t <3> s i z e ; // 3D s i z e o f t h e problem
c l : : Image3D f h // discretized f function
c l : : Image3D vh ; // s o l u t i o n approximation
c l : : Image3D rh ; // residual
} Problem ;
i i
i i
i i
i i
#define dx ( f l o a t 4 ) ( 1 , 0 , 0 , 0 )
#define dy ( f l o a t 4 ) ( 0 , 1 , 0 , 0 )
#define dz ( f l o a t 4 ) ( 0 , 0 , 1 , 0 )
The fourth dimension of the float4 struct is never used here but is required
to specify image sampling coordinates. The Jacobi compute kernel for the device
is then implemented as follows:
// s a m p l e r f o r a c c e s s i n g t h e vh and f h images ,
// out o f bounds a c c e s s e s a r e clamped t o t h e domain e d g e s
const sampler t s a m p l e r = CLK ADDRESS CLAMP TO EDGE ;
const f l o a t s =
( r e a d i m a g e f ( vh , sampler , id −dx ) . x +
r e a d i m a g e f ( vh , sampler , i d+dx ) . x +
r e a d i m a g e f ( vh , sampler , id −dy ) . x +
r e a d i m a g e f ( vh , sampler , i d+dy ) . x +
r e a d i m a g e f ( vh , sampler , id −dz ) . x +
r e a d i m a g e f ( vh , sampler , i d+dz ) . x −
h2 ∗ r e a d i m a g e f ( fh , sampler , i d ) . x ) / 6 . 0 f ;
vvh [ ( i n t ) dot ( id , s z ) ] = s ;
}
The function read imagef is a built-in function which accesses a read only
image through a sampler at a specific coordinate, passed as a float4 vector, and
returns a float4 vector containing the result. Since we initialize fh and vh as
one component image, only the first component (x) of the result is meaningful.
This kernel is launched with a global work size equal to the 3D extents of
the domain grid. The local work size depends on the capabilities of the OpenCL
compute device and must be a divider of the global work-group size along each
dimension. Experience shows that a cubic size (and in particular (8, 8, 8) for cur-
rent GPUs) is an optimal work-group configuration because it leads to a minimal
spatial scattering of memory accesses, thus fully exploiting the images cache. Af-
ter each iteration, the output buffer is copied back to the vh image to be reused,
i i
i i
i i
i i
// o f f s e t and s z a r e s i z e t [ 3 ] , o f f s e t c o n t a i n s z e r o s
// and s z c o n t a i n s t h e Problem s i z e o r 3D image e x t e n t s
queue . enqueueCopyBufferToImage ( b u f f e r , image , 0 , o f f s e t , s z ) ;
queue . e n q u e u e B a r r i e r ( ) ;
Once every few iterations, the approximation error of vα is tested on the host
to decide wether to continue refining or not by computing the L2 norm of the
residual on the host and comparing it against an value:
// compute t h e r e s i d u a l f o r t h e c u r r e n t Problem p
r e s i d u a l ( p . fh , p . vh , p . rh , p . s i z e , h2 ) ;
queue . enqueueReadImage ( p . rh , CL TRUE, n u l l s z ,
p . s i z e , 0 , 0 , &r [ 0 ] ) ;
f l o a t rnorm = L2Norm ( r , f i n e . s i z e ) ; // s q r t ( sum ( r ∗ r ) )
// b r e a k t h e s o l v e r l o o p
i f ( rnorm < e p s i l o n ) break ;
// t h e i n i t i a l x c e l l i d e n t i f i e r o f f s e t depends on t h e
// p a r i t y o f i d . y+i d . z and on t h e c u r r e n t p a s s c o l o r
i d . x += ( ( i n t ) ( i d . y + i d . z + r e d ) & 1 ) ;
. . . // compute s ( s e e J a c o b i )
vvh [ ( i n t ) dot ( id , s z ) ] = s ;
}
i i
i i
i i
i i
kernel void r b s o r ( . . . , f l o a t w) // w e i g h t i n g f a c t o r w
{
. . . // compute i d and s z ( s e e Red−Black Gauss−S e i d e l )
vvh [ ( i n t ) dot ( id , s z ) ] =
( 1 − w) ∗ r e a d i m a g e f ( vh , sampler , i d ) . x + w ∗ s ;
}
// V−c y c l e d e s c e n d i n g s t e p , from f i n e s t t o c o a r s e s t l e v e l
f o r ( k = 0 ; k < i n t ( p . s i z e ( ) − 1 ) ; ++k )
{
r b s o r ( p [ k ] . fh , p [ k ] . vh , p [ k ] . s i z e , h2 , 0 . 7 5 f , p r e S t e p s ) ;
r e s i d u a l ( p [ k ] . fh , p [ k ] . vh , p [ k ] . rh , p [ k ] . s i z e , h2 ) ;
p r o j e c t ( p [ k ] . rh , p [ k + 1 ] . fh , p [ k + 1 ] . s i z e ) ;
z e r o ( p [ k + 1 ] . vh , p [ k + 1 ] . s i z e ) ;
}
i i
i i
i i
i i
// ” D i r e c t ” s o l v i n g on t h e c o a r s e s t l e v e l
r b s o r ( p [ k ] . fh , p [ k ] . vh , p [ k ] . s i z e , h2 , 1 . 5 f , d i r e c t S t e p s ) ;
// V−c y c l e a s c e n d i n g s t e p , from c o a r s e s t t o f i n e s t l e v e l
f o r(−−k ; k >= 0 ; −−k )
{
i n t e r p o l a t e c o r r e c t ( p [ k + 1 ] . vh , p [ k ] . vh , p [ k ] . s i z e ) ;
r b s o r ( p [ k ] . fh , p [ k ] . vh , p [ k ] . s i z e , h2 , 1 . 2 5 f , p o s t S t e p s ) ;
}
The W-cycle is a direct extension of this code, adding an inner loop for kmax >
k > 0 in order to repeat the subcycle several times before finally reaching the
finest level. Experimentations show that choosing two pre-smoothing passes,
four post-smoothing at each level, and 32 direct-smoothing iterations for the
coarsest level leads to the best measured convergence rate. Additionally, using
four subcycles greatly reduces the overall computation time and seems to be the
best configuration for medium to large grid sizes (≥ 643 ).
rh [ ( i n t ) dot ( id , s z ) ] =
− r e a d i m a g e f ( fh , sampler , id ) . x −
( 6 ∗ r e a d i m a g e f ( vh , sampler , id ) . x −
( r e a d i m a g e f ( vh , sampler , id −dx ) . x +
r e a d i m a g e f ( vh , sampler , i d+dx ) . x +
r e a d i m a g e f ( vh , sampler , id −dy ) . x +
r e a d i m a g e f ( vh , sampler , i d+dy ) . x +
r e a d i m a g e f ( vh , sampler , id −dz ) . x +
r e a d i m a g e f ( vh , sampler , i d+dz ) . x ) ) / h2 ;
}
i i
i i
i i
i i
use trilinear interpolation (Figure 4.12) to average eight cells on the fine level into
one cell on the coarser level by enabling linear filtering and taking a sample at
the center of the eight fine cells:
// f i l t e r images with t r i l i n e a r i n t e r p o l a t i o n :
// c e l l c e n t e r s i n d e x i n g b e g i n s a t 0 . 5 s o t h a t
// i n t e g e r v a l u e s a r e a u t o m a t i c a l l y i n t e r p o l a t e d
const sampler t s a m p l e r = CLK FILTER LINEAR ;
// make t h e image c o o r d i n a t e a t t h e v e r t e x s h a r e d
// between t h e e i g h t f i n e r g r i d ( s e e f i g . 1 . 1 1 )
// then m u l t i p l y by 4 f o r c o a r s e n i n g : ( 2 h ) ˆ 2 = 4 hˆ2
f 2 h [ ( i n t ) dot ( id , s z ) ] =
r e a d i m a g e f ( rh , sampler , i d ∗ 2 + dx+dy+dz ) . x ∗ 4 ;
vh [ ( i n t ) dot ( id , s z ) ] = r e a d i m a g e f ( vh , sampler , i d ) . x −
r e a d i m a g e f ( v2h , sampler , i d / 2 ) . x ;
}
4.6 Benchmarks
As expected, we can observe in Figure 4.13 an exponential number of iterations re-
quired for the Jacobi as its complexity reaches O(N 3 ). The Gauss-Seidel method
has the same complexity but requires fewer iterations as the constant complexity
factor is halved.
The SOR method dramatically reduces this factor by over correcting the local
error, but its complexity is still exponential. Fortunately, the CSMG method
is confirmed to have a linear complexity of O(N ), where N is the number of
unknowns or cells.
i i
i i
i i
i i
Figure 4.13. Iterations per method for cubic domains until |eα | < 10−3 . The X-
axis represents the domain size (cubed) and the Y-axis shows the number of iterations
required to converge to = 10−3 .
Although its iterations have a higher computational cost, the multigrid cor-
rection scheme method shows a clear advantage over the pure iterative methods
in terms of computation time per unknown in Figure 4.14. The setup cost of
the CSMG method makes it more efficient for large problems than smaller ones
(< 323 ) where the SOR method should be preferred.
Figure 4.14. Computation time (µs) per cell for cubic domains on a GPU. The X-axis
represents the domain size (cubed) and the y-axis shows the computation time per cell
to converge to = 10−3 .
i i
i i
i i
i i
Figure 4.15. Time profiling of the execution of four CSMG 4W-cycles for a 1283 com-
putational domain, running on a Nvidia GTX-285 GPU. More than half of the time is
spent either copying buffers into images (memcpyDtoAasync) or transfering the residual
to the host (memcpyAtoHasync) to test the convergence.
4.7 Discussion
We introduced the theoretical background and implementation framework for a
fast OpenCL solver for the 3D Poisson equation with Neumann external boundary
condition. This is by no means a generic solver, but it can be extended to
address other problems such as different boundary conditions or the discretization
method.
In particular, writing to OpenCL images results in a significant computation-
time decrease; for the current implementation, half of the time is spent copying
output buffers back into images (see Figure 4.15). Unfortunately, this extension
would alienate most of the current OpenCL hardware because writing to 3D
images is an extension supported by very few devices as of the writing of this
book.
Finally, using a parallel reduction on the OpenCL device to compute the
residual norm would also result in a significant performance boost. Indeed, it
would require transfering only one float value instead of the whole residual grid
to test convergence on the host and decide whether or not to continue refining
the solution approximation.
Bibliography
[Briggs et al. 00] William L. Briggs, Van Emden Henson, and Stephen F. McCormick.
A Multigrid Tutorial, Second edition. Philadelphia: SIAM Books, 2000.
[Cormen et al. 01] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E.
Leiserson. Introduction to Algorithms. New York: McGraw-Hill Higher Education,
2001.
i i
i i
i i
i i
[Crane et al. 07] Keenan Crane, Ignacio Llamas, and Sarah Tariq. “Real-Time Simu-
lation and Rendering of 3D Fluids.” In GPU Gems 3, edited by Hubert Nguyen,
Chapter 30. Reading, MA: Addison Wesley Professional, 2007.
[Khronos Group 09] Khronos Group. The OpenCL Specification, version 1.0.48.
Khronos OpenCL Working Group, 2009. Available online ([Link]
registry/cl/specs/[Link]).
[McCormick 88] Stephen F. McCormick. Multigrid Methods: Theory, Applications, and
Supercomputing. New York: Marcel Dekker, 1988.
[Patankar 80] Suhas V. Patankar. Numerical Heat Transfer and Fluid Flow. New York:
Hemisphere Publishing Corporation, 1980.
[Saad 03] Youssef Saad. Iterative Methods for Sparse Linear Systems, Second edition.
Philadelphia: Society for Industrial and Applied Mathematics, 2003.
[Southwell 35] Richard V. Southwell. “Stress-Calculation in Frameworks by the Method
of Systematic Relaxation of Constraints. I and II.” In Proceedings of the Royal So-
ciety of London. Series A, Mathematical and Physical Sciences, pp. 56–96. London,
1935.
[Spivak 71] Michael Spivak. Calculus on Manifolds: A Modern Approach to Classical
Theorems of Advanced Calculus. New York: HarperCollins Publishers, 1971.
[Stam 99] Jos Stam. “Stable Fluids.” In SIGGRAPH ’99: Proceedings of the 26th
Annual Conference on Computer Graphics and Interactive Techniques, pp. 121–
128. New York: ACM Press/Addison-Wesley Publishing Co., 1999.
[Strang 07] Gilbert Strang. Computational Science and Engineering. Wellesley, MA:
Wellesley-Cambridge Press, 2007.
[Yu et al. 04] Yizhou Yu, Kun Zhou, Dong Xu, Xiaohan Shi, Hujun Bao, Baining Guo,
and Heung-Yeung Shum. “Mesh Editing with Poisson-Based Gradient Field Ma-
nipulation.” ACM Transactions on Graphics 23:3 (2004), 644–651.
i i
i i
i i
i i
i i
i i
i i
i i
Contributors
Kristof Beets is the business development manager for POWERVR Graphics at Imag-
ination Technologies. In this role he leads the overall graphics business promotion and
technical marketing efforts, the in-house demo development team. Previously he man-
aged the POWERVR Insider ecosystem and started work as a development engineer
on SDKs and tools for both PC and mobile products as a member of the POWERVR
Developer Relations Team. Kristof has a first degree in electrical engineering and a
master’s degree in artificial intelligence, both from the University of Leuven, Belgium.
He has spoken at numerous industry events including SIGGRAPH, GDC, and Euro-
Graphics, and has had articles published in ShaderX2, 5, 6, and 7 and GPU Pro books
as well as online by the Khronos Group, Beyond3D, and 3Dfx Interactive.
Andrea Bizzotto received his BS and MS degrees in computer engineering from the Uni-
versity of Padua, Italy. After completing college he joined Imagination Technologies,
where he developed a range of 3D demos for Imagination’s POWERVR Insider ecosys-
tem and published an article in GPU Pro. His research interests include 3D graphics,
computer vision, image processing, algorithm theory, and software design. More info at
[Link].
Samuel Boivin is a research scientist at INRIA in leave of absence. He is now the head
of Research and Development at SolidAnim, a company specializing in Visual Effects
for movies and video games. He earned a PhD in computer graphics in 2001 from Ecole
Polytechnique in Palaiseau (France). He has published several papers about computer
graphics in many conferences, including SIGGRAPH. His research topics are photo-
realistic real-time rendering, real-time augmented reality, fluid dynamics, and inverse
techniques for acquiring material properties (photometric, mechanical) from videos.
Xavier Bonaventura is currently a PhD student at the Graphics & Imaging Laboratory
of University of Girona, researching on viewpoint selection. He developed his master’s
thesis on hardware tessellation at Budapest University of Technology and Economics,
within the Erasmus program.
George Borshukov is CTO of embodee, a technology company that helps anyone find,
select, and personalize apparel using a fun yet useful process. He holds an MS from
the University of California, Berkeley, where he was one of the creators of The Cam-
panile Movie and real-time demo (1997). He was technical designer for the “bullet
time” sequences in The Matrix (1999) and received an Academy Scientific and Technical
473
i i
i i
i i
i i
Achievement Award for the image-based rendering technology used in the film. Bor-
shukov led the development of photoreal digital actors for The Matrix sequels (2003)
and received a Visual Effects Society Award for the design and application of the uni-
versal capture system in those films. He is also a co-inventor of the UV pelting approach
for parameterization and seamless texturing of polygonal or subdivision surfaces. He
joined Electronic Arts in 2004 to focus on setting a new standard for facial capture,
animation, and rendering in next-generation interactive entertainment. As director of
creative R&D for EA SPORTS he led a team focused on expanding the label’s brand
and its interactive entertainment offerings through innovation in hi-fidelity graphics and
new forms of interactivity, including 3D camera devices.
Ken Catterall graduated from the University of Toronto in 2005 as a specialist in soft-
ware engineering. He is currently a leading engineer for Imagination Technologies’
business development team, where he has been working since 2006 developing and sup-
porting Imagination’s key graphics demos. He has previously contributed to ShaderX6 ,
ShaderX7 , and GPU Pro.
Fabrice Colin earned a PhD in mathematics from the University of Sherbrooke (Canada)
in 2002, under the supervision of Dr. Kaczynski (University of Sherbrooke) and Dr.
Willem (Universit catholique de Louvain), with a thesis in the field of partial differen-
tial equations (PDE). In 2005 he was hired by the Mathematics and Computer Science
Department at Laurentian University, where he is currently a professor. His primary
research interests are the variational and topological methods in PDE. But besides his
theoretical work, he started collaborating in 2003 with Professor Egli, from the Depart-
ment of Computer Science (University of Sherbrooke), on the numerical simulations of
PDE related, for instance, to fluid dynamics or computer graphics.
Daniel Collin is a senior software engineer at EA DICE, where he for the past 5 years
he spent most of his time doing CPU and memory optimizations, tricky low-level debug-
ging, and implementing various culling systems. He is passionate about making code
faster and simpler in a field that moves rapidly.
Joe Davis graduated from the University of Hull in 2009 with a MEng degree in com-
puter science, where his studies focused on real-time graphics and physics for games.
Joe currently works as a developer technology engineer for the Imagination Technologies
POWERVR Graphics SDK, where his time is spent developing utilities, demos, and tu-
torials as well as helping developers optimize their graphics applications for POWERVR-
based platforms.
Jose I. Echevarria received his MS degree in computer science from the Universidad
de Zaragoza, Spain, where he is currently doing research in computer graphics. His re-
search fields range from real-time to off-line rendering, including appearance acquisition
techniques.
Richard Egli has been a professor in the Department of Computer Science at University
of Sherbrooke since 2000. He received his BSc degree and his MSc degrees in computer
science at the University of Sherbrooke (Québec, Canada). He received his PhD in
computer science from the University of Montréal (Québec, Canada) in 2000. He is the
i i
i i
i i
i i
4. Contributors 475
Holger Gruen ventured into 3D real-time graphics right after his university graduation,
writing fast software rasterizers in 1993. Since then he has held research and also
development positions in the middleware, games, and the simulation industries. He
addressed himself to doing developer relations in 2005 and now works for AMD’s product
group. Holger, his wife, and his four kids live in Germany close to Munich and near the
Alps.
Martin Guay completed a BSc in mathematics in 2007 and then Martin Guay joined
Cyanide’s Montreal-based studio, where he worked on the development of several game
titles as a graphics programmer. In 2010 he joined the MOIVRE research centre, Uni-
versité de Sherbrooke, as a graduate student in computer science, where he effectuated
research in the fields of computational physics and physically based animation of fluids.
Stephen Hill is a 3D technical lead at Ubisoft Montreal, where for the past 6 years he has
been single-mindedly focused on graphics R&D for the Splinter Cell series (Conviction
and Chaos Theory). For him, real-time rendering in games is the perfect storm of
artistic and technical creativity, low-level optimization, and pure showing off, all within
a fast-paced field.
i i
i i
i i
i i
Olivier Le Maı̂tre is a member of the research staff at the French National Center
for Research (CNRS), working in the Department of Mechanics and Energetics of the
LIMSI lab. After receiving a PhD in computational fluid dynamics in 1998 he joined the
University of Evry, where he taught scientific computing, numerical methods, and fluid
mechanics. He joined CNRS in 2007 as a full-time researcher, where he directs research
in complex flow simulations, uncertainty quantification techniques, and stochastic mod-
els.
Belen Masia received her MS degree in computer science and systems engineering from
the Universidad de Zaragoza, Spain, where she is currently a PhD student. Her research
interests lie somewhere between the fields of computer graphics and computer vision,
currently focusing on image processing and computational photography. Her work has
already been published in several venues, including Transactions on Graphics.
i i
i i
i i
i i
4. Contributors 477
in 2004 and his PhD in 2010. He coauthored several papers in the fields of real-time ren-
dering and scientific visualization. His current research interests are real-time rendering,
visibility computations, shadow algorithms, and real-time global illumination.
Fernando Navarro works as lead technical artist for Lionhead Studios (Microsoft Games
Studios). Prior to that he directed the R&D departments of different production houses.
His experience covers vfx, feature films, commercials, and games. He has a MSc from the
Universidad de Zaragoza and is currently pursuing a PhD in computer science, focused
on advanced rendering algorithms.
Matt Pharr is a principal engineer at Intel and the lead graphics architect in the Ad-
vanced Rendering Technology group. He previously co-founded Neoptica, worked in
the Software Architecture group at NVIDIA, co-founded Exluna, worked in Pixar’s
i i
i i
i i
i i
Rendering R&D group, and received his PhD from the Stanford Graphics Lab. With
Greg Humphreys, he wrote the textbook Physically Based Rendering: From Theory to
Implementation. He was also the editor of GPU Gems 2.
Peter Quayle graduated from the University of Brighton with a degree in computer
science. His interest in real-time computer graphics originates from a fascination with
the demoscene. Peter currently works at Imagination Technologies as a business devel-
opment engineer.
Donald Revie graduated from the University of Abertay with a BSc (Hons) in computer
games technology before joining Cohort Studios in late 2006. He has worked on Cohort’s
Praetorian Tech platform since its inception, designing and implementing much of its
renderer and core scene representation. He has also had the opportunity to develop
his interest in graphics programming through working on shaders and techniques across
various projects. He has continued to help develop Praetorian and Cohort’s shader
library across four released titles and numerous internal prototypes. He finds writing
about himself in the third person incredibly unsettling.
Pawel Rohleder claims he has been interested in the computer graphics and game indus-
try since he was born. He is keen on knowing how all things (algorithms) work, then try-
ing to improve them and developing new solutions. He started programming games pro-
fessionally in 2002; he has been 3D graphics programmer at Techland’s ChromeEngine
team for three years and since 2009 has been working as a Build Process Management
Lead / R&D manager. He is also a PhD student in computer graphics (at Wroclaw
University of Technology, since 2004).
Isaac Rudomin earned a PhD in computer science from the University of Pennsylvania
in 1990, with a dissertation “Simulating Cloth using a Mixed Geometrical-Physical
Method,” under the guidance of Norman I. Badler. He joined the faculty at Instituto
Tecnológico y de Estudios Superiores Monterrey, Campus Estado de México, in 1990
and from that date on he has been active in teaching and research. He is interested
in many areas of computer graphics and has published a number of papers. Lately his
research has an emphasis in human and crowd modeling, simulation, and rendering.
i i
i i
i i
i i
4. Contributors 479
Marco Salvi is a senior graphics engineer in the Advanced Rendering Technology group
at Intel, where he focuses his research on new interactive rendering algorithms and
sw/hw graphics architectures. Marco previously worked for Ninja Theory and Lu-
casArts as a graphics engineer on multi-platform and PS3-exclusive games where he
was responsible for architecting renderers, developing new rendering techniques and
performing low-level optimizations. Marco received his MSc in physics from the Uni-
versity of Bologna in 2001.
Wojciech Sterna has been interested in computer graphics and games development
since 2004. He is keen on implementing graphics algorithms (shadows especially) and
never lets go ’til he understand all the theoretical basics that lie behind. Wojtek is
currently finishing a remake of Grand Theft Auto 2 called Greedy Car Thieves (http://
[Link]) and hopes that players will love that game as much as he does.
Michael Schwärzler is a PhD student, researcher, and project manager in the field
of real-time rendering at the VRVis Research Center in Vienna, Austria. In 2009,
he received his master’s degree in computer graphics and digital image processing at
the Vienna University of Technology, and his master’s degree in computer science and
management at the University of Vienna. His current research efforts are concentrated
on GPU lighting simulations, real-time shadow algorithms, image-based reconstruction,
and semantics-based 3D modeling techniques.
Nicolas Thibieroz has more than 12 years of experience working in developer relations
for graphics hardware companies. He taught himself programming from an early age as
a result of his fascination with the first wave of ”real-time” 3D games such as Ultima
Underworld. After living in Paris for 22 years, Nicolas decided to pursue his studies
in England, where he earned a bachelor of electronic engineering in 1996. Not put
off by the English weather, Nicolas chose to stay and joined PowerVR Technologies to
eventually lead the developer relations group, supporting game developers on a variety
of platforms and contributing to SDK content. He then transitioned to ATI Technologies
and AMD Corporation, where his current role involves helping developers optimize the
performance of their games and educating them about the advanced features found in
cutting-edge graphics hardware.
Kiril Vidimce is a senior software architect and researcher at Intel’s Advanced Render-
ing Technologies group. His research and development interests are in the area of real
time rendering, cinematic lighting, physically-based camera models, and computational
photography. Previously he spent eight years at Pixar as a member of the R&D group
i i
i i
i i
i i
working on the in-house modeling, animation, lighting, and rendering tools, with a brief
stint as a Lighting TD on Pixar’s feature film, Cars. In his previous (academic) life
he did research in the area of multiresolution modeling and remeshing. His research
work has been published at SIGGRAPH, EGSR, IEEE Visualization, IEEE CG&A,
and Graphics Interfaces.
i i
i i