Generalized Decoupled and Object Space Shading System
Generalized Decoupled and Object Space Shading System
Figure 1: Rendered castle model (left) and the associated virtualized shadel sheet (right) with level of detail system (similar to a mipmap).
Abstract
We present a generalized decoupled and object space shading system. This system includes a new layer material system, a
dynamic hierarchical sparse shade space allocation system, a GPU shade work dispatcher, and a multi-frame shade work
distribution system. Together, these new systems create a generalized solution to decoupled shading, solving both the visibility
problem, where shading samples which are not visible are shaded; the overshading and under shading problem, where parts
of the scene are shaded at higher or lower number of samples than needed; and the shade allocation problem, where shade
samples must be efficiently stored in GPU memory. The generalized decoupled shading system introduced shades and stores
only the samples which are actually needed for the rendering a frame, with minimal overshading and no undershading. It does
so with minimal overhead, and overall performance is competitive with other rendering techniques on a modern GPU.
CCS Concepts
• Computing methodologies → Rasterization;
Decoupled shading, also known as object space shading, is a The advantages of decoupling shading are well known. Until re-
method of rendering specifically designed to address well known cent advances in GPU architecture, decoupling shading remained
weaknesses present in both forward and deferred renderers. Decou- impractical due to both performance consideration and lack of
pled shading helps address mathematical stability problems caused generalized GPU architectures. In 2016, the Nitrous game engine
by evaluating shade samples at different locations each frame, as demonstrated that limited decoupled and object space rendering
was practical for D3D12 and Vulkan class desktop GPU. The first tributes and subsequent shading occurs on these captured at-
Nitrous title, Ashes of the Singularity, [Bak16] demonstrated that tribute buffers. This distinct advantage over other rendering tech-
despite wastage in overshading and general overhead of decoupled nologies, such as forward, deferred, or even the Reyes architec-
shading, the increases in efficiency for shading in a decoupled man- ture, is among one of the fundamental reasons to use decoupled
ner overcame these issues. shading.
• Efficiently store in GPU memory shade samples. Nitrous 1.0’s
However, Nitrous 1.0 and Ashes of the Singularity did not repre-
shade packing scheme required significant wastage due to gutter
sent a generalized solution for object space and decoupled shading.
spaces and other problems. Though the memory was never read
Because the game is a top-down strategy game, the general prob-
or written to, it occupied valuable space in the high performance
lem of samples which were not visible being shaded was not overly
memory for the GPU.
severe, the projected texel density was relatively constant for most
• Competitive with other real time rendering architectures such as
objects, and the terrain system was built with a limited variable rate
deferred, forward, and forward+ on a modern GPU. We define
stitch map with shading controlled by the CPU.
competitive to be within 10% performance of a similar scene
Furthermore, artists were able to tune art assets to circumvent implemented via forward or deferred.
object space and Nitrous 1.0 issues, with techniques such as not • Allow arbitrary complex materials with relatively strong robust-
texture mapping the underside of units, or mapping it with limited ness. Materials should render with high quality without massive
texel density. This introduced additional cost and time to art assets effort from shader authors. Capabilities such as complex mate-
relative to deferred and forward rendering architectures. Further- rial layering should be supported. Decoupled shading materials
more, decoupled and object space shading produced other prob- should be a super set of any material authorable in other render-
lems; the need to overshade required substantial amount of shade ing architectures.
buffer space, potentially more storage space even than a deferred • Be compatible with a variety of rendering technologies such as
render, though the total bandwidth used was typically far less. rasterization or ray-tracing.
• Have simple, easy to understand, and real time adjustable perfor-
Though this solution worked well for Ashes of the Singularity,
mance controls.
it was not a generalized solution for decoupled and object space
shading. The same rendering technique would perform subopti- With these constraints in mind, we believe that the generalized
mally where there is both significant occlusion and large variation decoupled shading system solves all these problems and can be
in shading samples across the domain of a single object. used to render most types of scenes created for real time systems
today.
In earlier versions of Nitrous, we attempted to solve the over-
shading and large texel variation by breaking up objects into
smaller chunks. While this approach had merits and mitigated some 2. Related Work
of the aforementioned problems, it still required more memory and The Reyes Rendering Architecture [CCC87] implemented a tech-
performance was less than was desirable. Additionally, the burden nique called object-based shading. Reyes subdivides surfaces into
of building art assets broken into small chunks represented a non- micropolygons, that are approximately the size of a pixel, which are
trivial cost to production. Part of this cost was related to art assets inserted into a jittered grid – a super-sampled z-buffer, to introduce
needing to be built with a methodology which was considerably noise and avoid aliasing. In object-based shading, shading happens
different from conventional art pipelines and tool support. before rasterization, as opposed to forward shading where shading
For Nitrous 2.0 we therefore desired, based on our experience takes place during rasterization in screen space, and deferred shad-
and the experience of our partners, a solution with a number of ing [DWS*88] takes place after rasterization. Since shading occurs
constraints and goals: before occlusion testing in the z-buffer, overdrawing can occur.
Burns et al. [BFM10] built upon the Reyes by making two major
• Ability to consume art assets with minimal variation from exist-
improvements: no longer requiring surfaces to be subdivided into
ing tools and art processes. Most art assets have a normal map,
micropolygon size and shades 2×2 blocks of pixels on demand af-
albedo map, a color map, and various attribute maps, sometimes
ter performing visibility testing, reducing the number of samples
stored in vertices. The only extra burden for creation of assets
that are shaded but not used. Fascione, et al. [FHL*18] created
which is acceptable is the creation of a unique UV chart. We
Makuna, a modern implementation of Reyes. While not developed
note that the creation of the UV chart can be automated in many
for real time rendering, it did make many improvements to Reyes
cases.
including decoupling shading and path sampling.
• Ability to amortize shading (or parts of the shading) across mul-
tiple frames, decoupled temporally from rasterization, allowable Ragan-Kelley et al. [RLC*11] proposed a hardware extension
on a material by material basis. based on decoupled sampling, sampling visibility and shading sep-
• Have near perfect shade sample and shading coverage. Pixels on arately, and applying to depth of field and motion blur. Liktor and
the screen should be neither overshaded nor undershaded, nor Dachsbacher [LD12] used a deferred shading system where shad-
should back facing or occluded shading samples necessarily be ing samples are cached when computed, to speed up the rendering
rendered. of stochastic supersampling, depth of field, and motion blur. Clar-
• Maintain the filtering advantages of decoupled and object space berg et al. [CTH*14] proposed hardware extensions for computing
shading. One advantage of object space shading is that most shading in texture space, reducing the overshading problem and al-
aliasing occurs only once during the capture of the shading at- lowed for bilinear filtering.
3. Architecture In the shade frame loop, the process begins by performing a layer
space support expander to allocate shadels which are not directly
The generalized decoupled shading engine consists of several com- visible but which may contribute to the scene in indirect ways.
plex sub-systems which interact to render a provided scene. Be- Once this is done, the shadels themselves are allocated. During
cause of the general complexity and latency of CPU and GPU in- the allocation, data is collected for the actual use of shadels and
teraction, the systems and control flow remain largely on the GPU. a global adjustment factor is computed to adjust the shading rate to
The primary responsibility of the generalized decoupled shading fit into the desired shade space to not exceed the number of shadels
engine is the allocation, generation, processing, and management allocated.
of shade elements, which we call shadels.
The next step is for the work queues to be generated, allocated,
The process works by running two primary loops which may run and filled. Shadels must be dispatched for every object, for each
at different frequencies. For our upcoming title, Ara: History Un- material instance on that object, and for each layer on that object.
told, we always run them at the same temporal frequency. The first Once shadels have been processed, they are sent back to the render
loop is the raster frame, which corresponds roughly to what is typ- frame. Figure 3 illustrates these steps and the raster frame interac-
ically thought of in deferred or forward renderers’ entire rendering tion with the shade frame.
pipeline. The second loop is the shade frame, which performs the
actual shading. In the typical configuration, the shade frame runs at
3.1. Objects, Shade requests and Materials
no more than 30 times per second, while the raster frame runs at a
much higher rate, ideally as fast as the display device can display In Nitrous 2.0, the following things which are rendered through the
which may be upwards of 200 times per second. decoupled shading system are: objects which consist of triangles, a
material instance which goes with the object, dynamic and constant
The raster frame and the shade frame have very different com-
data, which is generated by the application, and a group of arbitrary
putational needs. The raster frame involves mostly the use of ras-
resources, usually collections of textures.
terization hardware such as triangle rasterization and depth buffer,
where the shade frame requires only the use of compute shaders. All objects together represent a scene and anything which may
Because they can be run in parallel, in one configuration the shade contribute to the final rendering of the screen should be in this
triangle for each pixel. In the decoupled shading system and in ob-
ject space rendering triangles are not used for shading, rather each
shadel has a collection of corresponding input attributes.
In our first implementation, triangulated meshes were converted
to have shadel input attribute textures. This process involved ren-
dering the model from the 2D texture parameter space into a buffer
(thereby capturing the rendered attributes) either repeating this pro-
cess for each shadel level (analogous to a mip level) or performing a
downsampling filter. In our Nitrous 2.0, rather than capturing each
individual attribute, we follow the approach of Hillesland and Yang
[HY16] and only capture the triangle ID. When the shading occurs,
the mesh index buffer and vertex buffers are bound as inputs to the
shadel shader. The triangle ID is used to look up the vertices, which
Figure 3: The Raster and Shade frames. The raster frame col- are used to interpolate the input attributes from the vertex data in
lects the scene and prepares it for the GPU, then dispatches it to nearly the exactly same way as would occur during a forward or
the shadel mark prepass. The shadel mark prepass marks which deferred renderer. This has the advantage of using far less data, and
shadels need to be rasterized and sends this data to the shade frame also kept attributes more consistent with any corresponding values
loop. Once the shadels have been processed in the shade frame loop that might come from rasterized buffers such as shadow maps.
they are sent back to the render frame for rasterization.
To capture the triangle ID, we rasterize the triangles in texture
space, with the output being only the triangle ID. The resolution of
this texture is captured at is controlled by a setting for each indi-
vidual asset. This is an asset cooking step and does not occur while
the game is running. This triangle ID is captured for each mip level
of the triangle ID texture. One problem that arose was that some-
times triangle or section of triangles on the mesh resulted in no
triangle ID being captured, due to triangles falling in between cov-
erage rules for rasterization. This can later result in geometry being
rendered and having no shadels which represent it.
To solve these problems, we make some important modifications
to other approaches to capture triangle IDs. After the attributes are
first generated as previously described, we capture them again, but
this time switch the rasterization mode to conservative rasterization
Figure 4: A pre-alpha screenshot of Ara: History Untold. Every- [AA05][HAO05]. This changes the coverage rule such that all tri-
thing in this screenshot are rendering through the decoupled shad- angles will emit an attribute to any sample they touch. This process
ing system, except for the trees, people, and special effects such as is repeated similarly to the aforementioned process.
fire. Next, the two triangle ID maps are merged. Non-conservative
rasterization is preferred, however if a sample exists in the conser-
vative rasterization where no sample exists in the non-conservative
scene. This scene is created implicitly by tracking all objects which version, the merged version uses the conservative rasterization sam-
have been requested to be rendered. Scene culling and object refine- ple. This process means that there is no chance that a triangle when
ment occur before the decoupled shading system begins its process- applied to rasterization, does not have any captured triangle IDs.
ing. Figure 4 is a scene from Ara: History Untold. See Figure 5.
Objects have two distinct stages of existing. They may be instan- We have found that these improvements greatly increase the ro-
tiated, which means that they exist somewhere in the world and will bustness of triangulated meshes such that in general the Nitrous
have an object instance ID, and they may be also requested for ren- engine can consume most assets created for forward or deferred
dering - which means they will be rendered into the scene if they renderers. In addition, by saving only the triangle ID, the only ad-
are visible. Multiple objects may share material instances, and vir- ditional memory for an asset is the creation of a triangle ID texture.
tually any buffer or GPU resource with the exception that they will
never share shadel space.
3.3. Virtualized shadel allocation
When objects are instantiated and could possibly render (but may
3.2. Attribute Processing
not be actually requested to render for a particular frame), they are
Triangulated meshes which we use typically have some attributes allocated shade space inside the virtualized shade space system. We
stored in the vertices including: normal, positions, texture coordi- call them shadels as they are distinct from both texels and pixels,
nates, specular power, albedo, etc., which are interpolated across a because they could be implemented in a variety of ways depending
tion to only one dependent read and the remap locations small size
means they are typically inside the L1 cache of the GPU.
rial. Because any material can read and write to any layer image as preprocess can be run at lower resolution often without noticeable
it desires, it is up to the individual material for how it might reuse issues. Problems can arise from certain types of charting on some
any intermediate value. high triangle count meshes, but can be mitigated by LOD choices
and carefulness in art preprocess. In our current implementation,
The retrieval of previous frames layer image samples works by
we chose to keep the resolutions the same, because we had a need
making a function call in the material which uses the previous
to have a screen resolution Z prepass for other rendering needs.
frame shadel’s allocation and storage. If a sample exists, it is re-
turned, otherwise, the function returns false so that the material can Our data indicates, that the shadel mark preprocess step is often
decided how to proceed. This feature is exploited extensively in our less than 1ms on modern GPUs even at high resolutions, therefore
terrain material, where we cache the expensive blended attributes we typically run the process at the full resolution to guarantee wa-
that need to be generated. tertightness with lower resolutions reserved for lower performance
systems.
Table 2: A frame analysis of a typical frame in our Decoupled Shading Engine (DSE) from Ara: History Untold, which featured 8 million
triangles and 6,500 draw calls.
The major difficulty with the shading section is that the num- poral anti-aliasing. The second observation is that a detailed anal-
ber of samples evaluated from a forward renderer and a decoupled ysis revealed that there is likely a large performance gain simul-
is not the same. A decoupled renderer will typically process more taneously rendering foliage with shading. This is because foliage
samples, though this is not always the case since small triangles is opaque and using alpha to coverage, it may render before any-
will over shade in a forward renderer due to the fact pixels must thing which needs shading. Since the foliage uses little compute
evaluate on a quad. Whereas in a decoupled renderer such as ours, resources and mostly uses rasterization resources, on an architec-
the shading granularity is higher and the renderer takes more sam- ture which allows asynchronous compute it should be possible to
ples on anisotropic edges. co-execute direct rendered opaque objects while shading is occur-
ring. This could yield a performance increase of 2-4ms.
However, we can use a couple of observations to estimate they
are likely approximately the same. We have also observed higher Overall, our analysis indicates that performance should be with-
GPU occupancy shading in decoupled shading than with forward ing 10% of other rendering architectures, and if other mentioned
rendering due to better cache use and dispatching, with occupancy factors are considered it may be able to outperform other rendering
often being near perfect if GPU register usage isn’t too high. There- architectures in some cases.
fore, while decoupled shading is likely processing more samples,
it can evaluate samples somewhat faster. Additionally, we observe
that 30-50% of a frame time spent shading is typical for many 4.2. Quality Feature Improvements
games, based a variety of frame captures we have both down on our We have already noted that decoupled shading is competitive with
own prior titles and the types of captures we have seen from other other rendering architectures, but performance is only one factor to
games and engines. Therefore, 40% of our time spent in shading is consider when building a rendering system.
roughly in-line with our expectations.
Regardless of performance advantages, decoupled shading has
We also ran timings on a similar scene on an AMD Radeon RX intrinsic quality improvements. One of the biggest problems with
6800 GPU. According to UserBench, this GPU is nearly identical shading techniques is shading aliasing and shading instability to in-
performance to the Nvidia GeForce GTX 1080 Ti. However, we terpolation of non-linear data. A common example of this is shim-
noted that while most of the timings were nearly identical to the mering results when a high frequency normal map is used with
1080 Ti as expected, the shading time was less than half at around a low roughness factor. This results in high frequency highlights
4ms, with the entire frame time around 18ms. In this GPU, shading which shimmer frame to frame as the micro variations in the sam-
would represent only around 25% at most of the frame time. We ples can cause widely different shading results.
are still investigating the reason for the performance difference.
It could be due to a newer, more efficient architecture for com- There are many techniques designed to help mitigate this prob-
pute shading, or perhaps some missed optimizations in the Nvidia lem such as LEAN Mapping [OB10], but there exists no general-
Driver. On this GPU architecture, it seems unlikely that any for- ized solution to this problem for forward or deferred, nor can there
ward rendering of our scene could be significantly faster than our be since real world materials are often too complex to fully solve
current approach. these issues. Part of our design requirements was shading robust-
ness, where any shader will have some degree of anti-aliasing re-
There are a few more considerations worth discussing that would
gardless of the attention spent to anti-alias it.
further benefit decoupled shading. One is that due to intrinsic tem-
poral and shader anti-aliasing, our renderer does not require tem- While decoupled shading still aliases, the samples used per
frame are invariant. This is distinct even from architectures such as multi-pass layers, such as when normals are generated from com-
Reyes where triangles are generated relative to the current frame. posited heights and human skin.
The advantage of this is that while decoupled shading may be in-
The primary hardware features which make this feasible for effi-
correct, it is incorrect in the same way each frame thereby removing
cient rendering are decent speed interlocked bitwise atomics, good
one class of rendering artifacts.
L1 caches, and good write-combiners. On newer GPU architectures
Another feature decoupled shading brings is the aforementioned such as RDNA 2 the rendering is particularly efficient, with shading
concept of material layers. Simply put, shadels can have precise accounting for only 25% of a typical frame time.
access to their complete neighborhood, as well as precise access to
The estimated cost over a forward renderer of similar complexity
previous temporal samples. Additionally, similar to forward render-
is around 10%, with the hope that further optimizations will make
ing, decoupled shading has no real limit to the number of different
the performance nearly the same or better. The primary overhead
material and material instances which can be used.
for decoupled shading is twofold. First is we require an additional
Finally, it is a trivial matter in decoupled shading to adjust the prepass on the geometry of the scene, and the second is that there
sampling rate for specific objects, areas of objects, materials, etc. is some amount of fixed overhead aggregating the results of this
Shading sampling and super sampling is controlled by a single prepass to dynamically dispatch GPU work.
number in the shader. Foveated rendering can be implemented with
The Nitrous 2.0 rendering system provides the first, comprehen-
a handful of shader code. Performance can also be easily adjusted
sive, and complete decoupled rendering solution. Decoupled shad-
by varying total shading samples across the scene independent of
ing is practical and efficient on modern GPUs. Decoupled shading
resolution.
can be used with many rendering techniques including, ray tracing,
point rendering, triangle rasterization, tile rendering, MSAA, alpha
5. Further Work blending, and many other GPU features. Additionally, decoupled
shading has intrinsic quality and flexibility benefits that cannot be
Decoupled Shading can integrate with ray tracing hardware in a
matched by forward or deferred rendering architectures [Bak16].
variety of ways. To ray trace inside a material, the scene is also up-
Finally, we believe with additional GPU modifications, decoupled
dated and maintained in one or more bounding volume hierarchies
shading could exceed forward and deferred rendering in almost all
(BVH), as is typical for real time ray tracing. At this point, any
work loads
shadel can request a ray trace in the same manner as a pixel shader
could, allowing full integration with ray tracing.
7. Acknowledgements
Ray tracing can be integrated more deeply into a decoupled shad-
ing system. If various surface properties are collected into different Ara: History Untold is published by Xbox Game Studios.
layers, traced rays can look up their values into the populated shadel
remap and storage buffer, marking the shadels in the remap buffer
References
so that they become available in future frames.
[AA05] A KENINE -M ÖLLER, T OMAS and A ILA, T IMO. “Conservative
Additionally, decoupled shading allows for additional interesting and Tiled Rasterization Using a Modified Triangle Set-Up”. J. Graph-
modes of operation. Rather than trace rays directly in the material, ics Tools 10 (Jan. 2005), 1–8. DOI: 10 . 1080 / 2151237X . 2005 .
the ray origin and direction can be stored into one or more layers. 10129198 4.
This layer is dispatched to ray tracing hardware which populates [AHTA14] A NDERSSON, M., H ASSELGREN, J., T OTH, R., and
another layer with the results of the ray trace shader. A KENINE -M ÖLLER, T. “Adaptive Texture Space Shading for Stochastic
Rendering”. Comput. Graph. Forum 33.2 (May 2014), 341–350. ISSN:
By dispatching large clusters of rays at once, the decoupled shad- 0167-7055. DOI: 10.1111/cgf.12303 3.
ing engine can sort and group the rays for a much faster trace [Bak16] BAKER, DAN. “Object Space Lighting”. Mar. 2016. URL:
through the scene, avoiding costly shading during the hit shaders. https://www.gdcvault.com/play/1023511/Advanced-
Graphics-Techniques-Tutorial-Day 2, 3, 10.
[BFM10] B URNS, C HRISTOPHER, FATAHALIAN, K AYVON, and M ARK,
6. Conclusion W ILLIAM. “A Lazy Object-Space Shading Architecture With Decoupled
Sampling”. Jan. 2010, 19–28 2.
The described process produces scenes which render quickly and
efficiently on modern hardware, despite the fact that the hardware [BH13] B URNS, C HRISTOPHER A. and H UNT, WARREN A. “The Visibil-
ity Buffer: A Cache-Friendly Approach to Deferred Shading”. Journal of
tested was not designed for this type of rendering. This rendering Computer Graphics Techniques (JCGT) 2.2 (Aug. 2013), 55–69. ISSN:
architecture can replace many forward rendering architectures for 2331-7418. URL: http://jcgt.org/published/0002/02/
a production title at scale, with approximately the same perfor- 04/ 3.
mance characteristics, and is competitive in overall performance [CCC87] C OOK, ROBERT L., C ARPENTER, L OREN, and C ATMULL, E D -
with other rendering techniques. Our current title has over 5,000 WIN . “The Reyes Image Rendering Architecture”. SIGGRAPH Comput.
assets, ranging from characters, to terrain, to buildings compatible Graph. 21.4 (Aug. 1987), 95–102. ISSN: 0097-8930. DOI: 10.1145/
with this decoupled shading architecture. 37402.37414 2.
[CTH*14] C LARBERG, P ETRIK, T OTH, ROBERT, H ASSELGREN, J ON, et
In addition, this rendering architecture can co-exist with forward al. “AMFS: Adaptive Multi-Frequency Shading for Future Graphics Pro-
rendering, and is generally a superset. Many of our materials now cessors”. ACM Trans. Graph. 33.4 (July 2014). ISSN: 0730-0301. DOI:
rely on some of these additional features, especially materials with 10.1145/2601097.2601214 2.