One of the great pastimes of graphics developers and enthusiasts is comparing specifications of GPUs and marveling at the ever-increasing counts of shader cores, RT cores, teraflops, and overall computational power with each new generation. Achieving the maximum theoretical performance represented by those numbers is a major focus in the world of graphics programming. Massive amounts of rendering data, such as triangles, pixels, and rays, flow through the immensely parallel GPU computation pipeline as if on a group of assembly lines in a manufacturing plant. Maximum throughput requires the factory to be humming, with no interrupted work or idle equipment.
This post covers several of the new features in Nsight Graphics 2024.3 to help you understand and manage these virtual assembly lines, and create optimally parallelized workloads for games and graphics applications.
Active Threads per Warp histogram
A warp is a group of 32 threads that forms the fundamental unit of execution for programmable shaders. Ray tracing, compute, vertex, pixel, and other types of shaders written in HLSL or GLSL are compiled down to machine instructions and eventually run on the hardware in warp-sized groups. The threads in the warp run in parallel, and hundreds of warps themselves run in parallel. When programmable shading is the limiting factor of a workload, having warps run efficiently is vital to reaching peak performance.
Warps run on a hardware unit called the Streaming Multiprocessor (SM), where they execute using a computational model called Single Instruction, Multiple Threads (SIMT). Each warp issues one instruction at a time across all of its threads, with each thread having its own operands. For example, if a line of shader code adds two numbers, then within a warp the addition instruction begins running simultaneously in all 32 threads, producing 32 unique sums from 64 inputs.
One way that warps can run suboptimally is through thread divergence due to branches in shader code from if-statements or control flow. The compiler might avoid branching entirely by running all blocks of code under an if-statement and then ignoring the unused results, or by using unrolled loops. But when a true dynamic branch is compiled down and encountered at execution time, and the conditional expression is not uniform across all threads in a warp, the warp must execute both the if-body and the else-body one at a time.
Again, only one instruction at a time can be issued in the SIMT model, so the SM must mask off the threads not applicable to the active side of the branch. For a more complete description of warp instruction scheduling in modern NVIDIA GPUs, see NVIDIA Tesla V100 GPU Architecture (page 26).
Figure 1 shows a simplified visualization of thread divergence in a warp with a hypothetical size of eight threads. Capital letters represent statements in the program pseudocode. Throughput is reduced due to the idle lanes of thread execution at the time each instruction is issued.
Note that this is different from running both blocks in an if-statement in a branchless manner as previously mentioned. True branching avoids any side effects of unexecuted threads. More importantly, the dynamic branch can be a win when there is a favorable warp-level distribution of the conditional expression that drives the branch. The more warps that have a homogeneous result for the conditional expression, the less divergence there will be, thus having less impact on throughput. In vertex and pixel shaders, warps will be grouped based on locality of vertices and pixels, respectively. In ray tracing and compute shaders, the user has more explicit control over how the work gets grouped.
Other factors in the performance impact of thread divergence include the overall percentage of shader instructions under any branch, and the specific machine instructions and their operand types. The only way to know the true impact is to be able to track warp thread efficiency when measuring overall performance and correlate changes in one to the other.
This is where the new Active Threads per Warp histogram comes in. This compact graphic is now available throughout the Shader Profiler views within the GPU Trace tool in Nsight Graphics, including Shader Pipelines, Top-Down, Bottom-Up, Hot Spots, and Source/Disassembly. It illustrates the aggregate impact of thread divergence for any given shader, function, or individual line of source code.
As shown in Figure 2, values on the right of the histogram (closer to 32) indicate more efficient instruction execution. The values shown are approximated from the sampling of performance counters at the time of the execution of each code block. A popup tooltip shows the histogram in greater detail. When launching GPU Trace, the Timeline Metrics setting must be set to either Top-Level Triage or Ray Tracing Triage (if available), and the Real-Time Shader Profiler enabled.
If a function is a performance bottleneck and has poor Active Threads per Warp, you should consider strategies to improve warp coherence, or to reduce branching. For ray tracing workloads, look at Shader Execution Reordering (SER), which was designed specifically to address thread and data divergence issues in ray tracing shaders. Other algorithmic changes may improve thread execution coherence; for example, using a different ray sampling pattern. For any type of shader, it may also be possible to improve efficiency by converting branches into warp-aware shader code in D3D12 or Vulkan.
The spread of the histogram reveals whether the behavior of the code block was consistent, and seeing Active Threads per Warp at or near 100% may also validate that thread divergence is not a limiting factor when it was originally suspected to be.
Figure 3 illustrates how advanced lighting techniques such as path tracing cause shader divergence as secondary rays bounce off objects in the scene. SER improves execution coherence because rays using the same hit shader make better use of SIMT at the warp level. When SER is working, you should see Active Threads per Warp improve.
At a higher level, improving overall shading time involves understanding and reducing warp stalls. Warps stall when they reach long latency operations, especially things like memory accesses and texture fetches. When one warp stalls, another warp can be scheduled for the next instruction. However, this can only buy so much parallelism. If too many warps are stalled, then the SM sits underutilized or even idle. The length of a stall can depend on many factors such as which level of cache is hit, if any, for a memory lookup.
GPU Trace has always provided tools for analyzing these stalls, but a new arrow in the quiver is the Warp Latency histogram, which was previously presented as a single Average Warp Latency cycle count. Seeing the distribution of warp latency provides greater insight into the variability of shader timings, providing hints as to whether early exits were taken, and whether the arguments to multiple shader invocations resulted in different behaviors. Note that the histogram currently only contains separate latency data points for disjoint regions in the timeline calling into the same shader.
For more detailed information about optimizing GPU workloads, check out these resources:
- The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload describes a performance triage method used internally at NVIDIA to determine the main performance limiters of any given GPU workload using NVIDIA-specific hardware metrics.
- Powerful Shader Insights: Using Shader Debug Info with NVIDIA Nsight Graphics explains the steps to take while compiling your shaders, to see flame graphs and source correlation in GPU Trace.
- The CUDA Programming Guide and CUDA Refresher: The CUDA Programming Model explain how compute and SIMT architecture apply to compute shaders in graphics APIs.
- Introduction to NVIDIA Nsight Compute – A CUDA Kernel Profiler discusses warp scheduling in a way that applies to graphics workloads.
D3D12 Work Graphs
Intercommunication between CPU and GPU is another common bottleneck in graphics pipelines. Even when bulk data is left resident on the GPU, the act of issuing rendering instructions from the CPU can create bubbles where the GPU is sitting idle. Work Graphs are a new feature in D3D12 that aim to decrease the dependency on the CPU to schedule GPU work. GPU-driven scheduling has been around for a while, but Work Graphs introduce more advanced capabilities than existing methods such as ExecuteIndirect
. For an overview of Work Graphs, see Advancing GPU-Driven Rendering with Work Graphs in Direct3D 12.
Initial support for profiling Work Graph nodes as a whole in the Shader Profiler was introduced in Nsight Graphics 2024.2. In 2024.3, the Shader Profiler now supports source correlation for Work Graphs, enabling the full functionality of line-by-line analysis in the Shader Source view and Hot Spots list. As Work Graphs are a new feature in D3D12, this capability should help developers to explore and better understand Work Graph performance characteristics. Note that source correlation requires the newest R565 series driver.
Likewise, the Nsight Aftermath SDK 2024.3 adds support for tracking shaders used for Work Graphs and providing contextual information to aid in narrowing down related GPU faults originating in Work Graph workloads.
Vulkan updates
The recently released Vulkan 1.4 standard promotes over a dozen previously optional extensions into the required extension set and introduces increased minimum hardware limits. For more information, see Khronos Streamlines Development and Deployment of GPU-Accelerated Applications with Vulkan 1.4. Nsight Graphics 2024.3 is shipping with Vulkan 1.4 support in the Frame Debugger. For beta drivers supporting 1.4, visit Vulkan Driver Support.
Even if you’re not using Vulkan 1.4 directly, all of the newly promoted extensions are now supported in Nsight Graphics. Support for many other extensions has been added as well, including VK_NV_inherited_viewport_scissor and VK_NV_device_generated_commands_compute. For the complete list, see the NVIDIA Nsight Graphics User Guide.
This release also adds support for Frame Debugging and GPU Trace of applications that use Vulkan SC on the Windows and Linux desktop platforms. For more information about Vulkan SC and driver support, visit Vulkan Driver Support.
Conclusion
You’ll want to have a foundational understanding of the GPU computing model before you start optimizing for parallelism. Yet, how the theory translates to practice can be hard to predict due to varying data patterns, compiler and hardware optimizations, and many other second-order influences. Have a strategy based on hypothesis testing and records of measurements in the tools. Understand what the metrics mean and then track how they change as you make adjustments. While it’s not practical to achieve 100% utilization of every hardware unit on the GPU simultaneously, incremental improvements can help you reach the performance requirements of your application.
Nsight Graphics 2024.3 is now available. Tell us about your experience with these new features using the Feedback button located at the top right of the Nsight Graphics window.
Learn more about Nsight Developer Tools and explore tutorials for Nsight Tools. Ask questions, provide feedback, and engage with the graphics developer community on the Nsight Graphics Developer forums.
Acknowledgments
For their contributions to this post, I’d like to thank Avinash Baliga, Jeff Kiel, Axel Mamode, Aurelio Reis, and Louis Bavoil.