Neural Rendering in NVIDIA OptiX Using Cooperative Vectors

The release of NVIDIA OptiX 9.0 introduces a new feature called cooperative vectors that enables AI workflows as part of ray tracing kernels. The feature leverages NVIDIA RTX Tensor Cores for hardware-accelerated matrix operations and neural net computations during shading. This unlocks AI rendering techniques such as NVIDIA RTX Neural Shaders and NVIDIA RTX Neural Texture Compression (NTC) and moves further toward movie-quality photoreal materials in real-time rendering.

Cooperative vector APIs are being introduced in OptiX, DirectX, NVAPI, Slang, and Vulkan. This post explores the concepts behind cooperative vectors that apply across all APIs, and works through an example using the OptiX API.

Why matrix operations?

A multilayer perceptron (MLP) is the basic building block of many neural network algorithms. Research has shown that MLPs are capable of faithfully reproducing the effects they’re trained on. Even when they are small enough to run in real-time, MLPs are able to handle interesting effects such as physically-based shading, sometimes faster and with a smaller memory footprint than traditional shading networks.

The MLP typically consists of a vector of inputs, several fully connected layers, and an output vector. The sizes of the various layer vectors need not be the same.

Diagram of a multilayer perceptron with one input layer, two hidden layers, and an output layer. — *Figure 1. A multilayer perceptron (MLP) with one input layer, two hidden layers, and an output layer*

Each layer of MLP evaluation (inference) has two phases: a weighted and biased linear combination of the previous layer’s values, and an optional nonlinear activation function. The weighted and biased linear combination boils down to a matrix-vector multiply followed by the addition of a bias vector, also known as an affine transform.

The composition of any two afﬁne transforms is an afﬁne transform. This means that with only an affine linear plus bias phase at each layer, then the entire MLP could always be reduced to a single affine transform. MLPs are much more expressive than this because of the nonlinear activation functions applied after each layer’s affine result.

In a fully-connected MLP, each neuron is a function of all the neurons in the prior layer, as shown in Figure 1. The complete conceptual computation of a single layer is shown in Figure 2.

Diagram of the conceptual building blocks of one layer of MLP evaluation: vector-matrix multiply, then addition of a bias vector, and finally application of the nonlinear activation function. — *Figure 2. Evaluation of a single layer of an MLP typically consists of a vector-matrix multiply, addition of a bias vector, and application of a nonlinear activation function*

Why cooperative vectors?

One of the goals of cooperative vectors is to enable the use of NVIDIA Tensor Cores to accelerate matrix operations. Normally, the CUDA SIMT programming model requires a full warp of active threads to do this, but the ray tracing programming model treats threads independently and does not guarantee full warps. In addition, the Tensor Cores provide matrix-matrix multiplication, but each ray tracing thread only needs vector-matrix multiplication, which would under-utilize the Tensor Cores.

Also, the CUDA API requires targeting specific hardware versions, and is not guaranteed to be forward compatible from architecture to architecture. For an introduction to the CUDA multithreaded approach to matrix multiplication, check out the Matrix Multiplication Background User’s Guide.

Cooperative vectors address these limitations by providing an API that:

Allows matrix operations with warps that have some inactive threads
Provides cross-architecture forward and backward compatibility
Enables users specify vector data in a single thread while remapping the operation to more efficiently utilize the Tensor Cores

Cooperative vectors can handle data and execution divergence in a warp, albeit with some degradation in performance. The best performance will be obtained when the MLP weights across a warp are the same, and when you have a full complement of threads in a warp. Using Shader Execution Reordering (SER) can help achieve both of those goals.

Since evaluating an MLP is a series of vector-matrix multiplies, when all the threads in a warp evaluate the same MLP side by side, the cooperative vector API can treat the combined warp’s affine operation as a matrix-matrix multiply plus a bias. This is what cooperative means: threads band together to turn several vector-matrix operations into matrix-matrix operations.

outputMatrix = inputMatrix × weightsMatrix + biasMatrix

Diagram of a matrix-matrix multiply, addition of a bias matrix, and which can be used for parallel execution of a full warp of threads doing MLP layer evaluation. — *Figure 3. The affine part of a full warp’s combined MLP layer evaluation consists of a matrix-matrix multiply plus a bias matrix*

Here, all matrices except the weights matrix are 32 rows tall, and each row of the input, output, and bias matrices represents the data for a separate thread.

Using cooperative vectors in OptiX

Cooperative vector is an opaque vector type, essentially an array class, that can have arbitrary length. OptiX provides an implementation of a cooperative vector called OptixCoopVec. Cooperative vectors in OptiX support a specific and limited set of operations intended to help accelerate the evaluation of MLPs and small neural networks.

In the cooperative vector API, the vector-matrix multiply with bias is done with the function optixCoopVecMatMul, which performs the affine portion of layer evaluation. Since using different activation functions at different stages is often desirable, activation is applied separately after the vector-matrix multiply, and can be constructed from the set of vector functions supplied by the cooperative vector API.

outputVector = inputVector × matrix + bias

Diagram of a vector-matrix multiply followed by addition of a bias vector, the conceptual building block of one layer of MLP evaluation. — *Figure 4. The affine part of a thread’s MLP layer evaluation consists of a vector-matrix multiply and addition of a bias vector*

Cooperative vectors are supported in OptiX on all RTX devices and select server-class GPUs. You can query device support using optixDeviceContextGetProperty with OPTIX_DEVICE_PROPERTY_COOP_VEC. Using cooperative vectors on unsupported devices will generate errors, as no fallback support is available.

Example implementation

This section explores the cooperative vectors API for doing inference, or MLP layer evaluation, in an OptiX shader. The example we’ll work through is adapted from the optixNeuralTexture sample in the OptiX SDK. This sample uses the NTC SDK, which handles training (compressing) textures, storing weights and biases in a specified file format, and demonstrating inference (decompression) in various situations with different shading languages. You can use the NTC SDK to compress your own textures, and then view them using optixNeuralTexture.

Code that uses cooperative vectors makes heavy use of C++ templating. Templates help the compiler produce high performance code by providing static array sizes and static data types that are known at compile time. Define commonly used types in advance to make the code easier to read.

using T_OUT = OptixCoopVec<float, 16 /*output channels*/>;
...
T_OUT texel = inferTexel<T_OUT> ( 
latents, weights, x, y, ... );

This way, you can use the shortcut type T_OUT, for example, in place of a templated OptixCoopVec<> type. The function evalMLP evaluates a complete MLP for a given pixel on screen. In pseudo-code terms, it will set up inputs to the MLP, then evaluate each layer of the MLP, and finally return the last layer’s output:

template <class T_OUT>
evalMLP( T_OUT& outLayer, latents, mlpWeights, x, y )
{
    using T_IN  = OptixCoopVec<half, 48 /* input vec size    */ >;
    using T_HID = OptixCoopVec<half, 64 /* hidden layer size */ >;
    T_IN networkInputs = prepareNetworkInputs_FP16<T_IN>(x, y, latents);
    T_HID hiddenOut1 = evalLayer<T_IN, T_HID>(
        networkInputs, mlpWeights, 0, scaleBiasOffset, hiddenOut1);
    T_HID hiddenOut2 = evalLayer<T_HID, T_HID>(
        hiddenOut1, mlpWeights, weightOffset1, scaleBiasOffset, hiddenOut2);
    T_HID hiddenOut3 = evalLayer<T_HID, T_HID>(
        hiddenOut2, mlpWeights, weightOffset2, scaleBiasOffset, hiddenOut3 );
    outLayer = evalLayer<T_HID, T_OUT>(
        hiddenOut3, mlpWeights, weightOffset3, scaleBiasOffset, outLayer);
    return true;
}

Notice how the output of each layer evaluation becomes the input for the next layer evaluation. Take a closer look at the layer evaluation:

template <class T_IN, class T_OUT> evalLayer(
    T_IN&           inputArray,
    uint8_t*        weights,
    uint32_t        weightsOffsetInBytes,
    uint32_t&       biasOffsetInBytes,
    T_OUT&          outputArray )
{
  outputArray = optixCoopVecMatMul <
    T_OUT,
    T_IN,
    OPTIX_COOP_VEC_ELEM_TYPE_FLOAT8_E4M3, // inputInterpretation
    MAT_LAYOUT,                           // matrixLayout
    false,                                // transpose
    T_OUT::size,                          // N
    T_IN::size,                           // K
    OPTIX_COOP_VEC_ELEM_TYPE_FLOAT8_E4M3, // matrixElementType
    OPTIX_COOP_VEC_ELEM_TYPE_FLOAT16      // biasElementType
  >(
    inputArray,                           // inputVector
    weights,                              // matrix base ptr
    weightsOffsetInBytes,                 // matrix offset
    weights,                              // bias base ptr, same as weights
    biasOffsetInBytes                     // bias offset
  );

  // increment offset to the next layer
  biasOffsetInBytes += T_OUT::size * sizeof( T_OUT::value_type );

  outputArray = activate<T_OUT>( outputArray );
}

The single layer evaluation is little more than a wrapper around optixCoopVecMatMul. After the matrix multiply, increment the offset to the bias vector to get it ready for the next layer (notice that the bias offset is passed by reference in this function). Then call the activation function on the layer.

Something you may have noticed in these code examples is that we’re passing the same weights base pointer to multiple calls to evalLayer, and also using this base pointer for both weights and biases. Find the correct data at each step by adding constant offset values to the base pointer, in this case either weightsOffsetInBytes or biasOffsetInBytes.

There are two reasons the code is being written this way. The first reason is that when reading files in NTC format, the API gives back a block of memory where the weights matrices and bias vectors are all tightly packed, and you can use simple arithmetic to iterate through the data for each layer. The second reason is that the cooperative vectors API takes advantage of successive calls to optixCoopVecMatMul when the same weights base pointers are used repeatedly. The compiler will notice reused base pointers (even when using different constant offsets), and it will optimize your program to prevent the shuffling and unshuffling operations from happening unnecessarily in between layers.

Finally, take a look at the activation function:

template<class T_IN> 
VecT activate(const T_IN& x, bool scaleActivation=true)
{
    T_IN tmp    = optixCoopVecFFMA( x, T_IN( 1.0f/3.0f ), T_IN( 0.5f ) );
    tmp         = optixCoopVecMin( optixCoopVecMax( tmp, 0.0f ), 1.f ); // clamp
    T_IN result = optixCoopVecMin( x, 3.0f );
    result      = optixCoopVecMul( result, tmp );
    if( scaleActivation )
         result = optixCoopVecFFMA( result, T_IN(invStep), T_IN(bias) );
    return result;
}

Because an MLP activation function applies a nonlinear mapping to each element of the layer’s output vector, the cooperative vector functions called in the code above are all vector operations rather than matrix operations. The activation is generally a much smaller and less expensive operation than applying the layer weights matrix. There is a limited set of built-in vector functions available with cooperative vectors that one would normally find in MLP activations, such as tanh, log2, exp2, min, max, ffma, and so on.

Some of the cooperative vector functions have variants that take scalar parameters, but not all of them. In some cases, you will need to create vectors with constant value from a scalar. Here this was done using the OptixCoopVec constructor, for example with the T_IN(0.5f) parameter to the first optixCoopVecFFMA call. This particular activation function comes from the NTC SDK. In general when designing your own network, activation could be as simple as a call to optixCoopVecMax to simulate the well known ReLU activation.

Neural graphics

Cooperative vectors are used to implement RTX Neural Shaders and RTX Neural Texture Compression. These are available as part of NVIDIA RTX Kit, a suite of open-source repositories to make using and integrating these technologies easier. For a quick start to RTX Kit, including links and resources for each repository, see Get Started with Neural Rendering Using NVIDIA RTX Kit.

Rendered scene with a dragon flying over craggy rock cliffs, with a blue sky background. — *Figure 5. Dragon scene rendered using NVIDIA NTC SDK and NVIDIA RTX Mega Geometry*

The dragon pictured in Figure 5 required texture compression in order to fit in the 16 GB video memory of a GeForce RTX 5080 GPU. The dragon alone has more than 100 8K UDIM textures with five layers each. If the textures are decompressed from files into memory, they would consume more than 32 GB of VRAM, more than twice the available memory on a 5080.

With NTC, the memory footprint of the dragon textures becomes much more reasonable at less than 3 GB, and about twice as small as BC compressed textures. This leaves plenty of room for the rest of the textures in the scene, as well as the geometry, animation, BVH, and shaders, allowing this massive production-scale scene to render in real time on a single 5080 GPU.

Neural shaders are demonstrated in the RTX Neural Shading SDK, which provides examples to help you learn how to train your own neural shading networks and then use them to perform inference as part of normal graphics rendering. Cooperative vectors could be a way to implement Real-Time Neural Appearance Models.

Performance considerations

Consider the following for best performance:

Shuffling and unshuffling: To make use of the Tensor Cores, data is shuffled across the warp before the call to optixCoopVecMatMul and then unshuffled afterwards. Between two such calls, if you use only supported vector operations, it can remove the need for unshuffling and reshuffling, improving performance.
Full warps: Performance is best when using full warps. Use SER to coalesce threads, and avoid calling optixCoopVecMatMul inside dynamic conditionals.
Memory layout: The layout of weight matrices significantly impacts performance. OptiX supports optimal layouts for inference and training, which should be used for best performance. Use optixCoopVecMatrixConvert to convert matrices into optimal layouts.

Training with OptiX cooperative vectors

OptiX supports training with cooperative vectors, including forward and backward propagation. For more information, see the OptiX Programming Guide. In particular, the two device-side intrinsics optixCoopVecReduceSumAccumulate and optixCoopVecOuterProductAccumulate help accumulate loss values over bias vectors and weight matrices, respectively.

Get started

Cooperative vectors are an NVIDIA OptiX data type and API for doing high-performance vector and matrix operations as part of OptiX shader programs. These vector and matrix operations are at the heart of common machine learning algorithms, such as multilayer perceptrons (MLPs). Cooperative vectors facilitate the use of Tensor Cores on NVIDIA RTX GPUs, which previously required explicit coordination between threads in a warp. With cooperative vectors, developers no longer need to use synchronous multithreading techniques. They can do efficient matrix-vector multiplication operations essential for these neural network algorithms using an easier single-threaded programming style.

Cooperative vectors are available starting with NVIDIA OptiX SDK 9.0. Cooperative vector APIs are also being introduced into DirectX through the Agility SDK preview at the end of April, Vulkan, and Slang, so you can use them anywhere hardware-accelerated ray tracing is supported. Documentation for the OptiX cooperative vectors API is available in the OptiX Programming Guide, available both online and in PDF form distributed with the SDK and examples.

In the OptiX SDK is an example of inference for RTX Neural Texture Compression, called optixNeuralTexture, which uses cooperative vectors to decompress neural compressed textures on the fly during shading, enabling a 20x memory savings compared to the popular BC5 or BC6 compressed texture format, or 80x texture footprint savings compared to the uncompressed textures used in the optixMeshViewer sample.

It will be exciting to see new and interesting use cases for cooperative vectors emerge over time. Join the conversation in the OptiX NVIDIA Developer Forum to learn more and post about your experiences.