cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication#

NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a sparse matrix:

D = Activation(\alpha op(A) \cdot op(B) + \beta op(C) + bias)

where op(A)/op(B) refers to in-place operations such as transpose/non-transpose, and alpha, beta are scalars or vectors.

The cuSPARSELt APIs allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.

Download: developer.nvidia.com/cusparselt/downloads

Provide Feedback: Math-Libs-Feedback@nvidia.com

Examples: cuSPARSELt Example 1, cuSPARSELt Example 2

Blog post:

Key Features#

  • NVIDIA Sparse MMA tensor core support

  • Mixed-precision computation support:

    Input A/B

    Input C

    Output D

    Compute

    Block scaled

    Support SM arch

    FP32

    FP32

    FP32

    FP32

    No

    8.0, 8.6, 8.7 9.0, 10.0, 12.0

    BF16

    BF16

    BF16

    FP32

    FP16

    FP16

    FP16

    FP32

    FP16

    FP16

    FP16

    FP16

    No

    9.0

    INT8

    INT8

    INT8

    INT32

    No

    8.0, 8.6, 8.7 9.0, 10.0, 12.0

    INT32

    INT32

    FP16

    FP16

    BF16

    BF16

    INT8

    INT8

    INT8

    INT32

    No

    8.0, 8.6, 8.7 9.0, 10.0, 12.0

    INT32

    INT32

    FP16

    FP16

    BF16

    BF16

    E4M3

    FP16

    E4M3

    FP32

    No

    9.0, 10.0, 12.0

    BF16

    E4M3

    FP16

    FP16

    BF16

    BF16

    FP32

    FP32

    E5M2

    FP16

    E5M2

    FP32

    No

    9.0, 10.0, 12.0

    BF16

    E5M2

    FP16

    FP16

    BF16

    BF16

    FP32

    FP32

    E4M3

    FP16

    E4M3

    FP32

    A/B/D_OUT_SCALE = VEC64_UE8M0

    D_SCALE = 32F

    10.0, 12.0

    BF16

    E4M3

    FP16

    FP16

    A/B_SCALE = VEC64_UE8M0

    BF16

    BF16

    FP32

    FP32

    E2M1

    FP16

    E2M1

    FP32

    A/B/D_SCALE = VEC32_UE4M3

    D_SCALE = 32F

    10.0, 12.0

    BF16

    E2M1

    FP16

    FP16

    A/B_SCALE = VEC32_UE4M3

    BF16

    BF16

    FP32

    FP32

  • Matrix pruning and compression functionalities

  • Activation functions, bias vector, and output scaling

  • Batched computation (multiple matrices in a single run)

  • GEMM Split-K mode

  • Auto-tuning functionality (see cusparseLtMatmulSearch())

  • NVTX ranging and Logging functionalities

Support#

  • Supported SM Architectures: SM 8.0, SM 8.6, SM 8.7, SM 8.9, SM 9.0, SM 10.0, SM 12.0

  • Supported CPU architectures and operating systems:

OS

CPU archs

Windows

x86_64

Linux

x86_64, Arm64

Index#