NVIDIA Fortran CUDA Interfaces

Preface

This document describes the NVIDIA Fortran interfaces to cuBLAS, cuFFT, cuRAND, cuSPARSE, and other CUDA Libraries used in scientific and engineering applications built upon the CUDA computing architecture.

Intended Audience

This guide is intended for application programmers, scientists and engineers proficient in programming with the Fortran language. This guide assumes some familiarity with either CUDA Fortran or OpenACC.

Organization

The organization of this document is as follows:

Introduction: contains a general introduction to Fortran interfaces, OpenACC, CUDA Fortran, and CUDA Library functions
BLAS Runtime Library: describes the Fortran interfaces to the various cuBLAS libraries
FFT Runtime Library APIs: describes the module types, definitions and Fortran interfaces to the cuFFT library
Random Number Runtime APIs: describes the Fortran interfaces to the host and device cuRAND libraries
Sparse Matrix Runtime APIs: describes the module types, definitions and Fortran interfaces to the cuSPARSE Library
Matrix Solver Runtime APIs: describes the module types, definitions and Fortran interfaces to the cuSOLVER Library
Tensor Primitives Runtime APIs: describes the module types, definitions and Fortran interfaces to the cuTENSOR Library
NVIDIA Collective Communications Library APIs: describes the module types, definitions and Fortran interfaces to the NCCL Library
NVSHMEM Communication Library APIs: describes the module types, definitions and Fortran interfaces to the NVSHMEM Library
NVTX Profiling Library APIs: describes the module types, definitions and Fortran interfaces to the NVTX API and Library
Examples: provides sample code and an explanation of each of the simple examples.

Conventions

This guide uses the following conventions:

italic: is used for emphasis.
Constant Width: is used for filenames, directories, arguments, options, examples, and for language statements in the text, including assembly language statements.
Bold: is used for commands.
[ item1 ]: in general, square brackets indicate optional items. In this case item1 is optional. In the context of p/t-sets, square brackets are required to specify a p/t-set.
{ item2 | item 3 }: braces indicate that a selection is required. In this case, you must select either item2 or item3.
filename …: ellipsis indicate a repetition. Zero or more of the preceding item may occur. In this example, multiple filenames are allowed.
FORTRAN: Fortran language statements are shown in the text of this guide using a reduced fixed point size.
C++ and C: C++ and C language statements are shown in the test of this guide using a reduced fixed point size.

Terminology

If there are terms in this guide with which you are unfamiliar, see the NVIDIA HPC glossary.

Related Publications

The following documents contain additional information related to OpenACC and CUDA Fortran programming, CUDA, and the CUDA Libraries.

ISO/IEC 1539-1:1997, Information Technology – Programming Languages – FORTRAN, Geneva, 1997 (Fortran 95).
NVIDIA CUDA Programming Guides, NVIDIA. Available online at docs.nvidia.com/cuda.
NVIDIA HPC Compiler User’s Guide, Release 2024. Available online at docs.nvidia.com/hpc-sdk.

1. Introduction

This document provides a reference for calling CUDA Library functions from NVIDIA Fortran. It can be used from Fortran code using the OpenACC or OpenMP programming models, or from NVIDIA CUDA Fortran. Currently, the CUDA libraries which NVIDIA provides pre-built interface modules for, and which are documented here, are:

cuBLAS, an implementation of the BLAS.
cuFFT, a library of Fast Fourier Transform (FFT) routines.
cuRAND, a library for random number generation.
cuSPARSE, a library of linear algebra routines used with sparse matrices.
cuSOLVER, a library of equation solvers used with dense or other matrices.
cuTENSOR, a library for tensor primitive operations.
NCCL, a collective communications librarys.
NVSHMEM, a library implementation of OpenSHMEM on GPUs.
NVTX, an API for annotating application events, code ranges, and resources.

The OpenACC Application Program Interface is a collection of compiler directives and runtime routines that allows the programmer to specify loops and regions of code for offloading from a host CPU to an attached accelerator, such as a GPU. The OpenACC API was designed and is maintained by an industry consortium. See the OpenACC website for more information about the OpenACC API.

OpenMP is a specification for a set of compiler directives, an applications programming interface (API), and a set of environment variables that can be used to specify parallel execution from Fortran (and other languages). The OpenMP target offload capabilities are similar in many respects to OpenACC. The methods for passing device arrays to library functions from host code differ only in syntax compared to those used in OpenACC. For general information about using OpenMP and to obtain a copy of the OpenMP specification, refer to the OpenMP organization’s website.

CUDA Fortran is a small set of extensions to Fortran that supports and is built upon the CUDA computing architecture. CUDA Fortran includes a Fortran 2003 compiler and tool chain for programming NVIDIA GPUs using Fortran, and is an analog to NVIDIA’s CUDA C compiler. Compared to the NVIDIA Accelerator and OpenACC directives-based model and compilers, CUDA Fortran is a lower-level explicit programming model with substantial runtime library components that give expert programmers direct control of all aspects of GPGPU programming.

This document does not contain explanations or purposes of the library functions, nor does it contain details of the approach used in the CUDA implementation to target GPUs. For that information, please see the appropriate library document that comes with the NVIDIA CUDA Toolkit. This document does provide the Fortran module contents: derived types, enumerations, and interfaces, to make use of the libraries from Fortran rather than from C or C++.

Many of the examples used in this document are provided in the HPC compiler and tools distribution, along with Makefiles, and are stored in the yearly directory, such as 2020/examples/CUDA-Libraries.

1.1. Fortran Interfaces and Wrappers

Almost all of the function interfaces shown in this document make use of features from the Fortran 2003 iso_c_binding intrinsic module. This module provides a standard way for dealing with isues such as inter-language data types, capitalization, adding underscores to symbol names, or passing arguments by value.

Often, the iso_c_binding module enables Fortran programs containing properly written interfaces to call directly into the C library functions. In some cases, NVIDIA has written small wrappers around the C library function, to make the Fortran call site more “Fortran-like”, hiding some issues exposed in the C interfaces like handle management, host vs. device pointer management, or character and complex data type issues.

In a small number of cases, the C Library may contain multiple entry points to handle different data types, perhaps an int in one function and a size_t in another, otherwise the functions are identical. In these cases, NVIDIA may provide just one generic Fortran interface, and will call the appropriate C function under the hood.

1.2. Using CUDA Libraries from OpenACC Host Code

All of the libraries covered in this document contain functions which are callable from OpenACC host code. Most functions take some arguments which are expected to be device pointers (the address of a variable in device global memory). There are several ways to do that in OpenACC.

If the call is lexically nested within an OpenACC data directive, the NVIDIA Fortran compiler, in the presence of an explicit interface such as those provided by the NVIDIA library modules, will default to passing the device pointer when required.

subroutine hostcall(a, b, n)
use cublas
real a(n), b(n)
!$acc data copy(a, b)
call cublasSswap(n, a, 1, b, 1)
!$acc end data

return
end

A Fortran interface is made explicit when you use the module that contains it, as in the line use cublas in the example above. If you look ahead to the actual interface for cublasSswap, you will see that the arrays a and b are declared with the CUDA Fortran device attribute, so they take only device addresses as arguments.

It is more acceptable and general when using OpenACC to pass device pointers to subprograms by using the host_data clause as most implementations don’t have a way to mark arguments as device pointers. The host_data construct with the use_device clause makes the device addresses available in host code for passing to the subprogram.

use cufft
use openacc
. . .
!$acc data copyin(a), copyout(b,c)
ierr = cufftPlan2D(iplan1,m,n,CUFFT_C2C)
ierr = ierr + cufftSetStream(iplan1,acc_get_cuda_stream(acc_async_sync))
!$acc host_data use_device(a,b,c)
ierr = ierr + cufftExecC2C(iplan1,a,b,CUFFT_FORWARD)
ierr = ierr + cufftExecC2C(iplan1,b,c,CUFFT_INVERSE)
!$acc end host_data

! scale c
!$acc kernels
c = c / (m*n)
!$acc end kernels
!$acc end data

This code snippet also shows an example of sharing the stream that OpenACC and the cuFFT library use. Every library in this document has a function for setting the CUDA stream which the library runs on. Usually, when using OpenACC, you want the OpenACC kernels to run on the same stream as the library functions. In the case above, this guarantees that the kernel c = c / (m*n) does not start until the FFT operations complete. The function acc_get_cuda_stream and the definition for acc_async_sync are in the openacc module.

1.3. Using CUDA Libraries from OpenACC Device Code

Two libraries are currently available from within OpenACC compute regions. Certain functions in both the openacc_curand module and the nvshmem module are marked acc routine seq.

The cuRAND device library is all contained within CUDA header files. In device code, it is designed to return one or a small number of random numbers per thread. The thread’s random generators run independently of each other, and it is usually advised for performance reasons to give each thread a different seed, rather than a different offset.

program t
use openacc_curand
integer, parameter :: n = 500
real a(n,n,4)
type(curandStateXORWOW) :: h
integer(8) :: seed, seq, offset
a = 0.0
!$acc parallel num_gangs(n) vector_length(n) copy(a)
!$acc loop gang
do j = 1, n
!$acc loop vector private(h)
  do i = 1, n
    seed = 12345_8 + j*n*n + i*2
    seq = 0_8
    offset = 0_8
    call curand_init(seed, seq, offset, h)
!$acc loop seq
    do k = 1, 4
      a(i,j,k) = curand_uniform(h)
    end do
  end do
end do
!$acc end parallel
print *,maxval(a),minval(a),sum(a)/(n*n*4)
end

When using the openacc_curand module, since all the code is contained in CUDA header files, you do not need any additional libraries on the link line.

1.4. Using CUDA Libraries from CUDA Fortran Host Code

The predominant usage model for the library functions listed in this document is to call them from CUDA Host code. CUDA Fortran allows some special capabilities in that the compiler is able to recognize the device and managed attribute in resolving generic interfaces. Device actual arguments can only match the interface’s device dummy arguments; managed actual arguments, by precedence, match managed dummy arguments first, then device dummies, then host.

program testisamax  ! link with -cudalib=cublas -lblas
use cublas
real*4              x(1000)
real*4, device  :: xd(1000)
real*4, managed :: xm(1000)

call random_number(x)

! Call host BLAS
j = isamax(1000,x,1)

xd = x
! Call cuBLAS
k = isamax(1000,xd,1)
print *,j.eq.k

xm = x
! Also calls cuBLAS
k = isamax(1000,xm,1)
print *,j.eq.k
end

Using the cudafor module, the full set of CUDA functionality is available to programmers for managing CUDA events, streams, synchronization, and asynchronous behaviors. CUDA Fortran can be used in OpenMP programs, and the CUDA Libraries in this document are thread safe with respect to host CPU threads. Further examples are included in chapter Examples.

1.5. Using CUDA Libraries from CUDA Fortran Device Code

The cuRAND and NVSHMEM libraries have functions callable from CUDA Fortran device code, and their interfaces are accessed via the curand_device and nvshmem modules, respectively. The module interfaces are very similar to the modules used in OpenACC device code, but for CUDA Fortran, each subroutine and function is declared attributes([host,]device), and the subroutines and functions do not need to be marked as acc routine seq.

module mrand
    use curand_device
    integer, parameter :: n = 500
    contains
    attributes(global) subroutine randsub(a)
    real, device :: a(n,n,4)
    type(curandStateXORWOW) :: h
    integer(8) :: seed, seq, offset
    j = blockIdx%x; i = threadIdx%x
    seed = 12345_8 + j*n*n + i*2
    seq = 0_8
    offset = 0_8
    call curand_init(seed, seq, offset, h)
    do k = 1, 4
        a(i,j,k) = curand_uniform(h)
    end do
    end subroutine
end module

program t   ! nvfortran t.cuf
use mrand
use cudafor ! recognize maxval, minval, sum w/managed
real, managed :: a(n,n,4)
a = 0.0
call randsub<<<n,n>>>(a)
print *,maxval(a),minval(a),sum(a)/(n*n*4)
end program

1.6. Pointer Modes in cuBLAS and cuSPARSE

Because the NVIDIA Fortran compiler can distinguish between host and device arguments, the NVIDIA modules for interfacing to cuBLAS and cuSPARSE handle pointer modes differently than CUDA C, which requires setting the mode explicitly for scalar arguments. Examples of scalar arguments which can reside either on the host or device are the alpha and beta scale factors to the *gemm functions.

Typically, when using the normal “non-_v2” interfaces in the cuBLAS and cuSPARSE modules, the runtime wrappers will implicitly add the setting and restoring of the library pointer modes behind the scenes. This adds some negligible but non-zero overhead to the calls.

To avoid the implicit getting and setting of the pointer mode with every invocation of a library function do the following:

For the BLAS, use the cublas_v2 module, and the v2 entry points, such as cublasIsamax_v2. It is the programmer’s responsibility to properly set the pointer mode when needed. Examples of scalar arguments which do require setting the pointer mode are the alpha and beta scale factors passed to the *gemm routines, and the scalar results returned from the v2 versions of the *amax(), *amin(), *asum(), *rotg(), *rotmg(), *nrm2(), and *dot() functions. In the v2 interfaces shown in the chapter 2, these scalar arguments will have the comment ! device or host variable. Examples of scalar arguments which do not require setting the pointer mode are increments, extents, and lengths such as incx, incy, n, lda, ldb, and ldc.
For the cuSPARSE library, each function listed in chapter 5 which contains scalar arguments with the comment ! device or host variable has a corresponding v2 interface, though it is not documented here. For instance, in addition to the interface named cusparseSaxpyi, there is another interface named cusparseSaxpyi_v2 with the exact same argument list which calls into the cuSPARSE library directly and will not implicitly get or set the library pointer mode.

The CUDA default pointer mode is that the scalar arguments reside on the host. The NVIDIA runtime does not change that setting.

1.7. Writing Your Own CUDA Interfaces

Despite the large number of interfaces included in the modules described in this document, users will have the need from time-to-time to write their own interfaces to new libraries or their own tuned CUDA, perhaps written in C/C++. There are some standard techniques to use, and some non-standard NVIDIA extensions which can make creating working interfaces easier.

! cufftExecC2C
interface cufftExecC2C
    integer function cufftExecC2C( plan, idata, odata, direction ) &
        bind(C,name='cufftExecC2C')
        integer, value :: plan
        complex, device, dimension(*) :: idata, odata
        integer, value :: direction
    end function cufftExecC2C
end interface cufftExecC2C

This interface calls the C library function directly. You can deal with Fortran’s capitalization issues by putting the properly capitalized C function in the bind(C) attribute. If the C function expects input arguments passed by value, you can add the value attribute to the dummy declaration as well. A nice feature of Fortran is that the interface can change, but the code at the call site may not have to. The compiler changes the details of the call to fit the interface.

Now suppose a user of this interface would like to call this function with REAL data (F77 code is notorious for mixing REAL and COMPLEX declarations). There are two ways to do this:

! cufftExecC2C
interface cufftExecC2C
    integer function cufftExecC2C( plan, idata, odata, direction ) &
        bind(C,name='cufftExecC2C')
        integer, value :: plan
        complex, device, dimension(*) :: idata, odata
        integer, value :: direction
    end function cufftExecC2C
    integer function cufftExecR2R( plan, idata, odata, direction ) &
        bind(C,name='cufftExecC2C')
        integer, value :: plan
        real, device, dimension(*) :: idata, odata
        integer, value :: direction
    end function cufftExecR2R
end interface cufftExecC2C

Here the C name hasn’t changed. The compiler will now accept actual arguments corresponding to idata and odata that are declared REAL. A generic interface is created named cufftExecC2C. If you have problems debugging your generic interface, as a debugging aid you can try calling the specific name, cufftExecR2R in this case, to help diagnose the problem.

A commonly used extension which is supported by NVIDIA is ignore_tkr. A programmer can use it in an interface to instruct the compiler to ignore any combination of the type, kind, and rank during the interface matching process. The previous example using ignore_tkr looks like this:

! cufftExecC2C
interface cufftExecC2C
    integer function cufftExecC2C( plan, idata, odata, direction ) &
        bind(C,name='cufftExecC2C')
        integer, value :: plan
        !dir$ ignore_tkr(tr) idata, (tr) odata
        complex, device, dimension(*) :: idata, odata
        integer, value :: direction
    end function cufftExecC2C
end interface cufftExecC2C

Now the compiler will ignore both the type and rank (F77 could also be sloppy in its handling of array dimensions) of idata and odata when matching the call site to the interface. An unfortunate side-effect is that the interface will now allow integer, logical, and character data for idata and odata. It is up to the implementor to determine if that is acceptable.

A final aid, specific to NVIDIA, worth mentioning here is ignore_tkr (d), which ignores the device attribute of an actual argument during interface matching.

Of course, if you write a wrapper, a narrow strip of code between the Fortran call and your library function, you are not limited by the simple transormations that a compiler can do, such as those listed here. As mentioned earlier, many of the interfaces provided in the cuBLAS and cuSPARSE modules use wrappers.

A common request is a way for Fortran programmers to take advantage of the thrust library. Explaining thrust and C++ programming is outside of the scope of this document, but this simple example can show how to take advantage of the excellent sort capabilities in thrust:

// Filename: csort.cu
// nvcc -c -arch sm_35 csort.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/sort.h>

extern "C" {

    //Sort for integer arrays
    void thrust_int_sort_wrapper( int *data, int N)
    {
    thrust::device_ptr <int> dev_ptr(data);
    thrust::sort(dev_ptr, dev_ptr+N);
    }

    //Sort for float arrays
    void thrust_float_sort_wrapper( float *data, int N)
    {
    thrust::device_ptr <float> dev_ptr(data);
    thrust::sort(dev_ptr, dev_ptr+N);
    }

    //Sort for double arrays
    void thrust_double_sort_wrapper( double *data, int N)
    {
    thrust::device_ptr <double> dev_ptr(data);
    thrust::sort(dev_ptr, dev_ptr+N);
    }
}

Set up interface to the sort subroutine in Fortran and calls are simple:

program t
interface sort
   subroutine sort_int(array, n) &
        bind(C,name='thrust_int_sort_wrapper')
        integer(4), device, dimension(*) :: array
        integer(4), value :: n
    end subroutine
end interface
integer(4), parameter :: n = 100
integer(4), device :: a_d(n)
integer(4) :: a_h(n)
!$cuf kernel do
do i = 1, n
    a_d(i) = 1 + mod(47*i,n)
end do
call sort(a_d, n)
a_h = a_d
nres  = count(a_h .eq. (/(i,i=1,n)/))
if (nres.eq.n) then
    print *,"test PASSED"
else
    print *,"test FAILED"
endif
end

1.8. NVIDIA Fortran Compiler Options

The NVIDIA Fortran compiler driver is called nvfortran. General information on the compiler options which can be passed to nvfortran can be obtained by typing nvfortran -help. To enable targeting NVIDIA GPUs using OpenACC, use nvfortran -acc=gpu. To enable targeting NVIDIA GPUs using CUDA Fortran, use nvfortran -cuda. CUDA Fortran is also supported by the NVIDIA Fortran compilers when the filename uses the .cuf extension. Uppercase file extensions, .F90 or .CUF, for example, may also be used, in which case the program is processed by the preprocessor before being compiled.

Other options which are pertinent to the examples in this document are:

-⁠cudalib[=cublas|cufft|cufftw|curand|cusolver|cusparse|cutensor|nvblas|nccl|nvshmem|nvlamath|nvtx]: this option adds the appropriate versions of the CUDA-optimized libraries to the link line. It handles static and dynamic linking, and platform (Linux, Windows) differences unobtrusively.
-⁠gpu=cc70: this option compiles for compute capability 7.0. Certain library functionality may require minimum compute capability of 6.0, 7.0, or higher.
-⁠gpu=cudaX.Y: this option compiles and links with a particular CUDA Toolkit version. Certain library functionality may require a newer (or older, for deprecated functions) CUDA runtime version.

2. BLAS Runtime APIs

This section describes the Fortran interfaces to the CUDA BLAS libraries. There are currently four separate collections of function entry points which are commonly referred to as the cuBLAS:

The original CUDA implementation of the BLAS routines, referred to as the legacy API, which are callable from the host and expect and operate on device data.
The newer “v2” CUDA implementation of the BLAS routines, plus some extensions for batched operations. These are also callable from the host and operate on device data. In Fortran terms, these entry points have been changed from subroutines to functions which return status.
The cuBLAS XT library which can target multiple GPUs using only host-resident data.
The cuBLAS MP library which can target multiple GPUs using distributed device data, similar to the ScaLAPACK PBLAS functions. The cublasMp and cusolverMp libraries are built, in part, upon a communications library named CAL, which is documented in another section of this document.

NVIDIA currently ships with four Fortran modules which programmers can use to call into this cuBLAS functionality:

cublas, which provides interfaces to into the main cublas library. Both the legacy and v2 names are supported. In this module, the cublas names (such as cublasSaxpy) use the legacy calling conventions. Interfaces to a host BLAS library (for instance libblas.a in the NVIDIA distribution) are also included in the cublas module. These interfaces are exposed by adding the line
```
use cublas
```
to your program unit.
cublas_v2, which is similar to the cublas module in most ways except the cublas names (such as cublasSaxpy) use the v2 calling conventions. For instance, instead of a subroutine, cublasSaxpy is a function which takes a handle as the first argument and returns an integer containing the status of the call. These interfaces are exposed by adding the line
```
use cublas_v2
```
to your program unit.
cublasxt, which interfaces directly to the cublasXT API. These interfaces are exposed by adding the line
```
use cublasxt
```
to your program unit.
cublasmp, which provides interfaces into the cublasMp API. These interfaces are exposed by adding the line
```
use cublasMp
```
to your program unit.

The v2 routines are integer functions that return an error status code; they return a value of CUBLAS_STATUS_SUCCESS if the call was successful, or other cuBLAS status return value if there was an error.

Documented interfaces to the traditional BLAS names in the subsequent sections, which contain the comment ! device or host variable should not be confused with the pointer mode issue from section 1.6. The traditional BLAS names are overloaded generic names in the cublas module. For instance, in this interface

subroutine scopy(n, x, incx, y, incy)
  integer :: n
  real(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

The arrays x and y can either both be device arrays, in which case cublasScopy is called via the generic interface, or they can both be host arrays, in which case scopy from the host BLAS library is called. Using CUDA Fortran managed data as actual arguments to scopy poses an interesting case, and calling cublasScopy is chosen by default. If you wish to call the host library version of scopy with managed data, don’t expose the generic scopy interface at the call site.

Unless a specific kind is provided, in the following interfaces the plain integer type implies integer(4) and the plain real type implies real(4).

2.1. CUBLAS Definitions and Helper Functions

This section contains definitions and data types used in the cuBLAS library and interfaces to the cuBLAS Helper Functions.

The cublas module contains the following derived type definitions:

TYPE cublasHandle
  TYPE(C_PTR)  :: handle
END TYPE

The cuBLAS module contains the following enumerations:

enum, bind(c)
    enumerator :: CUBLAS_STATUS_SUCCESS         =0
    enumerator :: CUBLAS_STATUS_NOT_INITIALIZED =1
    enumerator :: CUBLAS_STATUS_ALLOC_FAILED    =3
    enumerator :: CUBLAS_STATUS_INVALID_VALUE   =7
    enumerator :: CUBLAS_STATUS_ARCH_MISMATCH   =8
    enumerator :: CUBLAS_STATUS_MAPPING_ERROR   =11
    enumerator :: CUBLAS_STATUS_EXECUTION_FAILED=13
    enumerator :: CUBLAS_STATUS_INTERNAL_ERROR  =14
end enum

enum, bind(c)
    enumerator :: CUBLAS_FILL_MODE_LOWER=0
    enumerator :: CUBLAS_FILL_MODE_UPPER=1
end enum

enum, bind(c)
    enumerator :: CUBLAS_DIAG_NON_UNIT=0
    enumerator :: CUBLAS_DIAG_UNIT=1
end enum

enum, bind(c)
    enumerator :: CUBLAS_SIDE_LEFT =0
    enumerator :: CUBLAS_SIDE_RIGHT=1
end enum

enum, bind(c)
    enumerator :: CUBLAS_OP_N=0
    enumerator :: CUBLAS_OP_T=1
    enumerator :: CUBLAS_OP_C=2
end enum

enum, bind(c)
    enumerator :: CUBLAS_POINTER_MODE_HOST   = 0
    enumerator :: CUBLAS_POINTER_MODE_DEVICE = 1
end enum

2.1.1. cublasCreate

This function initializes the CUBLAS library and creates a handle to an opaque structure holding the CUBLAS library context. It allocates hardware resources on the host and device and must be called prior to making any other CUBLAS library calls. The CUBLAS library context is tied to the current CUDA device. To use the library on multiple devices, one CUBLAS handle needs to be created for each device. Furthermore, for a given device, multiple CUBLAS handles with different configuration can be created. Because cublasCreate allocates some internal resources and the release of those resources by calling cublasDestroy will implicitly call cublasDeviceSynchronize, it is recommended to minimize the number of cublasCreate/cublasDestroy occurences. For multi-threaded applications that use the same device from different threads, the recommended programming model is to create one CUBLAS handle per thread and use that CUBLAS handle for the entire life of the thread.

integer(4) function cublasCreate(handle)
  type(cublasHandle) :: handle

2.1.2. cublasDestroy

This function releases hardware resources used by the CUBLAS library. This function is usually the last call with a particular handle to the CUBLAS library. Because cublasCreate allocates some internal resources and the release of those resources by calling cublasDestroy will implicitly call cublasDeviceSynchronize, it is recommended to minimize the number of cublasCreate/cublasDestroy occurences.

integer(4) function cublasDestroy(handle)
  type(cublasHandle) :: handle

2.1.3. cublasGetVersion

This function returns the version number of the cuBLAS library.

integer(4) function cublasGetVersion(handle, version)
  type(cublasHandle) :: handle
  integer(4) :: version

2.1.4. cublasSetStream

This function sets the cuBLAS library stream, which will be used to execute all subsequent calls to the cuBLAS library functions. If the cuBLAS library stream is not set, all kernels use the default NULL stream. In particular, this routine can be used to change the stream between kernel launches and then to reset the cuBLAS library stream back to NULL.

integer(4) function cublasSetStream(handle, stream)
  type(cublasHandle) :: handle
  integer(kind=cuda_stream_kind()) :: stream

2.1.5. cublasGetStream

This function gets the cuBLAS library stream, which is being used to execute all calls to the cuBLAS library functions. If the cuBLAS library stream is not set, all kernels use the default NULL stream.

integer(4) function cublasGetStream(handle, stream)
  type(cublasHandle) :: handle
  integer(kind=cuda_stream_kind()) :: stream

2.1.6. cublasGetStatusName

This function returns the cuBLAS status name associated with a given status value.

character(128) function cublasGetStatusName(ierr)
  integer(4) :: ierr

2.1.7. cublasGetStatusString

This function returns the cuBLAS status string associated with a given status value.

character(128) function cublasGetStatusString(ierr)
  integer(4) :: ierr

2.1.8. cublasGetPointerMode

This function obtains the pointer mode used by the cuBLAS library. In the cublas module, the pointer mode is set and reset on a call-by-call basis depending on the whether the device attribute is set on scalar actual arguments. See section 1.6 for a discussion of pointer modes.

integer(4) function cublasGetPointerMode(handle, mode)
  type(cublasHandle) :: handle
  integer(4) :: mode

2.1.9. cublasSetPointerMode

This function sets the pointer mode used by the cuBLAS library. When using the cublas module, the pointer mode is set on a call-by-call basis depending on the whether the device attribute is set on scalar actual arguments. When using the cublas_v2 module with v2 interfaces, it is the programmer’s responsibility to make calls to cublasSetPointerMode so scalar arguments are handled correctly by the library. See section 1.6 for a discussion of pointer modes.

integer(4) function cublasSetPointerMode(handle, mode)
  type(cublasHandle) :: handle
  integer(4) :: mode

2.1.10. cublasGetAtomicsMode

This function obtains the atomics mode used by the cuBLAS library.

integer(4) function cublasGetAtomicsMode(handle, mode)
  type(cublasHandle) :: handle
  integer(4) :: mode

2.1.11. cublasSetAtomicsMode

This function sets the atomics mode used by the cuBLAS library. Some routines in the cuBLAS library have alternate implementations that use atomics to accumulate results. These alternate implementations may run faster but may also generate results which are not identical from one run to the other. The default is to not allow atomics in cuBLAS functions.

integer(4) function cublasSetAtomicsMode(handle, mode)
  type(cublasHandle) :: handle
  integer(4) :: mode

2.1.12. cublasGetMathMode

This function obtains the math mode used by the cuBLAS library.

integer(4) function cublasGetMathMode(handle, mode)
  type(cublasHandle) :: handle
  integer(4) :: mode

2.1.13. cublasSetMathMode

This function sets the math mode used by the cuBLAS library. Some routines in the cuBLAS library allow you to choose the compute precision used to generate results. These alternate approaches may run faster but may also generate different, less accurate results.

integer(4) function cublasSetMathMode(handle, mode)
  type(cublasHandle) :: handle
  integer(4) :: mode

2.1.14. cublasGetSmCountTarget

This function obtains the SM count target used by the cuBLAS library.

integer(4) function cublasGetSmCountTarget(handle, counttarget)
  type(cublasHandle) :: handle
  integer(4) :: counttarget

2.1.15. cublasSetSmCountTarget

This function sets the SM count target used by the cuBLAS library.

integer(4) function cublasSetSmCountTarget(handle, counttarget)
  type(cublasHandle) :: handle
  integer(4) :: counttarget

2.1.16. cublasGetHandle

This function gets the cuBLAS handle currently in use by a thread. The CUDA Fortran runtime keeps track of a CPU thread’s current handle, if you are either using the legacy BLAS API, or do not wish to pass the handle through to low-level functions or subroutines manually.

type(cublashandle) function cublasGetHandle()

integer(4) function cublasGetHandle(handle)
  type(cublasHandle) :: handle

2.1.17. cublasSetVector

This function copies n elements from a vector x in host memory space to a vector y in GPU memory space. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of vector x and y is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpy or array assignment statements.

integer(4) function cublassetvector(n, elemsize, x, incx, y, incy)
  integer :: n, elemsize, incx, incy
  integer*1, dimension(*) :: x
  integer*1, device, dimension(*) :: y

2.1.18. cublasGetVector

This function copies n elements from a vector x in GPU memory space to a vector y in host memory space. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of vector x and y is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpy or array assignment statements.

integer(4) function cublasgetvector(n, elemsize, x, incx, y, incy)
  integer :: n, elemsize, incx, incy
  integer*1, device, dimension(*) :: x
  integer*1, dimension(*) :: y

2.1.19. cublasSetMatrix

This function copies a tile of rows x cols elements from a matrix A in host memory space to a matrix B in GPU memory space. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of Matrix A and B is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpy, cudaMemcpy2D, or array assignment statements.

integer(4) function cublassetmatrix(rows, cols, elemsize, a, lda, b, ldb)
  integer :: rows, cols, elemsize, lda, ldb
  integer*1, dimension(lda, *) :: a
  integer*1, device, dimension(ldb, *) :: b

2.1.20. cublasGetMatrix

This function copies a tile of rows x cols elements from a matrix A in GPU memory space to a matrix B in host memory space. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of Matrix A and B is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpy, cudaMemcpy2D, or array assignment statements.

integer(4) function cublasgetmatrix(rows, cols, elemsize, a, lda, b, ldb)
  integer :: rows, cols, elemsize, lda, ldb
  integer*1, device, dimension(lda, *) :: a
  integer*1, dimension(ldb, *) :: b

2.1.21. cublasSetVectorAsync

This function copies n elements from a vector x in host memory space to a vector y in GPU memory space, asynchronously, on the given CUDA stream. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of vector x and y is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpyAsync.

integer(4) function cublassetvectorasync(n, elemsize, x, incx, y, incy, stream)
  integer :: n, elemsize, incx, incy
  integer*1, dimension(*) :: x
  integer*1, device, dimension(*) :: y
  integer(kind=cuda_stream_kind()) :: stream

2.1.22. cublasGetVectorAsync

This function copies n elements from a vector x in host memory space to a vector y in GPU memory space, asynchronously, on the given CUDA stream. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of vector x and y is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpyAsync.

integer(4) function cublasgetvectorasync(n, elemsize, x, incx, y, incy, stream)
  integer :: n, elemsize, incx, incy
  integer*1, device, dimension(*) :: x
  integer*1, dimension(*) :: y
  integer(kind=cuda_stream_kind()) :: stream

2.1.23. cublasSetMatrixAsync

This function copies a tile of rows x cols elements from a matrix A in host memory space to a matrix B in GPU memory space, asynchronously using the specified stream. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of Matrix A and B is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpyAsync or cudaMemcpy2DAsync.

integer(4) function cublassetmatrixasync(rows, cols, elemsize, a, lda, b, ldb, stream)
  integer :: rows, cols, elemsize, lda, ldb
  integer*1, dimension(lda, *) :: a
  integer*1, device, dimension(ldb, *) :: b
  integer(kind=cuda_stream_kind()) :: stream

2.1.24. cublasGetMatrixAsync

This function copies a tile of rows x cols elements from a matrix A in GPU memory space to a matrix B in host memory space, asynchronously, using the specified stream. It is assumed that each element requires storage of elemSize bytes. In CUDA Fortran, the type of Matrix A and B is overloaded to take any data type, but the size of the data type must still be specified in bytes. This functionality can also be implemented using cudaMemcpyAsync or cudaMemcpy2DAsync.

integer(4) function cublasgetmatrixasync(rows, cols, elemsize, a, lda, b, ldb, stream)
  integer :: rows, cols, elemsize, lda, ldb
  integer*1, device, dimension(lda, *) :: a
  integer*1, dimension(ldb, *) :: b
  integer(kind=cuda_stream_kind()) :: stream

2.2. Single Precision Functions and Subroutines

This section contains interfaces to the single precision BLAS and cuBLAS functions and subroutines.

2.2.1. isamax

ISAMAX finds the index of the element having the maximum absolute value.

integer(4) function isamax(n, x, incx)
  integer :: n
  real(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

integer(4) function cublasIsamax(n, x, incx)
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasIsamax_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.2.2. isamin

ISAMIN finds the index of the element having the minimum absolute value.

integer(4) function isamin(n, x, incx)
  integer :: n
  real(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

integer(4) function cublasIsamin(n, x, incx)
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasIsamin_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.2.3. sasum

SASUM takes the sum of the absolute values.

real(4) function sasum(n, x, incx)
  integer :: n
  real(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

real(4) function cublasSasum(n, x, incx)
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasSasum_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx
  real(4), device :: res ! device or host variable

2.2.4. saxpy

SAXPY constant times a vector plus a vector.

subroutine saxpy(n, a, x, incx, y, incy)
  integer :: n
  real(4), device :: a ! device or host variable
  real(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasSaxpy(n, a, x, incx, y, incy)
  integer :: n
  real(4), device :: a ! device or host variable
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasSaxpy_v2(h, n, a, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: a ! device or host variable
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.2.5. scopy

SCOPY copies a vector, x, to a vector, y.

subroutine scopy(n, x, incx, y, incy)
  integer :: n
  real(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasScopy(n, x, incx, y, incy)
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasScopy_v2(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.2.6. sdot

SDOT forms the dot product of two vectors.

real(4) function sdot(n, x, incx, y, incy)
  integer :: n
  real(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

real(4) function cublasSdot(n, x, incx, y, incy)
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasSdot_v2(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy
  real(4), device :: res ! device or host variable

2.2.7. snrm2

SNRM2 returns the euclidean norm of a vector via the function name, so that SNRM2 := sqrt( x’*x ).

real(4) function snrm2(n, x, incx)
  integer :: n
  real(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

real(4) function cublasSnrm2(n, x, incx)
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasSnrm2_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx
  real(4), device :: res ! device or host variable

2.2.8. srot

SROT applies a plane rotation.

subroutine srot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(4), device :: sc, ss ! device or host variable
  real(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasSrot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(4), device :: sc, ss ! device or host variable
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasSrot_v2(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: sc, ss ! device or host variable
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.2.9. srotg

SROTG constructs a Givens plane rotation.

subroutine srotg(sa, sb, sc, ss)
  real(4), device :: sa, sb, sc, ss ! device or host variable

subroutine cublasSrotg(sa, sb, sc, ss)
  real(4), device :: sa, sb, sc, ss ! device or host variable

integer(4) function cublasSrotg_v2(h, sa, sb, sc, ss)
  type(cublasHandle) :: h
  real(4), device :: sa, sb, sc, ss ! device or host variable

2.2.10. srotm

SROTM applies the modified Givens transformation, H, to the 2 by N matrix (SX**T) , where **T indicates transpose. The elements of SX are in (SX**T) SX(LX+I*INCX), I = 0 to N-1, where LX = 1 if INCX .GE. 0, ELSE LX = (-INCX)*N, and similarly for SY using LY and INCY. With SPARAM(1)=SFLAG, H has one of the following forms.. SFLAG=-1.E0 SFLAG=0.E0 SFLAG=1.E0 SFLAG=-2.E0 (SH11 SH12) (1.E0 SH12) (SH11 1.E0) (1.E0 0.E0) H=( ) ( ) ( ) ( ) (SH21 SH22), (SH21 1.E0), (-1.E0 SH22), (0.E0 1.E0). See SROTMG for a description of data storage in SPARAM.

subroutine srotm(n, x, incx, y, incy, param)
  integer :: n
  real(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasSrotm(n, x, incx, y, incy, param)
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy
  real(4), device :: param(*) ! device or host variable

integer(4) function cublasSrotm_v2(h, n, x, incx, y, incy, param)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: param(*) ! device or host variable
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.2.11. srotmg

SROTMG constructs the modified Givens transformation matrix H which zeros the second component of the 2-vector (SQRT(SD1)*SX1,SQRT(SD2)*SY2)**T. With SPARAM(1)=SFLAG, H has one of the following forms..SFLAG=-1.E0 SFLAG=0.E0 SFLAG=1.E0 SFLAG=-2.E0 (SH11 SH12) (1.E0 SH12) (SH11 1.E0) (1.E0 0.E0) H=( ) ( ) ( ) ( ) (SH21 SH22), (SH21 1.E0), (-1.E0 SH22), (0.E0 1.E0). Locations 2-4 of SPARAM contain SH11,SH21,SH12, and SH22 respectively. (Values of 1.E0, -1.E0, or 0.E0 implied by the value of SPARAM(1) are not stored in SPARAM.)

subroutine srotmg(d1, d2, x1, y1, param)
  real(4), device :: d1, d2, x1, y1, param(*) ! device or host variable

subroutine cublasSrotmg(d1, d2, x1, y1, param)
  real(4), device :: d1, d2, x1, y1, param(*) ! device or host variable

integer(4) function cublasSrotmg_v2(h, d1, d2, x1, y1, param)
  type(cublasHandle) :: h
  real(4), device :: d1, d2, x1, y1, param(*) ! device or host variable

2.2.12. sscal

SSCAL scales a vector by a constant.

subroutine sscal(n, a, x, incx)
  integer :: n
  real(4), device :: a ! device or host variable
  real(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

subroutine cublasSscal(n, a, x, incx)
  integer :: n
  real(4), device :: a ! device or host variable
  real(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasSscal_v2(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: a ! device or host variable
  real(4), device, dimension(*) :: x
  integer :: incx

2.2.13. sswap

SSWAP interchanges two vectors.

subroutine sswap(n, x, incx, y, incy)
  integer :: n
  real(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasSswap(n, x, incx, y, incy)
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasSswap_v2(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.2.14. sgbmv

SGBMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n band matrix, with kl sub-diagonals and ku super-diagonals.

subroutine sgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, kl, ku, lda, incx, incy
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x, y ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, kl, ku, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSgbmv_v2(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, kl, ku, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

2.2.15. sgemv

SGEMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.

subroutine sgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x, y ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSgemv_v2(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

2.2.16. sger

SGER performs the rank 1 operation A := alpha*x*y**T + A, where alpha is a scalar, x is an m element vector, y is an n element vector and A is an m by n matrix.

subroutine sger(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x, y ! device or host variable
  real(4), device :: alpha ! device or host variable

subroutine cublasSger(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha ! device or host variable

integer(4) function cublasSger_v2(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha ! device or host variable

2.2.17. ssbmv

SSBMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric band matrix, with k super-diagonals.

subroutine ssbmv(t, n, k, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: k, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x, y ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSsbmv(t, n, k, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: k, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSsbmv_v2(h, t, n, k, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: k, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

2.2.18. sspmv

SSPMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix, supplied in packed form.

subroutine sspmv(t, n, alpha, a, x, incx, beta, y, incy)
  character*1 :: t
  integer :: n, incx, incy
  real(4), device, dimension(*) :: a, x, y ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSspmv(t, n, alpha, a, x, incx, beta, y, incy)
  character*1 :: t
  integer :: n, incx, incy
  real(4), device, dimension(*) :: a, x, y
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSspmv_v2(h, t, n, alpha, a, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  real(4), device, dimension(*) :: a, x, y
  real(4), device :: alpha, beta ! device or host variable

2.2.19. sspr

SSPR performs the symmetric rank 1 operation A := alpha*x*x**T + A, where alpha is a real scalar, x is an n element vector and A is an n by n symmetric matrix, supplied in packed form.

subroutine sspr(t, n, alpha, x, incx, a)
  character*1 :: t
  integer :: n, incx
  real(4), device, dimension(*) :: a, x ! device or host variable
  real(4), device :: alpha ! device or host variable

subroutine cublasSspr(t, n, alpha, x, incx, a)
  character*1 :: t
  integer :: n, incx
  real(4), device, dimension(*) :: a, x
  real(4), device :: alpha ! device or host variable

integer(4) function cublasSspr_v2(h, t, n, alpha, x, incx, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx
  real(4), device, dimension(*) :: a, x
  real(4), device :: alpha ! device or host variable

2.2.20. sspr2

SSPR2 performs the symmetric rank 2 operation A := alpha*x*y**T + alpha*y*x**T + A, where alpha is a scalar, x and y are n element vectors and A is an n by n symmetric matrix, supplied in packed form.

subroutine sspr2(t, n, alpha, x, incx, y, incy, a)
  character*1 :: t
  integer :: n, incx, incy
  real(4), device, dimension(*) :: a, x, y ! device or host variable
  real(4), device :: alpha ! device or host variable

subroutine cublasSspr2(t, n, alpha, x, incx, y, incy, a)
  character*1 :: t
  integer :: n, incx, incy
  real(4), device, dimension(*) :: a, x, y
  real(4), device :: alpha ! device or host variable

integer(4) function cublasSspr2_v2(h, t, n, alpha, x, incx, y, incy, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  real(4), device, dimension(*) :: a, x, y
  real(4), device :: alpha ! device or host variable

2.2.21. ssymv

SSYMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix.

subroutine ssymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x, y ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSsymv_v2(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

2.2.22. ssyr

SSYR performs the symmetric rank 1 operation A := alpha*x*x**T + A, where alpha is a real scalar, x is an n element vector and A is an n by n symmetric matrix.

subroutine ssyr(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x ! device or host variable
  real(4), device :: alpha ! device or host variable

subroutine cublasSsyr(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x
  real(4), device :: alpha ! device or host variable

integer(4) function cublasSsyr_v2(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x
  real(4), device :: alpha ! device or host variable

2.2.23. ssyr2

SSYR2 performs the symmetric rank 2 operation A := alpha*x*y**T + alpha*y*x**T + A, where alpha is a scalar, x and y are n element vectors and A is an n by n symmetric matrix.

subroutine ssyr2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x, y ! device or host variable
  real(4), device :: alpha ! device or host variable

subroutine cublasSsyr2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha ! device or host variable

integer(4) function cublasSsyr2_v2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha ! device or host variable

2.2.24. stbmv

STBMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals.

subroutine stbmv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x ! device or host variable

subroutine cublasStbmv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

integer(4) function cublasStbmv_v2(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

2.2.25. stbsv

STBSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine stbsv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x ! device or host variable

subroutine cublasStbsv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

integer(4) function cublasStbsv_v2(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

2.2.26. stpmv

STPMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form.

subroutine stpmv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  real(4), device, dimension(*) :: a, x ! device or host variable

subroutine cublasStpmv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  real(4), device, dimension(*) :: a, x

integer(4) function cublasStpmv_v2(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  real(4), device, dimension(*) :: a, x

2.2.27. stpsv

STPSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine stpsv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  real(4), device, dimension(*) :: a, x ! device or host variable

subroutine cublasStpsv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  real(4), device, dimension(*) :: a, x

integer(4) function cublasStpsv_v2(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  real(4), device, dimension(*) :: a, x

2.2.28. strmv

STRMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.

subroutine strmv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x ! device or host variable

subroutine cublasStrmv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

integer(4) function cublasStrmv_v2(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

2.2.29. strsv

STRSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine strsv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(*) :: x ! device or host variable

subroutine cublasStrsv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

integer(4) function cublasStrsv_v2(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

2.2.30. sgemm

SGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

subroutine sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(ldb, *) :: b ! device or host variable
  real(4), device, dimension(ldc, *) :: c ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSgemm_v2(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.2.31. ssymm

SSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.

subroutine ssymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(ldb, *) :: b ! device or host variable
  real(4), device, dimension(ldc, *) :: c ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSsymm_v2(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.2.32. ssyrk

SSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

subroutine ssyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(ldc, *) :: c ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSsyrk_v2(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.2.33. ssyr2k

SSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

subroutine ssyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(ldb, *) :: b ! device or host variable
  real(4), device, dimension(ldc, *) :: c ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSsyr2k_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.2.34. ssyrkx

SSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.

subroutine ssyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(ldb, *) :: b ! device or host variable
  real(4), device, dimension(ldc, *) :: c ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasSsyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c

  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasSsyrkx_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.2.35. strmm

STRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ), where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T.

subroutine strmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(ldb, *) :: b ! device or host variable
  real(4), device :: alpha ! device or host variable

subroutine cublasStrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device :: alpha ! device or host variable

integer(4) function cublasStrmm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha ! device or host variable

2.2.36. strsm

STRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.

subroutine strsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(4), device, dimension(lda, *) :: a ! device or host variable
  real(4), device, dimension(ldb, *) :: b ! device or host variable
  real(4), device :: alpha ! device or host variable

subroutine cublasStrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device :: alpha ! device or host variable

integer(4) function cublasStrsm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device :: alpha ! device or host variable

2.2.37. cublasSgemvBatched

SGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

integer(4) function cublasSgemvBatched(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  real(4), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

integer(4) function cublasSgemvBatched_v2(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  real(4), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

2.2.38. cublasSgemmBatched

SGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasSgemmBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  real(4), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

integer(4) function cublasSgemmBatched_v2(h, transa, transb, m, n, k, alpha, &
           Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  real(4), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

2.2.39. cublasSgelsBatched

SGELS solves overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A. It is assumed that A has full rank. The following options are provided: 1. If TRANS = ‘N’ and m >= n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A*X ||. 2. If TRANS = ‘N’ and m < n: find the minimum norm solution of an underdetermined system A * X = B. 3. If TRANS = ‘T’ and m >= n: find the minimum norm solution of an undetermined system A**T * X = B. 4. If TRANS = ‘T’ and m < n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A**T * X ||. Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the M-by-NRHS right hand side matrix B and the N-by-NRHS solution matrix X.

integer(4) function cublasSgelsBatched(h, trans, m, n, nrhs, Aarray, lda, Carray, ldc, info, devinfo, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n, nrhs
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: info(*)
  integer, device :: devinfo(*)
  integer :: batchCount

2.2.40. cublasSgeqrfBatched

SGEQRF computes a QR factorization of a real M-by-N matrix A: A = Q * R.

integer(4) function cublasSgeqrfBatched(h, m, n, Aarray, lda, Tau, info, batchCount)
  type(cublasHandle) :: h
  integer :: m, n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Tau(*)
  integer :: info(*)
  integer :: batchCount

2.2.41. cublasSgetrfBatched

SGETRF computes an LU factorization of a general M-by-N matrix A using partial pivoting with row interchanges. The factorization has the form A = P * L * U where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n), and U is upper triangular (upper trapezoidal if m < n). This is the right-looking Level 3 BLAS version of the algorithm.

integer(4) function cublasSgetrfBatched(h, n, Aarray, lda, ipvt, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  integer, device :: info(*)
  integer :: batchCount

2.2.42. cublasSgetriBatched

SGETRI computes the inverse of a matrix using the LU factorization computed by SGETRF. This method inverts U and then computes inv(A) by solving the system inv(A)*L = inv(U) for inv(A).

integer(4) function cublasSgetriBatched(h, n, Aarray, lda, ipvt, Carray, ldc, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer, device :: info(*)
  integer :: batchCount

2.2.43. cublasSgetrsBatched

SGETRS solves a system of linear equations A * X = B or A**T * X = B with a general N-by-N matrix A using the LU factorization computed by SGETRF.

integer(4) function cublasSgetrsBatched(h, trans, n, nrhs, Aarray, lda, ipvt, Barray, ldb, info, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: n, nrhs
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  integer :: info(*)
  integer :: batchCount

2.2.44. cublasSmatinvBatched

cublasSmatinvBatched is a short cut of cublasSgetrfBatched plus cublasSgetriBatched. However it only works if n is less than 32. If not, the user has to go through cublasSgetrfBatched and cublasSgetriBatched.

integer(4) function cublasSmatinvBatched(h, n, Aarray, lda, Ainv, lda_inv, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Ainv(*)
  integer :: lda_inv
  integer, device :: info(*)
  integer :: batchCount

2.2.45. cublasStrsmBatched

STRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.

integer(4) function cublasStrsmBatched( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
  type(cublasHandle) :: h
  integer :: side ! integer or character(1) variable
  integer :: uplo ! integer or character(1) variable
  integer :: trans ! integer or character(1) variable
  integer :: diag ! integer or character(1) variable
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: A(*)
  integer :: lda
  type(c_devptr), device :: B(*)
  integer :: ldb
  integer :: batchCount

integer(4) function cublasStrsmBatched_v2( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
  type(cublasHandle) :: h
  integer :: side
  integer :: uplo
  integer :: trans
  integer :: diag
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: A(*)
  integer :: lda
  type(c_devptr), device :: B(*)
  integer :: ldb
  integer :: batchCount

2.2.46. cublasSgemvStridedBatched

SGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

integer(4) function cublasSgemvStridedBatched(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  real(4), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(4), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  real(4), device :: beta ! device or host variable
  real(4), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

integer(4) function cublasSgemvStridedBatched_v2(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  real(4), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(4), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  real(4), device :: beta ! device or host variable
  real(4), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

2.2.47. cublasSgemmStridedBatched

SGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasSgemmStridedBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  real(4), device :: alpha ! device or host variable
  real(4), device :: Aarray(*)
  integer :: lda
  integer :: strideA
  real(4), device :: Barray(*)
  integer :: ldb
  integer :: strideB
  real(4), device :: beta ! device or host variable
  real(4), device :: Carray(*)
  integer :: ldc
  integer :: strideC
  integer :: batchCount

integer(4) function cublasSgemmStridedBatched_v2(h, transa, transb, m, n, k, alpha, &
           Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  real(4), device :: alpha ! device or host variable
  real(4), device :: Aarray(*)
  integer :: lda
  integer :: strideA
  real(4), device :: Barray(*)
  integer :: ldb
  integer :: strideB
  real(4), device :: beta ! device or host variable
  real(4), device :: Carray(*)
  integer :: ldc
  integer :: strideC
  integer :: batchCount

2.3. Double Precision Functions and Subroutines

This section contains interfaces to the double precision BLAS and cuBLAS functions and subroutines.

2.3.1. idamax

IDAMAX finds the the index of the element having the maximum absolute value.

integer(4) function idamax(n, x, incx)
  integer :: n
  real(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

integer(4) function cublasIdamax(n, x, incx)
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasIdamax_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.3.2. idamin

IDAMIN finds the index of the element having the minimum absolute value.

integer(4) function idamin(n, x, incx)
  integer :: n
  real(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

integer(4) function cublasIdamin(n, x, incx)
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasIdamin_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.3.3. dasum

DASUM takes the sum of the absolute values.

real(8) function dasum(n, x, incx)
  integer :: n
  real(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

real(8) function cublasDasum(n, x, incx)
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasDasum_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx
  real(8), device :: res ! device or host variable

2.3.4. daxpy

DAXPY constant times a vector plus a vector.

subroutine daxpy(n, a, x, incx, y, incy)
  integer :: n
  real(8), device :: a ! device or host variable
  real(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasDaxpy(n, a, x, incx, y, incy)
  integer :: n
  real(8), device :: a ! device or host variable
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasDaxpy_v2(h, n, a, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: a ! device or host variable
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.3.5. dcopy

DCOPY copies a vector, x, to a vector, y.

subroutine dcopy(n, x, incx, y, incy)
  integer :: n
  real(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasDcopy(n, x, incx, y, incy)
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasDcopy_v2(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.3.6. ddot

DDOT forms the dot product of two vectors.

real(8) function ddot(n, x, incx, y, incy)
  integer :: n
  real(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

real(8) function cublasDdot(n, x, incx, y, incy)
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasDdot_v2(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy
  real(8), device :: res ! device or host variable

2.3.7. dnrm2

DNRM2 returns the euclidean norm of a vector via the function name, so that DNRM2 := sqrt( x’*x )

real(8) function dnrm2(n, x, incx)
  integer :: n
  real(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

real(8) function cublasDnrm2(n, x, incx)
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasDnrm2_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx
  real(8), device :: res ! device or host variable

2.3.8. drot

DROT applies a plane rotation.

subroutine drot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(8), device :: sc, ss ! device or host variable
  real(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasDrot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(8), device :: sc, ss ! device or host variable
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasDrot_v2(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: sc, ss ! device or host variable
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.3.9. drotg

DROTG constructs a Givens plane rotation.

subroutine drotg(sa, sb, sc, ss)
  real(8), device :: sa, sb, sc, ss ! device or host variable

subroutine cublasDrotg(sa, sb, sc, ss)
  real(8), device :: sa, sb, sc, ss ! device or host variable

integer(4) function cublasDrotg_v2(h, sa, sb, sc, ss)
  type(cublasHandle) :: h
  real(8), device :: sa, sb, sc, ss ! device or host variable

2.3.10. drotm

DROTM applies the modified Givens transformation, H, to the 2 by N matrix (DX**T) , where **T indicates transpose. The elements of DX are in (DX**T) DX(LX+I*INCX), I = 0 to N-1, where LX = 1 if INCX .GE. 0, ELSE LX = (-INCX)*N, and similarly for DY using LY and INCY. With DPARAM(1)=DFLAG, H has one of the following forms.. DFLAG=-1.D0 DFLAG=0.D0 DFLAG=1.D0 DFLAG=-2.D0 (DH11 DH12) (1.D0 DH12) (DH11 1.D0) (1.D0 0.D0) H=( ) ( ) ( ) ( ) (DH21 DH22), (DH21 1.D0), (-1.D0 DH22), (0.D0 1.D0). See DROTMG for a description of data storage in DPARAM.

subroutine drotm(n, x, incx, y, incy, param)
  integer :: n
  real(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasDrotm(n, x, incx, y, incy, param)
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy
  real(8), device :: param(*) ! device or host variable

integer(4) function cublasDrotm_v2(h, n, x, incx, y, incy, param)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: param(*) ! device or host variable
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.3.11. drotmg

DROTMG constructs the modified Givens transformation matrix H which zeros the second component of the 2-vector (SQRT(DD1)*DX1,SQRT(DD2)*DY2)**T. With DPARAM(1)=DFLAG, H has one of the following forms.. DFLAG=-1.D0 DFLAG=0.D0 DFLAG=1.D0 DFLAG=-2.D0 (DH11 DH12) (1.D0 DH12) (DH11 1.D0) (1.D0 0.D0) H=( ) ( ) ( ) ( ) (DH21 DH22), (DH21 1.D0), (-1.D0 DH22), (0.D0 1.D0). Locations 2-4 of DPARAM contain DH11, DH21, DH12, and DH22 respectively. (Values of 1.D0, -1.D0, of 0.D0 implied by the value of DPARAM(1) are not stored in DPARAM.)

subroutine drotmg(d1, d2, x1, y1, param)
  real(8), device :: d1, d2, x1, y1, param(*) ! device or host variable

subroutine cublasDrotmg(d1, d2, x1, y1, param)
  real(8), device :: d1, d2, x1, y1, param(*) ! device or host variable

integer(4) function cublasDrotmg_v2(h, d1, d2, x1, y1, param)
  type(cublasHandle) :: h
  real(8), device :: d1, d2, x1, y1, param(*) ! device or host variable

2.3.12. dscal

DSCAL scales a vector by a constant.

subroutine dscal(n, a, x, incx)
  integer :: n
  real(8), device :: a ! device or host variable
  real(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

subroutine cublasDscal(n, a, x, incx)
  integer :: n
  real(8), device :: a ! device or host variable
  real(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasDscal_v2(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: a ! device or host variable
  real(8), device, dimension(*) :: x
  integer :: incx

2.3.13. dswap

interchanges two vectors.

subroutine dswap(n, x, incx, y, incy)
  integer :: n
  real(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasDswap(n, x, incx, y, incy)
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasDswap_v2(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.3.14. dgbmv

DGBMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n band matrix, with kl sub-diagonals and ku super-diagonals.

subroutine dgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, kl, ku, lda, incx, incy
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x, y ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, kl, ku, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDgbmv_v2(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, kl, ku, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

2.3.15. dgemv

DGEMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.

subroutine dgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x, y ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDgemv_v2(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

2.3.16. dger

DGER performs the rank 1 operation A := alpha*x*y**T + A, where alpha is a scalar, x is an m element vector, y is an n element vector and A is an m by n matrix.

subroutine dger(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x, y ! device or host variable
  real(8), device :: alpha ! device or host variable

subroutine cublasDger(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha ! device or host variable

integer(4) function cublasDger_v2(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha ! device or host variable

2.3.17. dsbmv

DSBMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric band matrix, with k super-diagonals.

subroutine dsbmv(t, n, k, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: k, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x, y ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDsbmv(t, n, k, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: k, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDsbmv_v2(h, t, n, k, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: k, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

2.3.18. dspmv

DSPMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix, supplied in packed form.

subroutine dspmv(t, n, alpha, a, x, incx, beta, y, incy)
  character*1 :: t
  integer :: n, incx, incy
  real(8), device, dimension(*) :: a, x, y ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDspmv(t, n, alpha, a, x, incx, beta, y, incy)
  character*1 :: t
  integer :: n, incx, incy
  real(8), device, dimension(*) :: a, x, y
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDspmv_v2(h, t, n, alpha, a, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  real(8), device, dimension(*) :: a, x, y
  real(8), device :: alpha, beta ! device or host variable

2.3.19. dspr

DSPR performs the symmetric rank 1 operation A := alpha*x*x**T + A, where alpha is a real scalar, x is an n element vector and A is an n by n symmetric matrix, supplied in packed form.

subroutine dspr(t, n, alpha, x, incx, a)
  character*1 :: t
  integer :: n, incx
  real(8), device, dimension(*) :: a, x ! device or host variable
  real(8), device :: alpha ! device or host variable

subroutine cublasDspr(t, n, alpha, x, incx, a)
  character*1 :: t
  integer :: n, incx
  real(8), device, dimension(*) :: a, x
  real(8), device :: alpha ! device or host variable

integer(4) function cublasDspr_v2(h, t, n, alpha, x, incx, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx
  real(8), device, dimension(*) :: a, x
  real(8), device :: alpha ! device or host variable

2.3.20. dspr2

DSPR2 performs the symmetric rank 2 operation A := alpha*x*y**T + alpha*y*x**T + A, where alpha is a scalar, x and y are n element vectors and A is an n by n symmetric matrix, supplied in packed form.

subroutine dspr2(t, n, alpha, x, incx, y, incy, a)
  character*1 :: t
  integer :: n, incx, incy
  real(8), device, dimension(*) :: a, x, y ! device or host variable
  real(8), device :: alpha ! device or host variable

subroutine cublasDspr2(t, n, alpha, x, incx, y, incy, a)
  character*1 :: t
  integer :: n, incx, incy
  real(8), device, dimension(*) :: a, x, y
  real(8), device :: alpha ! device or host variable

integer(4) function cublasDspr2_v2(h, t, n, alpha, x, incx, y, incy, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  real(8), device, dimension(*) :: a, x, y
  real(8), device :: alpha ! device or host variable

2.3.21. dsymv

DSYMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix.

subroutine dsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x, y ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDsymv_v2(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

2.3.22. dsyr

DSYR performs the symmetric rank 1 operation A := alpha*x*x**T + A, where alpha is a real scalar, x is an n element vector and A is an n by n symmetric matrix.

subroutine dsyr(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x ! device or host variable
  real(8), device :: alpha ! device or host variable

subroutine cublasDsyr(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x
  real(8), device :: alpha ! device or host variable

integer(4) function cublasDsyr_v2(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x
  real(8), device :: alpha ! device or host variable

2.3.23. dsyr2

DSYR2 performs the symmetric rank 2 operation A := alpha*x*y**T + alpha*y*x**T + A, where alpha is a scalar, x and y are n element vectors and A is an n by n symmetric matrix.

subroutine dsyr2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x, y ! device or host variable
  real(8), device :: alpha ! device or host variable

subroutine cublasDsyr2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha ! device or host variable

integer(4) function cublasDsyr2_v2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha ! device or host variable

2.3.24. dtbmv

DTBMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals.

subroutine dtbmv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x ! device or host variable

subroutine cublasDtbmv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

integer(4) function cublasDtbmv_v2(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

2.3.25. dtbsv

DTBSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine dtbsv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x ! device or host variable

subroutine cublasDtbsv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

integer(4) function cublasDtbsv_v2(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

2.3.26. dtpmv

DTPMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form.

subroutine dtpmv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  real(8), device, dimension(*) :: a, x ! device or host variable

subroutine cublasDtpmv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  real(8), device, dimension(*) :: a, x

integer(4) function cublasDtpmv_v2(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  real(8), device, dimension(*) :: a, x

2.3.27. dtpsv

DTPSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine dtpsv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  real(8), device, dimension(*) :: a, x ! device or host variable

subroutine cublasDtpsv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  real(8), device, dimension(*) :: a, x

integer(4) function cublasDtpsv_v2(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  real(8), device, dimension(*) :: a, x

2.3.28. dtrmv

DTRMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.

subroutine dtrmv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x ! device or host variable

subroutine cublasDtrmv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

integer(4) function cublasDtrmv_v2(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

2.3.29. dtrsv

DTRSV solves one of the systems of equations A*x = b, or A**T*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine dtrsv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(*) :: x ! device or host variable

subroutine cublasDtrsv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

integer(4) function cublasDtrsv_v2(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

2.3.30. dgemm

DGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

subroutine dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(ldb, *) :: b ! device or host variable
  real(8), device, dimension(ldc, *) :: c ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDgemm_v2(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.3.31. dsymm

DSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.

subroutine dsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(ldb, *) :: b ! device or host variable
  real(8), device, dimension(ldc, *) :: c ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDsymm_v2(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.3.32. dsyrk

DSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

subroutine dsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(ldc, *) :: c ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDsyrk_v2(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.3.33. dsyr2k

DSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

subroutine dsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(ldb, *) :: b ! device or host variable
  real(8), device, dimension(ldc, *) :: c ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDsyr2k_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.3.34. dsyrkx

DSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.

subroutine dsyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(ldb, *) :: b ! device or host variable
  real(8), device, dimension(ldc, *) :: c ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasDsyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasDsyrkx_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.3.35. dtrmm

DTRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ), where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T.

subroutine dtrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(ldb, *) :: b ! device or host variable
  real(8), device :: alpha ! device or host variable

subroutine cublasDtrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device :: alpha ! device or host variable

integer(4) function cublasDtrmm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha ! device or host variable

2.3.36. dtrsm

DTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.

subroutine dtrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(8), device, dimension(lda, *) :: a ! device or host variable
  real(8), device, dimension(ldb, *) :: b ! device or host variable
  real(8), device :: alpha ! device or host variable

subroutine cublasDtrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device :: alpha ! device or host variable

integer(4) function cublasDtrsm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device :: alpha ! device or host variable

2.3.37. cublasDgemvBatched

DGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

integer(4) function cublasDgemvBatched(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  real(8), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  real(8), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

integer(4) function cublasDgemvBatched_v2(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  real(8), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  real(8), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

2.3.38. cublasDgemmBatched

DGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasDgemmBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  real(8), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  real(8), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

integer(4) function cublasDgemmBatched_v2(h, transa, transb, m, n, k, alpha, &
           Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  real(8), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  real(8), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

2.3.39. cublasDgelsBatched

DGELS solves overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A. It is assumed that A has full rank. The following options are provided: 1. If TRANS = ‘N’ and m >= n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A*X ||. 2. If TRANS = ‘N’ and m < n: find the minimum norm solution of an underdetermined system A * X = B. 3. If TRANS = ‘T’ and m >= n: find the minimum norm solution of an undetermined system A**T * X = B. 4. If TRANS = ‘T’ and m < n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A**T * X ||. Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the M-by-NRHS right hand side matrix B and the N-by-NRHS solution matrix X.

integer(4) function cublasDgelsBatched(h, trans, m, n, nrhs, Aarray, lda, Carray, ldc, info, devinfo, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n, nrhs
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: info(*)
  integer, device :: devinfo(*)
  integer :: batchCount

2.3.40. cublasDgeqrfBatched

DGEQRF computes a QR factorization of a real M-by-N matrix A: A = Q * R.

integer(4) function cublasDgeqrfBatched(h, m, n, Aarray, lda, Tau, info, batchCount)
  type(cublasHandle) :: h
  integer :: m, n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Tau(*)
  integer :: info(*)
  integer :: batchCount

2.3.41. cublasDgetrfBatched

DGETRF computes an LU factorization of a general M-by-N matrix A using partial pivoting with row interchanges. The factorization has the form A = P * L * U where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n), and U is upper triangular (upper trapezoidal if m < n). This is the right-looking Level 3 BLAS version of the algorithm.

integer(4) function cublasDgetrfBatched(h, n, Aarray, lda, ipvt, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  integer, device :: info(*)
  integer :: batchCount

2.3.42. cublasDgetriBatched

DGETRI computes the inverse of a matrix using the LU factorization computed by DGETRF. This method inverts U and then computes inv(A) by solving the system inv(A)*L = inv(U) for inv(A).

integer(4) function cublasDgetriBatched(h, n, Aarray, lda, ipvt, Carray, ldc, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer, device :: info(*)
  integer :: batchCount

2.3.43. cublasDgetrsBatched

DGETRS solves a system of linear equations A * X = B or A**T * X = B with a general N-by-N matrix A using the LU factorization computed by DGETRF.

integer(4) function cublasDgetrsBatched(h, trans, n, nrhs, Aarray, lda, ipvt, Barray, ldb, info, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: n, nrhs
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  integer :: info(*)
  integer :: batchCount

2.3.44. cublasDmatinvBatched

cublasDmatinvBatched is a short cut of cublasDgetrfBatched plus cublasDgetriBatched. However it only works if n is less than 32. If not, the user has to go through cublasDgetrfBatched and cublasDgetriBatched.

integer(4) function cublasDmatinvBatched(h, n, Aarray, lda, Ainv, lda_inv, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Ainv(*)
  integer :: lda_inv
  integer, device :: info(*)
  integer :: batchCount

2.3.45. cublasDtrsmBatched

DTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.

integer(4) function cublasDtrsmBatched( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
  type(cublasHandle) :: h
  integer :: side ! integer or character(1) variable
  integer :: uplo ! integer or character(1) variable
  integer :: trans ! integer or character(1) variable
  integer :: diag ! integer or character(1) variable
  integer :: m, n
  real(8), device :: alpha ! device or host variable
  type(c_devptr), device :: A(*)
  integer :: lda
  type(c_devptr), device :: B(*)
  integer :: ldb
  integer :: batchCount

integer(4) function cublasDtrsmBatched_v2( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
  type(cublasHandle) :: h
  integer :: side
  integer :: uplo
  integer :: trans
  integer :: diag
  integer :: m, n
  real(8), device :: alpha ! device or host variable
  type(c_devptr), device :: A(*)
  integer :: lda
  type(c_devptr), device :: B(*)
  integer :: ldb
  integer :: batchCount

2.3.46. cublasDgemvStridedBatched

DGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

integer(4) function cublasDgemvStridedBatched(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  real(8), device :: alpha ! device or host variable
  real(8), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(8), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  real(8), device :: beta ! device or host variable
  real(8), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

integer(4) function cublasDgemvStridedBatched_v2(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  real(8), device :: alpha ! device or host variable
  real(8), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(8), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  real(8), device :: beta ! device or host variable
  real(8), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

2.3.47. cublasDgemmStridedBatched

DGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasDgemmStridedBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  real(8), device :: alpha ! device or host variable
  real(8), device :: Aarray(*)
  integer :: lda
  integer :: strideA
  real(8), device :: Barray(*)
  integer :: ldb
  integer :: strideB
  real(8), device :: beta ! device or host variable
  real(8), device :: Carray(*)
  integer :: ldc
  integer :: strideC
  integer :: batchCount

integer(4) function cublasDgemmStridedBatched_v2(h, transa, transb, m, n, k, alpha, &
           Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  real(8), device :: alpha ! device or host variable
  real(8), device :: Aarray(*)
  integer :: lda
  integer :: strideA
  real(8), device :: Barray(*)
  integer :: ldb
  integer :: strideB
  real(8), device :: beta ! device or host variable
  real(8), device :: Carray(*)
  integer :: ldc
  integer :: strideC
  integer :: batchCount

2.4. Single Precision Complex Functions and Subroutines

This section contains interfaces to the single precision complex BLAS and cuBLAS functions and subroutines.

2.4.1. icamax

ICAMAX finds the index of the element having the maximum absolute value.

integer(4) function icamax(n, x, incx)
  integer :: n
  complex(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

integer(4) function cublasIcamax(n, x, incx)
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasIcamax_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.4.2. icamin

ICAMIN finds the index of the element having the minimum absolute value.

integer(4) function icamin(n, x, incx)
  integer :: n
  complex(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

integer(4) function cublasIcamin(n, x, incx)
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasIcamin_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.4.3. scasum

SCASUM takes the sum of the absolute values of a complex vector and returns a single precision result.

real(4) function scasum(n, x, incx)
  integer :: n
  complex(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

real(4) function cublasScasum(n, x, incx)
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasScasum_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx
  real(4), device :: res ! device or host variable

2.4.4. caxpy

CAXPY constant times a vector plus a vector.

subroutine caxpy(n, a, x, incx, y, incy)
  integer :: n
  complex(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasCaxpy(n, a, x, incx, y, incy)
  integer :: n
  complex(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasCaxpy_v2(h, n, a, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.4.5. ccopy

CCOPY copies a vector x to a vector y.

subroutine ccopy(n, x, incx, y, incy)
  integer :: n
  complex(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasCcopy(n, x, incx, y, incy)
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasCcopy_v2(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.4.6. cdotc

forms the dot product of two vectors, conjugating the first vector.

complex(4) function cdotc(n, x, incx, y, incy)
  integer :: n
  complex(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

complex(4) function cublasCdotc(n, x, incx, y, incy)
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasCdotc_v2(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy
  complex(4), device :: res ! device or host variable

2.4.7. cdotu

CDOTU forms the dot product of two vectors.

complex(4) function cdotu(n, x, incx, y, incy)
  integer :: n
  complex(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

complex(4) function cublasCdotu(n, x, incx, y, incy)
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasCdotu_v2(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy
  complex(4), device :: res ! device or host variable

2.4.8. scnrm2

SCNRM2 returns the euclidean norm of a vector via the function name, so that SCNRM2 := sqrt( x**H*x )

real(4) function scnrm2(n, x, incx)
  integer :: n
  complex(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

real(4) function cublasScnrm2(n, x, incx)
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasScnrm2_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx
  real(4), device :: res ! device or host variable

2.4.9. crot

CROT applies a plane rotation, where the cos (C) is real and the sin (S) is complex, and the vectors CX and CY are complex.

subroutine crot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(4), device :: sc ! device or host variable
  complex(4), device :: ss ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasCrot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(4), device :: sc ! device or host variable
  complex(4), device :: ss ! device or host variable
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasCrot_v2(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: sc ! device or host variable
  complex(4), device :: ss ! device or host variable
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.4.10. csrot

CSROT applies a plane rotation, where the cos and sin (c and s) are real and the vectors cx and cy are complex.

subroutine csrot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(4), device :: sc, ss ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasCsrot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(4), device :: sc, ss ! device or host variable
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasCsrot_v2(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: sc, ss ! device or host variable
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.4.11. crotg

CROTG determines a complex Givens rotation.

subroutine crotg(sa, sb, sc, ss)
  complex(4), device :: sa, sb, ss ! device or host variable
  real(4), device :: sc ! device or host variable

subroutine cublasCrotg(sa, sb, sc, ss)
  complex(4), device :: sa, sb, ss ! device or host variable
  real(4), device :: sc ! device or host variable

integer(4) function cublasCrotg_v2(h, sa, sb, sc, ss)
  type(cublasHandle) :: h
  complex(4), device :: sa, sb, ss ! device or host variable
  real(4), device :: sc ! device or host variable

2.4.12. cscal

CSCAL scales a vector by a constant.

subroutine cscal(n, a, x, incx)
  integer :: n
  complex(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

subroutine cublasCscal(n, a, x, incx)
  integer :: n
  complex(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasCscal_v2(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x
  integer :: incx

2.4.13. csscal

CSSCAL scales a complex vector by a real constant.

subroutine csscal(n, a, x, incx)
  integer :: n
  real(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x ! device or host variable
  integer :: incx

subroutine cublasCsscal(n, a, x, incx)
  integer :: n
  real(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasCsscal_v2(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x
  integer :: incx

2.4.14. cswap

CSWAP interchanges two vectors.

subroutine cswap(n, x, incx, y, incy)
  integer :: n
  complex(4), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasCswap(n, x, incx, y, incy)
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasCswap_v2(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.4.15. cgbmv

CGBMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, or y := alpha*A**H*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n band matrix, with kl sub-diagonals and ku super-diagonals.

subroutine cgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, kl, ku, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasCgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, kl, ku, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasCgbmv_v2(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, kl, ku, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.4.16. cgemv

CGEMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, or y := alpha*A**H*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.

subroutine cgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasCgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasCgemv_v2(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.4.17. cgerc

CGERC performs the rank 1 operation A := alpha*x*y**H + A, where alpha is a scalar, x is an m element vector, y is an n element vector and A is an m by n matrix.

subroutine cgerc(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  complex(4), device :: alpha ! device or host variable

subroutine cublasCgerc(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha ! device or host variable

integer(4) function cublasCgerc_v2(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha ! device or host variable

2.4.18. cgeru

CGERU performs the rank 1 operation A := alpha*x*y**T + A, where alpha is a scalar, x is an m element vector, y is an n element vector and A is an m by n matrix.

subroutine cgeru(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  complex(4), device :: alpha ! device or host variable

subroutine cublasCgeru(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha ! device or host variable

integer(4) function cublasCgeru_v2(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha ! device or host variable

2.4.19. csymv

CSYMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix.

subroutine csymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasCsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasCsymv_v2(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.4.20. csyr

CSYR performs the symmetric rank 1 operation A := alpha*x*x**H + A, where alpha is a complex scalar, x is an n element vector and A is an n by n symmetric matrix.

subroutine csyr(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x ! device or host variable
  complex(4), device :: alpha ! device or host variable

subroutine cublasCsyr(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x
  complex(4), device :: alpha ! device or host variable

integer(4) function cublasCsyr_v2(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x
  complex(4), device :: alpha ! device or host variable

2.4.21. csyr2

CSYR2 performs the symmetric rank 2 operation A := alpha*x*y’ + alpha*y*x’ + A, where alpha is a complex scalar, x and y are n element vectors and A is an n by n SY matrix.

subroutine csyr2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  complex(4), device :: alpha ! device or host variable

subroutine cublasCsyr2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha ! device or host variable

integer(4) function cublasCsyr2_v2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha ! device or host variable

2.4.22. ctbmv

CTBMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, or x := A**H*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals.

subroutine ctbmv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x ! device or host variable

subroutine cublasCtbmv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

integer(4) function cublasCtbmv_v2(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

2.4.23. ctbsv

CTBSV solves one of the systems of equations A*x = b, or A**T*x = b, or A**H*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine ctbsv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x ! device or host variable

subroutine cublasCtbsv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

integer(4) function cublasCtbsv_v2(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

2.4.24. ctpmv

CTPMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, or x := A**H*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form.

subroutine ctpmv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x ! device or host variable

subroutine cublasCtpmv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x

integer(4) function cublasCtpmv_v2(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x

2.4.25. ctpsv

CTPSV solves one of the systems of equations A*x = b, or A**T*x = b, or A**H*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine ctpsv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x ! device or host variable

subroutine cublasCtpsv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x

integer(4) function cublasCtpsv_v2(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x

2.4.26. ctrmv

CTRMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, or x := A**H*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.

subroutine ctrmv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x ! device or host variable

subroutine cublasCtrmv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

integer(4) function cublasCtrmv_v2(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

2.4.27. ctrsv

CTRSV solves one of the systems of equations A*x = b, or A**T*x = b, or A**H*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine ctrsv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x ! device or host variable

subroutine cublasCtrsv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

integer(4) function cublasCtrsv_v2(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

2.4.28. chbmv

CHBMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n hermitian band matrix, with k super-diagonals.

subroutine chbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: k, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasChbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: k, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasChbmv_v2(h, uplo, n, k, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: k, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.4.29. chemv

CHEMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n hermitian matrix.

subroutine chemv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasChemv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasChemv_v2(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.4.30. chpmv

CHPMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n hermitian matrix, supplied in packed form.

subroutine chpmv(uplo, n, alpha, a, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, incx, incy
  complex(4), device, dimension(*) :: a, x, y ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasChpmv(uplo, n, alpha, a, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, incx, incy
  complex(4), device, dimension(*) :: a, x, y
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasChpmv_v2(h, uplo, n, alpha, a, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, incx, incy
  complex(4), device, dimension(*) :: a, x, y
  complex(4), device :: alpha, beta ! device or host variable

2.4.31. cher

CHER performs the hermitian rank 1 operation A := alpha*x*x**H + A, where alpha is a real scalar, x is an n element vector and A is an n by n hermitian matrix.

subroutine cher(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  complex(4), device, dimension(*) :: a, x ! device or host variable
  real(4), device :: alpha ! device or host variable

subroutine cublasCher(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  complex(4), device, dimension(*) :: a, x
  real(4), device :: alpha ! device or host variable

integer(4) function cublasCher_v2(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  complex(4), device, dimension(*) :: a, x
  real(4), device :: alpha ! device or host variable

2.4.32. cher2

CHER2 performs the hermitian rank 2 operation A := alpha*x*y**H + conjg( alpha )*y*x**H + A, where alpha is a scalar, x and y are n element vectors and A is an n by n hermitian matrix.

subroutine cher2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  complex(4), device, dimension(*) :: a, x, y ! device or host variable
  complex(4), device :: alpha ! device or host variable

subroutine cublasCher2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  complex(4), device, dimension(*) :: a, x, y
  complex(4), device :: alpha ! device or host variable

integer(4) function cublasCher2_v2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  complex(4), device, dimension(*) :: a, x, y
  complex(4), device :: alpha ! device or host variable

2.4.33. chpr

CHPR performs the hermitian rank 1 operation A := alpha*x*x**H + A, where alpha is a real scalar, x is an n element vector and A is an n by n hermitian matrix, supplied in packed form.

subroutine chpr(t, n, alpha, x, incx, a)
  character*1 :: t
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x ! device or host variable
  real(4), device :: alpha ! device or host variable

subroutine cublasChpr(t, n, alpha, x, incx, a)
  character*1 :: t
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x
  real(4), device :: alpha ! device or host variable

integer(4) function cublasChpr_v2(h, t, n, alpha, x, incx, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x
  real(4), device :: alpha ! device or host variable

2.4.34. chpr2

CHPR2 performs the hermitian rank 2 operation A := alpha*x*y**H + conjg( alpha )*y*x**H + A, where alpha is a scalar, x and y are n element vectors and A is an n by n hermitian matrix, supplied in packed form.

subroutine chpr2(t, n, alpha, x, incx, y, incy, a)
  character*1 :: t
  integer :: n, incx, incy
  complex(4), device, dimension(*) :: a, x, y ! device or host variable
  complex(4), device :: alpha ! device or host variable

subroutine cublasChpr2(t, n, alpha, x, incx, y, incy, a)
  character*1 :: t
  integer :: n, incx, incy
  complex(4), device, dimension(*) :: a, x, y
  complex(4), device :: alpha ! device or host variable

integer(4) function cublasChpr2_v2(h, t, n, alpha, x, incx, y, incy, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  complex(4), device, dimension(*) :: a, x, y
  complex(4), device :: alpha ! device or host variable

2.4.35. cgemm

CGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

subroutine cgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldb, *) :: b ! device or host variable
  complex(4), device, dimension(ldc, *) :: c ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasCgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasCgemm_v2(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.4.36. csymm

CSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.

subroutine csymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldb, *) :: b ! device or host variable
  complex(4), device, dimension(ldc, *) :: c ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasCsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasCsymm_v2(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.4.37. csyrk

CSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

subroutine csyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldc, *) :: c ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasCsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasCsyrk_v2(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.4.38. csyr2k

CSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

subroutine csyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldb, *) :: b ! device or host variable
  complex(4), device, dimension(ldc, *) :: c ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasCsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasCsyr2k_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.4.39. csyrkx

CSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.

subroutine csyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldb, *) :: b ! device or host variable
  complex(4), device, dimension(ldc, *) :: c ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasCsyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasCsyrkx_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.4.40. ctrmm

CTRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ) where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H.

subroutine ctrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldb, *) :: b ! device or host variable
  complex(4), device :: alpha ! device or host variable

subroutine cublasCtrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device :: alpha ! device or host variable

integer(4) function cublasCtrmm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha ! device or host variable

2.4.41. ctrsm

CTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H. The matrix X is overwritten on B.

subroutine ctrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldb, *) :: b ! device or host variable
  complex(4), device :: alpha ! device or host variable

subroutine cublasCtrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device :: alpha ! device or host variable

integer(4) function cublasCtrsm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device :: alpha ! device or host variable

2.4.42. chemm

CHEMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is an hermitian matrix and B and C are m by n matrices.

subroutine chemm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldb, *) :: b ! device or host variable
  complex(4), device, dimension(ldc, *) :: c ! device or host variable
  complex(4), device :: alpha, beta ! device or host variable

subroutine cublasChemm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

integer(4) function cublasChemm_v2(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.4.43. cherk

CHERK performs one of the hermitian rank k operations C := alpha*A*A**H + beta*C, or C := alpha*A**H*A + beta*C, where alpha and beta are real scalars, C is an n by n hermitian matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

subroutine cherk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldc, *) :: c ! device or host variable
  real(4), device :: alpha, beta ! device or host variable

subroutine cublasCherk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

integer(4) function cublasCherk_v2(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.4.44. cher2k

CHER2K performs one of the hermitian rank 2k operations C := alpha*A*B**H + conjg( alpha )*B*A**H + beta*C, or C := alpha*A**H*B + conjg( alpha )*B**H*A + beta*C, where alpha and beta are scalars with beta real, C is an n by n hermitian matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

subroutine cher2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldb, *) :: b ! device or host variable
  complex(4), device, dimension(ldc, *) :: c ! device or host variable
  complex(4), device :: alpha ! device or host variable
  real(4), device :: beta ! device or host variable

subroutine cublasCher2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha ! device or host variable
  real(4), device :: beta ! device or host variable

integer(4) function cublasCher2k_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha ! device or host variable
  real(4), device :: beta ! device or host variable

2.4.45. cherkx

CHERKX performs a variation of the hermitian rank k operations C := alpha*A*B**H + beta*C, where alpha and beta are real scalars, C is an n by n hermitian matrix stored in lower or upper mode, and A and B are n by k matrices. See the CUBLAS documentation for more details.

subroutine cherkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a ! device or host variable
  complex(4), device, dimension(ldb, *) :: b ! device or host variable
  complex(4), device, dimension(ldc, *) :: c ! device or host variable
  complex(4), device :: alpha ! device or host variable
  real(4), device :: beta ! device or host variable

subroutine cublasCherkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha ! device or host variable
  real(4), device :: beta ! device or host variable

integer(4) function cublasCherkx_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha ! device or host variable
  real(4), device :: beta ! device or host variable

2.4.46. cublasCgemvBatched

CGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

integer(4) function cublasCgemvBatched(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  complex(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  complex(4), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

integer(4) function cublasCgemvBatched_v2(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  complex(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  complex(4), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

2.4.47. cublasCgemmBatched

CGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasCgemmBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  complex(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  complex(4), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

integer(4) function cublasCgemmBatched_v2(h, transa, transb, m, n, k, alpha, &
           Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  complex(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  complex(4), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

2.4.48. cublasCgelsBatched

CGELS solves overdetermined or underdetermined complex linear systems involving an M-by-N matrix A, or its conjugate-transpose, using a QR or LQ factorization of A. It is assumed that A has full rank. The following options are provided: 1. If TRANS = ‘N’ and m >= n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A*X ||. 2. If TRANS = ‘N’ and m < n: find the minimum norm solution of an underdetermined system A * X = B. 3. If TRANS = ‘C’ and m >= n: find the minimum norm solution of an undetermined system A**H * X = B. 4. If TRANS = ‘C’ and m < n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A**H * X ||. Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the M-by-NRHS right hand side matrix B and the N-by-NRHS solution matrix X.

integer(4) function cublasCgelsBatched(h, trans, m, n, nrhs, Aarray, lda, Carray, ldc, info, devinfo, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n, nrhs
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: info(*)
  integer, device :: devinfo(*)
  integer :: batchCount

2.4.49. cublasCgeqrfBatched

CGEQRF computes a QR factorization of a complex M-by-N matrix A: A = Q * R.

integer(4) function cublasCgeqrfBatched(h, m, n, Aarray, lda, Tau, info, batchCount)
  type(cublasHandle) :: h
  integer :: m, n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Tau(*)
  integer :: info(*)
  integer :: batchCount

2.4.50. cublasCgetrfBatched

CGETRF computes an LU factorization of a general M-by-N matrix A using partial pivoting with row interchanges. The factorization has the form A = P * L * U where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n), and U is upper triangular (upper trapezoidal if m < n). This is the right-looking Level 3 BLAS version of the algorithm.

integer(4) function cublasCgetrfBatched(h, n, Aarray, lda, ipvt, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  integer, device :: info(*)
  integer :: batchCount

2.4.51. cublasCgetriBatched

CGETRI computes the inverse of a matrix using the LU factorization computed by CGETRF. This method inverts U and then computes inv(A) by solving the system inv(A)*L = inv(U) for inv(A).

integer(4) function cublasCgetriBatched(h, n, Aarray, lda, ipvt, Carray, ldc, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer, device :: info(*)
  integer :: batchCount

2.4.52. cublasCgetrsBatched

CGETRS solves a system of linear equations A * X = B, A**T * X = B, or A**H * X = B with a general N-by-N matrix A using the LU factorization computed by CGETRF.

integer(4) function cublasCgetrsBatched(h, trans, n, nrhs, Aarray, lda, ipvt, Barray, ldb, info, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: n, nrhs
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  integer :: info(*)
  integer :: batchCount

2.4.53. cublasCmatinvBatched

cublasCmatinvBatched is a short cut of cublasCgetrfBatched plus cublasCgetriBatched. However it only works if n is less than 32. If not, the user has to go through cublasCgetrfBatched and cublasCgetriBatched.

integer(4) function cublasCmatinvBatched(h, n, Aarray, lda, Ainv, lda_inv, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Ainv(*)
  integer :: lda_inv
  integer, device :: info(*)
  integer :: batchCount

2.4.54. cublasCtrsmBatched

CTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H. The matrix X is overwritten on B.

integer(4) function cublasCtrsmBatched( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
  type(cublasHandle) :: h
  integer :: side ! integer or character(1) variable
  integer :: uplo ! integer or character(1) variable
  integer :: trans ! integer or character(1) variable
  integer :: diag ! integer or character(1) variable
  integer :: m, n
  complex(4), device :: alpha ! device or host variable
  type(c_devptr), device :: A(*)
  integer :: lda
  type(c_devptr), device :: B(*)
  integer :: ldb
  integer :: batchCount

integer(4) function cublasCtrsmBatched_v2( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
  type(cublasHandle) :: h
  integer :: side
  integer :: uplo
  integer :: trans
  integer :: diag
  integer :: m, n
  complex(4), device :: alpha ! device or host variable
  type(c_devptr), device :: A(*)
  integer :: lda
  type(c_devptr), device :: B(*)
  integer :: ldb
  integer :: batchCount

2.4.55. cublasCgemvStridedBatched

CGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

integer(4) function cublasCgemvStridedBatched(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  complex(4), device :: alpha ! device or host variable
  complex(4), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  complex(4), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  complex(4), device :: beta ! device or host variable
  complex(4), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

integer(4) function cublasCgemvStridedBatched_v2(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  complex(4), device :: alpha ! device or host variable
  complex(4), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  complex(4), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  complex(4), device :: beta ! device or host variable
  complex(4), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

2.4.56. cublasCgemmStridedBatched

CGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasCgemmStridedBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  complex(4), device :: alpha ! device or host variable
  complex(4), device :: Aarray(*)
  integer :: lda
  integer :: strideA
  complex(4), device :: Barray(*)
  integer :: ldb
  integer :: strideB
  complex(4), device :: beta ! device or host variable
  complex(4), device :: Carray(*)
  integer :: ldc
  integer :: strideC
  integer :: batchCount

integer(4) function cublasCgemmStridedBatched_v2(h, transa, transb, m, n, k, alpha, &
           Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  complex(4), device :: alpha ! device or host variable
  complex(4), device :: Aarray(*)
  integer :: lda
  integer :: strideA
  complex(4), device :: Barray(*)
  integer :: ldb
  integer :: strideB
  complex(4), device :: beta ! device or host variable
  complex(4), device :: Carray(*)
  integer :: ldc
  integer :: strideC
  integer :: batchCount

2.5. Double Precision Complex Functions and Subroutines

This section contains interfaces to the double precision complex BLAS and cuBLAS functions and subroutines.

2.5.1. izamax

IZAMAX finds the index of the element having the maximum absolute value.

integer(4) function izamax(n, x, incx)
  integer :: n
  complex(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

integer(4) function cublasIzamax(n, x, incx)
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasIzamax_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.5.2. izamin

IZAMIN finds the index of the element having the minimum absolute value.

integer(4) function izamin(n, x, incx)
  integer :: n
  complex(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

integer(4) function cublasIzamin(n, x, incx)
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasIzamin_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.5.3. dzasum

DZASUM takes the sum of the absolute values.

real(8) function dzasum(n, x, incx)
  integer :: n
  complex(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

real(8) function cublasDzasum(n, x, incx)
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasDzasum_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx
  real(8), device :: res ! device or host variable

2.5.4. zaxpy

ZAXPY constant times a vector plus a vector.

subroutine zaxpy(n, a, x, incx, y, incy)
  integer :: n
  complex(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasZaxpy(n, a, x, incx, y, incy)
  integer :: n
  complex(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasZaxpy_v2(h, n, a, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.5.5. zcopy

ZCOPY copies a vector, x, to a vector, y.

subroutine zcopy(n, x, incx, y, incy)
  integer :: n
  complex(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasZcopy(n, x, incx, y, incy)
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasZcopy_v2(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.5.6. zdotc

ZDOTC forms the dot product of a vector.

complex(8) function zdotc(n, x, incx, y, incy)
  integer :: n
  complex(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

complex(8) function cublasZdotc(n, x, incx, y, incy)
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasZdotc_v2(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy
  complex(8), device :: res ! device or host variable

2.5.7. zdotu

ZDOTU forms the dot product of two vectors.

complex(8) function zdotu(n, x, incx, y, incy)
  integer :: n
  complex(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

complex(8) function cublasZdotu(n, x, incx, y, incy)
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasZdotu_v2(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy
  complex(8), device :: res ! device or host variable

2.5.8. dznrm2

DZNRM2 returns the euclidean norm of a vector via the function name, so that DZNRM2 := sqrt( x**H*x )

real(8) function dznrm2(n, x, incx)
  integer :: n
  complex(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

real(8) function cublasDznrm2(n, x, incx)
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasDznrm2_v2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx
  real(8), device :: res ! device or host variable

2.5.9. zrot

ZROT applies a plane rotation, where the cos (C) is real and the sin (S) is complex, and the vectors CX and CY are complex.

subroutine zrot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(8), device :: sc ! device or host variable
  complex(8), device :: ss ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasZrot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(8), device :: sc ! device or host variable
  complex(8), device :: ss ! device or host variable
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasZrot_v2(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: sc ! device or host variable
  complex(8), device :: ss ! device or host variable
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.5.10. zsrot

ZSROT applies a plane rotation, where the cos and sin (c and s) are real and the vectors cx and cy are complex.

subroutine zsrot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(8), device :: sc, ss ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasZsrot(n, x, incx, y, incy, sc, ss)
  integer :: n
  real(8), device :: sc, ss ! device or host variable
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasZsrot_v2(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: sc, ss ! device or host variable
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.5.11. zrotg

ZROTG determines a double complex Givens rotation.

subroutine zrotg(sa, sb, sc, ss)
  complex(8), device :: sa, sb, ss ! device or host variable
  real(8), device :: sc ! device or host variable

subroutine cublasZrotg(sa, sb, sc, ss)
  complex(8), device :: sa, sb, ss ! device or host variable
  real(8), device :: sc ! device or host variable

integer(4) function cublasZrotg_v2(h, sa, sb, sc, ss)
  type(cublasHandle) :: h
  complex(8), device :: sa, sb, ss ! device or host variable
  real(8), device :: sc ! device or host variable

2.5.12. zscal

ZSCAL scales a vector by a constant.

subroutine zscal(n, a, x, incx)
  integer :: n
  complex(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

subroutine cublasZscal(n, a, x, incx)
  integer :: n
  complex(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasZscal_v2(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x
  integer :: incx

2.5.13. zdscal

ZDSCAL scales a vector by a constant.

subroutine zdscal(n, a, x, incx)
  integer :: n
  real(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x ! device or host variable
  integer :: incx

subroutine cublasZdscal(n, a, x, incx)
  integer :: n
  real(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x
  integer :: incx

integer(4) function cublasZdscal_v2(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x
  integer :: incx

2.5.14. zswap

ZSWAP interchanges two vectors.

subroutine zswap(n, x, incx, y, incy)
  integer :: n
  complex(8), device, dimension(*) :: x, y ! device or host variable
  integer :: incx, incy

subroutine cublasZswap(n, x, incx, y, incy)
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

integer(4) function cublasZswap_v2(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.5.15. zgbmv

ZGBMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, or y := alpha*A**H*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n band matrix, with kl sub-diagonals and ku super-diagonals.

subroutine zgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, kl, ku, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZgbmv(t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, kl, ku, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZgbmv_v2(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, kl, ku, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.5.16. zgemv

ZGEMV performs one of the matrix-vector operations y := alpha*A*x + beta*y, or y := alpha*A**T*x + beta*y, or y := alpha*A**H*x + beta*y, where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.

subroutine zgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZgemv(t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: t
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZgemv_v2(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.5.17. zgerc

ZGERC performs the rank 1 operation A := alpha*x*y**H + A, where alpha is a scalar, x is an m element vector, y is an n element vector and A is an m by n matrix.

subroutine zgerc(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  complex(8), device :: alpha ! device or host variable

subroutine cublasZgerc(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha ! device or host variable

integer(4) function cublasZgerc_v2(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha ! device or host variable

2.5.18. zgeru

ZGERU performs the rank 1 operation A := alpha*x*y**T + A, where alpha is a scalar, x is an m element vector, y is an n element vector and A is an m by n matrix.

subroutine zgeru(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  complex(8), device :: alpha ! device or host variable

subroutine cublasZgeru(m, n, alpha, x, incx, y, incy, a, lda)
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha ! device or host variable

integer(4) function cublasZgeru_v2(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha ! device or host variable

2.5.19. zsymv

ZSYMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n symmetric matrix.

subroutine zsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZsymv_v2(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.5.20. zsyr

ZSYR performs the symmetric rank 1 operation A := alpha*x*x**H + A, where alpha is a complex scalar, x is an n element vector and A is an n by n symmetric matrix.

subroutine zsyr(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x ! device or host variable
  complex(8), device :: alpha ! device or host variable

subroutine cublasZsyr(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x
  complex(8), device :: alpha ! device or host variable

integer(4) function cublasZsyr_v2(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x
  complex(8), device :: alpha ! device or host variable

2.5.21. zsyr2

ZSYR2 performs the symmetric rank 2 operation A := alpha*x*y’ + alpha*y*x’ + A, where alpha is a complex scalar, x and y are n element vectors and A is an n by n SY matrix.

subroutine zsyr2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  complex(8), device :: alpha ! device or host variable

subroutine cublasZsyr2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha ! device or host variable

integer(4) function cublasZsyr2_v2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha ! device or host variable

2.5.22. ztbmv

ZTBMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, or x := A**H*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals.

subroutine ztbmv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x ! device or host variable

subroutine cublasZtbmv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

integer(4) function cublasZtbmv_v2(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

2.5.23. ztbsv

ZTBSV solves one of the systems of equations A*x = b, or A**T*x = b, or A**H*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular band matrix, with ( k + 1 ) diagonals. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine ztbsv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x ! device or host variable

subroutine cublasZtbsv(u, t, d, n, k, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, k, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

integer(4) function cublasZtbsv_v2(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

2.5.24. ztpmv

ZTPMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, or x := A**H*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form.

subroutine ztpmv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x ! device or host variable

subroutine cublasZtpmv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x

integer(4) function cublasZtpmv_v2(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x

2.5.25. ztpsv

ZTPSV solves one of the systems of equations A*x = b, or A**T*x = b, or A**H*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine ztpsv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x ! device or host variable

subroutine cublasZtpsv(u, t, d, n, a, x, incx)
  character*1 :: u, t, d
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x

integer(4) function cublasZtpsv_v2(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x

2.5.26. ztrmv

ZTRMV performs one of the matrix-vector operations x := A*x, or x := A**T*x, or x := A**H*x, where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.

subroutine ztrmv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x ! device or host variable

subroutine cublasZtrmv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

integer(4) function cublasZtrmv_v2(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

2.5.27. ztrsv

ZTRSV solves one of the systems of equations A*x = b, or A**T*x = b, or A**H*x = b, where b and x are n element vectors and A is an n by n unit, or non-unit, upper or lower triangular matrix. No test for singularity or near-singularity is included in this routine. Such tests must be performed before calling this routine.

subroutine ztrsv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x ! device or host variable

subroutine cublasZtrsv(u, t, d, n, a, lda, x, incx)
  character*1 :: u, t, d
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

integer(4) function cublasZtrsv_v2(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

2.5.28. zhbmv

ZHBMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n hermitian band matrix, with k super-diagonals.

subroutine zhbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: k, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZhbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: k, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZhbmv_v2(h, uplo, n, k, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: k, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.5.29. zhemv

ZHEMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n hermitian matrix.

subroutine zhemv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZhemv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZhemv_v2(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.5.30. zhpmv

ZHPMV performs the matrix-vector operation y := alpha*A*x + beta*y, where alpha and beta are scalars, x and y are n element vectors and A is an n by n hermitian matrix, supplied in packed form.

subroutine zhpmv(uplo, n, alpha, a, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, incx, incy
  complex(8), device, dimension(*) :: a, x, y ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZhpmv(uplo, n, alpha, a, x, incx, beta, y, incy)
  character*1 :: uplo
  integer :: n, incx, incy
  complex(8), device, dimension(*) :: a, x, y
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZhpmv_v2(h, uplo, n, alpha, a, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, incx, incy
  complex(8), device, dimension(*) :: a, x, y
  complex(8), device :: alpha, beta ! device or host variable

2.5.31. zher

ZHER performs the hermitian rank 1 operation A := alpha*x*x**H + A, where alpha is a real scalar, x is an n element vector and A is an n by n hermitian matrix.

subroutine zher(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  complex(8), device, dimension(*) :: a, x ! device or host variable
  real(8), device :: alpha ! device or host variable

subroutine cublasZher(t, n, alpha, x, incx, a, lda)
  character*1 :: t
  integer :: n, incx, lda
  complex(8), device, dimension(*) :: a, x
  real(8), device :: alpha ! device or host variable

integer(4) function cublasZher_v2(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  complex(8), device, dimension(*) :: a, x
  real(8), device :: alpha ! device or host variable

2.5.32. zher2

ZHER2 performs the hermitian rank 2 operation A := alpha*x*y**H + conjg( alpha )*y*x**H + A, where alpha is a scalar, x and y are n element vectors and A is an n by n hermitian matrix.

subroutine zher2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy
  complex(8), device, dimension(*) :: a, x, y ! device or host variable
  complex(8), device :: alpha ! device or host variable

subroutine cublasZher2(t, n, alpha, x, incx, y, incy, a, lda)
  character*1 :: t
  integer :: n, incx, incy, lda
  complex(8), device, dimension(*) :: a, x, y
  complex(8), device :: alpha ! device or host variable

integer(4) function cublasZher2_v2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  complex(8), device, dimension(*) :: a, x, y
  complex(8), device :: alpha ! device or host variable

2.5.33. zhpr

ZHPR performs the hermitian rank 1 operation A := alpha*x*x**H + A, where alpha is a real scalar, x is an n element vector and A is an n by n hermitian matrix, supplied in packed form.

subroutine zhpr(t, n, alpha, x, incx, a)
  character*1 :: t
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x ! device or host variable
  real(8), device :: alpha ! device or host variable

subroutine cublasZhpr(t, n, alpha, x, incx, a)
  character*1 :: t
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x
  real(8), device :: alpha ! device or host variable

integer(4) function cublasZhpr_v2(h, t, n, alpha, x, incx, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x
  real(8), device :: alpha ! device or host variable

2.5.34. zhpr2

ZHPR2 performs the hermitian rank 2 operation A := alpha*x*y**H + conjg( alpha )*y*x**H + A, where alpha is a scalar, x and y are n element vectors and A is an n by n hermitian matrix, supplied in packed form.

subroutine zhpr2(t, n, alpha, x, incx, y, incy, a)
  character*1 :: t
  integer :: n, incx, incy
  complex(8), device, dimension(*) :: a, x, y ! device or host variable
  complex(8), device :: alpha ! device or host variable

subroutine cublasZhpr2(t, n, alpha, x, incx, y, incy, a)
  character*1 :: t
  integer :: n, incx, incy
  complex(8), device, dimension(*) :: a, x, y
  complex(8), device :: alpha ! device or host variable

integer(4) function cublasZhpr2_v2(h, t, n, alpha, x, incx, y, incy, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  complex(8), device, dimension(*) :: a, x, y
  complex(8), device :: alpha ! device or host variable

2.5.35. zgemm

ZGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

subroutine zgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldb, *) :: b ! device or host variable
  complex(8), device, dimension(ldc, *) :: c ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZgemm_v2(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.5.36. zsymm

ZSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.

subroutine zsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldb, *) :: b ! device or host variable
  complex(8), device, dimension(ldc, *) :: c ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZsymm_v2(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.5.37. zsyrk

ZSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

subroutine zsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldc, *) :: c ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZsyrk_v2(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.5.38. zsyr2k

ZSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

subroutine zsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldb, *) :: b ! device or host variable
  complex(8), device, dimension(ldc, *) :: c ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZsyr2k_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.5.39. zsyrkx

ZSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.

subroutine zsyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldb, *) :: b ! device or host variable
  complex(8), device, dimension(ldc, *) :: c ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZsyrkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZsyrkx_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.5.40. ztrmm

ZTRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ) where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H.

subroutine ztrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldb, *) :: b ! device or host variable
  complex(8), device :: alpha ! device or host variable

subroutine cublasZtrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device :: alpha ! device or host variable

integer(4) function cublasZtrmm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha ! device or host variable

2.5.41. ztrsm

ZTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H. The matrix X is overwritten on B.

subroutine ztrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldb, *) :: b ! device or host variable
  complex(8), device :: alpha ! device or host variable

subroutine cublasZtrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  character*1 :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device :: alpha ! device or host variable

integer(4) function cublasZtrsm_v2(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device :: alpha ! device or host variable

2.5.42. zhemm

ZHEMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is an hermitian matrix and B and C are m by n matrices.

subroutine zhemm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldb, *) :: b ! device or host variable
  complex(8), device, dimension(ldc, *) :: c ! device or host variable
  complex(8), device :: alpha, beta ! device or host variable

subroutine cublasZhemm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZhemm_v2(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.5.43. zherk

ZHERK performs one of the hermitian rank k operations C := alpha*A*A**H + beta*C, or C := alpha*A**H*A + beta*C, where alpha and beta are real scalars, C is an n by n hermitian matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

subroutine zherk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldc, *) :: c ! device or host variable
  real(8), device :: alpha, beta ! device or host variable

subroutine cublasZherk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

integer(4) function cublasZherk_v2(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.5.44. zher2k

ZHER2K performs one of the hermitian rank 2k operations C := alpha*A*B**H + conjg( alpha )*B*A**H + beta*C, or C := alpha*A**H*B + conjg( alpha )*B**H*A + beta*C, where alpha and beta are scalars with beta real, C is an n by n hermitian matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

subroutine zher2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldb, *) :: b ! device or host variable
  complex(8), device, dimension(ldc, *) :: c ! device or host variable
  complex(8), device :: alpha ! device or host variable
  real(8), device :: beta ! device or host variable

subroutine cublasZher2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha ! device or host variable
  real(8), device :: beta ! device or host variable

integer(4) function cublasZher2k_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha ! device or host variable
  real(8), device :: beta ! device or host variable

2.5.45. zherkx

ZHERKX performs a variation of the hermitian rank k operations C := alpha*A*B**H + beta*C, where alpha and beta are real scalars, C is an n by n hermitian matrix stored in lower or upper mode, and A and B are n by k matrices. See the CUBLAS documentation for more details.

subroutine zherkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a ! device or host variable
  complex(8), device, dimension(ldb, *) :: b ! device or host variable
  complex(8), device, dimension(ldc, *) :: c ! device or host variable
  complex(8), device :: alpha ! device or host variable
  real(8), device :: beta ! device or host variable

subroutine cublasZherkx(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  character*1 :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha ! device or host variable
  real(8), device :: beta ! device or host variable

integer(4) function cublasZherkx_v2(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha ! device or host variable
  real(8), device :: beta ! device or host variable

2.5.46. cublasZgemvBatched

ZGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

integer(4) function cublasZgemvBatched(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  complex(8), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  complex(8), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

integer(4) function cublasZgemvBatched_v2(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  complex(8), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  complex(8), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

2.5.47. cublasZgemmBatched

ZGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasZgemmBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  complex(8), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  complex(8), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

integer(4) function cublasZgemmBatched_v2(h, transa, transb, m, n, k, alpha, &
           Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  complex(8), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  complex(8), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

2.5.48. cublasZgelsBatched

ZGELS solves overdetermined or underdetermined complex linear systems involving an M-by-N matrix A, or its conjugate-transpose, using a QR or LQ factorization of A. It is assumed that A has full rank. The following options are provided: 1. If TRANS = ‘N’ and m >= n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A*X ||. 2. If TRANS = ‘N’ and m < n: find the minimum norm solution of an underdetermined system A * X = B. 3. If TRANS = ‘C’ and m >= n: find the minimum norm solution of an undetermined system A**H * X = B. 4. If TRANS = ‘C’ and m < n: find the least squares solution of an overdetermined system, i.e., solve the least squares problem minimize || B - A**H * X ||. Several right hand side vectors b and solution vectors x can be handled in a single call; they are stored as the columns of the M-by-NRHS right hand side matrix B and the N-by-NRHS solution matrix X.

integer(4) function cublasZgelsBatched(h, trans, m, n, nrhs, Aarray, lda, Carray, ldc, info, devinfo, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n, nrhs
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: info(*)
  integer, device :: devinfo(*)
  integer :: batchCount

2.5.49. cublasZgeqrfBatched

ZGEQRF computes a QR factorization of a complex M-by-N matrix A: A = Q * R.

integer(4) function cublasZgeqrfBatched(h, m, n, Aarray, lda, Tau, info, batchCount)
  type(cublasHandle) :: h
  integer :: m, n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Tau(*)
  integer :: info(*)
  integer :: batchCount

2.5.50. cublasZgetrfBatched

ZGETRF computes an LU factorization of a general M-by-N matrix A using partial pivoting with row interchanges. The factorization has the form A = P * L * U where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n), and U is upper triangular (upper trapezoidal if m < n). This is the right-looking Level 3 BLAS version of the algorithm.

integer(4) function cublasZgetrfBatched(h, n, Aarray, lda, ipvt, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  integer, device :: info(*)
  integer :: batchCount

2.5.51. cublasZgetriBatched

ZGETRI computes the inverse of a matrix using the LU factorization computed by ZGETRF. This method inverts U and then computes inv(A) by solving the system inv(A)*L = inv(U) for inv(A).

integer(4) function cublasZgetriBatched(h, n, Aarray, lda, ipvt, Carray, ldc, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer, device :: info(*)
  integer :: batchCount

2.5.52. cublasZgetrsBatched

ZGETRS solves a system of linear equations A * X = B, A**T * X = B, or A**H * X = B with a general N-by-N matrix A using the LU factorization computed by ZGETRF.

integer(4) function cublasZgetrsBatched(h, trans, n, nrhs, Aarray, lda, ipvt, Barray, ldb, info, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: n, nrhs
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  integer, device :: ipvt(*)
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  integer :: info(*)
  integer :: batchCount

2.5.53. cublasZmatinvBatched

cublasZmatinvBatched is a short cut of cublasZgetrfBatched plus cublasZgetriBatched. However it only works if n is less than 32. If not, the user has to go through cublasZgetrfBatched and cublasZgetriBatched.

integer(4) function cublasZmatinvBatched(h, n, Aarray, lda, Ainv, lda_inv, info, batchCount)
  type(cublasHandle) :: h
  integer :: n
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Ainv(*)
  integer :: lda_inv
  integer, device :: info(*)
  integer :: batchCount

2.5.54. cublasZtrsmBatched

ZTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H. The matrix X is overwritten on B.

integer(4) function cublasZtrsmBatched( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
  type(cublasHandle) :: h
  integer :: side ! integer or character(1) variable
  integer :: uplo ! integer or character(1) variable
  integer :: trans ! integer or character(1) variable
  integer :: diag ! integer or character(1) variable
  integer :: m, n
  complex(8), device :: alpha ! device or host variable
  type(c_devptr), device :: A(*)
  integer :: lda
  type(c_devptr), device :: B(*)
  integer :: ldb
  integer :: batchCount

integer(4) function cublasZtrsmBatched_v2( h, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb, batchCount)
  type(cublasHandle) :: h
  integer :: side
  integer :: uplo
  integer :: trans
  integer :: diag
  integer :: m, n
  complex(8), device :: alpha ! device or host variable
  type(c_devptr), device :: A(*)
  integer :: lda
  type(c_devptr), device :: B(*)
  integer :: ldb
  integer :: batchCount

2.5.55. cublasZgemvStridedBatched

ZGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

integer(4) function cublasZgemvStridedBatched(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  complex(8), device :: alpha ! device or host variable
  complex(8), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  complex(8), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  complex(8), device :: beta ! device or host variable
  complex(8), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

integer(4) function cublasZgemvStridedBatched_v2(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  complex(8), device :: alpha ! device or host variable
  complex(8), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  complex(8), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  complex(8), device :: beta ! device or host variable
  complex(8), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

2.5.56. cublasZgemmStridedBatched

ZGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasZgemmStridedBatched(h, transa, transb, m, n, k, alpha, Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  complex(8), device :: alpha ! device or host variable
  complex(8), device :: Aarray(*)
  integer :: lda
  integer :: strideA
  complex(8), device :: Barray(*)
  integer :: ldb
  integer :: strideB
  complex(8), device :: beta ! device or host variable
  complex(8), device :: Carray(*)
  integer :: ldc
  integer :: strideC
  integer :: batchCount

integer(4) function cublasZgemmStridedBatched_v2(h, transa, transb, m, n, k, alpha, &
           Aarray, lda, strideA, Barray, ldb, strideB, beta, Carray, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  complex(8), device :: alpha ! device or host variable
  complex(8), device :: Aarray(*)
  integer :: lda
  integer :: strideA
  complex(8), device :: Barray(*)
  integer :: ldb
  integer :: strideB
  complex(8), device :: beta ! device or host variable
  complex(8), device :: Carray(*)
  integer :: ldc
  integer :: strideC
  integer :: batchCount

2.6. Half Precision Functions and Extension Functions

This section contains interfaces to the half precision cuBLAS functions and the BLAS extension functions which allow the user to individually specify the types of the arrays and computation (many or all of which support half precision).

The extension functions can accept one of many supported datatypes. Users should always check the latest cuBLAS documentation for supported combinations. In this document we will use the real(2) datatype since those functions are not otherwise supported by the S, D, C, and Z variants in the libraries. In addition, the user is responsible for properly setting the pointer mode by making calls to cublasSetPointerMode for all extension functions.

The type(cudaDataType) is now common to several of the newer library functions covered in this document. Though some functions will accept an appropriately valued integer, the use of type(cudaDataType) is now recommended going forward.

2.6.1. cublasHgemvBatched

HGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

In the HSH versions, alpha, beta are real(4), and the arrays which are pointed to should all contain real(2) data.

integer(4) function cublasHSHgemvBatched(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  real(4), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

integer(4) function cublasHSHgemvBatched_v2(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  real(4), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

In the HSS versions, alpha, beta are real(4), the Aarray, xarray arrays which are pointed to should contain real(2) data, and yarray should contain real(4) data.

integer(4) function cublasHSSgemvBatched(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  real(4), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

integer(4) function cublasHSSgemvBatched_v2(h, trans, m, n, alpha, &
      Aarray, lda, xarray, incx, beta, yarray, incy, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: xarray(*)
  integer :: incx
  real(4), device :: beta ! device or host variable
  type(c_devptr), device :: yarray(*)
  integer :: incy
  integer :: batchCount

2.6.2. cublasHgemvStridedBatched

HGEMV performs a batch of the matrix-vector operations Y := alpha*op( A ) * X + beta*Y, where op( A ) is one of op( A ) = A or op( A ) = A**T, alpha and beta are scalars, A is an m by n matrix, and X and Y are vectors.

In the HSH versions, alpha, beta are real(4), and the arrays A, X, Y are all real(2) data.

integer(4) function cublasHSHgemvStridedBatched(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  real(2), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(2), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  real(4), device :: beta ! device or host variable
  real(2), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

integer(4) function cublasHSHgemvStridedBatched_v2(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  real(2), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(2), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  real(4), device :: beta ! device or host variable
  real(2), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

In the HSS versions, alpha, beta are real(4), the A, X arrays contain real(2) data, and the Y array contains real(4) data.

integer(4) function cublasHSSgemvStridedBatched(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans ! integer or character(1) variable
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  real(2), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(2), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  real(4), device :: beta ! device or host variable
  real(4), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

integer(4) function cublasHSSgemvStridedBatched_v2(h, trans, m, n, alpha, &
      A, lda, strideA, X, incx, strideX, beta, Y, incy, strideY, batchCount)
  type(cublasHandle) :: h
  integer :: trans
  integer :: m, n
  real(4), device :: alpha ! device or host variable
  real(2), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(2), device :: X(*)
  integer :: incx
  integer(8) :: strideX
  real(4), device :: beta ! device or host variable
  real(4), device :: Y(*)
  integer :: incy
  integer(8) :: strideY
  integer :: batchCount

2.6.3. cublasHgemm

HGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

subroutine cublasHgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, &
      beta, c, ldc)
  integer :: transa  ! integer or character(1) variable
  integer :: transb  ! integer or character(1) variable
  integer :: m, n, k, lda, ldb, ldc
  real(2), device, dimension(lda, *) :: a
  real(2), device, dimension(ldb, *) :: b
  real(2), device, dimension(ldc, *) :: c
  real(2), device :: alpha, beta ! device or host variable

In the v2 version, the user is responsible for setting the pointer mode for the alpha, beta arguments.

integer(4) function cublasHgemm_v2(h, transa, transb, m, n, k, alpha, &
      a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  real(2), device, dimension(lda, *) :: a
  real(2), device, dimension(ldb, *) :: b
  real(2), device, dimension(ldc, *) :: c
  real(2), device :: alpha, beta ! device or host variable

2.6.4. cublasHgemmBatched

HGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasHgemmBatched(h, transa, transb, m, n, k, &
      alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  real(2), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  real(2), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

integer(4) function cublasHgemmBatched_v2(h, transa, transb, m, n, k, &
      alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  real(2), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  integer :: lda
  type(c_devptr), device :: Barray(*)
  integer :: ldb
  real(2), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  integer :: ldc
  integer :: batchCount

2.6.5. cublasHgemmStridedBatched

HGEMM performs a set of matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasHgemmStridedBatched(h, transa, transb, m, n, k, &
    alpha, A, lda, strideA, B, ldb, strideB, beta, C, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa ! integer or character(1) variable
  integer :: transb ! integer or character(1) variable
  integer :: m, n, k
  real(2), device :: alpha ! device or host variable
  real(2), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(2), device :: B(ldb,*)
  integer :: ldb
  integer(8) :: strideB
  real(2), device :: beta ! device or host variable
  real(2), device :: C(ldc,*)
  integer :: ldc
  integer(8) :: strideC
  integer :: batchCount

integer(4) function cublasHgemmStridedBatched_v2(h, transa, transb, m, n, k, &
    alpha, A, lda, strideA, B, ldb, strideB, beta, C, ldc, strideC, batchCount)
  type(cublasHandle) :: h
  integer :: transa
  integer :: transb
  integer :: m, n, k
  real(2), device :: alpha ! device or host variable
  real(2), device :: A(lda,*)
  integer :: lda
  integer(8) :: strideA
  real(2), device :: B(ldb,*)
  integer :: ldb
  integer(8) :: strideB
  real(2), device :: beta ! device or host variable
  real(2), device :: C(ldc,*)
  integer :: ldc
  integer(8) :: strideC
  integer :: batchCount

2.6.6. cublasIamaxEx

IAMAX finds the index of the element having the maximum absolute value.

integer(4) function cublasIamaxEx(h, n, x, xtype, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x  ! Type and kind as specified by xtype
  type(cudaDataType) :: xtype
  integer :: incx
  integer, device :: res ! device or host variable

2.6.7. cublasIaminEx

IAMIN finds the index of the element having the minimum absolute value.

integer(4) function cublasIaminEx(h, n, x, xtype, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x  ! Type and kind as specified by xtype
  type(cudaDataType) :: xtype
  integer :: incx
  integer, device :: res ! device or host variable

2.6.8. cublasAsumEx

ASUM takes the sum of the absolute values.

integer(4) function cublasAsumEx(h, n, x, xtype, incx, res, &
      restype, extype)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x  ! Type and kind as specified by xtype
  type(cudaDataType) :: xtype
  integer :: incx
  real(2), device :: res ! device or host variable
  type(cudaDataType) :: restype
  type(cudaDataType) :: extype

2.6.9. cublasAxpyEx

AXPY computes a constant times a vector plus a vector.

integer(4) function cublasAxpyEx(h, n, alpha, alphatype, &
      x, xtype, incx, y, ytype, incy, extype)
  type(cublasHandle) :: h
  integer :: n
  real(2), device :: alpha
  type(cudaDataType) :: alphatype
  real(2), device, dimension(*) :: x
  type(cudaDataType) :: xtype
  integer :: incx
  real(2), device, dimension(*) :: y
  type(cudaDataType) :: ytype
  integer :: incy
  type(cudaDataType) :: extype

2.6.10. cublasCopyEx

COPY copies a vector, x, to a vector, y.

integer(4) function cublasCopyEx(h, n, x, xtype, incx, &
      y, ytype, incy)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x
  type(cudaDataType) :: xtype
  integer :: incx
  real(2), device, dimension(*) :: y
  type(cudaDataType) :: ytype
  integer :: incy

2.6.11. cublasDotEx

DOT forms the dot product of two vectors.

integer(4) function cublasDotEx(h, n, x, xtype, incx, &
      y, ytype, incy, res, restype, extype)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x  ! Type and kind as specified by xtype
  type(cudaDataType) :: xtype
  integer :: incx
  real(2), device, dimension(*) :: y  ! Type and kind as specified by ytype
  type(cudaDataType) :: ytype
  integer :: incy
  real(2), device :: res ! device or host variable
  type(cudaDataType) :: restype
  type(cudaDataType) :: extype

2.6.12. cublasDotcEx

DOTC forms the conjugated dot product of two vectors.

integer(4) function cublasDotcEx(h, n, x, xtype, incx, &
      y, ytype, incy, res, restype, extype)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x  ! Type and kind as specified by xtype
  type(cudaDataType) :: xtype
  integer :: incx
  real(2), device, dimension(*) :: y  ! Type and kind as specified by ytype
  type(cudaDataType) :: ytype
  integer :: incy
  real(2), device :: res ! device or host variable
  type(cudaDataType) :: restype
  type(cudaDataType) :: extype

2.6.13. cublasNrm2Ex

NRM2 produces the euclidean norm of a vector.

integer(4) function cublasNrm2Ex(h, n, x, xtype, incx, res, &
      restype, extype)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x  ! Type and kind as specified by xtype
  type(cudaDataType) :: xtype
  integer :: incx
  real(2), device :: res ! device or host variable
  type(cudaDataType) :: restype
  type(cudaDataType) :: extype

2.6.14. cublasRotEx

ROT applies a plane rotation.

integer(4) function cublasRotEx(h, n, x, xtype, incx, &
      y, ytype, incy, c, s, cstype, extype)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x  ! Type and kind as specified by xtype
  type(cudaDataType) :: xtype
  integer :: incx
  real(2), device, dimension(*) :: y  ! Type and kind as specified by ytype
  type(cudaDataType) :: ytype
  integer :: incy
  real(2), device :: c, s ! device or host variable
  type(cudaDataType) :: cstype
  type(cudaDataType) :: extype

2.6.15. cublasRotgEx

ROTG constructs a Givens plane rotation

integer(4) function cublasRotgEx(h, a, b, abtype, &
      c, s, cstype, extype)
  type(cublasHandle) :: h
  real(2), device :: a, b  ! Type and kind as specified by abtype
  type(cudaDataType) :: abtype
  real(2), device :: c, s ! device or host variable
  type(cudaDataType) :: cstype
  type(cudaDataType) :: extype

2.6.16. cublasRotmEx

ROTM applies a modified Givens transformation.

integer(4) function cublasRotmEx(h, n, x, xtype, incx, &
      y, ytype, incy, param, paramtype, extype)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x  ! Type and kind as specified by xtype
  type(cudaDataType) :: xtype
  integer :: incx
  real(2), device, dimension(*) :: y  ! Type and kind as specified by ytype
  type(cudaDataType) :: ytype
  integer :: incy
  real(2), device, dimension(*) :: param
  type(cudaDataType) :: paramtype
  type(cudaDataType) :: extype

2.6.17. cublasRotmgEx

ROTMG constructs a modified Givens transformation matrix.

integer(4) function cublasRotmgEx(h, d1, d1type, d2, d2type, &
      x1, x1type, y1, y1type, param, paramtype, extype)
  type(cublasHandle) :: h
  real(2), device :: d1  ! Type and kind as specified by d1type
  type(cudaDataType) :: d1type
  real(2), device :: d2  ! Type and kind as specified by d2type
  type(cudaDataType) :: d2type
  real(2), device :: x1  ! Type and kind as specified by x1type
  type(cudaDataType) :: x1type
  real(2), device :: y1  ! Type and kind as specified by y1type
  type(cudaDataType) :: y1type
  real(2), device, dimension(*) :: param
  type(cudaDataType) :: paramtype
  type(cudaDataType) :: extype

2.6.18. cublasScalEx

SCAL scales a vector by a constant.

integer(4) function cublasScalEx(h, n, alpha, alphatype, &
      x, xtype, incx, extype)
  type(cublasHandle) :: h
  integer :: n
  real(2), device :: alpha
  type(cudaDataType) :: alphatype
  real(2), device, dimension(*) :: x
  type(cudaDataType) :: xtype
  integer :: incx
  type(cudaDataType) :: extype

2.6.19. cublasSwapEx

SWAP interchanges two vectors.

integer(4) function cublasSwapEx(h, n, x, xtype, incx, &
      y, ytype, incy)
  type(cublasHandle) :: h
  integer :: n
  real(2), device, dimension(*) :: x
  type(cudaDataType) :: xtype
  integer :: incx
  real(2), device, dimension(*) :: y
  type(cudaDataType) :: ytype
  integer :: incy

2.6.20. cublasGemmEx

GEMM performs the matrix-matrix multiply operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix, and C an m by n matrix.

The data type of alpha, beta mainly follows the computeType argument. See the cuBLAS documentation for data type combinations currently supported.

integer(4) function cublasGemmEx(h, transa, transb, m, n, k, alpha, &
      A, atype, lda, B, btype, ldb, beta, C, ctype, ldc, computeType, algo)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k
  real(2), device :: alpha ! device or host variable
  real(2), device :: A(lda,*)
  type(cudaDataType) :: atype
  integer :: lda
  real(2), device :: B(ldb,*)
  type(cudaDataType) :: btype
  integer :: ldb
  real(2), device :: beta ! device or host variable
  real(2), device :: C(ldc,*)
  type(cudaDataType) :: ctype
  integer :: ldc
  type(cublasComputeType) :: computeType  ! also accept integer
  type(cublasGemmAlgoType) :: algo        ! also accept integer

2.6.21. cublasGemmBatchedEx

GEMM performs a batch of matrix-matrix multiply operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix, and C an m by n matrix.

The data type of alpha, beta mainly follows the computeType argument. See the cuBLAS documentation for data type combinations currently supported.

integer(4) function cublasGemmBatchedEx(h, transa, transb, m, n, k, &
      alpha, Aarray, atype, lda, Barray, btype, ldb, beta, &
      Carray, ctype, ldc, batchCount, computeType, algo)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k
  real(2), device :: alpha ! device or host variable
  type(c_devptr), device :: Aarray(*)
  type(cudaDataType) :: atype
  integer :: lda
  type(c_devptr), device :: Barray(*)
  type(cudaDataType) :: btype
  integer :: ldb
  real(2), device :: beta ! device or host variable
  type(c_devptr), device :: Carray(*)
  type(cudaDataType) :: ctype
  integer :: ldc
  integer :: batchCount
  type(cublasComputeType) :: computeType  ! also accept integer
  type(cublasGemmAlgoType) :: algo        ! also accept integer

2.6.22. cublasGemmStridedBatchedEx

GEMM performs a batch of matrix-matrix multiply operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix, and C an m by n matrix.

The data type of alpha, beta mainly follows the computeType argument. See the cuBLAS documentation for data type combinations currently supported.

integer(4) function cublasGemmStridedBatchedEx(h, transa, transb, m, n, k, &
      alpha, A, atype, lda, strideA, B, btype, ldb, strideB, beta, &
      C, ctype, ldc, strideC, batchCount, computeType, algo)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k
  real(2), device :: alpha ! device or host variable
  real(2), device :: A(lda,*)
  type(cudaDataType) :: atype
  integer :: lda
  integer(8) :: strideA
  real(2), device :: B(ldb,*)
  type(cudaDataType) :: btype
  integer :: ldb
  integer(8) :: strideB
  real(2), device :: beta ! device or host variable
  real(2), device :: C(ldc,*)
  type(cudaDataType) :: ctype
  integer :: ldc
  integer(8) :: strideC
  integer :: batchCount
  type(cublasComputeType) :: computeType  ! also accept integer
  type(cublasGemmAlgoType) :: algo        ! also accept integer

2.7. CUBLAS V2 Module Functions

This section contains interfaces to the cuBLAS V2 Module Functions. Users can access this module by inserting the line use cublas_v2 into the program unit. One major difference in the cublas_v2 versus the cublas module is the cublas entry points, such as cublasIsamax are changed to take the handle as the first argument. The second difference in the cublas_v2 module is the v2 entry points, such as cublasIsamax_v2 do not implicitly handle the pointer modes for the user. It is up to the programmer to make calls to cublasSetPointerMode to tell the library if scalar arguments reside on the host or device. The actual interfaces to the v2 entry points do not change, and are not listed in this section.

2.7.1. Single Precision Functions and Subroutines

This section contains the V2 interfaces to the single precision BLAS and cuBLAS functions and subroutines.

2.7.1.1. isamax

If you use the cublas_v2 module, the interface for cublasIsamax is changed to the following:

integer(4) function cublasIsamax(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.7.1.2. isamin

If you use the cublas_v2 module, the interface for cublasIsamin is changed to the following:

integer(4) function cublasIsamin(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.7.1.3. sasum

If you use the cublas_v2 module, the interface for cublasSasum is changed to the following:

integer(4) function cublasSasum(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx
  real(4), device :: res ! device or host variable

2.7.1.4. saxpy

If you use the cublas_v2 module, the interface for cublasSaxpy is changed to the following:

integer(4) function cublasSaxpy(h, n, a, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: a ! device or host variable
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.1.5. scopy

If you use the cublas_v2 module, the interface for cublasScopy is changed to the following:

integer(4) function cublasScopy(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.1.6. sdot

If you use the cublas_v2 module, the interface for cublasSdot is changed to the following:

integer(4) function cublasSdot(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy
  real(4), device :: res ! device or host variable

2.7.1.7. snrm2

If you use the cublas_v2 module, the interface for cublasSnrm2 is changed to the following:

integer(4) function cublasSnrm2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x
  integer :: incx
  real(4), device :: res ! device or host variable

2.7.1.8. srot

If you use the cublas_v2 module, the interface for cublasSrot is changed to the following:

integer(4) function cublasSrot(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: sc, ss ! device or host variable
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.1.9. srotg

If you use the cublas_v2 module, the interface for cublasSrotg is changed to the following:

integer(4) function cublasSrotg(h, sa, sb, sc, ss)
  type(cublasHandle) :: h
  real(4), device :: sa, sb, sc, ss ! device or host variable

2.7.1.10. srotm

If you use the cublas_v2 module, the interface for cublasSrotm is changed to the following:

integer(4) function cublasSrotm(h, n, x, incx, y, incy, param)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: param(*) ! device or host variable
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.1.11. srotmg

If you use the cublas_v2 module, the interface for cublasSrotmg is changed to the following:

integer(4) function cublasSrotmg(h, d1, d2, x1, y1, param)
  type(cublasHandle) :: h
  real(4), device :: d1, d2, x1, y1, param(*) ! device or host variable

2.7.1.12. sscal

If you use the cublas_v2 module, the interface for cublasSscal is changed to the following:

integer(4) function cublasSscal(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: a ! device or host variable
  real(4), device, dimension(*) :: x
  integer :: incx

2.7.1.13. sswap

If you use the cublas_v2 module, the interface for cublasSswap is changed to the following:

integer(4) function cublasSswap(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.1.14. sgbmv

If you use the cublas_v2 module, the interface for cublasSgbmv is changed to the following:

integer(4) function cublasSgbmv(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, kl, ku, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

2.7.1.15. sgemv

If you use the cublas_v2 module, the interface for cublasSgemv is changed to the following:

integer(4) function cublasSgemv(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

2.7.1.16. sger

If you use the cublas_v2 module, the interface for cublasSger is changed to the following:

integer(4) function cublasSger(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha ! device or host variable

2.7.1.17. ssbmv

If you use the cublas_v2 module, the interface for cublasSsbmv is changed to the following:

integer(4) function cublasSsbmv(h, t, n, k, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: k, n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

2.7.1.18. sspmv

If you use the cublas_v2 module, the interface for cublasSspmv is changed to the following:

integer(4) function cublasSspmv(h, t, n, alpha, a, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  real(4), device, dimension(*) :: a, x, y
  real(4), device :: alpha, beta ! device or host variable

2.7.1.19. sspr

If you use the cublas_v2 module, the interface for cublasSspr is changed to the following:

integer(4) function cublasSspr(h, t, n, alpha, x, incx, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx
  real(4), device, dimension(*) :: a, x
  real(4), device :: alpha ! device or host variable

2.7.1.20. sspr2

If you use the cublas_v2 module, the interface for cublasSspr2 is changed to the following:

integer(4) function cublasSspr2(h, t, n, alpha, x, incx, y, incy, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  real(4), device, dimension(*) :: a, x, y
  real(4), device :: alpha ! device or host variable

2.7.1.21. ssymv

If you use the cublas_v2 module, the interface for cublasSsymv is changed to the following:

integer(4) function cublasSsymv(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha, beta ! device or host variable

2.7.1.22. ssyr

If you use the cublas_v2 module, the interface for cublasSsyr is changed to the following:

integer(4) function cublasSsyr(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x
  real(4), device :: alpha ! device or host variable

2.7.1.23. ssyr2

If you use the cublas_v2 module, the interface for cublasSsyr2 is changed to the following:

integer(4) function cublasSsyr2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x, y
  real(4), device :: alpha ! device or host variable

2.7.1.24. stbmv

If you use the cublas_v2 module, the interface for cublasStbmv is changed to the following:

integer(4) function cublasStbmv(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

2.7.1.25. stbsv

If you use the cublas_v2 module, the interface for cublasStbsv is changed to the following:

integer(4) function cublasStbsv(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

2.7.1.26. stpmv

If you use the cublas_v2 module, the interface for cublasStpmv is changed to the following:

integer(4) function cublasStpmv(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  real(4), device, dimension(*) :: a, x

2.7.1.27. stpsv

If you use the cublas_v2 module, the interface for cublasStpsv is changed to the following:

integer(4) function cublasStpsv(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  real(4), device, dimension(*) :: a, x

2.7.1.28. strmv

If you use the cublas_v2 module, the interface for cublasStrmv is changed to the following:

integer(4) function cublasStrmv(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

2.7.1.29. strsv

If you use the cublas_v2 module, the interface for cublasStrsv is changed to the following:

integer(4) function cublasStrsv(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(*) :: x

2.7.1.30. sgemm

If you use the cublas_v2 module, the interface for cublasSgemm is changed to the following:

integer(4) function cublasSgemm(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.7.1.31. ssymm

If you use the cublas_v2 module, the interface for cublasSsymm is changed to the following:

integer(4) function cublasSsymm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.7.1.32. ssyrk

If you use the cublas_v2 module, the interface for cublasSsyrk is changed to the following:

integer(4) function cublasSsyrk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.7.1.33. ssyr2k

If you use the cublas_v2 module, the interface for cublasSsyr2k is changed to the following:

integer(4) function cublasSsyr2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.7.1.34. ssyrkx

If you use the cublas_v2 module, the interface for cublasSsyrkx is changed to the following:

integer(4) function cublasSsyrkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.7.1.35. strmm

If you use the cublas_v2 module, the interface for cublasStrmm is changed to the following:

integer(4) function cublasStrmm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb, ldc
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha ! device or host variable

2.7.1.36. strsm

If you use the cublas_v2 module, the interface for cublasStrsm is changed to the following:

integer(4) function cublasStrsm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(4), device, dimension(lda, *) :: a
  real(4), device, dimension(ldb, *) :: b
  real(4), device :: alpha ! device or host variable

2.7.2. Double Precision Functions and Subroutines

This section contains the V2 interfaces to the double precision BLAS and cuBLAS functions and subroutines.

2.7.2.1. idamax

If you use the cublas_v2 module, the interface for cublasIdamax is changed to the following:

integer(4) function cublasIdamax(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.7.2.2. idamin

If you use the cublas_v2 module, the interface for cublasIdamin is changed to the following:

integer(4) function cublasIdamin(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.7.2.3. dasum

If you use the cublas_v2 module, the interface for cublasDasum is changed to the following:

integer(4) function cublasDasum(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx
  real(8), device :: res ! device or host variable

2.7.2.4. daxpy

If you use the cublas_v2 module, the interface for cublasDaxpy is changed to the following:

integer(4) function cublasDaxpy(h, n, a, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: a ! device or host variable
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.2.5. dcopy

If you use the cublas_v2 module, the interface for cublasDcopy is changed to the following:

integer(4) function cublasDcopy(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.2.6. ddot

If you use the cublas_v2 module, the interface for cublasDdot is changed to the following:

integer(4) function cublasDdot(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy
  real(8), device :: res ! device or host variable

2.7.2.7. dnrm2

If you use the cublas_v2 module, the interface for cublasDnrm2 is changed to the following:

integer(4) function cublasDnrm2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x
  integer :: incx
  real(8), device :: res ! device or host variable

2.7.2.8. drot

If you use the cublas_v2 module, the interface for cublasDrot is changed to the following:

integer(4) function cublasDrot(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: sc, ss ! device or host variable
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.2.9. drotg

If you use the cublas_v2 module, the interface for cublasDrotg is changed to the following:

integer(4) function cublasDrotg(h, sa, sb, sc, ss)
  type(cublasHandle) :: h
  real(8), device :: sa, sb, sc, ss ! device or host variable

2.7.2.10. drotm

If you use the cublas_v2 module, the interface for cublasDrotm is changed to the following:

integer(4) function cublasDrotm(h, n, x, incx, y, incy, param)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: param(*) ! device or host variable
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.2.11. drotmg

If you use the cublas_v2 module, the interface for cublasDrotmg is changed to the following:

integer(4) function cublasDrotmg(h, d1, d2, x1, y1, param)
  type(cublasHandle) :: h
  real(8), device :: d1, d2, x1, y1, param(*) ! device or host variable

2.7.2.12. dscal

If you use the cublas_v2 module, the interface for cublasDscal is changed to the following:

integer(4) function cublasDscal(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: a ! device or host variable
  real(8), device, dimension(*) :: x
  integer :: incx

2.7.2.13. dswap

If you use the cublas_v2 module, the interface for cublasDswap is changed to the following:

integer(4) function cublasDswap(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  real(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.2.14. dgbmv

If you use the cublas_v2 module, the interface for cublasDgbmv is changed to the following:

integer(4) function cublasDgbmv(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, kl, ku, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

2.7.2.15. dgemv

If you use the cublas_v2 module, the interface for cublasDgemv is changed to the following:

integer(4) function cublasDgemv(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

2.7.2.16. dger

If you use the cublas_v2 module, the interface for cublasDger is changed to the following:

integer(4) function cublasDger(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha ! device or host variable

2.7.2.17. dsbmv

If you use the cublas_v2 module, the interface for cublasDsbmv is changed to the following:

integer(4) function cublasDsbmv(h, t, n, k, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: k, n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

2.7.2.18. dspmv

If you use the cublas_v2 module, the interface for cublasDspmv is changed to the following:

integer(4) function cublasDspmv(h, t, n, alpha, a, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  real(8), device, dimension(*) :: a, x, y
  real(8), device :: alpha, beta ! device or host variable

2.7.2.19. dspr

If you use the cublas_v2 module, the interface for cublasDspr is changed to the following:

integer(4) function cublasDspr(h, t, n, alpha, x, incx, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx
  real(8), device, dimension(*) :: a, x
  real(8), device :: alpha ! device or host variable

2.7.2.20. dspr2

If you use the cublas_v2 module, the interface for cublasDspr2 is changed to the following:

integer(4) function cublasDspr2(h, t, n, alpha, x, incx, y, incy, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  real(8), device, dimension(*) :: a, x, y
  real(8), device :: alpha ! device or host variable

2.7.2.21. dsymv

If you use the cublas_v2 module, the interface for cublasDsymv is changed to the following:

integer(4) function cublasDsymv(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha, beta ! device or host variable

2.7.2.22. dsyr

If you use the cublas_v2 module, the interface for cublasDsyr is changed to the following:

integer(4) function cublasDsyr(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x
  real(8), device :: alpha ! device or host variable

2.7.2.23. dsyr2

If you use the cublas_v2 module, the interface for cublasDsyr2 is changed to the following:

integer(4) function cublasDsyr2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x, y
  real(8), device :: alpha ! device or host variable

2.7.2.24. dtbmv

If you use the cublas_v2 module, the interface for cublasDtbmv is changed to the following:

integer(4) function cublasDtbmv(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

2.7.2.25. dtbsv

If you use the cublas_v2 module, the interface for cublasDtbsv is changed to the following:

integer(4) function cublasDtbsv(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

2.7.2.26. dtpmv

If you use the cublas_v2 module, the interface for cublasDtpmv is changed to the following:

integer(4) function cublasDtpmv(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  real(8), device, dimension(*) :: a, x

2.7.2.27. dtpsv

If you use the cublas_v2 module, the interface for cublasDtpsv is changed to the following:

integer(4) function cublasDtpsv(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  real(8), device, dimension(*) :: a, x

2.7.2.28. dtrmv

If you use the cublas_v2 module, the interface for cublasDtrmv is changed to the following:

integer(4) function cublasDtrmv(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

2.7.2.29. dtrsv

If you use the cublas_v2 module, the interface for cublasDtrsv is changed to the following:

integer(4) function cublasDtrsv(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(*) :: x

2.7.2.30. dgemm

If you use the cublas_v2 module, the interface for cublasDgemm is changed to the following:

integer(4) function cublasDgemm(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.7.2.31. dsymm

If you use the cublas_v2 module, the interface for cublasDsymm is changed to the following:

integer(4) function cublasDsymm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.7.2.32. dsyrk

If you use the cublas_v2 module, the interface for cublasDsyrk is changed to the following:

integer(4) function cublasDsyrk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.7.2.33. dsyr2k

If you use the cublas_v2 module, the interface for cublasDsyr2k is changed to the following:

integer(4) function cublasDsyr2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.7.2.34. dsyrkx

If you use the cublas_v2 module, the interface for cublasDsyrkx is changed to the following:

integer(4) function cublasDsyrkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.7.2.35. dtrmm

If you use the cublas_v2 module, the interface for cublasDtrmm is changed to the following:

integer(4) function cublasDtrmm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb, ldc
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha ! device or host variable

2.7.2.36. dtrsm

If you use the cublas_v2 module, the interface for cublasDtrsm is changed to the following:

integer(4) function cublasDtrsm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  real(8), device, dimension(lda, *) :: a
  real(8), device, dimension(ldb, *) :: b
  real(8), device :: alpha ! device or host variable

2.7.3. Single Precision Complex Functions and Subroutines

This section contains the V2 interfaces to the single precision complex BLAS and cuBLAS functions and subroutines.

2.7.3.1. icamax

If you use the cublas_v2 module, the interface for cublasIcamax is changed to the following:

integer(4) function cublasIcamax(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.7.3.2. icamin

If you use the cublas_v2 module, the interface for cublasIcamin is changed to the following:

integer(4) function cublasIcamin(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.7.3.3. scasum

If you use the cublas_v2 module, the interface for cublasScasum is changed to the following:

integer(4) function cublasScasum(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx
  real(4), device :: res ! device or host variable

2.7.3.4. caxpy

If you use the cublas_v2 module, the interface for cublasCaxpy is changed to the following:

integer(4) function cublasCaxpy(h, n, a, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.3.5. ccopy

If you use the cublas_v2 module, the interface for cublasCcopy is changed to the following:

integer(4) function cublasCcopy(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.3.6. cdotc

If you use the cublas_v2 module, the interface for cublasCdotc is changed to the following:

integer(4) function cublasCdotc(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy
  complex(4), device :: res ! device or host variable

2.7.3.7. cdotu

If you use the cublas_v2 module, the interface for cublasCdotu is changed to the following:

integer(4) function cublasCdotu(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy
  complex(4), device :: res ! device or host variable

2.7.3.8. scnrm2

If you use the cublas_v2 module, the interface for cublasScnrm2 is changed to the following:

integer(4) function cublasScnrm2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x
  integer :: incx
  real(4), device :: res ! device or host variable

2.7.3.9. crot

If you use the cublas_v2 module, the interface for cublasCrot is changed to the following:

integer(4) function cublasCrot(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: sc ! device or host variable
  complex(4), device :: ss ! device or host variable
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.3.10. csrot

If you use the cublas_v2 module, the interface for cublasCsrot is changed to the following:

integer(4) function cublasCsrot(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: sc, ss ! device or host variable
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.3.11. crotg

If you use the cublas_v2 module, the interface for cublasCrotg is changed to the following:

integer(4) function cublasCrotg(h, sa, sb, sc, ss)
  type(cublasHandle) :: h
  complex(4), device :: sa, sb, ss ! device or host variable
  real(4), device :: sc ! device or host variable

2.7.3.12. cscal

If you use the cublas_v2 module, the interface for cublasCscal is changed to the following:

integer(4) function cublasCscal(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x
  integer :: incx

2.7.3.13. csscal

If you use the cublas_v2 module, the interface for cublasCsscal is changed to the following:

integer(4) function cublasCsscal(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  real(4), device :: a ! device or host variable
  complex(4), device, dimension(*) :: x
  integer :: incx

2.7.3.14. cswap

If you use the cublas_v2 module, the interface for cublasCswap is changed to the following:

integer(4) function cublasCswap(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(4), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.3.15. cgbmv

If you use the cublas_v2 module, the interface for cublasCgbmv is changed to the following:

integer(4) function cublasCgbmv(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, kl, ku, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.16. cgemv

If you use the cublas_v2 module, the interface for cublasCgemv is changed to the following:

integer(4) function cublasCgemv(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.17. cgerc

If you use the cublas_v2 module, the interface for cublasCgerc is changed to the following:

integer(4) function cublasCgerc(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha ! device or host variable

2.7.3.18. cgeru

If you use the cublas_v2 module, the interface for cublasCgeru is changed to the following:

integer(4) function cublasCgeru(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha ! device or host variable

2.7.3.19. csymv

If you use the cublas_v2 module, the interface for cublasCsymv is changed to the following:

integer(4) function cublasCsymv(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.20. csyr

If you use the cublas_v2 module, the interface for cublasCsyr is changed to the following:

integer(4) function cublasCsyr(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x
  complex(4), device :: alpha ! device or host variable

2.7.3.21. csyr2

If you use the cublas_v2 module, the interface for cublasCsyr2 is changed to the following:

integer(4) function cublasCsyr2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha ! device or host variable

2.7.3.22. ctbmv

If you use the cublas_v2 module, the interface for cublasCtbmv is changed to the following:

integer(4) function cublasCtbmv(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

2.7.3.23. ctbsv

If you use the cublas_v2 module, the interface for cublasCtbsv is changed to the following:

integer(4) function cublasCtbsv(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

2.7.3.24. ctpmv

If you use the cublas_v2 module, the interface for cublasCtpmv is changed to the following:

integer(4) function cublasCtpmv(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x

2.7.3.25. ctpsv

If you use the cublas_v2 module, the interface for cublasCtpsv is changed to the following:

integer(4) function cublasCtpsv(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x

2.7.3.26. ctrmv

If you use the cublas_v2 module, the interface for cublasCtrmv is changed to the following:

integer(4) function cublasCtrmv(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

2.7.3.27. ctrsv

If you use the cublas_v2 module, the interface for cublasCtrsv is changed to the following:

integer(4) function cublasCtrsv(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x

2.7.3.28. chbmv

If you use the cublas_v2 module, the interface for cublasChbmv is changed to the following:

integer(4) function cublasChbmv(h, uplo, n, k, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: k, n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.29. chemv

If you use the cublas_v2 module, the interface for cublasChemv is changed to the following:

integer(4) function cublasChemv(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(*) :: x, y
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.30. chpmv

If you use the cublas_v2 module, the interface for cublasChpmv is changed to the following:

integer(4) function cublasChpmv(h, uplo, n, alpha, a, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, incx, incy
  complex(4), device, dimension(*) :: a, x, y
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.31. cher

If you use the cublas_v2 module, the interface for cublasCher is changed to the following:

integer(4) function cublasCher(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  complex(4), device, dimension(*) :: a, x
  real(4), device :: alpha ! device or host variable

2.7.3.32. cher2

If you use the cublas_v2 module, the interface for cublasCher2 is changed to the following:

integer(4) function cublasCher2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  complex(4), device, dimension(*) :: a, x, y
  complex(4), device :: alpha ! device or host variable

2.7.3.33. chpr

If you use the cublas_v2 module, the interface for cublasChpr is changed to the following:

integer(4) function cublasChpr(h, t, n, alpha, x, incx, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx
  complex(4), device, dimension(*) :: a, x
  real(4), device :: alpha ! device or host variable

2.7.3.34. chpr2

If you use the cublas_v2 module, the interface for cublasChpr2 is changed to the following:

integer(4) function cublasChpr2(h, t, n, alpha, x, incx, y, incy, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  complex(4), device, dimension(*) :: a, x, y
  complex(4), device :: alpha ! device or host variable

2.7.3.35. cgemm

If you use the cublas_v2 module, the interface for cublasCgemm is changed to the following:

integer(4) function cublasCgemm(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.36. csymm

If you use the cublas_v2 module, the interface for cublasCsymm is changed to the following:

integer(4) function cublasCsymm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.37. csyrk

If you use the cublas_v2 module, the interface for cublasCsyrk is changed to the following:

integer(4) function cublasCsyrk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.38. csyr2k

If you use the cublas_v2 module, the interface for cublasCsyr2k is changed to the following:

integer(4) function cublasCsyr2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.39. csyrkx

If you use the cublas_v2 module, the interface for cublasCsyrkx is changed to the following:

integer(4) function cublasCsyrkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.40. ctrmm

If you use the cublas_v2 module, the interface for cublasCtrmm is changed to the following:

integer(4) function cublasCtrmm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha ! device or host variable

2.7.3.41. ctrsm

If you use the cublas_v2 module, the interface for cublasCtrsm is changed to the following:

integer(4) function cublasCtrsm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device :: alpha ! device or host variable

2.7.3.42. chemm

If you use the cublas_v2 module, the interface for cublasChemm is changed to the following:

integer(4) function cublasChemm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha, beta ! device or host variable

2.7.3.43. cherk

If you use the cublas_v2 module, the interface for cublasCherk is changed to the following:

integer(4) function cublasCherk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldc, *) :: c
  real(4), device :: alpha, beta ! device or host variable

2.7.3.44. cher2k

If you use the cublas_v2 module, the interface for cublasCher2k is changed to the following:

integer(4) function cublasCher2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha ! device or host variable
  real(4), device :: beta ! device or host variable

2.7.3.45. cherkx

If you use the cublas_v2 module, the interface for cublasCherkx is changed to the following:

integer(4) function cublasCherkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(4), device, dimension(lda, *) :: a
  complex(4), device, dimension(ldb, *) :: b
  complex(4), device, dimension(ldc, *) :: c
  complex(4), device :: alpha ! device or host variable
  real(4), device :: beta ! device or host variable

2.7.4. Double Precision Complex Functions and Subroutines

This section contains the V2 interfaces to the double precision complex BLAS and cuBLAS functions and subroutines.

2.7.4.1. izamax

If you use the cublas_v2 module, the interface for cublasIzamax is changed to the following:

integer(4) function cublasIzamax(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.7.4.2. izamin

If you use the cublas_v2 module, the interface for cublasIzamin is changed to the following:

integer(4) function cublasIzamin(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx
  integer, device :: res ! device or host variable

2.7.4.3. dzasum

If you use the cublas_v2 module, the interface for cublasDzasum is changed to the following:

integer(4) function cublasDzasum(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx
  real(8), device :: res ! device or host variable

2.7.4.4. zaxpy

If you use the cublas_v2 module, the interface for cublasZaxpy is changed to the following:

integer(4) function cublasZaxpy(h, n, a, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.4.5. zcopy

If you use the cublas_v2 module, the interface for cublasZcopy is changed to the following:

integer(4) function cublasZcopy(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.4.6. zdotc

If you use the cublas_v2 module, the interface for cublasZdotc is changed to the following:

integer(4) function cublasZdotc(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy
  complex(8), device :: res ! device or host variable

2.7.4.7. zdotu

If you use the cublas_v2 module, the interface for cublasZdotu is changed to the following:

integer(4) function cublasZdotu(h, n, x, incx, y, incy, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy
  complex(8), device :: res ! device or host variable

2.7.4.8. dznrm2

If you use the cublas_v2 module, the interface for cublasDznrm2 is changed to the following:

integer(4) function cublasDznrm2(h, n, x, incx, res)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x
  integer :: incx
  real(8), device :: res ! device or host variable

2.7.4.9. zrot

If you use the cublas_v2 module, the interface for cublasZrot is changed to the following:

integer(4) function cublasZrot(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: sc ! device or host variable
  complex(8), device :: ss ! device or host variable
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.4.10. zsrot

If you use the cublas_v2 module, the interface for cublasZsrot is changed to the following:

integer(4) function cublasZsrot(h, n, x, incx, y, incy, sc, ss)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: sc, ss ! device or host variable
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.4.11. zrotg

If you use the cublas_v2 module, the interface for cublasZrotg is changed to the following:

integer(4) function cublasZrotg(h, sa, sb, sc, ss)
  type(cublasHandle) :: h
  complex(8), device :: sa, sb, ss ! device or host variable
  real(8), device :: sc ! device or host variable

2.7.4.12. zscal

If you use the cublas_v2 module, the interface for cublasZscal is changed to the following:

integer(4) function cublasZscal(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x
  integer :: incx

2.7.4.13. zdscal

If you use the cublas_v2 module, the interface for cublasZdscal is changed to the following:

integer(4) function cublasZdscal(h, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: n
  real(8), device :: a ! device or host variable
  complex(8), device, dimension(*) :: x
  integer :: incx

2.7.4.14. zswap

If you use the cublas_v2 module, the interface for cublasZswap is changed to the following:

integer(4) function cublasZswap(h, n, x, incx, y, incy)
  type(cublasHandle) :: h
  integer :: n
  complex(8), device, dimension(*) :: x, y
  integer :: incx, incy

2.7.4.15. zgbmv

If you use the cublas_v2 module, the interface for cublasZgbmv is changed to the following:

integer(4) function cublasZgbmv(h, t, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, kl, ku, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.16. zgemv

If you use the cublas_v2 module, the interface for cublasZgemv is changed to the following:

integer(4) function cublasZgemv(h, t, m, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: t
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.17. zgerc

If you use the cublas_v2 module, the interface for cublasZgerc is changed to the following:

integer(4) function cublasZgerc(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha ! device or host variable

2.7.4.18. zgeru

If you use the cublas_v2 module, the interface for cublasZgeru is changed to the following:

integer(4) function cublasZgeru(h, m, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: m, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha ! device or host variable

2.7.4.19. zsymv

If you use the cublas_v2 module, the interface for cublasZsymv is changed to the following:

integer(4) function cublasZsymv(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.20. zsyr

If you use the cublas_v2 module, the interface for cublasZsyr is changed to the following:

integer(4) function cublasZsyr(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x
  complex(8), device :: alpha ! device or host variable

2.7.4.21. zsyr2

If you use the cublas_v2 module, the interface for cublasZsyr2 is changed to the following:

integer(4) function cublasZsyr2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha ! device or host variable

2.7.4.22. ztbmv

If you use the cublas_v2 module, the interface for cublasZtbmv is changed to the following:

integer(4) function cublasZtbmv(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

2.7.4.23. ztbsv

If you use the cublas_v2 module, the interface for cublasZtbsv is changed to the following:

integer(4) function cublasZtbsv(h, u, t, d, n, k, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, k, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

2.7.4.24. ztpmv

If you use the cublas_v2 module, the interface for cublasZtpmv is changed to the following:

integer(4) function cublasZtpmv(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x

2.7.4.25. ztpsv

If you use the cublas_v2 module, the interface for cublasZtpsv is changed to the following:

integer(4) function cublasZtpsv(h, u, t, d, n, a, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x

2.7.4.26. ztrmv

If you use the cublas_v2 module, the interface for cublasZtrmv is changed to the following:

integer(4) function cublasZtrmv(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

2.7.4.27. ztrsv

If you use the cublas_v2 module, the interface for cublasZtrsv is changed to the following:

integer(4) function cublasZtrsv(h, u, t, d, n, a, lda, x, incx)
  type(cublasHandle) :: h
  integer :: u, t, d
  integer :: n, incx, lda
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x

2.7.4.28. zhbmv

If you use the cublas_v2 module, the interface for cublasZhbmv is changed to the following:

integer(4) function cublasZhbmv(h, uplo, n, k, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: k, n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.29. zhemv

If you use the cublas_v2 module, the interface for cublasZhemv is changed to the following:

integer(4) function cublasZhemv(h, uplo, n, alpha, a, lda, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, lda, incx, incy
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(*) :: x, y
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.30. zhpmv

If you use the cublas_v2 module, the interface for cublasZhpmv is changed to the following:

integer(4) function cublasZhpmv(h, uplo, n, alpha, a, x, incx, beta, y, incy)
  type(cublasHandle) :: h
  integer :: uplo
  integer :: n, incx, incy
  complex(8), device, dimension(*) :: a, x, y
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.31. zher

If you use the cublas_v2 module, the interface for cublasZher is changed to the following:

integer(4) function cublasZher(h, t, n, alpha, x, incx, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, lda
  complex(8), device, dimension(*) :: a, x
  real(8), device :: alpha ! device or host variable

2.7.4.32. zher2

If you use the cublas_v2 module, the interface for cublasZher2 is changed to the following:

integer(4) function cublasZher2(h, t, n, alpha, x, incx, y, incy, a, lda)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy, lda
  complex(8), device, dimension(*) :: a, x, y
  complex(8), device :: alpha ! device or host variable

2.7.4.33. zhpr

If you use the cublas_v2 module, the interface for cublasZhpr is changed to the following:

integer(4) function cublasZhpr(h, t, n, alpha, x, incx, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx
  complex(8), device, dimension(*) :: a, x
  real(8), device :: alpha ! device or host variable

2.7.4.34. zhpr2

If you use the cublas_v2 module, the interface for cublasZhpr2 is changed to the following:

integer(4) function cublasZhpr2(h, t, n, alpha, x, incx, y, incy, a)
  type(cublasHandle) :: h
  integer :: t
  integer :: n, incx, incy
  complex(8), device, dimension(*) :: a, x, y
  complex(8), device :: alpha ! device or host variable

2.7.4.35. zgemm

If you use the cublas_v2 module, the interface for cublasZgemm is changed to the following:

integer(4) function cublasZgemm(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: transa, transb
  integer :: m, n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.36. zsymm

If you use the cublas_v2 module, the interface for cublasZsymm is changed to the following:

integer(4) function cublasZsymm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.37. zsyrk

If you use the cublas_v2 module, the interface for cublasZsyrk is changed to the following:

integer(4) function cublasZsyrk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.38. zsyr2k

If you use the cublas_v2 module, the interface for cublasZsyr2k is changed to the following:

integer(4) function cublasZsyr2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.39. zsyrkx

If you use the cublas_v2 module, the interface for cublasZsyrkx is changed to the following:

integer(4) function cublasZsyrkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.40. ztrmm

If you use the cublas_v2 module, the interface for cublasZtrmm is changed to the following:

integer(4) function cublasZtrmm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha ! device or host variable

2.7.4.41. ztrsm

If you use the cublas_v2 module, the interface for cublasZtrsm is changed to the following:

integer(4) function cublasZtrsm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasHandle) :: h
  integer :: side, uplo, transa, diag
  integer :: m, n, lda, ldb
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device :: alpha ! device or host variable

2.7.4.42. zhemm

If you use the cublas_v2 module, the interface for cublasZhemm is changed to the following:

integer(4) function cublasZhemm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: side, uplo
  integer :: m, n, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha, beta ! device or host variable

2.7.4.43. zherk

If you use the cublas_v2 module, the interface for cublasZherk is changed to the following:

integer(4) function cublasZherk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldc, *) :: c
  real(8), device :: alpha, beta ! device or host variable

2.7.4.44. zher2k

If you use the cublas_v2 module, the interface for cublasZher2k is changed to the following:

integer(4) function cublasZher2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha ! device or host variable
  real(8), device :: beta ! device or host variable

2.7.4.45. zherkx

If you use the cublas_v2 module, the interface for cublasZherkx is changed to the following:

integer(4) function cublasZherkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasHandle) :: h
  integer :: uplo, trans
  integer :: n, k, lda, ldb, ldc
  complex(8), device, dimension(lda, *) :: a
  complex(8), device, dimension(ldb, *) :: b
  complex(8), device, dimension(ldc, *) :: c
  complex(8), device :: alpha ! device or host variable
  real(8), device :: beta ! device or host variable

2.8. CUBLAS XT Module Functions

This section contains interfaces to the cuBLAS XT Module Functions. Users can access this module by inserting the line use cublasXt into the program unit. The cublasXt library is a host-side library, which supports multiple GPUs. Here is an example:

subroutine testxt(n)
use cublasXt
complex*16 :: a(n,n), b(n,n), c(n,n), alpha, beta
type(cublasXtHandle) :: h
integer ndevices(1)
a = cmplx(1.0d0,0.0d0)
b = cmplx(2.0d0,0.0d0)
c = cmplx(-1.0d0,0.0d0)
alpha = cmplx(1.0d0,0.0d0)
beta = cmplx(0.0d0,0.0d0)
istat = cublasXtCreate(h)
if (istat .ne. CUBLAS_STATUS_SUCCESS) print *,istat
ndevices(1) = 0
istat = cublasXtDeviceSelect(h, 1, ndevices)
if (istat .ne. CUBLAS_STATUS_SUCCESS) print *,istat
istat = cublasXtZgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, &
                      n, n, n, &
                      alpha, A, n, B, n, beta, C, n)
if (istat .ne. CUBLAS_STATUS_SUCCESS) print *,istat
istat = cublasXtDestroy(h)
if (istat .ne. CUBLAS_STATUS_SUCCESS) print *,istat
if (all(dble(c).eq.2.0d0*n)) then
    print *,"Test PASSED"
else
    print *,"Test FAILED"
endif
end

The cublasXt module contains all the types and definitions from the cublas module, and these additional types and enumerations:

TYPE cublasXtHandle
  TYPE(C_PTR)  :: handle
END TYPE

! Pinned memory mode
enum, bind(c)
    enumerator :: CUBLASXT_PINNING_DISABLED=0
    enumerator :: CUBLASXT_PINNING_ENABLED=1
end enum

! cublasXtOpType
enum, bind(c)
    enumerator :: CUBLASXT_FLOAT=0
    enumerator :: CUBLASXT_DOUBLE=1
    enumerator :: CUBLASXT_COMPLEX=2
    enumerator :: CUBLASXT_DOUBLECOMPLEX=3
end enum

! cublasXtBlasOp
enum, bind(c)
    enumerator :: CUBLASXT_GEMM=0
    enumerator :: CUBLASXT_SYRK=1
    enumerator :: CUBLASXT_HERK=2
    enumerator :: CUBLASXT_SYMM=3
    enumerator :: CUBLASXT_HEMM=4
    enumerator :: CUBLASXT_TRSM=5
    enumerator :: CUBLASXT_SYR2K=6
    enumerator :: CUBLASXT_HER2K=7
    enumerator :: CUBLASXT_SPMM=8
    enumerator :: CUBLASXT_SYRKX=9
    enumerator :: CUBLASXT_HERKX=10
    enumerator :: CUBLASXT_TRMM=11
    enumerator :: CUBLASXT_ROUTINE_MAX=12
end enum

2.8.1. cublasXtCreate

This function initializes the cublasXt API and creates a handle to an opaque structure holding the cublasXT library context. It allocates hardware resources on the host and device and must be called prior to making any other cublasXt API library calls.

integer(4) function cublasXtcreate(h)
  type(cublasXtHandle) :: h

2.8.2. cublasXtDestroy

This function releases hardware resources used by the cublasXt API context. This function is usually the last call with a particular handle to the cublasXt API.

integer(4) function cublasXtdestroy(h)
  type(cublasXtHandle) :: h

2.8.3. cublasXtDeviceSelect

This function allows the user to provide the number of GPU devices and their respective Ids that will participate to the subsequent cublasXt API math function calls. This function will create a cuBLAS context for every GPU provided in that list. Currently the device configuration is static and cannot be changed between math function calls. In that regard, this function should be called only once after cublasXtCreate. To be able to run multiple configurations, multiple cublasXt API contexts should be created.

integer(4) function cublasXtdeviceselect(h, ndevices, deviceid)
  type(cublasXtHandle) :: h
  integer :: ndevices
  integer, dimension(*) :: deviceid

2.8.4. cublasXtSetBlockDim

This function allows the user to set the block dimension used for the tiling of the matrices for the subsequent Math function calls. Matrices are split in square tiles of blockDim x blockDim dimension. This function can be called anytime and will take effect for the following math function calls. The block dimension should be chosen in a way to optimize the math operation and to make sure that the PCI transfers are well overlapped with the computation.

integer(4) function cublasXtsetblockdim(h, blockdim)
  type(cublasXtHandle) :: h
  integer :: blockdim

2.8.5. cublasXtGetBlockDim

This function allows the user to query the block dimension used for the tiling of the matrices.

integer(4) function cublasXtgetblockdim(h, blockdim)
  type(cublasXtHandle) :: h
  integer :: blockdim

2.8.6. cublasXtSetCpuRoutine

This function allows the user to provide a CPU implementation of the corresponding BLAS routine. This function can be used with the function cublasXtSetCpuRatio() to define an hybrid computation between the CPU and the GPUs. Currently the hybrid feature is only supported for the xGEMM routines.

integer(4) function cublasXtsetcpuroutine(h, blasop, blastype)
  type(cublasXtHandle) :: h
  integer :: blasop, blastype

2.8.7. cublasXtSetCpuRatio

This function allows the user to define the percentage of workload that should be done on a CPU in the context of an hybrid computation. This function can be used with the function cublasXtSetCpuRoutine() to define an hybrid computation between the CPU and the GPUs. Currently the hybrid feature is only supported for the xGEMM routines.

integer(4) function cublasXtsetcpuratio(h, blasop, blastype, ratio)
  type(cublasXtHandle) :: h
  integer :: blasop, blastype
  real(4) :: ratio

2.8.8. cublasXtSetPinningMemMode

This function allows the user to enable or disable the Pinning Memory mode. When enabled, the matrices passed in subsequent cublasXt API calls will be pinned/unpinned using the CUDART routine cudaHostRegister and cudaHostUnregister respectively if the matrices are not already pinned. If a matrix happened to be pinned partially, it will also not be pinned. Pinning the memory improve PCI transfer performace and allows to overlap PCI memory transfer with computation. However pinning/unpinning the memory takes some time which might not be amortized. It is advised that the user pins the memory on its own using cudaMallocHost or cudaHostRegister and unpins it when the computation sequence is completed. By default, the Pinning Memory mode is disabled.

integer(4) function cublasXtsetpinningmemmode(h, mode)
  type(cublasXtHandle) :: h
  integer :: mode

2.8.9. cublasXtGetPinningMemMode

This function allows the user to query the Pinning Memory mode. By default, the Pinning Memory mode is disabled.

integer(4) function cublasXtgetpinningmemmode(h, mode)
  type(cublasXtHandle) :: h
  integer :: mode

2.8.10. cublasXtSgemm

SGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasXtsgemm(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: transa, transb
  integer(kind=c_intptr_t) :: m, n, k, lda, ldb, ldc
  real(4), dimension(lda, *) :: a
  real(4), dimension(ldb, *) :: b
  real(4), dimension(ldc, *) :: c
  real(4) :: alpha, beta

2.8.11. cublasXtSsymm

SSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.

integer(4) function cublasXtssymm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  real(4), dimension(lda, *) :: a
  real(4), dimension(ldb, *) :: b
  real(4), dimension(ldc, *) :: c
  real(4) :: alpha, beta

2.8.12. cublasXtSsyrk

SSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

integer(4) function cublasXtssyrk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldc
  real(4), dimension(lda, *) :: a
  real(4), dimension(ldc, *) :: c
  real(4) :: alpha, beta

2.8.13. cublasXtSsyr2k

SSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

integer(4) function cublasXtssyr2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  real(4), dimension(lda, *) :: a
  real(4), dimension(ldb, *) :: b
  real(4), dimension(ldc, *) :: c
  real(4) :: alpha, beta

2.8.14. cublasXtSsyrkx

SSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.

integer(4) function cublasXtssyrkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  real(4), dimension(lda, *) :: a
  real(4), dimension(ldb, *) :: b
  real(4), dimension(ldc, *) :: c
  real(4) :: alpha, beta

2.8.15. cublasXtStrmm

STRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ), where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T.

integer(4) function cublasXtstrmm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo, transa, diag
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  real(4), dimension(lda, *) :: a
  real(4), dimension(ldb, *) :: b
  real(4), dimension(ldc, *) :: c
  real(4) :: alpha

2.8.16. cublasXtStrsm

STRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.

integer(4) function cublasXtstrsm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasXtHandle) :: h
  integer :: side, uplo, transa, diag
  integer(kind=c_intptr_t) :: m, n, lda, ldb
  real(4), dimension(lda, *) :: a
  real(4), dimension(ldb, *) :: b
  real(4) :: alpha

2.8.17. cublasXtSspmm

SSPMM performs one of the symmetric packed matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a n by n symmetric matrix stored in packed format, and B and C are m by n matrices.

integer(4) function cublasXtsspmm(h, side, uplo, m, n, alpha, ap, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, ldb, ldc
  real(4), dimension(*) :: ap
  real(4), dimension(ldb, *) :: b
  real(4), dimension(ldc, *) :: c
  real(4) :: alpha, beta

2.8.18. cublasXtCgemm

CGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasXtcgemm(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: transa, transb
  integer(kind=c_intptr_t) :: m, n, k, lda, ldb, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldb, *) :: b
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha, beta

2.8.19. cublasXtChemm

CHEMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is an hermitian matrix and B and C are m by n matrices.

integer(4) function cublasXtchemm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldb, *) :: b
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha, beta

2.8.20. cublasXtCherk

CHERK performs one of the hermitian rank k operations C := alpha*A*A**H + beta*C, or C := alpha*A**H*A + beta*C, where alpha and beta are real scalars, C is an n by n hermitian matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

integer(4) function cublasXtcherk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldc, *) :: c
  real(4) :: alpha, beta

2.8.21. cublasXtCher2k

CHER2K performs one of the hermitian rank 2k operations C := alpha*A*B**H + conjg( alpha )*B*A**H + beta*C, or C := alpha*A**H*B + conjg( alpha )*B**H*A + beta*C, where alpha and beta are scalars with beta real, C is an n by n hermitian matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

integer(4) function cublasXtcher2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldb, *) :: b
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha
  real(4) :: beta

2.8.22. cublasXtCherkx

CHERKX performs a variation of the hermitian rank k operations C := alpha*A*B**H + beta*C, where alpha and beta are real scalars, C is an n by n hermitian matrix stored in lower or upper mode, and A and B are n by k matrices. See the CUBLAS documentation for more details.

integer(4) function cublasXtcherkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldb, *) :: b
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha
  real(4) :: beta

2.8.23. cublasXtCsymm

CSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.

integer(4) function cublasXtcsymm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldb, *) :: b
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha, beta

2.8.24. cublasXtCsyrk

CSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

integer(4) function cublasXtcsyrk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha, beta

2.8.25. cublasXtCsyr2k

CSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

integer(4) function cublasXtcsyr2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldb, *) :: b
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha, beta

2.8.26. cublasXtCsyrkx

CSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.

integer(4) function cublasXtcsyrkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldb, *) :: b
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha, beta

2.8.27. cublasXtCtrmm

CTRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ) where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H.

integer(4) function cublasXtctrmm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo, transa, diag
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldb, *) :: b
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha

2.8.28. cublasXtCtrsm

CTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H. The matrix X is overwritten on B.

integer(4) function cublasXtctrsm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasXtHandle) :: h
  integer :: side, uplo, transa, diag
  integer(kind=c_intptr_t) :: m, n, lda, ldb
  complex(4), dimension(lda, *) :: a
  complex(4), dimension(ldb, *) :: b
  complex(4) :: alpha

2.8.29. cublasXtCspmm

CSPMM performs one of the symmetric packed matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a n by n symmetric matrix stored in packed format, and B and C are m by n matrices.

integer(4) function cublasXtcspmm(h, side, uplo, m, n, alpha, ap, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, ldb, ldc
  complex(4), dimension(*) :: ap
  complex(4), dimension(ldb, *) :: b
  complex(4), dimension(ldc, *) :: c
  complex(4) :: alpha, beta

2.8.30. cublasXtDgemm

DGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasXtdgemm(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: transa, transb
  integer(kind=c_intptr_t) :: m, n, k, lda, ldb, ldc
  real(8), dimension(lda, *) :: a
  real(8), dimension(ldb, *) :: b
  real(8), dimension(ldc, *) :: c
  real(8) :: alpha, beta

2.8.31. cublasXtDsymm

DSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.

integer(4) function cublasXtdsymm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  real(8), dimension(lda, *) :: a
  real(8), dimension(ldb, *) :: b
  real(8), dimension(ldc, *) :: c
  real(8) :: alpha, beta

2.8.32. cublasXtDsyrk

DSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

integer(4) function cublasXtdsyrk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldc
  real(8), dimension(lda, *) :: a
  real(8), dimension(ldc, *) :: c
  real(8) :: alpha, beta

2.8.33. cublasXtDsyr2k

DSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

integer(4) function cublasXtdsyr2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  real(8), dimension(lda, *) :: a
  real(8), dimension(ldb, *) :: b
  real(8), dimension(ldc, *) :: c
  real(8) :: alpha, beta

2.8.34. cublasXtDsyrkx

DSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.

integer(4) function cublasXtdsyrkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  real(8), dimension(lda, *) :: a
  real(8), dimension(ldb, *) :: b
  real(8), dimension(ldc, *) :: c
  real(8) :: alpha, beta

2.8.35. cublasXtDtrmm

DTRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ), where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T.

integer(4) function cublasXtdtrmm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo, transa, diag
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  real(8), dimension(lda, *) :: a
  real(8), dimension(ldb, *) :: b
  real(8), dimension(ldc, *) :: c
  real(8) :: alpha

2.8.36. cublasXtDtrsm

DTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T. The matrix X is overwritten on B.

integer(4) function cublasXtdtrsm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasXtHandle) :: h
  integer :: side, uplo, transa, diag
  integer(kind=c_intptr_t) :: m, n, lda, ldb
  real(8), dimension(lda, *) :: a
  real(8), dimension(ldb, *) :: b
  real(8) :: alpha

2.8.37. cublasXtDspmm

DSPMM performs one of the symmetric packed matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a n by n symmetric matrix stored in packed format, and B and C are m by n matrices.

integer(4) function cublasXtdspmm(h, side, uplo, m, n, alpha, ap, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, ldb, ldc
  real(8), dimension(*) :: ap
  real(8), dimension(ldb, *) :: b
  real(8), dimension(ldc, *) :: c
  real(8) :: alpha, beta

2.8.38. cublasXtZgemm

ZGEMM performs one of the matrix-matrix operations C := alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

integer(4) function cublasXtzgemm(h, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: transa, transb
  integer(kind=c_intptr_t) :: m, n, k, lda, ldb, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldb, *) :: b
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha, beta

2.8.39. cublasXtZhemm

ZHEMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is an hermitian matrix and B and C are m by n matrices.

integer(4) function cublasXtzhemm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldb, *) :: b
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha, beta

2.8.40. cublasXtZherk

ZHERK performs one of the hermitian rank k operations C := alpha*A*A**H + beta*C, or C := alpha*A**H*A + beta*C, where alpha and beta are real scalars, C is an n by n hermitian matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

integer(4) function cublasXtzherk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldc, *) :: c
  real(8) :: alpha, beta

2.8.41. cublasXtZher2k

ZHER2K performs one of the hermitian rank 2k operations C := alpha*A*B**H + conjg( alpha )*B*A**H + beta*C, or C := alpha*A**H*B + conjg( alpha )*B**H*A + beta*C, where alpha and beta are scalars with beta real, C is an n by n hermitian matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

integer(4) function cublasXtzher2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldb, *) :: b
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha
  real(8) :: beta

2.8.42. cublasXtZherkx

ZHERKX performs a variation of the hermitian rank k operations C := alpha*A*B**H + beta*C, where alpha and beta are real scalars, C is an n by n hermitian matrix stored in lower or upper mode, and A and B are n by k matrices. See the CUBLAS documentation for more details.

integer(4) function cublasXtzherkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldb, *) :: b
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha
  real(8) :: beta

2.8.43. cublasXtZsymm

ZSYMM performs one of the matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a symmetric matrix and B and C are m by n matrices.

integer(4) function cublasXtzsymm(h, side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldb, *) :: b
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha, beta

2.8.44. cublasXtZsyrk

ZSYRK performs one of the symmetric rank k operations C := alpha*A*A**T + beta*C, or C := alpha*A**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A is an n by k matrix in the first case and a k by n matrix in the second case.

integer(4) function cublasXtzsyrk(h, uplo, trans, n, k, alpha, a, lda, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha, beta

2.8.45. cublasXtZsyr2k

ZSYR2K performs one of the symmetric rank 2k operations C := alpha*A*B**T + alpha*B*A**T + beta*C, or C := alpha*A**T*B + alpha*B**T*A + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

integer(4) function cublasXtzsyr2k(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldb, *) :: b
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha, beta

2.8.46. cublasXtZsyrkx

ZSYRKX performs a variation of the symmetric rank k update C := alpha*A*B**T + beta*C, where alpha and beta are scalars, C is an n by n symmetric matrix stored in lower or upper mode, and A and B are n by k matrices. This routine can be used when B is in such a way that the result is guaranteed to be symmetric. See the CUBLAS documentation for more details.

integer(4) function cublasXtzsyrkx(h, uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: uplo, trans
  integer(kind=c_intptr_t) :: n, k, lda, ldb, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldb, *) :: b
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha, beta

2.8.47. cublasXtZtrmm

ZTRMM performs one of the matrix-matrix operations B := alpha*op( A )*B, or B := alpha*B*op( A ) where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H.

integer(4) function cublasXtztrmm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo, transa, diag
  integer(kind=c_intptr_t) :: m, n, lda, ldb, ldc
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldb, *) :: b
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha

2.8.48. cublasXtZtrsm

ZTRSM solves one of the matrix equations op( A )*X = alpha*B, or X*op( A ) = alpha*B, where alpha is a scalar, X and B are m by n matrices, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of op( A ) = A or op( A ) = A**T or op( A ) = A**H. The matrix X is overwritten on B.

integer(4) function cublasXtztrsm(h, side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb)
  type(cublasXtHandle) :: h
  integer :: side, uplo, transa, diag
  integer(kind=c_intptr_t) :: m, n, lda, ldb
  complex(8), dimension(lda, *) :: a
  complex(8), dimension(ldb, *) :: b
  complex(8) :: alpha

2.8.49. cublasXtZspmm

ZSPMM performs one of the symmetric packed matrix-matrix operations C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where alpha and beta are scalars, A is a n by n symmetric matrix stored in packed format, and B and C are m by n matrices.

integer(4) function cublasXtzspmm(h, side, uplo, m, n, alpha, ap, b, ldb, beta, c, ldc)
  type(cublasXtHandle) :: h
  integer :: side, uplo
  integer(kind=c_intptr_t) :: m, n, ldb, ldc
  complex(8), dimension(*) :: ap
  complex(8), dimension(ldb, *) :: b
  complex(8), dimension(ldc, *) :: c
  complex(8) :: alpha, beta

2.9. CUBLAS MP Module Functions

This section contains interfaces to the cuBLAS MP Module Functions. Users can access this module by inserting the line use cublasMp into the program unit. The cublasMp library is a host-side library which operates on distributed device data, and which supports multiple processes and GPUs. It is based on the ScaLAPACK PBLAS library.

Beginning with the 25.1 release, the cublasMp library has a newer API for CUDA versions > 12.0, specifically cublasMp version 0.3.0 and higher. For users of CUDA versions <= 11.8, the old module has been renamed, and you can access it by inserting the line use cublasMp02 in the program unit. One major difference in version 0.3.x is that all cublasMp functions now return a type(cublasMpStatus) rather than an integer(4). There are other additions and changes which we will point out in the individual descriptions below. For complete documentation of the Fortran interfaces for cublasMp 0.2.x, please see the documentation from a 2024 NVHPC release.

Some overloaded operations for comparing and assigning type(cublasMpStatus) variables and expressions are provided in the new module.

The cublasMp module contains all the common types and definitions from the cublas module, types and interfaces from the nvf_cal_comm module, and these additional types and enumerations:

! Version information
integer, parameter :: CUBLASMP_VER_MAJOR = 0
integer, parameter :: CUBLASMP_VER_MINOR = 3
integer, parameter :: CUBLASMP_VER_PATCH = 0
integer, parameter :: CUBLASMP_VERSION = &
     (CUBLASMP_VER_MAJOR * 1000 + CUBLASMP_VER_MINOR * 100 + CUBLASMP_VER_PATCH)

! New status type, with version 0.3.0
TYPE cublasMpStatus
  integer(4) :: stat
END TYPE
TYPE(cublasMpStatus), parameter :: &
  CUBLASMP_STATUS_SUCCESS                = cublasMpStatus(0), &
  CUBLASMP_STATUS_NOT_INITIALIZED        = cublasMpStatus(1), &
  CUBLASMP_STATUS_ALLOCATION_FAILED      = cublasMpStatus(2), &
  CUBLASMP_STATUS_INVALID_VALUE          = cublasMpStatus(3), &
  CUBLASMP_STATUS_ARCHITECTURE_MISMATCH  = cublasMpStatus(4), &
  CUBLASMP_STATUS_EXECUTION_FAILED       = cublasMpStatus(5), &
  CUBLASMP_STATUS_INTERNAL_ERROR         = cublasMpStatus(6), &
  CUBLASMP_STATUS_NOT_SUPPORTED          = cublasMpStatus(7)

! Grid Layout
TYPE cublasMpGridLayout
  integer(4) :: grid
END TYPE
TYPE(cublasMpGridLayout), parameter :: &
  CUBLASMP_GRID_LAYOUT_COL_MAJOR = cublasMpGridLayout(0), &
  CUBLASMP_GRID_LAYOUT_ROW_MAJOR = cublasMpGridLayout(1)

! Matmul Descriptor Attributes
TYPE cublasMpMatmulDescriptorAttribute
  integer(4) :: attr
END TYPE
TYPE(cublasMpMatmulDescriptorAttribute), parameter :: &
  CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_TRANSA       = cublasMpMatmulDescriptorAttribute(0), &
  CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_TRANSB       = cublasMpMatmulDescriptorAttribute(1), &
  CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_COMPUTE_TYPE = cublasMpMatmulDescriptorAttribute(2), &
  CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_ALGO_TYPE    = cublasMpMatmulDescriptorAttribute(3)

! Matmul Algorithm Type
TYPE cublasMpMatmulAlgoType
  integer(4) :: atyp
END TYPE
TYPE(cublasMpMatmulAlgoType), parameter :: &
  CUBLASMP_MATMUL_ALGO_TYPE_DEFAULT          = cublasMpMatmulAlgoType(0), &
  CUBLASMP_MATMUL_ALGO_TYPE_SPLIT_P2P        = cublasMpMatmulAlgoType(1), &
  CUBLASMP_MATMUL_ALGO_TYPE_SPLIT_MULTICAST  = cublasMpMatmulAlgoType(2), &
  CUBLASMP_MATMUL_ALGO_TYPE_ATOMIC_P2P       = cublasMpMatmulAlgoType(3), &
  CUBLASMP_MATMUL_ALGO_TYPE_ATOMIC_MULTICAST = cublasMpMatmulAlgoType(4)

TYPE cublasMpHandle
  TYPE(C_PTR)  :: handle
END TYPE

TYPE cublasMpGrid
  TYPE(C_PTR)  :: handle
END TYPE

TYPE cublasMpMatrixDescriptor
  TYPE(C_PTR)  :: handle
END TYPE

TYPE cublasMpMatmulDescriptor
  TYPE(C_PTR)  :: handle
END TYPE

2.9.1. cublasMpCreate

This function initializes the cublasMp API and creates a handle to an opaque structure holding the cublasMp library context. It allocates hardware resources on the host and device and must be called prior to making any other cublasMp library calls.

type(cublasMpStatus) function cublasMpCreate(handle, stream)
  type(cublasMpHandle) :: handle
  integer(kind=cuda_stream_kind()) :: stream

2.9.2. cublasMpDestroy

This function releases resources used by the cublasMp handle and context.

type(cublasMpStatus) function cublasMpDestroy(handle)
  type(cublasMpHandle) :: handle

2.9.3. cublasMpStreamSet

This function sets the CUDA stream to be used in the cublasMp computations.

type(cublasMpStatus) function cublasMpStreamSet(handle, stream)
  type(cublasMpHandle) :: handle
  integer(kind=cuda_stream_kind()) :: stream

2.9.4. cublasMpStreamGet

This function returns the current CUDA stream used in the cublasMp computations.

type(cublasMpStatus) function cublasMpStreamGet(handle, stream)
  type(cublasMpHandle) :: handle
  integer(kind=cuda_stream_kind()) :: stream

2.9.5. cublasMpGetVersion

This function returns the version number of the cublasMp library.

type(cublasMpStatus) function cublasMpGetVersion(handle, version)
  type(cublasMpHandle) :: handle
  integer(4) :: version

2.9.6. cublasMpGridCreate

This function initializes the grid data structure used in the cublasMp library. It takes a communicator, and other information related to the data layout as inputs. Starting in version 0.3.0, it no longer takes a handle argument.

type(cublasMpStatus) function cublasMpGridCreate(nprow, npcol, &
          layout, comm, grid)
  integer(8) :: nprow, npcol
  type(cublasMpGridLayout) :: layout ! usually column major in Fortran
  type(cal_comm) :: comm
  type(cublasMpGrid), intent(out) :: grid

2.9.7. cublasMpGridDestroy

This function releases the grid data structure used in the cublasMp library. Starting in version 0.3.0, it no longer takes a handle argument.

type(cublasMpStatus) function cublasMpGridDestroy(grid)
  type(cublasMpGrid) :: grid

2.9.8. cublasMpMatrixDescriptorCreate

This function initializes the matrix descriptor object used in the cublasMp library. It takes the number of rows (M) and the number of columns (N) in the global array, along with the blocking factor over each dimension. RSRC and CSRC must currently be 0. LLD is the leading dimension of the local matrix, after blocking and distributing the matrix. Starting in version 0.3.0, it no longer takes a handle argument.

type(cublasMpStatus) function cublasMpMatrixDescriptorCreate(M, N, MB, NB, &
          RSRC, CSRC, LLD, dataType, grid, descr)
  integer(8) :: M, N, MB, NB, RSRC, CSRC, LLD
  type(cudaDataType) :: dataType
  type(cublasMpGrid) :: grid
  type(cublasMpMatrixDescriptor), intent(out)  :: descr

2.9.9. cublasMpMatrixDescriptorDestroy

This function frees the matrix descriptor object used in the cublasMp library. Starting in version 0.3.0, it no longer takes a handle argument.

type(cublasMpStatus) function cublasMpMatrixDescriptorDestroy(descr)
  type(cublasMpMatrixDescriptor) :: descr

2.9.10. cublasMpMatrixDescriptorInit

This function initializes the values within the matrix descriptor object used in the cublasMp library. It takes the number of rows (M) and the number of columns (N) in the global array, along with the blocking factor over each dimension. RSRC and CSRC must currently be 0. LLD is the leading dimension of the local matrix, after blocking and distributing the matrix.

type(cublasMpStatus) function cublasMpMatrixDescriptorInit(M, N, MB, NB, &
          RSRC, CSRC, LLD, dataType, grid, descr)
  integer(8) :: M, N, MB, NB, RSRC, CSRC, LLD
  type(cudaDataType) :: dataType
  type(cublasMpGrid) :: grid
  type(cublasMpMatrixDescriptor), intent(out)  :: descr

2.9.11. cublasMpNumroc

This function computes (and returns) the local number of rows or columns of a distributed matrix, similar to the ScaLAPACK NUMROC function.

type(cublasMpStatus) function cublasMpNumroc(N, NB, iproc, isrcproc, nprocs)
  integer(8) :: N, NB
  integer(4) :: iproc, isrcproc, nprocs

2.9.12. cublasMpMatmulDescriptorCreate

This function initializes the matmul descriptor object used in the cublasMp library.

type(cublasMpStatus) function cublasMpMatmulDescriptorCreate(descr, computeType)
  type(cublasMpMatmulDescriptor)  :: descr
  type(cublasComputeType) :: computeType

2.9.13. cublasMpMatmulDescriptorDestroy

This function destroys the matmul descriptor object used in the cublasMp library.

type(cublasMpStatus) function cublasMpMatmulDescriptorDestroy(descr)
  type(cublasMpMatmulDescriptor)  :: descr

2.9.14. cublasMpMatmulDescriptorAttributeSet

This function sets attributes within the matmul descriptor object used in the cublasMp library.

type(cublasMpStatus) function cublasMpMatmulDescriptorAttributeSet(descr, attr &
          buf, sizeInBytes)
  type(cublasMpMatmulDescriptor)  :: descr
  type(cublasMpMatmulDescriptorAttribute)  :: attr
  integer(1) :: buf(sizeInBytes) ! Any type, kind, or rank allowed
  integer(8) :: sizeInBytes

2.9.15. cublasMpMatmulDescriptorAttributeGet

This function retrieves attributes within the matmul descriptor object used in the cublasMp library.

type(cublasMpStatus) function cublasMpMatmulDescriptorAttributeGet(descr, attr &
          buf, sizeInBytes, sizeWritten)
  type(cublasMpMatmulDescriptor)  :: descr
  type(cublasMpMatmulDescriptorAttribute)  :: attr
  integer(1) :: buf(sizeInBytes) ! Any type, kind, or rank allowed
  integer(8) :: sizeInBytes, sizeWritten

2.9.16. cublasMpGemr2D_bufferSize

This functions computes the workspace requirements of cublasMpGemr2D

type(cublasMpStatus) function cublasMpGemr2D_bufferSize(handle, M, N, &
   A, IA, JA, descrA, B, IB, JB, descrB, &
   devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes, comm)
   type(cublasMpHandle) :: handle
   integer(8), intent(in) :: M, N, IA, JA, IB, JB
   real(4), device, dimension(*) :: A, B  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB
   integer(8), intent(out) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   type(cal_comm) :: comm

2.9.17. cublasMpGemr2D

This functions copies a matrix from one distributed form to another. The layout of each matrix is defined in the matrix descriptor. M and N are the global matrix dimensions. IA, JA, IB, and JB are 1-based, and typically equal to 1 for a full matrix.

type(cublasMpStatus) function cublasMpGemr2D(handle, M, N,  &
   A, IA, JA, descrA, B, IB, JB, descrB, &
   bufferOnDevice, devWorkspaceSizeInBytes, &
   bufferOnHost, hostWorkspaceSizeInBytes, comm)
   type(cublasMpHandle) :: handle
   integer(8), intent(in) :: M, N, IA, JA, IB, JB
   real(4), device, dimension(*) :: A, B  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB
   integer(8), intent(in) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   integer(1), device :: bufferOnDevice(devWorkspaceSizeInBytes) ! Any type
   integer(1)         :: bufferOnHost(hostWorkspaceSizeInBytes)  ! Any type
   type(cal_comm) :: comm

2.9.18. cublasMpTrmr2D_bufferSize

This functions computes the workspace requirements of cublasMpTrmr2D

type(cublasMpStatus) function cublasMpTrmr2D_bufferSize(handle, uplo, diag, &
   M, N, A, IA, JA, descrA, B, IB, JB, descrB, &
   devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes, comm)
   type(cublasMpHandle) :: handle
   integer(4), intent(in) :: uplo, diag
   integer(8), intent(in) :: M, N, IA, JA, IB, JB
   real(4), device, dimension(*) :: A, B  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB
   integer(8), intent(out) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   type(cal_comm) :: comm

2.9.19. cublasMpTrmr2D

This functions copies a trapezoidal matrix from one distributed form to another. The layout of each matrix is defined in the matrix descriptor. M and N are the global matrix dimensions. IA, JA, IB, and JB are 1-based, and typically equal to 1 for a full matrix.

type(cublasMpStatus) function cublasMpTrmr2D(handle, uplo, diag,  &
   M, N, A, IA, JA, descrA, B, IB, JB, descrB, &
   bufferOnDevice, devWorkspaceSizeInBytes, &
   bufferOnHost, hostWorkspaceSizeInBytes, comm)
   type(cublasMpHandle) :: handle
   integer(4), intent(in) :: uplo, diag
   integer(8), intent(in) :: M, N, IA, JA, IB, JB
   real(4), device, dimension(*) :: A, B  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB
   integer(8), intent(in) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   integer(1), device :: bufferOnDevice(devWorkspaceSizeInBytes) ! Any type
   integer(1)         :: bufferOnHost(hostWorkspaceSizeInBytes)  ! Any type
   type(cal_comm) :: comm

2.9.20. cublasMpGemm_bufferSize

This functions computes the workspace requirements of cublasMpGemm.

type(cublasMpStatus) function cublasMpGemm_bufferSize(handle, transA, transB, M, N, K, &
   alpha, A, IA, JA, descrA, B, IB, JB, descrB, beta, C, IC, JC, descrC, &
   computeType, devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: transA, transB
   integer(8), intent(in) :: M, N, K, IA, JA, IB, JB, IC, JC
   real(4) :: alpha, beta  ! type and kind compatible with computeType
   real(4), device, dimension(*) :: A, B, C  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB, descrC
   type(cublasComputeType) :: computeType
   integer(8), intent(out) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes

2.9.21. cublasMpGemm

This is the multi-processor version of the BLAS GEMM operation, similar to the ScaLAPACK PBLAS functions pdgemm, pzgemm, etc.

GEMM performs one of the matrix-matrix operations

C := alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X or op( X ) = X**T,

alpha and beta are scalars, and A, B, and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix. The data for A, B, and C should be properly distributed over the process grid. That mapping is contained within the descriptors descrA, descrB, and descrC via the cublasMpMatrixDescriptorCreate() function. The datatype is also specified then. M, N, and K are the global matrix dimensions. IA, JA, IB, JB, IC, and JC are 1-based, and typically equal to 1 for a full matrix. Integer(4) input values will be promoted to integer(8) according to the interface.

type(cublasMpStatus) function cublasMpGemm(handle, transA, transB, M, N, K, &
   alpha, A, IA, JA, descrA, B, IB, JB, descrB, beta, C, IC, JC, descrC, &
   computeType, bufferOnDevice, devWorkspaceSizeInBytes, &
   bufferOnHost, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: transA, transB
   integer(8), intent(in) :: M, N, K, IA, JA, IB, JB, IC, JC
   real(4) :: alpha, beta  ! type and kind compatible with computeType
   real(4), device, dimension(*) :: A, B, C  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB, descrC
   type(cublasComputeType) :: computeType
   integer(8), intent(in) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   integer(1), device :: bufferOnDevice(devWorkspaceInBytes) ! Any type
   integer(1)         :: bufferOnHost(hostWorkspaceInBytes)  ! Any type

2.9.22. cublasMpMatmul_bufferSize

This functions computes the workspace requirements of cublasMpMatmul.

type(cublasMpStatus) function cublasMpMatmul_bufferSize(handle, matmulDescr, M, N, K, &
   alpha, A, IA, JA, descrA, B, IB, JB, descrB, beta, C, IC, JC, descrC, &
   D, ID, JD, descrD, devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   type(cublasMpMatmulDescriptor) :: matmulDescr
   integer(8), intent(in) :: M, N, K, IA, JA, IB, JB, IC, JC, ID, JD
   real(4) :: alpha, beta  ! Any compatible kind
   real(4), device, dimension(*) :: A, B, C, D  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB, descrC, descrD
   integer(8), intent(out) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes

2.9.23. cublasMpMatmul

This is the multi-processor version of the matrix multiplication operation.

Matmul performs one of the matrix-matrix operations

D := alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X or op( X ) = X**T, as set by a call to cublasMpMatmulDescriptorAttributeSet().

alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices. The data for A, B, C, and D should be properly distributed over the process grid. That mapping is contained within the descriptors descrA, descrB, descrC, and descrD via the cublasMpMatrixDescriptorCreate() function. The datatype is also specified there. M, N, and K are the global matrix dimensions. IA, JA, IB, JB, IC, JC, ID, and JD are 1-based, and typically equal to 1 for a full matrix. Integer(4) input values will be promoted to integer(8) according to the interface.

type(cublasMpStatus) function cublasMpMatmul(handle, matmulDescr, M, N, K, &
   alpha, A, IA, JA, descrA, B, IB, JB, descrB, beta, C, IC, JC, descrC, &
   D, ID, JD, descrD, bufferOnDevice, devWorkspaceSizeInBytes, &
   bufferOnHost, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   type(cublasMpMatmulDescriptor) :: matmulDescr
   integer(8), intent(in) :: M, N, K, IA, JA, IB, JB, IC, JC, ID, JD
   real(4) :: alpha, beta  ! Any supported type and kind
   real(4), device, dimension(*) :: A, B, C, D  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB, descrC, descrD
   integer(8), intent(in) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   integer(1), device :: bufferOnDevice(devWorkspaceInBytes) ! Any type
   integer(1)         :: bufferOnHost(hostWorkspaceInBytes)  ! Any type

2.9.24. cublasMpSyrk

This is the multi-processor version of the BLAS SYRK operation, similar to the ScaLAPACK PBLAS functions pdsyrk, pzsyrk, etc.

SYRK performs one of the symmetric rank k operations

C := alpha*A*A**T + beta*C, or

C := alpha*A**T*A + beta*C

alpha and beta are scalars, and A and C are matrices. A is either N x K or K x N depending on the trans argument, and C is N x N. The data for A and C should be properly distributed over the process grid. That mapping is contained within the descriptors descrA and descrC via the cublasMpMatrixDescriptorCreate() function. The datatype is also specified then. N and K are the global matrix dimensions. IA, JA, IC, and JC are 1-based, and typically equal to 1 for a full matrix. Integer(4) input values will be promoted to integer(8) according to the interface.

type(cublasMpStatus) function cublasMpSyrk(handle, uplo, trans, &
   N, K, alpha, A, IA, JA, descrA, beta, C, IC, JC, descrC, &
   computeType, bufferOnDevice, devWorkspaceSizeInBytes, &
   bufferOnHost, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: uplo, trans
   integer(8), intent(in) :: N, K, IA, JA, IC, JC
   real(4) :: alpha, beta  ! type and kind compatible with computeType
   real(4), device, dimension(*) :: A, C  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrC
   type(cublasComputeType) :: computeType
   integer(8), intent(in) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   integer(1), device :: bufferOnDevice(devWorkspaceSizeInBytes) ! Any type
   integer(1)         :: bufferOnHost(hostWorkspaceSizeInBytes)  ! Any type

2.9.25. cublasMpSyrk

This is the multi-processor version of the BLAS SYRK operation, similar to the ScaLAPACK PBLAS functions pdsyrk, pzsyrk, etc.

SYRK performs one of the symmetric rank k operations

C := alpha*A*A**T + beta*C, or

C := alpha*A**T*A + beta*C

alpha and beta are scalars, and A and C are matrices. A is either N x K or K x N depending on the trans argument, and C is N x N. The data for A and C should be properly distributed over the process grid. That mapping is contained within the descriptors descrA and descrC via the cublasMpMatrixDescriptorCreate() function. The datatype is also specified then. N and K are the global matrix dimensions. IA, JA, IC, and JC are 1-based, and typically equal to 1 for a full matrix. Integer(4) input values will be promoted to integer(8) according to the interface.

type(cublasMpStatus) function cublasMpSyrk(handle, uplo, trans, &
   N, K, alpha, A, IA, JA, descrA, beta, C, IC, JC, descrC, &
   computeType, bufferOnDevice, devWorkspaceSizeInBytes, &
   bufferOnHost, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: uplo, trans
   integer(8), intent(in) :: N, K, IA, JA, IC, JC
   real(4) :: alpha, beta  ! type and kind compatible with computeType
   real(4), device, dimension(*) :: A, C  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrC
   type(cublasComputeType) :: computeType
   integer(8), intent(in) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   integer(1), device :: bufferOnDevice(devWorkspaceSizeInBytes) ! Any type
   integer(1)         :: bufferOnHost(hostWorkspaceSizeInBytes)  ! Any type

2.9.26. cublasMpTrsm_bufferSize

This functions computes the workspace requirements of cublasMpTrsm.

type(cublasMpStatus) function cublasMpTrsm_bufferSize(handle, side, uplo, trans, diag, &
   M, N, alpha, A, IA, JA, descrA, B, IB, JB, descrB, &
   computeType, devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: side, uplo, trans, diag
   integer(8), intent(in) :: M, N, IA, JA, IB, JB
   real(4) :: alpha  ! type and kind compatible with computeType
   real(4), device, dimension(*) :: A, B  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB
   type(cublasComputeType) :: computeType
   integer(8), intent(out) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes

2.9.27. cublasMpTrsm

This is the multi-processor version of the BLAS TRSM operation, similar to the ScaLAPACK PBLAS functions pdtrsm, pztrsm, etc.

TRSM solves one of the matrix equations

op( A )*X = alpha*B, or

X*op( A ) = alpha*B

alpha is a scalar, A and B are matrices whose dimensions are determined by the side argument. The data for A and B should be properly distributed over the process grid. That mapping is contained within the descriptors descrA and descrB via the cublasMpMatrixDescriptorCreate() function. The datatype is also specified then. M and N are the global matrix dimensions. IA, JA, IB, and JB are 1-based, and typically equal to 1 for a full matrix. Integer(4) input values will be promoted to integer(8) according to the interface.

type(cublasMpStatus) function cublasMpTrsm(handle, side, uplo, trans, diag,  &
   M, N, alpha, A, IA, JA, descrA, B, IB, JB, descrB, &
   computeType, bufferOnDevice, devWorkspaceSizeInBytes, &
   bufferOnHost, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: side, uplo, trans, diag
   integer(8), intent(in) :: M, N, IA, JA, IB, JB
   real(4) :: alpha  ! type and kind compatible with computeType
   real(4), device, dimension(*) :: A, B  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrB
   type(cublasComputeType) :: computeType
   integer(8), intent(in) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   integer(1), device :: bufferOnDevice(devWorkspaceSizeInBytes) ! Any type
   integer(1)         :: bufferOnHost(hostWorkspaceSizeInBytes)     ! Any type

2.9.28. cublasMpGeadd_bufferSize

This functions computes the workspace requirements of cublasMpGeadd.

type(cublasMpStatus) function cublasMpGeadd_bufferSize(handle, trans, &
   M, N, alpha, A, IA, JA, descrA, beta, C, IC, JC, descrC, &
   devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: trans
   integer(8), intent(in) :: M, N, IA, JA, IC, JC
   real(4) :: alpha, beta  ! Any compatible kind
   real(4), device, dimension(*) :: A, C  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrC
   integer(8), intent(out) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes

2.9.29. cublasMpGeadd

This is the multi-processor version of a general matrix addition function.

GEADD performs the matrix-matrix addition operation

C := alpha*A + beta*C

alpha and beta are scalars, and A and C are matrices. A is either M x N or N x M depending on the trans argument, and C is M x N. The data for A and C should be properly distributed over the process grid. That mapping is contained within the descriptors descrA and descrC via the cublasMpMatrixDescriptorCreate() function. The datatype is also specified then. M and N are the global matrix dimensions. IA, JA, IC, and JC are 1-based, and typically equal to 1 for a full matrix. Integer(4) input values will be promoted to integer(8) according to the interface.

type(cublasMpStatus) function cublasMpGeadd(handle, trans, &
   M, N, alpha, A, IA, JA, descrA, beta, C, IC, JC, descrC, &
   bufferOnDevice, devWorkspaceSizeInBytes, &
   bufferOnHost, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: trans
   integer(8), intent(in) :: M, N, IA, JA, IC, JC
   real(4) :: alpha, beta  ! Any compatible type and kind
   real(4), device, dimension(*) :: A, C  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrC
   integer(8), intent(in) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   integer(1), device :: bufferOnDevice(devWorkspaceSizeInBytes) ! Any type
   integer(1)         :: bufferOnHost(hostWorkspaceSizeInBytes)  ! Any type

2.9.30. cublasMpTradd_bufferSize

This functions computes the workspace requirements of cublasMpTradd.

type(cublasMpStatus) function cublasMpTradd_bufferSize(handle, uplo, trans, &
   M, N, alpha, A, IA, JA, descrA, beta, C, IC, JC, descrC, &
   devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: uplo, trans
   integer(8), intent(in) :: M, N, IA, JA, IC, JC
   real(4) :: alpha, beta  ! Any compatible kind
   real(4), device, dimension(*) :: A, C  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrC
   integer(8), intent(out) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes

2.9.31. cublasMpTradd

This is the multi-processor version of a trapezoidal matrix addition function.

TRADD performs the trapezoidal matrix-matrix addition operation

C := alpha*A + beta*C

alpha and beta are scalars, and A and C are matrices. A is either M x N or N x M depending on the trans argument, and C is M x N. The data for A and C should be properly distributed over the process grid. That mapping is contained within the descriptors descrA and descrC via the cublasMpMatrixDescriptorCreate() function. The datatype is also specified then. M and N are the global matrix dimensions. IA, JA, IC, and JC are 1-based, and typically equal to 1 for a full matrix. Integer(4) input values will be promoted to integer(8) according to the interface.

type(cublasMpStatus) function cublasMpTradd(handle, uplo, trans, &
   M, N, alpha, A, IA, JA, descrA, beta, C, IC, JC, descrC, &
   bufferOnDevice, devWorkspaceSizeInBytes, &
   bufferOnHost, hostWorkspaceSizeInBytes)
   type(cublasMpHandle) :: handle
   integer(4) :: uplo, trans
   integer(8), intent(in) :: M, N, IA, JA, IC, JC
   real(4) :: alpha, beta  ! Any compatible type and kind
   real(4), device, dimension(*) :: A, C  ! Any supported type and kind
   type(cublasMpMatrixDescriptor) :: descrA, descrC
   integer(8), intent(in) :: devWorkspaceSizeInBytes, hostWorkspaceSizeInBytes
   integer(1), device :: bufferOnDevice(devWorkspaceSizeInBytes) ! Any type
   integer(1)         :: bufferOnHost(hostWorkspaceSizeInBytes)  ! Any type

2.9.32. cublasMpLoggerSetFile

This function specifies the Fortran unit to be used as the cublasMp logfile.

type(cublasMpStatus) function cublasMpLoggerSetFile(unit)
  integer :: unit

2.9.33. cublasMpLoggerOpenFile

This function specifies a Fortran character string to be opened and used as the cublasMp logfile.

type(cublasMpStatus) function cublasMpLoggerOpenFile(logFile)
  character*(*) :: logFile

2.9.34. cublasMpLoggerSetLevel

This function specifies the cublasMp logging level.

type(cublasMpStatus) function cublasMpLoggerSetLevel(level)
  integer :: level

2.9.35. cublasMpLoggerSetMask

This function specifies the cublasMp logging mask.

type(cublasMpStatus) function cublasMpLoggerSetMask(mask)
  integer :: mask

2.9.36. cublasMpLoggerForceDisable

This function disables cublasMp logging.

type(cublasMpStatus) function cublasMpLoggerForceDisable()

3. FFT Runtime Library APIs

This section describes the Fortran interfaces to the cuFFT library. The FFT functions are only accessible from host code. All of the runtime API routines are integer functions that return an error code; they return a value of CUFFT_SUCCESS if the call was successful, or another cuFFT status return value if there was an error.

Chapter 10 contains examples of accessing the cuFFT library routines from OpenACC and CUDA Fortran. In both cases, the interfaces to the library can be exposed by adding the line

use cufft

to your program unit.

Beginning with our 21.9 release, we also support a cufftXt module, which provides interfaces to the multi-gpu support available in the cuFFT library. These interfaces can be used within any Fortran program by adding the line

use cufftxt

to your program unit. The cufftXt interfaces are documented beginning in section 4 of this chapter.

Unless a specific kind is provided in the following interfaces, the plain integer type implies integer(4) and the plain real type implies real(4).

3.1. CUFFT Definitions and Helper Functions

This section contains definitions and data types used in the cuFFT library and interfaces to the cuFFT helper functions.

The cuFFT module contains the following constants and enumerations:

integer, parameter :: CUFFT_FORWARD = -1
integer, parameter :: CUFFT_INVERSE = 1

! CUFFT Status
enum, bind(C)
    enumerator :: CUFFT_SUCCESS        = 0
    enumerator :: CUFFT_INVALID_PLAN   = 1
    enumerator :: CUFFT_ALLOC_FAILED   = 2
    enumerator :: CUFFT_INVALID_TYPE   = 3
    enumerator :: CUFFT_INVALID_VALUE  = 4
    enumerator :: CUFFT_INTERNAL_ERROR = 5
    enumerator :: CUFFT_EXEC_FAILED    = 6
    enumerator :: CUFFT_SETUP_FAILED   = 7
    enumerator :: CUFFT_INVALID_SIZE   = 8
    enumerator :: CUFFT_UNALIGNED_DATA = 9
end enum

! CUFFT Transform Types
enum, bind(C)
    enumerator :: CUFFT_R2C = z'2a'     ! Real to Complex (interleaved)
    enumerator :: CUFFT_C2R = z'2c'     ! Complex (interleaved) to Real
    enumerator :: CUFFT_C2C = z'29'     ! Complex to Complex, interleaved
    enumerator :: CUFFT_D2Z = z'6a'     ! Double to Double-Complex
    enumerator :: CUFFT_Z2D = z'6c'     ! Double-Complex to Double
    enumerator :: CUFFT_Z2Z = z'69'     ! Double-Complex to Double-Complex
end enum

! CUFFT Data Layouts
enum, bind(C)
    enumerator :: CUFFT_COMPATIBILITY_NATIVE          = 0
    enumerator :: CUFFT_COMPATIBILITY_FFTW_PADDING    = 1
    enumerator :: CUFFT_COMPATIBILITY_FFTW_ASYMMETRIC = 2
    enumerator :: CUFFT_COMPATIBILITY_FFTW_ALL        = 3
end enum

integer, parameter :: CUFFT_COMPATIBILITY_DEFAULT = CUFFT_COMPATIBILITY_FFTW_PADDING

3.1.1. cufftSetCompatibilityMode

This function configures the layout of cuFFT output in FFTW-compatible modes.

integer(4) function cufftSetCompatibilityMode( plan, mode )
  integer :: plan
  integer :: mode

3.1.2. cufftSetStream

This function sets the stream to be used by the cuFFT library to execute its routines.

integer(4) function cufftSetStream(plan, stream)
  integer :: plan
  integer(kind=cuda_stream_kind) :: stream

3.1.3. cufftGetVersion

This function returns the version number of cuFFT.

integer(4) function cufftGetVersion( version )
  integer :: version

3.1.4. cufftSetAutoAllocation

This function indicates that the caller intends to allocate and manage work areas for plans that have been generated. cuFFT default behavior is to allocate the work area at plan generation time. If cufftSetAutoAllocation() has been called with autoAllocate set to 0 prior to one of the cufftMakePlan*() calls, cuFFT does not allocate the work area. This is the preferred sequence for callers wishing to manage work area allocation.

integer(4) function cufftSetAutoAllocation(plan, autoAllocate)
  integer(4) :: plan, autoallocate

3.1.5. cufftSetWorkArea

This function overrides the work area pointer associated with a plan. If the work area was auto-allocated, cuFFT frees the auto-allocated space. The cufftExecute*() calls assume that the work area pointer is valid and that it points to a contiguous region in device memory that does not overlap with any other work area. If this is not the case, results are indeterminate.

integer(4) function cufftSetWorkArea(plan, workArea)
  integer(4) :: plan
  integer, device :: workArea(*) ! Can be integer, real, complex
                                 ! or a type(c_devptr)

3.1.6. cufftDestroy

This function frees all GPU resources associated with a cuFFT plan and destroys the internal plan data structure.

integer(4) function cufftDestroy( plan )
  integer :: plan

3.2. CUFFT Plans and Estimated Size Functions

This section contains functions from the cuFFT library used to create plans and estimate work buffer size.

3.2.1. cufftPlan1d

This function creates a 1D FFT plan configuration for a specified signal size and data type. Nx is the size of the transform; batch is the number of transforms of size nx.

integer(4) function cufftPlan1d(plan, nx, ffttype, batch)
  integer :: plan
  integer :: nx
  integer :: ffttype
  integer :: batch

3.2.2. cufftPlan2d

This function creates a 2D FFT plan configuration according to a specified signal size and data type. For a Fortran array(nx,ny), nx is the size of the of the 1st dimension in the transform, but the 2nd size argument to the function; ny is the size of the 2nd dimension, and the 1st size argument to the function.

integer(4) function cufftPlan2d( plan, ny, nx, ffttype )
  integer :: plan
  integer :: ny, nx
  integer :: ffttype

3.2.3. cufftPlan3d

This function creates a 3D FFT plan configuration according to a specified signal size and data type. For a Fortran array(nx,ny,nz), nx is the size of the of the 1st dimension in the transform, but the 3rd size argument to the function; nz is the size of the 3rd dimension, and the 1st size argument to the function.

integer(4) function cufftPlan3d( plan, nz, ny, nx, ffttype )
  integer :: plan
  integer :: nz, ny, nx
  integer :: ffttype

3.2.4. cufftPlanMany

This function creates an FFT plan configuration of dimension rank, with sizes specified in the array n. Batch is the number of transforms to configure. This function supports more complicated input and output data layouts using the arguments inembed, istride, idist, onembed, ostride, and odist. In the C function, if inembed and onembed are set to NULL, all other stride information is ignored. Fortran programmers can pass NULL when using the NVIDIA cufft module by setting an F90 pointer to null(), either through direct assignment, using c_f_pointer() with c_null_ptr as the first argument, or the nullify statement, then passing the nullified F90 pointer as the actual argument for the inembed and onembed dummies.

integer(4) function cufftPlanMany(plan, rank, n, inembed, istride, idist, onembed, ostride, odist, ffttype, batch )
  integer :: plan
  integer :: rank
  integer :: n
  integer :: inembed, onembed
  integer :: istride, idist, ostride, odist
  integer :: ffttype, batch

3.2.5. cufftCreate

This function creates an opaque handle for further cuFFT calls and allocates some small data structures on the host. In C, the handle type is currently typedef’ed to an int, so in Fortran we use an integer*4 to hold the plan.

integer(4) function cufftCreate(plan)
  integer(4) :: plan

3.2.6. cufftMakePlan1d

Following a call to cufftCreate(), this function creates a 1D FFT plan configuration for a specified signal size and data type. Nx is the size of the transform; batch is the number of transforms of size nx. If cufftXtSetGPUs was called prior to this call with multiple GPUs, then workSize is an array containing multiple sizes. The workSize values are in bytes.

integer(4) function cufftMakePlan1d(plan, nx, ffttype, batch, worksize)
  integer(4) :: plan
  integer(4) :: nx
  integer(4) :: ffttype
  integer(4) :: batch
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.7. cufftMakePlan2d

Following a call to cufftCreate(), this function creates a 2D FFT plan configuration according to a specified signal size and data type. For a Fortran array(nx,ny), nx is the size of the of the 1st dimension in the transform, but the 2nd size argument to the function; ny is the size of the 2nd dimension, and the 1st size argument to the function. If cufftXtSetGPUs was called prior to this call with multiple GPUs, then workSize is an array containing multiple sizes. The workSize values are in bytes.

integer(4) function cufftMakePlan2d(plan, ny, nx, ffttype, workSize)
  integer(4) :: plan
  integer(4) :: ny, nx
  integer(4) :: ffttype
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.8. cufftMakePlan3d

Following a call to cufftCreate(), this function creates a 3D FFT plan configuration according to a specified signal size and data type. For a Fortran array(nx,ny,nz), nx is the size of the of the 1st dimension in the transform, but the 3rd size argument to the function; nz is the size of the 3rd dimension, and the 1st size argument to the function. If cufftXtSetGPUs was called prior to this call with multiple GPUs, then workSize is an array containing multiple sizes. The workSize values are in bytes.

integer(4) function cufftMakePlan3d(plan, nz, ny, nx, ffttype, workSize)
  integer(4) :: plan
  integer(4) :: nz, ny, nx
  integer(4) :: ffttype
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.9. cufftMakePlanMany

Following a call to cufftCreate(), this function creates an FFT plan configuration of dimension rank, with sizes specified in the array n. Batch is the number of transforms to configure. This function supports more complicated input and output data layouts using the arguments inembed, istride, idist, onembed, ostride, and odist.

In the C function, if inembed and onembed are set to NULL, all other stride information is ignored. Fortran programmers can pass NULL when using the NVIDIA cufft module by setting an F90 pointer to null(), either through direct assignment, using c_f_pointer() with c_null_ptr as the first argument, or the nullify statement, then passing the nullified F90 pointer as the actual argument for the inembed and onembed dummies.

If cufftXtSetGPUs was called prior to this call with multiple GPUs, then workSize is an array containing multiple sizes. The workSize values are in bytes.

integer(4) function cufftMakePlanMany(plan, rank, n, inembed, istride, idist, onembed, ostride, odist, ffttype, batch, workSize)
  integer(4) :: plan
  integer(4) :: rank
  integer :: n(rank)
  integer :: inembed(rank), onembed(rank)
  integer(4) :: istride, idist, ostride, odist
  integer(4) :: ffttype, batch
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.10. cufftEstimate1d

This function returns an estimate for the size of the work area required, in bytes, given the specified size and data type, and assuming default plan settings.

integer(4) function cufftEstimate1d(nx, ffttype, batch, workSize)
  integer(4) :: nx
  integer(4) :: ffttype
  integer(4) :: batch
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.11. cufftEstimate2d

This function returns an estimate for the size of the work area required, in bytes, given the specified size and data type, and assuming default plan settings.

integer(4) function cufftEstimate2d(ny, nx, ffttype, workSize)
  integer(4) :: ny, nx
  integer(4) :: ffttype
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.12. cufftEstimate3d

This function returns an estimate for the size of the work area required, in bytes, given the specified size and data type, and assuming default plan settings.

integer(4) function cufftEstimate3d(nz, ny, nx, ffttype, workSize)
  integer(4) :: nz, ny, nx
  integer(4) :: ffttype
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.13. cufftEstimateMany

This function returns an estimate for the size of the work area required, in bytes, given the specified size and data type, and assuming default plan settings.

integer(4) function cufftEstimateMany(rank, n, inembed, istride, idist, onembed, ostride, odist, ffttype, batch, workSize)
  integer(4) :: rank, istride, idist, ostride, odist
  integer(4), dimension(rank) :: n, inembed, onembed
  integer(4) :: ffttype
  integer(4) :: batch
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.14. cufftGetSize1d

This function gives a more accurate estimate than cufftEstimate1d() of the size of the work area required, in bytes, given the specified plan parameters and taking into account any plan settings which may have been made.

integer(4) function cufftGetSize1d(plan, nx, ffttype, batch, workSize)
  integer(4) :: plan, nx, ffttype, batch
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.15. cufftGetSize2d

This function gives a more accurate estimate than cufftEstimate2d() of the size of the work area required, in bytes, given the specified plan parameters and taking into account any plan settings which may have been made.

integer(4) function cufftGetSize2d(plan, ny, nx, ffttype, workSize)
  integer(4) :: plan, ny, nx, ffttype
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.16. cufftGetSize3d

This function gives a more accurate estimate than cufftEstimate3d() of the size of the work area required, in bytes, given the specified plan parameters and taking into account any plan settings which may have been made.

integer(4) function cufftGetSize3d(plan, nz, ny, nx, ffttype, workSize)
  integer(4) :: plan, nz, ny, nx, ffttype
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.17. cufftGetSizeMany

This function gives a more accurate estimate than cufftEstimateMany() of the size of the work area required, in bytes, given the specified plan parameters and taking into account any plan settings which may have been made.

integer(4) function cufftGetSizeMany(plan, rank, n, inembed, istride, idist, onembed, ostride, odist, ffttype, batch, workSize)
  integer(4) :: plan, rank, istride, idist, ostride, odist
  integer(4), dimension(rank) :: n, inembed, onembed
  integer(4) :: ffttype
  integer(4) :: batch
  integer(kind=int_ptr_kind()) :: workSize(*)

3.2.18. cufftGetSize

Once plan generation has been done, either with the original API or the extensible API, this call returns the actual size of the work area required, in bytes, to support the plan. Callers who choose to manage work area allocation within their application must use this call after plan generation, and after any cufftSet*() calls subsequent to plan generation, if those calls might alter the required work space size.

integer(4) function cufftGetSize(plan, workSize)
  integer(4) :: plan
  integer(kind=int_ptr_kind()) :: workSize(*)

3.3. CUFFT Execution Functions

This section contains the execution functions, which perform the actual Fourier transform, in the cuFFT library.

3.3.1. cufftExecC2C

This function executes a single precision complex-to-complex transform plan in the transform direction as specified by the direction parameter. If idata and odata are the same, this function does an in-place transform.

integer(4) function cufftExecC2C( plan, idata, odata, direction )
  integer :: plan
  complex(4), device, dimension(*) :: idata, odata
  integer :: direction

3.3.2. cufftExecR2C

This function executes a single precision real-to-complex, implicity forward, cuFFT transform plan. If idata and odata are the same, this function does an in-place transform, but note there are data layout differences between in-place and out-of-place transforms for real-to- complex FFTs in cuFFT.

integer(4) function cufftExecR2C( plan, idata, odata )
  integer :: plan
  real(4), device, dimension(*) :: idata
  complex(4), device, dimension(*) :: odata

3.3.3. cufftExecC2R

This function executes a single precision complex-to-real, implicity inverse, cuFFT transform plan. If idata and odata are the same, this function does an in-place transform.

integer(4) function cufftExecC2R( plan, idata, odata )
  integer :: plan
  complex(4), device, dimension(*) :: idata
  real(4), device, dimension(*) :: odata

3.3.4. cufftExecZ2Z

This function executes a double precision complex-to-complex transform plan in the transform direction as specified by the direction parameter. If idata and odata are the same, this function does an in-place transform.

integer(4) function cufftExecZ2Z( plan, idata, odata, direction )
  integer :: plan
  complex(8), device, dimension(*) :: idata, odata
  integer :: direction

3.3.5. cufftExecD2Z

This function executes a double precision real-to-complex, implicity forward, cuFFT transform plan. If idata and odata are the same, this function does an in-place transform, but note there are data layout differences between in-place and out-of-place transforms for real-to- complex FFTs in cuFFT.

integer(4) function cufftExecD2Z( plan, idata, odata )
  integer :: plan
  real(8), device, dimension(*) :: idata
  complex(8), device, dimension(*) :: odata

3.3.6. cufftExecZ2D

This function executes a double precision complex-to-real, implicity inverse, cuFFT transform plan. If idata and odata are the same, this function does an in-place transform.

integer(4) function cufftExecZ2D( plan, idata, odata )
  integer :: plan
  complex(8), device, dimension(*) :: idata
  real(8), device, dimension(*) :: odata

3.4. CUFFTXT Definitions and Helper Functions

This section contains definitions and data types used in the cufftXt library and interfaces to helper functions. Beginning with NVHPC version 22.5, this module also contains some interfaces and definitions used with the cuFFTMp library.

The cufftXt module contains the following constants and enumerations:

integer, parameter :: MAX_CUDA_DESCRIPTOR_GPUS = 64

! libFormat enum is used for the library member of cudaLibXtDesc
enum, bind(C)
    enumerator :: LIB_FORMAT_CUFFT     = 0
    enumerator :: LIB_FORMAT_UNDEFINED = 1
end enum

! cufftXtSubFormat identifies the data layout of a memory descriptor
enum, bind(C)
    ! by default input is in linear order across GPUs
    enumerator :: CUFFT_XT_FORMAT_INPUT = 0

    ! by default output is in scrambled order depending on transform
    enumerator :: CUFFT_XT_FORMAT_OUTPUT = 1

    ! by default inplace is input order, which is linear across GPUs
    enumerator :: CUFFT_XT_FORMAT_INPLACE = 2

    ! shuffled output order after execution of the transform
    enumerator :: CUFFT_XT_FORMAT_INPLACE_SHUFFLED = 3

    ! shuffled input order prior to execution of 1D transforms
    enumerator :: CUFFT_XT_FORMAT_1D_INPUT_SHUFFLED = 4

    ! distributed input order
    enumerator :: CUFFT_XT_FORMAT_DISTRIBUTED_INPUT = 5

    ! distributed output order
    enumerator :: CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT = 6

    enumerator :: CUFFT_FORMAT_UNDEFINED = 7
end enum

! cufftXtCopyType specifies the type of copy for cufftXtMemcpy
enum, bind(C)
    enumerator :: CUFFT_COPY_HOST_TO_DEVICE   = 0
    enumerator :: CUFFT_COPY_DEVICE_TO_HOST   = 1
    enumerator :: CUFFT_COPY_DEVICE_TO_DEVICE = 2
    enumerator :: CUFFT_COPY_UNDEFINED        = 3
end enum

! cufftXtQueryType specifies the type of query for cufftXtQueryPlan
enum, bind(c)
    enumerator :: CUFFT_QUERY_1D_FACTORS = 0
    enumerator :: CUFFT_QUERY_UNDEFINED  = 1
end enum

! cufftXtWorkAreaPolicy specifies the policy for cufftXtSetWorkAreaPolicy
enum, bind(c)
    enumerator :: CUFFT_WORKAREA_MINIMAL     = 0 ! maximum reduction
    enumerator :: CUFFT_WORKAREA_USER        = 1 ! use workSize parameter as limit
    enumerator :: CUFFT_WORKAREA_PERFORMANCE = 2 ! default - 1x overhead or more, max perf
end enum

! cufftMpCommType specifies how to initialize cuFFTMp
enum, bind(c)
    enumerator :: CUFFT_COMM_MPI       = 0
    enumerator :: CUFFT_COMM_NVSHMEM   = 1
    enumerator :: CUFFT_COMM_UNDEFINED = 2
end enum

The cufftXt module contains the following derived type definitions:

! cufftXt1dFactors type
type, bind(c) :: cufftXt1dFactors
    integer(8) :: size
    integer(8) :: stringCount
    integer(8) :: stringLength
    integer(8) :: subStringLength
    integer(8) :: factor1
    integer(8) :: factor2
    integer(8) :: stringMask
    integer(8) :: subStringMask
    integer(8) :: factor1Mask
    integer(8) :: factor2Mask
    integer(4) :: stringShift
    integer(4) :: subStringShift
    integer(4) :: factor1Shift
    integer(4) :: factor2Shift
end type cufftXt1dFactors

type, bind(C) :: cudaXtDesc
    integer(4) :: version
    integer(4) :: nGPUs
    integer(4) :: GPUs(MAX_CUDA_DESCRIPTOR_GPUS)
    type(c_devptr) :: data(MAX_CUDA_DESCRIPTOR_GPUS)
    integer(8) :: size(MAX_CUDA_DESCRIPTOR_GPUS)
    type(c_ptr) :: cudaXtState
end type cudaXtDesc

type, bind(C) :: cudaLibXtDesc
    integer(4) :: version
    type(c_ptr) :: descriptor     ! cudaXtDesc *descriptor
    integer(4) :: library         ! libFormat library
    integer(4) :: subFormat
    type(c_ptr) :: libDescriptor  ! void *libDescriptor
end type cudaLibXtDesc

type, bind(C) :: cufftBox3d
    integer(8) :: lower(3)
    integer(8) :: upper(3)
    integer(8) :: strides(3)
end type cufftBox3d

3.4.1. cufftXtSetGPUs

This function identifies which GPUs are to be used with the plan. The call to cufftXtSetGPUs must occur after the call to cufftCreate but before the call to cufftMakePlan*.

integer(4) function cufftXtSetGPUs( plan, nGPUs, whichGPUs )
  integer(4) :: plan
  integer(4) :: nGPUs
  integer(4) :: whichGPUs(*)

3.4.2. cufftXtMalloc

This function allocates a cufftXt descriptor, and memory for data in the GPUs associated with the plan. The value of cufftXtSubFormat determines if the buffer will be used for input or output. Fortran programmers should declare and pass a pointer to a type(cudaLibXtDesc) variable so the entire information can be stored, and also freed in subsequent calls to cufftXtFree. For programmers comfortable with the C interface, a variant of this function can take a type(c_ptr) for the 2nd argument.

integer(4) function cufftXtMalloc( plan, descriptor, format )
  integer(4) :: plan
  type(cudaLibXtDesc), pointer :: descriptor  ! A type(c_ptr) is also accepted.
  integer(4) :: format ! cufftXtSubFormat value

3.4.3. cufftXtFree

This function frees the cufftXt descriptor, and all memory associated with it. The descriptor and memory must have been allocated by a previous call to cufftXtMalloc. Fortran programmers should declare and pass a pointer to a type(cudaLibXtDesc) variable. For programmers comfortable with the C interface, a variant of this function can take a type(c_ptr) as the only argument.

integer(4) function cufftXtFree( descriptor )
  type(cudaLibXtDesc), pointer :: descriptor  ! A type(c_ptr) is also accepted.

3.4.4. cufftXtMemcpy

This function copies data between buffers on the host and GPUs, or between GPUs. The value of the type argument determines the copy direction. In addition, this Fortran function is overloaded to take a type(cudaLibXtDesc) variable for the destination (H2D transfer), for the source (D2H transfer), or for both (D2D transfer), in which case the type argument is not required.

integer(4) function cufftXtMemcpy( plan, dst, src, type )
  integer(4) :: plan
  type(cudaLibXtDesc) :: dst  ! Or any host buffer, depending on the type
  type(cudaLibXtDesc) :: src  ! Or any host buffer, depending on the type
  integer(4) :: type          ! optional cufftXtCopyType value

3.5. CUFFTXT Plans and Work Area Functions

This section contains functions from the cufftXt library used to create plans and manage work buffers.

3.5.1. cufftXtMakePlanMany

Following a call to cufftCreate(), this function creates an FFT plan configuration of dimension rank, with sizes specified in the array n. Batch is the number of transforms to configure. This function supports more complicated input and output data layouts using the arguments inembed, istride, idist, onembed, ostride, and odist. In the C function, if inembed and onembed are set to NULL, all other stride information is ignored. Fortran programmers can pass NULL when using the NVIDIA cufft module by setting an F90 pointer to null(), either through direct assignment, using c_f_pointer() with c_null_ptr as the first argument, or the nullify statement, then passing the nullified F90 pointer as the actual argument for the inembed and onembed dummies.

integer(4) function cufftXtMakePlanMany(plan, rank, n, inembed, istride, &
    idist, inputType, onembed, ostride, odist, outputType, batch, workSize, &
    executionType)
  integer(4) :: plan
  integer(4) :: rank
  integer(8) :: n(*)
  integer(8) :: inembed(*), onembed(*)
  integer(8) :: istride, idist, ostride, odist
  type(cudaDataType) :: inputType, outputType, executionType
  integer(4) :: batch
  integer(8) :: workSize(*)

3.5.2. cufftXtQueryPlan

This function only supports multi-gpu 1D transforms. It returns a derived type, factors, which contains the number of strings, the decomposition of factors, and (in the case of power of 2 sizes) some other useful mask and shift elements, used in converting between permuted and linear indexes.

integer(4) function cufftXtQueryPlan(plan, factors, queryType)
  integer(4) :: plan
  type(cufftXt1DFactors) :: factors
  integer(4) :: queryType

3.5.3. cufftXtSetWorkAreaPolicy

This function overrides the work area associated with a plan. Currently, the workAreaPolicy can be specified as CUFFT_WORKAREA_MINIMAL and cuFFT will attempt to re-plan to use zero bytes of work area memory. See the CUFFT documentation for support of other features.

integer(4) function cufftXtSetWorkAreaPolicy(plan, workAreaPolicy, workSize)
  integer(4) :: plan
  integer(4) :: workAreaPolicy
  integer(8) :: workSize

3.5.4. cufftXtGetSizeMany

This function gives a more accurate estimate than cufftEstimateMany() of the size of the work area required, in bytes, given the specified plan parameters used for cufftXtMakePlanMany and taking into account any plan settings which may have been made.

integer(4) function cufftXtGetSizeMany(plan, rank, n, inembed, istride, &
    idist, inputType, onembed, ostride, odist, outputType, batch, workSize, &
    executionType)
  integer(4) :: plan
  integer(4) :: rank
  integer(8) :: n(*)
  integer(8) :: inembed(*), onembed(*)
  integer(8) :: istride, idist, ostride, odist
  type(cudaDataType) :: inputType, outputType, executionType
  integer(4) :: batch
  integer(8) :: workSize(*)

3.5.5. cufftXtSetWorkArea

This function overrides the work areas associated with a plan. If the work area was auto-allocated, cuFFT frees the auto-allocated space. The cufftExecute*() calls assume that the work area pointer is valid and that it points to a contiguous region in device memory that does not overlap with any other work area. If this is not the case, results are indeterminate.

integer(4) function cufftXtSetWorkArea(plan, workArea)
  integer(4) :: plan
  type(c_devptr) :: workArea(*)

3.5.6. cufftXtSetDistribution

This function registers and describes the data distribution for a subsequent FFT operation. The call to cufftXtSetDistribution must occur after the call to cufftCreate but before the call to cufftMakePlan*.

integer(4) function cufftXtSetDistribution( plan, boxIn, boxOut )
  integer(4) :: plan
  type(cufftBox3d) :: boxIn
  type(cufftBox3d) :: boxOut

3.6. CUFFTXT Execution Functions

This section contains the execution functions, which perform the actual Fourier transform, in the cufftXt library.

3.6.1. cufftXtExec

This function executes any Fourier transform regardless of precision and type. In case of complex-to-real and real-to-complex transforms, the direction argument is ignored. Otherwise, the transform direction is specified by the direction parameter. This function uses the GPU memory pointed to by input as input data, and stores the computed Fourier coefficients in the output array. If those are the same, this method does an in-place transform. Any valid data type for the input and output arrays are accepted.

integer(4) function cufftXtExec( plan, input, output, direction )
  integer :: plan
  real, dimension(*) :: input, output  ! Any data type is allowed
  integer :: direction

3.6.2. cufftXtExecDescriptor

This function executes any Fourier transform regardless of precision and type. In case of complex-to-real and real-to-complex transforms, the direction argument is ignored. Otherwise, the transform direction is specified by the direction parameter. This function stores the result in the specified output arrays.

integer(4) function cufftXtExecDescriptor( plan, input, output, direction )
  integer :: plan
  type(cudaLibXtDesc) :: input, output
  integer :: direction

3.6.3. cufftXtExecDescriptorC2C

This function executes a single precision complex-to-complex transform plan in the transform direction as specified by the direction parameter. This multiple GPU function currently supports in-place transforms only; the result will be stored in the input arrays.

integer(4) function cufftXtExecDescriptorC2C( plan, input, output, direction )
  integer :: plan
  type(cudaLibXtDesc) :: input, output
  integer :: direction

3.6.4. cufftXtExecDescriptorZ2Z

This function executes a double precision complex-to-complex transform plan in the transform direction as specified by the direction parameter. This multiple GPU function currently supports in-place transforms only; the result will be stored in the input arrays.

integer(4) function cufftXtExecDescriptorZ2Z( plan, input, output, direction )
  integer :: plan
  type(cudaLibXtDesc) :: input, output
  integer :: direction

3.6.5. cufftXtExecDescriptorR2C

This function executes a single precision real-to-complex transform plan. This multiple GPU function currently supports in-place transforms only; the result will be stored in the input arrays.

integer(4) function cufftXtExecDescriptorR2C( plan, input, output )
  integer :: plan
  type(cudaLibXtDesc) :: input, output

3.6.6. cufftXtExecDescriptorD2Z

This function executes a double precision real-to-complex transform plan. This multiple GPU function currently supports in-place transforms only; the result will be stored in the input arrays.

integer(4) function cufftXtExecDescriptorD2Z( plan, input, output )
  integer :: plan
  type(cudaLibXtDesc) :: input, output

3.6.7. cufftXtExecDescriptorC2R

This function executes a single precision complex-to-real transform plan. This multiple GPU function currently supports in-place transforms only; the result will be stored in the input arrays.

integer(4) function cufftXtExecDescriptorC2R( plan, input, output )
  integer :: plan
  type(cudaLibXtDesc) :: input, output

3.6.8. cufftXtExecDescriptorZ2D

This function executes a double precision complex-to-real transform plan. This multiple GPU function currently supports in-place transforms only; the result will be stored in the input arrays.

integer(4) function cufftXtExecDescriptorZ2D( plan, input, output )
  integer :: plan
  type(cudaLibXtDesc) :: input, output

3.7. CUFFTMP Functions

This section contains the cuFFTMp functions which extend the cuFFTXt library functionality to multiple processes and multiple GPUs.

3.7.1. cufftMpNvshmemMalloc

This function allocates space from the NVSHMEM symmetric heap. The cuFFTMp library is based on NVSHMEM. However, the user is not allowd to link and use NVSHMEM in their own application. This may cause a crash at applicaton start time. This limitation will be lifted in a future release of cuFFTMp.

However, some functionality of cuFFTMp requires NVSHMEM-allocated memory, so this function is currently exposed and supported. This function requires that at least one cuFFTMp plan is active prior to its use.

integer(4) function cufftMpNvshmemMalloc( size, workArea )
  integer(8) :: size  ! Size is in bytes
  type(c_devptr) :: workArea

3.7.2. cufftMpNvshmemFree

This function frees the space previously allocated from the NVSHMEM symmetric heap. The cuFFTMp library is based on NVSHMEM. However, the user is not allowd to link and use NVSHMEM in their own application. This may cause a crash at applicaton start time. This limitation will be lifted in a future release of cuFFTMp.

However, some functionality of cuFFTMp requires NVSHMEM-allocated memory, so this function is currently exposed and supported. This function requires that at least one cuFFTMp plan is active prior to its use.

integer(4) function cufftMpNvshmemFree( workArea )
  type(c_devptr) :: workArea

3.7.3. cufftMpAttachComm

This function attaches a communicator, such as MPI_COMM_WORLD, to a cuFFT plan, for later application of a distributed FFT operation

integer(4) function cufftMpAttachComm( plan, commType, fcomm )
  integer(4) :: plan
  integer(4) :: commType
  integer(4) :: fcomm

3.7.4. cufftMpCreateReshape

This function creates a cuFFTMp reshape handle for later application of a distributed FFT operation

integer(4) function cufftMpCreateReshape( reshapeHandle )
  type(c_ptr) :: reshapeHandle

3.7.5. cufftMpAttachReshapeComm

This function attaches a communicator, such as MPI_COMM_WORLD, to a cuFFTMp reshape handle, for later application of a distributed FFT operation

integer(4) function cufftMpAttachReshapeComm( reshapeHandle, commType, fcomm )
  type(c_ptr) :: reshapeHandle
  integer(4) :: commType
  integer(4) :: fcomm

3.7.6. cufftMpGetReshapeSize

This function returns the size needed for work space in the subsequent cuFFTMp reshape execution. Currently, a work area is not required, but that may change in future releases.

integer(4) function cufftMpGetReshapeSize( reshapeHandle, workSize )
   type(c_ptr) :: reshapeHandle
   integer(8)  :: workSize

3.7.7. cufftMpMakeReshape

This function creates a cuFFTMp reshape plan based on the input and output boxes. Note that the boxes use C conventions for bounds and strides.

integer(4) function cufftMpMakeReshape( reshapeHandle, &
       elementSize, boxIn, boxOut )
   type(c_ptr) :: reshapeHandle
   integer(8)  :: elementSize
   type(cufftBox3d) :: boxIn
   type(cufftBox3d) :: boxOut

3.7.8. cufftMpExecReshapeAsync

This function executes a cuFFTMp reshape plan on the specified stream.

integer(4) function cufftMpExecReshapeAsync( reshapeHandle, &
       dataOut, dataIn, workSpace, stream )
   type(c_ptr) :: reshapeHandle
   type(c_devptr) :: dataOut
   type(c_devptr) :: dataIn
   type(c_devptr) :: workSpace
   integer(kind=cuda_stream_kind) :: stream

3.7.9. cufftMpDestroyReshape

This function destroys a cuFFTMp reshape handle.

integer(4) function cufftMpDestroyReshape( reshapeHandle )
  type(c_ptr) :: reshapeHandle

4. Random Number Runtime Library APIs

This section describes the Fortran interfaces to the CUDA cuRAND library. The cuRAND functionality is accessible from both host and device code. In the host library, all of the runtime API routines are integer functions that return an error code; they return a value of CURAND_STATUS_SUCCESS if the call was successful, or other cuRAND return status value if there was an error. The host library routines are meant to produce a series or array of random numbers. In the device library, the init routines are subroutines and the generator functions return the type of the value being generated. The device library routines are meant for producing a single value per thread per call.

Chapter 10 contains examples of accessing the cuRAND library routines from OpenACC and CUDA Fortran. In both cases, the interfaces to the library can be exposed in host code by adding the line

use curand

to your program unit.

Unless a specific kind is provided, the plain integer type implies integer(4) and the plain real type implies real(4).

4.1. CURAND Definitions and Helper Functions

This section contains definitions and data types used in the cuRAND library and interfaces to the cuRAND helper functions.

The curand module contains the following derived type definitions:

TYPE curandGenerator
  TYPE(C_PTR)  :: handle
END TYPE

The curand module contains the following enumerations:

! CURAND Status
enum, bind(c)
    enumerator :: CURAND_STATUS_SUCCESS                   = 0
    enumerator :: CURAND_STATUS_VERSION_MISMATCH          = 100
    enumerator :: CURAND_STATUS_NOT_INITIALIZED           = 101
    enumerator :: CURAND_STATUS_ALLOCATION_FAILED         = 102
    enumerator :: CURAND_STATUS_TYPE_ERROR                = 103
    enumerator :: CURAND_STATUS_OUT_OF_RANGE              = 104
    enumerator :: CURAND_STATUS_LENGTH_NOT_MULTIPLE       = 105
    enumerator :: CURAND_STATUS_DOUBLE_PRECISION_REQUIRED = 106
    enumerator :: CURAND_STATUS_LAUNCH_FAILURE            = 201
    enumerator :: CURAND_STATUS_PREEXISTING_FAILURE       = 202
    enumerator :: CURAND_STATUS_INITIALIZATION_FAILED     = 203
    enumerator :: CURAND_STATUS_ARCH_MISMATCH             = 204
    enumerator :: CURAND_STATUS_INTERNAL_ERROR            = 999
end enum

! CURAND Generator Types
enum, bind(c)
    enumerator :: CURAND_RNG_TEST                    = 0
    enumerator :: CURAND_RNG_PSEUDO_DEFAULT          = 100
    enumerator :: CURAND_RNG_PSEUDO_XORWOW           = 101
    enumerator :: CURAND_RNG_PSEUDO_MRG32K3A         = 121
    enumerator :: CURAND_RNG_PSEUDO_MTGP32           = 141
    enumerator :: CURAND_RNG_PSEUDO_MT19937          = 142
    enumerator :: CURAND_RNG_PSEUDO_PHILOX4_32_10    = 161
    enumerator :: CURAND_RNG_QUASI_DEFAULT           = 200
    enumerator :: CURAND_RNG_QUASI_SOBOL32           = 201
    enumerator :: CURAND_RNG_QUASI_SCRAMBLED_SOBOL32 = 202
    enumerator :: CURAND_RNG_QUASI_SOBOL64           = 203
    enumerator :: CURAND_RNG_QUASI_SCRAMBLED_SOBOL64 = 204
end enum

! CURAND Memory Ordering
enum, bind(c)
    enumerator :: CURAND_ORDERING_PSEUDO_BEST    = 100
    enumerator :: CURAND_ORDERING_PSEUDO_DEFAULT = 101
    enumerator :: CURAND_ORDERING_PSEUDO_SEEDED  = 102
    enumerator :: CURAND_ORDERING_QUASI_DEFAULT  = 201
end enum

! CURAND Direction Vectors
enum, bind(c)
    enumerator :: CURAND_DIRECTION_VECTORS_32_JOEKUO6           = 101
    enumerator :: CURAND_SCRAMBLED_DIRECTION_VECTORS_32_JOEKUO6 = 102
    enumerator :: CURAND_DIRECTION_VECTORS_64_JOEKUO6           = 103
    enumerator :: CURAND_SCRAMBLED_DIRECTION_VECTORS_64_JOEKUO6 = 104
end enum

! CURAND Methods
enum, bind(c)
    enumerator :: CURAND_CHOOSE_BEST    = 0
    enumerator :: CURAND_ITR            = 1
    enumerator :: CURAND_KNUTH          = 2
    enumerator :: CURAND_HITR           = 3
    enumerator :: CURAND_M1             = 4
    enumerator :: CURAND_M2             = 5
    enumerator :: CURAND_BINARY_SEARCH  = 6
    enumerator :: CURAND_DISCRETE_GAUSS = 7
    enumerator :: CURAND_REJECTION      = 8
    enumerator :: CURAND_DEVICE_API     = 9
    enumerator :: CURAND_FAST_REJECTION = 10
    enumerator :: CURAND_3RD            = 11
    enumerator :: CURAND_DEFINITION     = 12
    enumerator :: CURAND_POISSON        = 13
end enum

4.1.1. curandCreateGenerator

This function creates a new random number generator of type rng. See the beginning of this section for valid values of rng.

integer(4) function curandCreateGenerator(generator, rng)
  type(curandGenerator) :: generator
  integer :: rng

4.1.2. curandCreateGeneratorHost

This function creates a new host CPU random number generator of type rng. See the beginning of this section for valid values of rng.

integer(4) function curandCreateGeneratorHost(generator, rng)
  type(curandGenerator) :: generator
  integer :: rng

4.1.3. curandDestroyGenerator

This function destroys an existing random number generator.

integer(4) function curandDestroyGenerator(generator)
  type(curandGenerator) :: generator

4.1.4. curandGetVersion

This function returns the version number of the cuRAND library.

integer(4) function curandGetVersion(version)
  integer(4) :: version

4.1.5. curandSetStream

This function sets the current stream for the cuRAND kernel launches.

integer(4) function curandSetStream(generator, stream)
  type(curandGenerator) :: generator
  integer(kind=c_intptr_t) :: stream

4.1.6. curandSetPseudoRandomGeneratorSeed

This function sets the seed value of the pseudo-random number generator.

integer(4) function curandSetPseudoRandomGeneratorSeed(generator, seed)
  type(curandGenerator) :: generator
  integer(8) :: seed

4.1.7. curandSetGeneratorOffset

This function sets the absolute offset of the pseudo or quasirandom number generator.

integer(4) function curandSetGeneratorOffset(generator, offset)
  type(curandGenerator) :: generator
  integer(8) :: offset

4.1.8. curandSetGeneratorOrdering

This function sets the ordering of results of the pseudo or quasirandom number generator.

integer(4) function curandSetGeneratorOrdering(generator, order)
  type(curandGenerator) :: generator
  integer(4) :: order

4.1.9. curandSetQuasiRandomGeneratorDimensions

This function sets number of dimensions of the quasirandom number generator.

integer(4) function curandSetQuasiRandomGeneratorDimensions(generator, num)
  type(curandGenerator) :: generator
  integer(4) :: num

4.2. CURAND Generator Functions

This section contains interfaces for the cuRAND generator functions.

4.2.1. curandGenerate

This function generates 32-bit pseudo or quasirandom numbers.

integer(4) function curandGenerate(generator, array, num )
  type(curandGenerator) :: generator
  integer(4), device :: array(*) ! Host or device depending on the generator
  integer(kind=c_intptr_t) :: num

4.2.2. curandGenerateLongLong

This function generates 64-bit integer quasirandom numbers. The function curandGenerate() has also been overloaded to accept these function arguments.

integer(4) function curandGenerateLongLong(generator, array, num )
  type(curandGenerator) :: generator
  integer(8), device :: array(*) ! Host or device depending on the generator
  integer(kind=c_intptr_t) :: num

4.2.3. curandGenerateUniform

This function generates 32-bit floating point uniformly distributed random numbers. The function curandGenerate() has also been overloaded to accept these function arguments.

integer(4) function curandGenerateUniform(generator, array, num )
  type(curandGenerator) :: generator
  real(4), device :: array(*) ! Host or device depending on the generator
  integer(kind=c_intptr_t) :: num

4.2.4. curandGenerateUniformDouble

This function generates 64-bit floating point uniformly distributed random numbers. The function curandGenerate() has also been overloaded to accept these function arguments.

integer(4) function curandGenerateUniformDouble(generator, array, num )
  type(curandGenerator) :: generator
  real(8), device :: array(*) ! Host or device depending on the generator
  integer(kind=c_intptr_t) :: num

4.2.5. curandGenerateNormal

This function generates 32-bit floating point normally distributed random numbers. The function curandGenerate() has also been overloaded to accept these function arguments.

integer(4) function curandGenerateNormal(generator, array, num, mean, stddev )
  type(curandGenerator) :: generator
  real(4), device :: array(*) ! Host or device depending on the generator
  integer(kind=c_intptr_t) :: num
  real(4) :: mean, stddev

4.2.6. curandGenerateNormalDouble

This function generates 64-bit floating point normally distributed random numbers. The function curandGenerate() has also been overloaded to accept these function arguments.

integer(4) function curandGenerateNormalDouble(generator, array, num, mean, stddev )
  type(curandGenerator) :: generator
  real(8), device :: array(*) ! Host or device depending on the generator
  integer(kind=c_intptr_t) :: num
  real(8) :: mean, stddev

4.2.7. curandGeneratePoisson

This function generates Poisson-distributed random numbers. The function curandGenerate() has also been overloaded to accept these function arguments.

integer(4) function curandGeneratePoisson(generator, array, num, lambda )
  type(curandGenerator) :: generator
  real(8), device :: array(*) ! Host or device depending on the generator
  integer(kind=c_intptr_t) :: num
  real(8) :: lambda

4.2.8. curandGenerateSeeds

This function sets the starting state of the generator.

integer(4) function curandGenerateSeeds(generator)
  type(curandGenerator) :: generator

4.2.9. curandGenerateLogNormal

This function generates 32-bit floating point log-normally distributed random numbers.

integer(4) function curandGenerateLogNormal(generator, array, num, mean, stddev )
  type(curandGenerator) :: generator
  real(4), device :: array(*) ! Host or device depending on the generator
  integer(kind=c_intptr_t) :: num
  real(4) :: mean, stddev

4.2.10. curandGenerateLogNormalDouble

This function generates 64-bit floating point log-normally distributed random numbers.

integer(4) function curandGenerateLogNormalDouble(generator, array, num, mean, stddev )
  type(curandGenerator) :: generator
  real(8), device :: array(*) ! Host or device depending on the generator
  integer(kind=c_intptr_t) :: num
  real(8) :: mean, stddev

4.3. CURAND Device Definitions and Functions

This section contains definitions and data types used in the cuRAND device library and interfaces to the cuRAND functions.

The curand device module contains the following derived type definitions:

TYPE curandStateXORWOW
    integer(4) :: d
    integer(4) :: v(5)
    integer(4) :: boxmuller_flag
    integer(4) :: boxmuller_flag_double
    real(4)    :: boxmuller_extra
    real(8)    :: boxmuller_extra_double
END TYPE curandStateXORWOW

TYPE curandStateMRG32k3a
    real(8)    :: s1(3)
    real(8)    :: s2(3)
    integer(4) :: boxmuller_flag
    integer(4) :: boxmuller_flag_double
    real(4)    :: boxmuller_extra
    real(8)    :: boxmuller_extra_double
END TYPE curandStateMRG32k3a

TYPE curandStateSobol32
    integer(4) :: d
    integer(4) :: x
    integer(4) :: c
    integer(4) :: direction_vectors(32)
END TYPE curandStateSobol32

TYPE curandStateScrambledSobol32
    integer(4) :: d
    integer(4) :: x
    integer(4) :: c
    integer(4) :: direction_vectors(32)
END TYPE curandStateScrambledSobol32

TYPE curandStateSobol64
    integer(8) :: d
    integer(8) :: x
    integer(8) :: c
    integer(8) :: direction_vectors(32)
END TYPE curandStateSobol64

TYPE curandStateScrambledSobol64
    integer(8) :: d
    integer(8) :: x
    integer(8) :: c
    integer(8) :: direction_vectors(32)
END TYPE curandStateScrambledSobol64

TYPE curandStateMtgp32
    integer(4) :: s(MTGP32_STATE_SIZE)
    integer(4) :: offset
    integer(4) :: pIdx
    integer(kind=int_ptr_kind()) :: k
    integer(4) :: precise_double_flag
END TYPE curandStateMtgp32

TYPE curandStatePhilox4_32_10
    integer(4) :: ctr
    integer(4) :: output
    integer(2) :: key
    integer(4) :: state
    integer(4) :: boxmuller_flag
    integer(4) :: boxmuller_flag_double
    real(4)    :: boxmuller_extra
    real(8)    :: boxmuller_extra_double
END TYPE curandStatePhilox4_32_10

4.3.1. curand_Init

This overloaded device subroutine initializes the state for the random number generator. These device subroutines are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

4.3.1.1. curandInitXORWOW

This function initializes the state for the XORWOW random number generator. The function curand_init() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

subroutine curandInitXORWOW(seed, sequence, offset, state)
  integer(8) :: seed
  integer(8) :: sequence
  integer(8) :: offset
  TYPE(curandStateXORWOW) :: state

4.3.1.2. curandInitMRG32k3a

This function initializes the state for the MRG32k3a random number generator. The function curand_init() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

subroutine curandInitMRG32k3a(seed, sequence, offset, state)
  integer(8) :: seed
  integer(8) :: sequence
  integer(8) :: offset
  TYPE(curandStateMRG32k3a) :: state

4.3.1.3. curandInitPhilox4_32_10

This function initializes the state for the Philox4_32_10 random number generator. The function curand_init() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

subroutine curandInitPhilox4_32_10(seed, sequence, offset, state)
  integer(8) :: seed
  integer(8) :: sequence
  integer(8) :: offset
  TYPE(curandStatePhilox4_32_10) :: state

4.3.1.4. curandInitSobol32

This function initializes the state for the Sobol32 random number generator. The function curand_init() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

subroutine curandInitSobol32(direction_vectors, offset, state)
  integer :: direction_vectors(*)
  integer(4) :: offset
  TYPE(curandStateSobol32) :: state

4.3.1.5. curandInitScrambledSobol32

This function initializes the state for the scrambled Sobol32 random number generator. The function curand_init() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

subroutine curandInitScrambledSobol32(direction_vectors, scramble, offset, state)
  integer :: direction_vectors(*)
  integer(4) :: scramble
  integer(4) :: offset
  TYPE(curandStateScrambledSobol32) :: state

4.3.1.6. curandInitSobol64

This function initializes the state for the Sobol64 random number generator. The function curand_init() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

subroutine curandInitSobol64(direction_vectors, offset, state)
  integer :: direction_vectors(*)
  integer(8) :: offset
  TYPE(curandStateSobol64) :: state

4.3.1.7. curandInitScrambledSobol64

This function initializes the state for the scrambled Sobol64 random number generator. The function curand_init() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

subroutine curandInitScrambledSobol64(direction_vectors, scramble, offset, state)
  integer :: direction_vectors(*)
  integer(8) :: scramble
  integer(8) :: offset
  TYPE(curandStateScrambledSobol64) :: state

4.3.2. curand

This overloaded device function returns 32 or 64 bits or random data based on the state argument. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

4.3.2.1. curandGetXORWOW

This function returns 32 bits of pseudorandomness from the XORWOW random number generator. The function curand() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

integer(4) function curandGetXORWOW(state)
  TYPE(curandStateXORWOW) :: state

4.3.2.2. curandGetMRG32k3a

This function returns 32 bits of pseudorandomness from the MRG32k3a random number generator. The function curand() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

integer(4) function curandGetMRG32k3a(state)
  TYPE(curandStateMRG32k3a) :: state

4.3.2.3. curandGetPhilox4_32_10

This function returns 32 bits of pseudorandomness from the Philox4_32_10 random number generator. The function curand() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

integer(4) function curandGetPhilox4_32_10(state)
  TYPE(curandStatePhilox4_32_10) :: state

4.3.2.4. curandGetSobol32

This function returns 32 bits of quasirandomness from the Sobol32 random number generator. The function curand() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

integer(4) function curandGetSobol32(state)
  TYPE(curandStateSobol32) :: state

4.3.2.5. curandGetScrambledSobol32

This function returns 32 bits of quasirandomness from the scrambled Sobol32 random number generator. The function curand() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

integer(4) function curandGetScrambledSobol32(state)
  TYPE(curandStateScrambledSobol32) :: state

4.3.2.6. curandGetSobol64

This function returns 64 bits of quasirandomness from the Sobol64 random number generator. The function curand() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

integer(4) function curandGetSobol64(state)
  TYPE(curandStateSobol64) :: state

4.3.2.7. curandGetScrambledSobol64

This function returns 64 bits of quasirandomness from the scrambled Sobol64 random number generator. The function curand() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

integer(4) function curandGetScrambledSobol64(state)
  TYPE(curandStateScrambledSobol64) :: state

4.3.3. Curand_Normal

This overloaded device function returns a 32-bit floating point normally distributed random number. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

4.3.3.1. curandNormalXORWOW

This function returns a 32-bit floating point normally distributed random number from an XORWOW generator. The function curand_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandNormalXORWOW(state)
  TYPE(curandStateXORWOW) :: state

4.3.3.2. curandNormalMRG32k3a

This function returns a 32-bit floating point normally distributed random number from an MRG32k3a generator. The function curand_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandNormalMRG32k3a(state)
  TYPE(curandStateMRG32k3a) :: state

4.3.3.3. curandNormalPhilox4_32_10

This function returns a 32-bit floating point normally distributed random number from a Philox4_32_10 generator. The function curand_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandNormalPhilox4_32_10(state)
  TYPE(curandStatePhilox4_32_10) :: state

4.3.3.4. curandNormalSobol32

This function returns a 32-bit floating point normally distributed random number from an Sobol32 generator. The function curand_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandNormalSobol32(state)
  TYPE(curandStateSobol32) :: state

4.3.3.5. curandNormalScrambledSobol32

This function returns a 32-bit floating point normally distributed random number from an scrambled Sobol32 generator. The function curand_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandNormalScrambledSobol32(state)
  TYPE(curandStateScrambledSobol32) :: state

4.3.3.6. curandNormalSobol64

This function returns a 32-bit floating point normally distributed random number from an Sobol64 generator. The function curand_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandNormalSobol64(state)
  TYPE(curandStateSobol64) :: state

4.3.3.7. curandNormalScrambledSobol64

This function returns a 32-bit floating point normally distributed random number from an scrambled Sobol64 generator. The function curand_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandNormalScrambledSobol64(state)
  TYPE(curandStateScrambledSobol64) :: state

4.3.4. Curand_Normal_Double

This overloaded device function returns a 64-bit floating point normally distributed random number. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

4.3.4.1. curandNormalDoubleXORWOW

This function returns a 64-bit floating point normally distributed random number from an XORWOW generator. The function curand_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandNormalDoubleXORWOW(state)
  TYPE(curandStateXORWOW) :: state

4.3.4.2. curandNormalDoubleMRG32k3a

This function returns a 64-bit floating point normally distributed random number from an MRG32k3a generator. The function curand_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandNormalDoubleMRG32k3a(state)
  TYPE(curandStateMRG32k3a) :: state

4.3.4.3. curandNormalDoublePhilox4_32_10

This function returns a 64-bit floating point normally distributed random number from a Philox4_32_10 generator. The function curand_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandNormalDoublePhilox4_32_10(state)
  TYPE(curandStatePhilox4_32_10) :: state

4.3.4.4. curandNormalDoubleSobol32

This function returns a 64-bit floating point normally distributed random number from an Sobol32 generator. The function curand_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandNormalDoubleSobol32(state)
  TYPE(curandStateSobol32) :: state

4.3.4.5. curandNormalDoubleScrambledSobol32

This function returns a 64-bit floating point normally distributed random number from an scrambled Sobol32 generator. The function curand_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandNormalDoubleScrambledSobol32(state)
  TYPE(curandStateScrambledSobol32) :: state

4.3.4.6. curandNormalDoubleSobol64

This function returns a 64-bit floating point normally distributed random number from an Sobol64 generator. The function curand_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandNormalDoubleSobol64(state)
  TYPE(curandStateSobol64) :: state

4.3.4.7. curandNormalDoubleScrambledSobol64

This function returns a 64-bit floating point normally distributed random number from an scrambled Sobol64 generator. The function curand_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandNormalDoubleScrambledSobol64(state)
  TYPE(curandStateScrambledSobol64) :: state

4.3.5. Curand_Log_Normal

This overloaded device function returns a 32-bit floating point log-normally distributed random number. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

4.3.5.1. curandLogNormalXORWOW

This function returns a 32-bit floating point log-normally distributed random number from an XORWOW generator. The function curand_log_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandLogNormalXORWOW(state)
  TYPE(curandStateXORWOW) :: state

4.3.5.2. curandLogNormalMRG32k3a

This function returns a 32-bit floating point log-normally distributed random number from an MRG32k3a generator. The function curand_log_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandLogNormalMRG32k3a(state)
  TYPE(curandStateMRG32k3a) :: state

4.3.5.3. curandLogNormalPhilox4_32_10

This function returns a 32-bit floating point log-normally distributed random number from a Philox4_32_10 generator. The function curand_log_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandLogNormalPhilox4_32_10(state)
  TYPE(curandStatePhilox4_32_10) :: state

4.3.5.4. curandLogNormalSobol32

This function returns a 32-bit floating point log-normally distributed random number from an Sobol32 generator. The function curand_log_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandLogNormalSobol32(state)
  TYPE(curandStateSobol32) :: state

4.3.5.5. curandLogNormalScrambledSobol32

This function returns a 32-bit floating point log-normally distributed random number from an scrambled Sobol32 generator. The function curand_log_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandLogNormalScrambledSobol32(state)
  TYPE(curandStateScrambledSobol32) :: state

4.3.5.6. curandLogNormalSobol64

This function returns a 32-bit floating point log-normally distributed random number from an Sobol64 generator. The function curand_log_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandLogNormalSobol64(state)
  TYPE(curandStateSobol64) :: state

4.3.5.7. curandLogNormalScrambledSobol64

This function returns a 32-bit floating point log-normally distributed random number from an scrambled Sobol64 generator. The function curand_log_normal() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandLogNormalScrambledSobol64(state)
  TYPE(curandStateScrambledSobol64) :: state

4.3.6. Curand_Log_Normal_Double

This overloaded device function returns a 64-bit floating point log-normally distributed random number. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

4.3.6.1. curandLogNormalDoubleXORWOW

This function returns a 64-bit floating point log-normally distributed random number from an XORWOW generator. The function curand_log_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandLogNormalDoubleXORWOW(state)
  TYPE(curandStateXORWOW) :: state

4.3.6.2. curandLogNormalDoubleMRG32k3a

This function returns a 64-bit floating point log-normally distributed random number from an MRG32k3a generator. The function curand_log_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandLogNormalDoubleMRG32k3a(state)
  TYPE(curandStateMRG32k3a) :: state

4.3.6.3. curandLogNormalDoublePhilox4_32_10

This function returns a 64-bit floating point log-normally distributed random number from a Philox4_32_10 generator. The function curand_log_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandLogNormalDoublePhilox4_32_10(state)
  TYPE(curandStatePhilox4_32_10) :: state

4.3.6.4. curandLogNormalDoubleSobol32

This function returns a 64-bit floating point log-normally distributed random number from an Sobol32 generator. The function curand_log_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandLogNormalDoubleSobol32(state)
  TYPE(curandStateSobol32) :: state

4.3.6.5. curandLogNormalDoubleScrambledSobol32

This function returns a 64-bit floating point log-normally distributed random number from an scrambled Sobol32 generator. The function curand_log_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandLogNormalDoubleScrambledSobol32(state)
  TYPE(curandStateScrambledSobol32) :: state

4.3.6.6. curandLogNormalDoubleSobol64

This function returns a 64-bit floating point log-normally distributed random number from an Sobol64 generator. The function curand_log_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandLogNormalDoubleSobol64(state)
  TYPE(curandStateSobol64) :: state

4.3.6.7. curandLogNormalDoubleScrambledSobol64

This function returns a 64-bit floating point log-normally distributed random number from an scrambled Sobol64 generator. The function curand_log_normal_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandLogNormalDoubleScrambledSobol64(state)
  TYPE(curandStateScrambledSobol64) :: state

4.3.7. Curand_Uniform

This overloaded device function returns a 32-bit floating point uniformly distributed random number. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

4.3.7.1. curandUniformXORWOW

This function returns a 32-bit floating point uniformly distributed random number from an XORWOW generator. The function curand_uniform() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandUniformXORWOW(state)
  TYPE(curandStateXORWOW) :: state

4.3.7.2. curandUniformMRG32k3a

This function returns a 32-bit floating point uniformly distributed random number from an MRG32k3a generator. The function curand_uniform() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandUniformMRG32k3a(state)
  TYPE(curandStateMRG32k3a) :: state

4.3.7.3. curandUniformPhilox4_32_10

This function returns a 32-bit floating point uniformly distributed random number from a Philox4_32_10 generator. The function curand_uniform() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandUniformPhilox4_32_10(state)
  TYPE(curandStatePhilox4_32_10) :: state

4.3.7.4. curandUniformSobol32

This function returns a 32-bit floating point uniformly distributed random number from an Sobol32 generator. The function curand_uniform() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandUniformSobol32(state)
  TYPE(curandStateSobol32) :: state

4.3.7.5. curandUniformScrambledSobol32

This function returns a 32-bit floating point uniformly distributed random number from an scrambled Sobol32 generator. The function curand_uniform() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandUniformScrambledSobol32(state)
  TYPE(curandStateScrambledSobol32) :: state

4.3.7.6. curandUniformSobol64

This function returns a 32-bit floating point uniformly distributed random number from an Sobol64 generator. The function curand_uniform() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandUniformSobol64(state)
  TYPE(curandStateSobol64) :: state

4.3.7.7. curandUniformScrambledSobol64

This function returns a 32-bit floating point uniformly distributed random number from an scrambled Sobol64 generator. The function curand_uniform() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(4) function curandUniformScrambledSobol64(state)
  TYPE(curandStateScrambledSobol64) :: state

4.3.8. Curand_Uniform_Double

This overloaded device function returns a 64-bit floating point uniformly distributed random number. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

4.3.8.1. curandUniformDoubleXORWOW

This function returns a 64-bit floating point uniformly distributed random number from an XORWOW generator. The function curand_uniform_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandUniformDoubleXORWOW(state)
  TYPE(curandStateXORWOW) :: state

4.3.8.2. curandUniformDoubleMRG32k3a

This function returns a 64-bit floating point uniformly distributed random number from an MRG32k3a generator. The function curand_uniform_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandUniformDoubleMRG32k3a(state)
  TYPE(curandStateMRG32k3a) :: state

4.3.8.3. curandUniformDoublePhilox4_32_10

This function returns a 64-bit floating point uniformly distributed random number from a Philox4_32_10 generator. The function curand_uniform_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandUniformDoublePhilox4_32_10(state)
  TYPE(curandStatePhilox4_32_10) :: state

4.3.8.4. curandUniformDoubleSobol32

This function returns a 64-bit floating point uniformly distributed random number from an Sobol32 generator. The function curand_uniform_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandUniformDoubleSobol32(state)
  TYPE(curandStateSobol32) :: state

4.3.8.5. curandUniformDoubleScrambledSobol32

This function returns a 64-bit floating point uniformly distributed random number from an scrambled Sobol32 generator. The function curand_uniform_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandUniformDoubleScrambledSobol32(state)
  TYPE(curandStateScrambledSobol32) :: state

4.3.8.6. curandUniformDoubleSobol64

This function returns a 64-bit floating point uniformly distributed random number from an Sobol64 generator. The function curand_uniform_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandUniformDoubleSobol64(state)
  TYPE(curandStateSobol64) :: state

4.3.8.7. curandUniformDoubleScrambledSobol64

This function returns a 64-bit floating point uniformly distributed random number from an scrambled Sobol64 generator. The function curand_uniform_double() has also been overloaded to accept these function arguments, as in CUDA C++. Device Functions are declared attributes(device) in CUDA Fortran and !$acc routine() seq in OpenACC.

real(8) function curandUniformDoubleScrambledSobol64(state)
  TYPE(curandStateScrambledSobol64) :: state

5. SPARSE Matrix Runtime Library APIs

This section describes the Fortran interfaces to the CUDA cuSPARSE library. The cuSPARSE functions are only accessible from host code. All of the runtime API routines are integer functions that return an error code; they return a value of CUSPARSE_STATUS_SUCCESS if the call was successful, or another cuSPARSE status return value if there was an error.

Chapter 10 contains examples of accessing the cuSPARSE library routines from OpenACC and CUDA Fortran. In both cases, the interfaces to the library can be exposed by adding the line

use cusparse

to your program unit.

A number of the function interfaces listed in this chapter can take host or device scalar arguments. Those functions have an additional v2 interface, which does not implicitly manage the pointer mode for these calls. See section 1.6 for further discussion on the handling of pointer modes.

Unless a specific kind is provided, the plain integer type used in the interfaces implies integer(4) and the plain real type implies real(4).

5.1. CUSPARSE Definitions and Helper Functions

This section contains definitions and data types used in the cuSPARSE library and interfaces to the cuSPARSE helper functions.

The cuSPARSE module contains the following derived type definitions:

type cusparseHandle
  type(c_ptr) :: handle
end type cusparseHandle

type :: cusparseMatDescr
     type(c_ptr) :: descr
end type cusparseMatDescr

! This type was removed in CUDA 11.0
type cusparseSolveAnalysisInfo
     type(c_ptr) :: info
end type cusparseSolveAnalysisInfo

! This type was removed in CUDA 11.0
type cusparseHybMat
     type(c_ptr) :: mat
end type cusparseHybMat

type cusparseCsrsv2Info
     type(c_ptr) :: info
end type cusparseCsrsv2Info

type cusparseCsric02Info
     type(c_ptr) :: info
end type cusparseCsric02Info

type cusparseCsrilu02Info
     type(c_ptr) :: info
end type cusparseCsrilu02Info

type cusparseBsrsv2Info
     type(c_ptr) :: info
end type cusparseBsrsv2Info

type cusparseBsric02Info
     type(c_ptr) :: info
end type cusparseBsric02Info

type cusparseBsrilu02Info
     type(c_ptr) :: info
end type cusparseBsrilu02Info

type cusparseBsrsm2Info
     type(c_ptr) :: info
end type cusparseBsrsm2Info

type cusparseCsrgemm2Info
     type(c_ptr) :: info
end type cusparseCsrgemm2Info

type cusparseColorInfo
     type(c_ptr) :: info
end type cusparseColorInfo

type cusparseCsru2csrInfo
     type(c_ptr) :: info
end type cusparseCsru2csrInfo

type cusparseSpVecDescr
     type(c_ptr) :: descr
end type cusparseSpVecDescr

type cusparseDnVecDescr
     type(c_ptr) :: descr
end type cusparseDnVecDescr

type cusparseSpMatDescr
     type(c_ptr) :: descr
end type cusparseSpMatDescr

type cusparseDnMatDescr
     type(c_ptr) :: descr
end type cusparseDnMatDescr

type cusparseSpSVDescr
     type(c_ptr) :: descr
end type cusparseSpSVDescr

type cusparseSpSMDescr
     type(c_ptr) :: descr
end type cusparseSpSMDescr

type cusparseSpGEMMDescr
     type(c_ptr) :: descr
end type cusparseSpGEMMDescr

The cuSPARSE module contains the following constants and enumerations:

! cuSPARSE Version Info
  integer, parameter :: CUSPARSE_VER_MAJOR = 12
  integer, parameter :: CUSPARSE_VER_MINOR = 1
  integer, parameter :: CUSPARSE_VER_PATCH = 2
  integer, parameter :: CUSPARSE_VER_BUILD = 129
  integer, parameter :: CUSPARSE_VERSION   = (CUSPARSE_VER_MAJOR * 1000 + &
                          CUSPARSE_VER_MINOR *  100 + CUSPARSE_VER_PATCH)

! cuSPARSE status return values
enum, bind(C) ! cusparseStatus_t
    enumerator :: CUSPARSE_STATUS_SUCCESS=0
    enumerator :: CUSPARSE_STATUS_NOT_INITIALIZED=1
    enumerator :: CUSPARSE_STATUS_ALLOC_FAILED=2
    enumerator :: CUSPARSE_STATUS_INVALID_VALUE=3
    enumerator :: CUSPARSE_STATUS_ARCH_MISMATCH=4
    enumerator :: CUSPARSE_STATUS_MAPPING_ERROR=5
    enumerator :: CUSPARSE_STATUS_EXECUTION_FAILED=6
    enumerator :: CUSPARSE_STATUS_INTERNAL_ERROR=7
    enumerator :: CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED=8
    enumerator :: CUSPARSE_STATUS_ZERO_PIVOT=9
    enumerator :: CUSPARSE_STATUS_NOT_SUPPORTED=10
    enumerator :: CUSPARSE_STATUS_INSUFFICIENT_RESOURCES=11
end enum

enum, bind(c) ! cusparsePointerMode_t
    enumerator :: CUSPARSE_POINTER_MODE_HOST = 0
    enumerator :: CUSPARSE_POINTER_MODE_DEVICE = 1
end enum

enum, bind(c) ! cusparseAction_t
    enumerator :: CUSPARSE_ACTION_SYMBOLIC = 0
    enumerator :: CUSPARSE_ACTION_NUMERIC = 1
end enum

enum, bind(C) ! cusparseMatrixType_t
    enumerator :: CUSPARSE_MATRIX_TYPE_GENERAL = 0
    enumerator :: CUSPARSE_MATRIX_TYPE_SYMMETRIC = 1
    enumerator :: CUSPARSE_MATRIX_TYPE_HERMITIAN = 2
    enumerator :: CUSPARSE_MATRIX_TYPE_TRIANGULAR = 3
end enum

enum, bind(C) ! cusparseFillMode_t
    enumerator :: CUSPARSE_FILL_MODE_LOWER = 0
    enumerator :: CUSPARSE_FILL_MODE_UPPER = 1
end enum

enum, bind(C) ! cusparseDiagType_t
    enumerator :: CUSPARSE_DIAG_TYPE_NON_UNIT = 0
    enumerator :: CUSPARSE_DIAG_TYPE_UNIT = 1
end enum

enum, bind(C) ! cusparseIndexBase_t
    enumerator :: CUSPARSE_INDEX_BASE_ZERO = 0
    enumerator :: CUSPARSE_INDEX_BASE_ONE = 1
end enum

enum, bind(C) ! cusparseOperation_t
    enumerator :: CUSPARSE_OPERATION_NON_TRANSPOSE = 0
    enumerator :: CUSPARSE_OPERATION_TRANSPOSE = 1
    enumerator :: CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE = 2
end enum

enum, bind(C) ! cusparseDirection_t
    enumerator :: CUSPARSE_DIRECTION_ROW = 0
    enumerator :: CUSPARSE_DIRECTION_COLUMN = 1
end enum

enum, bind(C) ! cusparseHybPartition_t
    enumerator :: CUSPARSE_HYB_PARTITION_AUTO = 0
    enumerator :: CUSPARSE_HYB_PARTITION_USER = 1
    enumerator :: CUSPARSE_HYB_PARTITION_MAX = 2
end enum

enum, bind(C) ! cusparseSolvePolicy_t
    enumerator :: CUSPARSE_SOLVE_POLICY_NO_LEVEL = 0
    enumerator :: CUSPARSE_SOLVE_POLICY_USE_LEVEL = 1
end enum

enum, bind(C) ! cusparseSideMode_t
    enumerator :: CUSPARSE_SIDE_LEFT  = 0
    enumerator :: CUSPARSE_SIDE_RIGHT = 1
end enum

enum, bind(C) ! cusparseColorAlg_t
    enumerator :: CUSPARSE_COLOR_ALG0 = 0
    enumerator :: CUSPARSE_COLOR_ALG1 = 1
end enum

enum, bind(C) ! cusparseAlgMode_t;
    enumerator :: CUSPARSE_ALG0           = 0
    enumerator :: CUSPARSE_ALG1           = 1
    enumerator :: CUSPARSE_ALG_NAIVE      = 0
    enumerator :: CUSPARSE_ALG_MERGE_PATH = 0
end enum

enum, bind(C) ! cusparseCsr2CscAlg_t;
    enumerator :: CUSPARSE_CSR2CSC_ALG_DEFAULT = 1
    enumerator :: CUSPARSE_CSR2CSC_ALG1 = 1
    enumerator :: CUSPARSE_CSR2CSC_ALG2 = 2
end enum

enum, bind(C) ! cusparseFormat_t;
    enumerator :: CUSPARSE_FORMAT_CSR     = 1
    enumerator :: CUSPARSE_FORMAT_CSC     = 2
    enumerator :: CUSPARSE_FORMAT_COO     = 3
    enumerator :: CUSPARSE_FORMAT_COO_AOS = 4
    enumerator :: CUSPARSE_FORMAT_BLOCKED_ELL = 5
    enumerator :: CUSPARSE_FORMAT_BSR     = 6
    enumerator :: CUSPARSE_FORMAT_SLICED_ELLPACK = 7
end enum

enum, bind(C) ! cusparseOrder_t;
    enumerator :: CUSPARSE_ORDER_COL = 1
    enumerator :: CUSPARSE_ORDER_ROW = 2
end enum

enum, bind(C) ! cusparseSpMVAlg_t;
    enumerator :: CUSPARSE_MV_ALG_DEFAULT = 0
    enumerator :: CUSPARSE_COOMV_ALG      = 1
    enumerator :: CUSPARSE_CSRMV_ALG1     = 2
    enumerator :: CUSPARSE_CSRMV_ALG2     = 3
    enumerator :: CUSPARSE_SPMV_ALG_DEFAULT = 0
    enumerator :: CUSPARSE_SPMV_CSR_ALG1    = 2
    enumerator :: CUSPARSE_SPMV_CSR_ALG2    = 3
    enumerator :: CUSPARSE_SPMV_COO_ALG1    = 1
    enumerator :: CUSPARSE_SPMV_COO_ALG2    = 4
    enumerator :: CUSPARSE_SPMV_SELL_ALG1   = 5
end enum

enum, bind(C) ! cusparseSpMMAlg_t;
    enumerator :: CUSPARSE_MM_ALG_DEFAULT = 0
    enumerator :: CUSPARSE_COOMM_ALG1 = 1
    enumerator :: CUSPARSE_COOMM_ALG2 = 2
    enumerator :: CUSPARSE_COOMM_ALG3 = 3
    enumerator :: CUSPARSE_CSRMM_ALG1 = 4
    enumerator :: CUSPARSE_SPMM_ALG_DEFAULT = 0
    enumerator :: CUSPARSE_SPMM_COO_ALG1    = 1
    enumerator :: CUSPARSE_SPMM_COO_ALG2    = 2
    enumerator :: CUSPARSE_SPMM_COO_ALG3    = 3
    enumerator :: CUSPARSE_SPMM_COO_ALG4    = 5
    enumerator :: CUSPARSE_SPMM_CSR_ALG1    = 4
    enumerator :: CUSPARSE_SPMM_CSR_ALG2    = 6
    enumerator :: CUSPARSE_SPMM_CSR_ALG3    = 12
    enumerator :: CUSPARSE_SPMM_BLOCKED_ELL_ALG1 = 13
end enum

enum, bind(C) ! cusparseIndexType_t;
    enumerator :: CUSPARSE_INDEX_16U = 1
    enumerator :: CUSPARSE_INDEX_32I = 2
    enumerator :: CUSPARSE_INDEX_64I = 3
end enum

enum, bind(C) ! cusparseSpMatAttribute_t;
    enumerator :: CUSPARSE_SPMAT_FILL_MODE = 0
    enumerator :: CUSPARSE_SPMAT_DIAG_TYPE = 1
end enum

enum, bind(C) ! cusparseSparseToDenseAlg_t;
    enumerator :: CUSPARSE_SPARSETODENSE_ALG_DEFAULT = 0
    enumerator :: CUSPARSE_DENSETOSPARSE_ALG_DEFAULT = 0
end enum

enum, bind(C) ! cusparseSpSVAlg_t;
    enumerator :: CUSPARSE_SPSV_ALG_DEFAULT = 0
end enum

enum, bind(C) ! cusparseSpSVUpdate_t;
    enumerator :: CUSPARSE_SPSV_UPDATE_GENERAL  = 0
    enumerator :: CUSPARSE_SPSV_UPDATE_DIAGONAL = 1
end enum

enum, bind(C) ! cusparseSpSMAlg_t;
    enumerator :: CUSPARSE_SPSM_ALG_DEFAULT = 0
end enum

enum, bind(C) ! cusparseSpMMOpAlg_t;
    enumerator :: CUSPARSE_SPMM_OP_ALG_DEFAULT = 0
end enum

enum, bind(C) ! cusparseSpGEMMAlg_t;
    enumerator :: CUSPARSE_SPGEMM_DEFAULT = 0
    enumerator :: CUSPARSE_SPGEMM_CSR_ALG_DETERMINISTIC = 1
    enumerator :: CUSPARSE_SPGEMM_CSR_ALG_DETERMINITIC  = 1
    enumerator :: CUSPARSE_SPGEMM_CSR_ALG_NONDETERMINISTIC = 2
    enumerator :: CUSPARSE_SPGEMM_CSR_ALG_NONDETERMINITIC  = 2
    enumerator :: CUSPARSE_SPGEMM_ALG1 = 3
    enumerator :: CUSPARSE_SPGEMM_ALG2 = 4
    enumerator :: CUSPARSE_SPGEMM_ALG3 = 5
end enum

enum, bind(C) ! cusparseSDDMMAlg_t;
    enumerator :: CUSPARSE_SDDMM_ALG_DEFAULT = 0
end enum

5.1.1. cusparseCreate

This function initializes the cuSPARSE library and creates a handle on the cuSPARSE context. It must be called before any other cuSPARSE API function is invoked. It allocates hardware resources necessary for accessing the GPU.

integer(4) function cusparseCreate(handle)
  type(cusparseHandle) :: handle

5.1.2. cusparseDestroy

This function releases CPU-side resources used by the cuSPARSE library. The release of GPU-side resources may be deferred until the application shuts down.

integer(4) function cusparseDestroy(handle)
  type(cusparseHandle) :: handle

5.1.3. cusparseGetErrorName

This function returns the error code name.

character*128 function cusparseGetErrorName(ierr)
   integer(c_int) :: ierr

5.1.4. cusparseGetErrorString

This function returns the description string for an error code.

character*128 function cusparseGetErrorString(ierr)
   integer(c_int) :: ierr

5.1.5. cusparseGetVersion

This function returns the version number of the cuSPARSE library.

integer(4) function cusparseGetVersion(handle, version)
  type(cusparseHandle) :: handle
  integer(c_int) :: version

5.1.6. cusparseSetStream

This function sets the stream to be used by the cuSPARSE library to execute its routines.

integer(4) function cusparseSetStream(handle, stream)
  type(cusparseHandle) :: handle
  integer(cuda_stream_kind) :: stream

5.1.7. cusparseGetStream

This function gets the stream used by the cuSPARSE library to execute its routines. If the cuSPARSE library stream is not set, all kernels use the default NULL stream.

integer(4) function cusparseGetStream(handle, stream)
  type(cusparseHandle) :: handle
  integer(cuda_stream_kind) :: stream

5.1.8. cusparseGetPointerMode

This function obtains the pointer mode used by the cuSPARSE library. Please see section 1.6 for more details on pointer modes.

integer(4) function cusparseGetPointerMode(handle, mode)
  type(cusparseHandle) :: handle
  integer(c_int) :: mode

5.1.9. cusparseSetPointerMode

This function sets the pointer mode used by the cuSPARSE library. In these Fortran interfaces, this only has an effect when using the *_v2 interfaces. The default is for the values to be passed by reference on the host. Please see section 1.6 for more details on pointer modes.

integer(4) function cusparseSetPointerMode(handle, mode)
  type(cusparseHandle) :: handle
  integer(4) :: mode

5.1.10. cusparseCreateMatDescr

This function initializes the matrix descriptor. It sets the fields MatrixType and IndexBase to the default values CUSPARSE_MATRIX_TYPE_GENERAL and CUSPARSE_INDEX_BASE_ZERO , respectively, while leaving other fields uninitialized.

integer(4) function cusparseCreateMatDescr(descrA)
  type(cusparseMatDescr) :: descrA

5.1.11. cusparseDestroyMatDescr

This function releases the memory allocated for the matrix descriptor.

integer(4) function cusparseDestroyMatDescr(descrA)
  type(cusparseMatDescr) :: descrA

5.1.12. cusparseSetMatType

This function sets the MatrixType of the matrix descriptor descrA.

integer(4) function cusparseSetMatType(descrA, type)
  type(cusparseMatDescr) :: descrA
  integer(4) :: type

5.1.13. cusparseGetMatType

This function returns the MatrixType of the matrix descriptor descrA.

integer(4) function cusparseGetMatType(descrA)
  type(cusparseMatDescr) :: descrA

5.1.14. cusparseSetMatFillMode

This function sets the FillMode field of the matrix descriptor descrA.

integer(4) function cusparseSetMatFillMode(descrA, mode)
  type(cusparseMatDescr) :: descrA
  integer(4) :: mode

5.1.15. cusparseGetMatFillMode

This function returns the FillMode field of the matrix descriptor descrA.

integer(4) function cusparseGetMatFillMode(descrA)
  type(cusparseMatDescr) :: descrA

5.1.16. cusparseSetMatDiagType

This function sets the DiagType of the matrix descriptor descrA.

integer(4) function cusparseSetMatDiagType(descrA, type)
  type(cusparseMatDescr) :: descrA
  integer(4) :: type

5.1.17. cusparseGetMatDiagType

This function returns the DiagType of the matrix descriptor descrA.

integer(4) function cusparseGetMatDiagType(descrA)
  type(cusparseMatDescr) :: descrA

5.1.18. cusparseSetMatIndexBase

This function sets the IndexBase field of the matrix descriptor descrA.

integer(4) function cusparseSetMatIndexBase(descrA, base)
  type(cusparseMatDescr) :: descrA
  integer(4) :: base

5.1.19. cusparseGetMatIndexBase

This function returns the IndexBase field of the matrix descriptor descrA.

integer(4) function cusparseGetMatIndexBase(descrA)
  type(cusparseMatDescr) :: descrA

5.1.20. cusparseCreateSolveAnalysisInfo

This function creates and initializes the solve and analysis structure to default values. This function, and all functions which use the cusparseSolveAnalysisInfo type, were removed in CUDA 11.0.

integer(4) function cusparseCreateSolveAnalysisInfo(info)
  type(cusparseSolveAnalysisinfo) :: info

5.1.21. cusparseDestroySolveAnalysisInfo

This function destroys and releases any memory required by the structure. This function, and all functions which use the cusparseSolveAnalysisInfo type, were removed in CUDA 11.0.

integer(4) function cusparseDestroySolveAnalysisInfo(info)
  type(cusparseSolveAnalysisinfo) :: info

5.1.22. cusparseGetLevelInfo

This function returns the number of levels and the assignment of rows into the levels computed by either the csrsv_analysis, csrsm_analysis or hybsv_analysis routines.

integer(4) function cusparseGetLevelInfo(handle, info, nlevels, levelPtr, levelInd)
  type(cusparseHandle) :: handle
  type(cusparseSolveAnalysisinfo) :: info
  integer(c_int) :: nlevels
  type(c_ptr) :: levelPtr
  type(c_ptr) :: levelInd

5.1.23. cusparseCreateHybMat

This function creates and initializes the hybA opaque data structure. This function, and all functions which use the cusparseHybMat type, were removed in CUDA 11.0.

integer(4) function cusparseCreateHybMat(hybA)
  type(cusparseHybMat) :: hybA

5.1.24. cusparseDestroyHybMat

This function destroys and releases any memory required by the hybA structure. This function, and all functions which use the cusparseHybMat type, were removed in CUDA 11.0.

integer(4) function cusparseDestroyHybMat(hybA)
  type(cusparseHybMat) :: hybA