CUDA Compiler Driver NVCC
CUDA Compiler Driver NVCC
Release 12.0
NVIDIA
1 Overview 3
1.1 CUDA Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 CUDA Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Purpose of NVCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Compilation Phases 7
3.1 NVCC Identification Macro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 NVCC Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Supported Input File Suffixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Supported Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
i
--device-c (-dc) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
--device-w (-dw) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
--cuda (-cuda) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
--compile (-c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
--fatbin (-fatbin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
--cubin (-cubin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
--ptx (-ptx) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
--preprocess (-E) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
--generate-dependencies (-M) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
--generate-nonsystem-dependencies (-MM) . . . . . . . . . . . . . . . . . . . . . 19
--generate-dependencies-with-compile (-MD) . . . . . . . . . . . . . . . . . . 20
--generate-nonsystem-dependencies-with-compile (-MMD) . . . . . . . . . 20
--optix-ir (-optix-ir) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
--run (-run) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.3 Options for Specifying Behavior of Compiler/Linker . . . . . . . . . . . . . . . . . . 20
--profile (-pg) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
--debug (-g) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
--device-debug (-G) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
--extensible-whole-program (-ewp) . . . . . . . . . . . . . . . . . . . . . . . . . . 21
--no-compress (-no-compress) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
--generate-line-info (-lineinfo) . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
--optimization-info kind,... (-opt-info) . . . . . . . . . . . . . . . . . . . . 21
--optimize level (-O) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
--dopt kind (-dopt) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
--dlink-time-opt (-dlto) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
--ftemplate-backtrace-limit limit (-ftemplate-backtrace-limit) . . 22
--ftemplate-depth limit (-ftemplate-depth) . . . . . . . . . . . . . . . . . . 22
--no-exceptions (-noeh) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
--shared (-shared) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
--x {c|c++|cu} (-x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
--std {c++03|c++11|c++14|c++17|c++20} (-std) . . . . . . . . . . . . . . . . . . . . 23
--no-host-device-initializer-list (-nohdinitlist) . . . . . . . . . . . . . 23
--expt-relaxed-constexpr (-expt-relaxed-constexpr) . . . . . . . . . . . . 24
--extended-lambda (-extended-lambda) . . . . . . . . . . . . . . . . . . . . . . . 24
--expt-extended-lambda (-expt-extended-lambda) . . . . . . . . . . . . . . . 24
--machine {64} (-m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
--m64 (-m64) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
--host-linker-script {use-lcs|gen-lcs} (-hls) . . . . . . . . . . . . . . . . . 24
--augment-host-linker-script (-aug-hls) . . . . . . . . . . . . . . . . . . . . . 25
--host-relocatable-link (-r) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.4 Options for Passing Specific Phase Options . . . . . . . . . . . . . . . . . . . . . . . 25
--compiler-options options,... (-Xcompiler) . . . . . . . . . . . . . . . . . . 25
--linker-options options,... (-Xlinker) . . . . . . . . . . . . . . . . . . . . . 25
--archive-options options,... (-Xarchive) . . . . . . . . . . . . . . . . . . . 26
--ptxas-options options,... (-Xptxas) . . . . . . . . . . . . . . . . . . . . . . . 26
--nvlink-options options,... (-Xnvlink) . . . . . . . . . . . . . . . . . . . . . 26
5.2.5 Options for Guiding the Compiler Driver . . . . . . . . . . . . . . . . . . . . . . . . . . 26
--forward-unknown-to-host-compiler (-forward-unknown-to-host-compiler) 26
--forward-unknown-to-host-linker (-forward-unknown-to-host-linker) 26
--dont-use-profile (-noprof) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
--threads number (-t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
--dryrun (-dryrun) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
--verbose (-v) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
--keep (-keep) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ii
--keep-dir directory (-keep-dir) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
--save-temps (-save-temps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
--clean-targets (-clean) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
--run-args arguments,... (-run-args) . . . . . . . . . . . . . . . . . . . . . . . 28
--use-local-env (-use-local-env) . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
--input-drive-prefix prefix (-idp) . . . . . . . . . . . . . . . . . . . . . . . . . 28
--dependency-drive-prefix prefix (-ddp) . . . . . . . . . . . . . . . . . . . . . 28
--drive-prefix prefix (-dp) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
--dependency-target-name target (-MT) . . . . . . . . . . . . . . . . . . . . . . . 28
--no-align-double . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
--no-device-link (-nodlink) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
--allow-unsupported-compiler (-allow-unsupported-compiler) . . . . . 29
5.2.6 Options for Steering CUDA Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
--default-stream {legacy|null|per-thread} (-default-stream) . . . . . . 29
5.2.7 Options for Steering GPU Code Generation . . . . . . . . . . . . . . . . . . . . . . . . 29
--gpu-architecture {arch|native|all|all-major} (-arch) . . . . . . . . . . . 29
--gpu-code code,... (-code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
--generate-code specification (-gencode) . . . . . . . . . . . . . . . . . . . . 31
--relocatable-device-code {true|false} (-rdc) . . . . . . . . . . . . . . . . . 31
--entries entry,... (-e) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
--maxrregcount amount (-maxrregcount) . . . . . . . . . . . . . . . . . . . . . . . 31
--use_fast_math (-use_fast_math) . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
--ftz {true|false} (-ftz) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
--prec-div {true|false} (-prec-div) . . . . . . . . . . . . . . . . . . . . . . . . . . 32
--prec-sqrt {true|false} (-prec-sqrt) . . . . . . . . . . . . . . . . . . . . . . . . 32
--fmad {true|false} (-fmad) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
--extra-device-vectorization (-extra-device-vectorization) . . . . . 33
--compile-as-tools-patch (-astoolspatch) . . . . . . . . . . . . . . . . . . . . 33
--keep-device-functions (-keep-device-functions) . . . . . . . . . . . . . 33
5.2.8 Generic Tool Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
--disable-warnings (-w) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
--source-in-ptx (-src-in-ptx) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
--restrict (-restrict) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
--Wno-deprecated-gpu-targets (-Wno-deprecated-gpu-targets) . . . . . 34
--Wno-deprecated-declarations (-Wno-deprecated-declarations) . . . . 34
--Wreorder (-Wreorder) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
--Wdefault-stream-launch (-Wdefault-stream-launch) . . . . . . . . . . . . 34
--Wmissing-launch-bounds (-Wmissing-launch-bounds) . . . . . . . . . . . . 34
--Wext-lambda-captures-this (-Wext-lambda-captures-this) . . . . . . . 34
--Werror kind,... (-Werror) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
--display-error-number (-err-no) . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
--no-display-error-number (-no-err-no) . . . . . . . . . . . . . . . . . . . . . . 35
--diag-error errNum,... (-diag-error) . . . . . . . . . . . . . . . . . . . . . . . 35
--diag-suppress errNum,... (-diag-suppress) . . . . . . . . . . . . . . . . . . 35
--diag-warn errNum,... (-diag-warn) . . . . . . . . . . . . . . . . . . . . . . . . 36
--resource-usage (-res-usage) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
--help (-h) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
--version (-V) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
--options-file file,... (-optf) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
--time filename (-time) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
--qpp-config config (-qpp-config) . . . . . . . . . . . . . . . . . . . . . . . . . . 36
--list-gpu-code (-code-ls) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
--list-gpu-arch (-arch-ls) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.9 Phase Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
Ptxas Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
--allow-expensive-optimizations (-allow-expensive-optimizations) 37
--compile-only (-c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
--def-load-cache (-dlcm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
--def-store-cache (-dscm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
--device-debug (-g) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
--disable-optimizer-constants (-disable-optimizer-consts) . . 38
--entry entry,... (-e) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
--fmad (-fmad) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
--force-load-cache (-flcm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
--force-store-cache (-fscm) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
--generate-line-info (-lineinfo) . . . . . . . . . . . . . . . . . . . . . . . 38
--gpu-name gpuname (-arch) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
--help (-h) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
--machine (-m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
--maxrregcount amount (-maxrregcount) . . . . . . . . . . . . . . . . . . 39
--opt-level N (-O) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
--options-file file,... (-optf) . . . . . . . . . . . . . . . . . . . . . . . 39
--position-independent-code (-pic) . . . . . . . . . . . . . . . . . . . . . 39
--preserve-relocs (-preserve-relocs) . . . . . . . . . . . . . . . . . . . 39
--sp-bound-check (-sp-bound-check) . . . . . . . . . . . . . . . . . . . . . 39
--verbose (-v) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
--version (-V) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
--warning-as-error (-Werror) . . . . . . . . . . . . . . . . . . . . . . . . . . 40
--warn-on-double-precision-use (-warn-double-usage) . . . . . . . 40
--warn-on-local-memory-usage (-warn-lmem-usage) . . . . . . . . . . 40
--warn-on-spills (-warn-spills) . . . . . . . . . . . . . . . . . . . . . . . 40
--compile-as-tools-patch (-astoolspatch) . . . . . . . . . . . . . . . . 40
NVLINK Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
--disable-warnings (-w) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
--preserve-relocs (-preserve-relocs) . . . . . . . . . . . . . . . . . . . 41
--verbose (-v) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
--warning-as-error (-Werror) . . . . . . . . . . . . . . . . . . . . . . . . . . 41
--suppress-arch-warning (-suppress-arch-warning) . . . . . . . . . 41
--suppress-stack-size-warning (-suppress-stack-size-warning) 41
--dump-callgraph (-dump-callgraph) . . . . . . . . . . . . . . . . . . . . . 41
5.3 NVCC Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 GPU Compilation 43
6.1 GPU Generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 GPU Feature List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Application Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4 Virtual Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5 Virtual Architecture Feature List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6 Further Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6.1 Just-in-Time Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6.2 Fatbinaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.7 NVCC Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.7.1 Base Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.7.2 Shorthand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Shorthand 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Shorthand 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Shorthand 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.7.3 Extended Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iv
6.7.4 Virtual Architecture Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9 Notices 59
9.1 Notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.3 Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.4 Copyright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
v
vi
NVIDIA CUDA Compiler Driver, Release 12.0
Contents 1
NVIDIA CUDA Compiler Driver, Release 12.0
2 Contents
Chapter 1. Overview
3
NVIDIA CUDA Compiler Driver, Release 12.0
4 Chapter 1. Overview
Chapter 2. Supported Host Compilers
A general purpose C++ host compiler is needed by nvcc in the following situations:
▶ During non-CUDA phases (except the run phase), because these phases will be forwarded by
nvcc to this compiler.
▶ During CUDA phases, for several preprocessing stages and host code compilation (see also The
CUDA Compilation Trajectory).
nvcc assumes that the host compiler is installed with the standard method designed by the compiler
provider. If the host compiler installation is non-standard, the user must make sure that the environ-
ment is set appropriately and use relevant nvcc compile options.
The following documents provide detailed information about supported host compilers:
▶ NVIDIA CUDA Installation Guide for Linux
▶ NVIDIA CUDA Installation Guide for Microsoft Windows
On all platforms, the default host compiler executable (gcc and g++ on Linux and cl.exe on Windows)
found in the current execution search path will be used, unless specified otherwise with appropriate
options (see File and Path Specifications).
5
NVIDIA CUDA Compiler Driver, Release 12.0
7
NVIDIA CUDA Compiler Driver, Release 12.0
to display the compilation steps that it executes, these are for debugging purposes only and must not
be copied and used into build scripts.
nvcc phases are selected by a combination of command line options and input file name suffixes, and
the execution of these phases may be modified by other command line options. In phase selection, the
input file suffix defines the phase input, while the command line option defines the required output
of the phase.
The following paragraphs list the recognized file name suffixes and the supported compilation phases.
A full explanation of the nvcc command line options can be found in NVCC Command Options.
Note that nvcc does not make any distinction between object, library or resource files. It just passes
files of these types to the linker when the linking phase is executed.
Notes:
▶ The last phase in this list is more of a convenience phase. It allows running the compiled and
linked executable without having to explicitly set the library path to the CUDA dynamic libraries.
▶ Unless a phase option is specified, nvcc will compile and link all its input files.
CUDA compilation works as follows: the input program is preprocessed for device compilation com-
pilation and is compiled to CUDA binary (cubin) and/or PTX intermediate code, which are placed in
a fatbinary. The input program is preprocessed once again for host compilation and is synthesized
to embed the fatbinary and transform CUDA specific C++ extensions into standard C++ constructs.
Then the C++ host compiler compiles the synthesized host code with the embedded fatbinary into a
host object. The exact steps that are followed to achieve this are displayed in Figure 1.
The embedded fatbinary is inspected by the CUDA runtime system whenever the device code is
launched by the host program to obtain an appropriate fatbinary image for the current GPU.
CUDA programs are compiled in the whole program compilation mode by default, i.e., the device code
cannot reference an entity from a separate file. In the whole program compilation mode, device link
steps have no effect. For more information on the separate compilation and the whole program com-
pilation, see Using Separate Compilation in CUDA.
11
NVIDIA CUDA Compiler Driver, Release 12.0
-o=file
Long option names are used throughout the document, unless specified otherwise; however, short
names can be used instead of long names to have the same effect.
13
NVIDIA CUDA Compiler Driver, Release 12.0
--objdir-as-tempdir (-objtemp)
Create all intermediate files in the same directory as the object file. These intermediate files are deleted
when the compilation is finished. This option will take effect only if -c, -dc or -dw is also used. Using this
option will ensure that the intermediate file name that is embedded in the object file will not change
in multiple compiles of the same file. However, this is not guaranteed if the input is stdin. If the
same file is compiled with two different options, ex., ‘nvcc -c t.cu’ and ‘nvcc -c -ptx t.cu’, then the files
should be compiled in different directories. Compiling them in the same directory can either cause the
compilation to fail or produce incorrect results.
Specify libraries to be used in the linking stage without the library file extension.
The libraries are searched for on the library search paths that have been specified using option
--library-path (see Libraries).
--generate-dependency-targets (-MP)
Specify the directory in which the default host compiler executable resides.
The host compiler executable name can be also specified to ensure that the correct host compiler is se-
lected. In addition, driver prefix options (--input-drive-prefix, --dependency-drive-prefix,
or --drive-prefix) may need to be specified, if nvcc is executed in a Cygwin shell or a MinGW shell
on Windows.
--allow-unsupported-compiler (-allow-unsupported-compiler)
Specify the path of the archiver tool used create static library with --lib.
Specify the type of CUDA runtime library to be used: no CUDA runtime library, shared/dynamic CUDA
runtime library, or static CUDA runtime library.
Allowed Values
▶ none
▶ shared
▶ static
Default
The static CUDA runtime library is used by default.
Specify the type of CUDA device runtime library to be used: no CUDA device runtime library, or static
CUDA device runtime library.
Allowed Values
▶ none
▶ static
Default
The static CUDA device runtime library is used by default.
Specify the subfolder name in the targets directory where the default include and library paths are
located.
--link (-link)
Specify the default behavior: compile and link all input files.
Default Output File Name
a.exe on Windows or a.out on other platforms is used as the default output file name.
--lib (-lib)
Compile all input files into object files, if necessary, and add the results to the specified library output
file.
Default Output File Name
a.lib on Windows or a.a on other platforms is used as the default output file name.
--device-link (-dlink)
Link object files with relocatable device code and .ptx, .cubin, and .fatbin files into an object file
with executable device code, which can be passed to the host linker.
Default Output File Name
a_dlink.obj on Windows or a_dlink.o on other platforms is used as the default output file name.
When this option is used in conjunction with --fatbin, a_dlink.fatbin is used as the default out-
put file name. When this option is used in conjunction with --cubin, a_dlink.cubin is used as the
default output file name.
--device-c (-dc)
Compile each .c, .cc, .cpp, .cxx, and .cu input file into an object file that contains relocatable device
code.
It is equivalent to --relocatable-device-code=true --compile.
Default Output File Name
The source file name extension is replaced by .obj on Windows and .o on other platforms to create
the default output file name. For example, the default output file name for x.cu is x.obj on Windows
and x.o on other platforms.
--device-w (-dw)
Compile each .c, .cc, .cpp, .cxx, and .cu input file into an object file that contains executable device
code.
It is equivalent to --relocatable-device-code=false --compile.
Default Output File Name
The source file name extension is replaced by .obj on Windows and .o on other platforms to create
the default output file name. For example, the default output file name for x.cu is x.obj on Windows
and x.o on other platforms.
--cuda (-cuda)
--compile (-c)
Compile each .c, .cc, .cpp, .cxx, and .cu input file into an object file.
Default Output File Name
The source file name extension is replaced by .obj on Windows and .o on other platforms to create
the default output file name. For example, the default output file name for x.cu is x.obj on Windows
and x.o on other platforms.
--fatbin (-fatbin)
Compile all .cu, .ptx, and .cubin input files to device-only .fatbin files.
nvcc discards the host code for each .cu input file with this option.
Default Output File Name
The source file name extension is replaced by .fatbin to create the default output file name. For
example, the default output file name for x.cu is x.fatbin.
--cubin (-cubin)
Compile all .cu and .ptx input files to device-only .cubin files.
nvcc discards the host code for each .cu input file with this option.
Default Output File Name
The source file name extension is replaced by .cubin to create the default output file name. For
example, the default output file name for x.cu is x.cubin.
--ptx (-ptx)
--preprocess (-E)
Preprocess all .c, .cc, .cpp, .cxx, and .cu input files.
Default Output File Name
The output is generated in stdout by default.
--generate-dependencies (-M)
Generate a dependency file that can be included in a Makefile for the .c, .cc, .cpp, .cxx, and .cu
input file.
nvcc uses a fixed prefix to identify dependencies in the preprocessed file ( ‘#line 1’ on Linux and
‘# 1’ on Windows). The files mentioned in source location directives starting with this prefix will be
included in the dependency list.
Default Output File Name
The output is generated in stdout by default.
--generate-nonsystem-dependencies (-MM)
Same as --generate-dependencies but skip header files found in system directories (Linux only).
Default Output File Name
The output is generated in stdout by default.
--generate-dependencies-with-compile (-MD)
Generate a dependency file and compile the input file. The dependency file can be included in a Make-
file for the .c, .cc, .cpp, .cxx, and .cu input file.
This option cannot be specified together with -E. The dependency file name is computed as follows:
▶ If -MF is specified, then the specified file is used as the dependency file name.
▶ If -o is specified, the dependency file name is computed from the specified file name by replacing
the suffix with ‘.d’.
▶ Otherwise, the dependency file name is computed by replacing the input file names’s suffix with
‘.d’.
If the dependency file name is computed based on either -MF or -o, then multiple input files are not
supported.
--generate-nonsystem-dependencies-with-compile (-MMD)
--optix-ir (-optix-ir)
Compile CUDA source to OptiX IR (.optixir) output. The OptiX IR is only intended for consumption by
OptiX through appropriate APIs. This feature is not supported with link-time-optimization (-dlto), the
lto_NN -arch target, or with -gencode.
Default Output File Name
The source file name extension is replaced by .optixir to create the default output file name. For
example, the default output file name for x.cu is x.optixir.
--run (-run)
Compile and link all input files into an executable, and executes it.
When the input is a single executable, it is executed without any compilation or linking. This step
is intended for developers who do not want to be bothered with setting the necessary environment
variables; these are set temporarily by nvcc.
--debug (-g)
--device-debug (-G)
--extensible-whole-program (-ewp)
Generate extensible whole program device code, which allows some calls to not be resolved until linking
with libcudadevrt.
--no-compress (-no-compress)
--generate-line-info (-lineinfo)
Enable device code optimization. When specified along with -G, enables limited debug information
generation for optimized device code (currently, only line number information). When -G is not speci-
fied, -dopt=on is implicit.
Allowed Values
▶ on: enable device code optimization.
--dlink-time-opt (-dlto)
Perform link-time optimization of device code. The option ‘-lto’ is also an alias to ‘-dlto’. Link-time opti-
mization must be specified at both compile and link time; at compile time it stores high-level interme-
diate code, then at link time it links together and optimizes the intermediate code. If that intermediate
is not found at link time then nothing happens. Intermediate code is also stored at compile time with
the --gpu-code='lto_NN' target. The options -dlto -arch=sm_NN will add a lto_NN target; if you
want to only add a lto_NN target and not the compute_NN that -arch=sm_NN usually generates, use
-arch=lto_NN.
Set the maximum number of template instantiation notes for a single warning or error to limit.
A value of 0 is allowed, and indicates that no limit should be enforced. This value is also passed to the
host compiler if it provides an equivalent flag.
--no-exceptions (-noeh)
--shared (-shared)
Explicitly specify the language for the input files, rather than letting the compiler choose a default
based on the file name suffix.
Allowed Values
▶ c
▶ c++
▶ cu
Default
The language of the source code is determined based on the file name suffix.
--no-host-device-initializer-list (-nohdinitlist)
--expt-relaxed-constexpr (-expt-relaxed-constexpr)
Experimental flag: Allow host code to invoke ``__device__ constexpr`` functions, and device code to
invoke ``__host__ constexpr`` functions.
Note that the behavior of this flag may change in future compiler releases.
--extended-lambda (-extended-lambda)
--expt-extended-lambda (-expt-extended-lambda)
--m64 (-m64)
Use the host linker script (GNU/Linux only) to enable support for certain CUDA specific requirements,
while building executable files or shared libraries.
Allowed Values
use-lcs
Prepares a host linker script and enables host linker to support relocatable device object
files that are larger in size, that would otherwise, in certain cases, cause the host linker to
fail with relocation truncation error.
gen-lcs
Generates a host linker script that can be passed to host linker manually, in the case where
host linker is invoked separately outside of nvcc. This option can be combined with -shared
or -r option to generate linker scripts that can be used while generating host shared li-
braries or host relocatable links respectively.
The file generated using this options must be provided as the last input file to the host
linker.
The output is generated to stdout by default. Use the option -o filename to specify the
output filename.
A linker script may already be in used and passed to the host linker using the host linker option
--script (or -T), then the generated host linker script must augment the existing linker script. In
such cases, the option -aug-hls must be used to generate linker script that contains only the aug-
mentation parts. Otherwise, the host linker behaviour is undefined.
A host linker option, such as -z with a non-default argument, that can modify the default linker script
internally, is incompatible with this option and the behavior of any such usage is undefined.
Default Value
use-lcs is used as the default type.
--augment-host-linker-script (-aug-hls)
Enables generation of host linker script that augments an existing host linker script (GNU/Linux only).
See option --host-linker-script for more details.
--host-relocatable-link (-r)
When used in combination with -hls=gen-lcs, controls the behaviour of -hls=gen-lcs and sets
it to generate host linker script that can be used in host relocatable link (ld -r linkage). See option
-hls=gen-lcs for more information.
This option currently is effective only when used with -hls=gen-lcs; in all other cases, this option is
ignored currently.
Forward unknown options to the host compiler. An ‘unknown option’ is a command line argument that
starts with - followed by another character, and is not a recognized nvcc flag or an argument for a
recognized nvcc flag.
If the unknown option is followed by a separate command line argument, the argument will not be
forwarded, unless it begins with the - character.
For example:
▶ nvcc -forward-unknown-to-host-compiler -foo=bar a.cu will forward -foo=bar to
host compiler.
▶ nvcc -forward-unknown-to-host-compiler -foo bar a.cu will report an error for bar
argument.
▶ nvcc -forward-unknown-to-host-compiler -foo -bar a.cu will forward -foo and -bar
to host compiler.
--forward-unknown-to-host-linker (-forward-unknown-to-host-linker)
Forward unknown options to the host linker. An ‘unknown option’ is a command line argument that
starts with - followed by another character, and is not a recognized nvcc flag or an argument for a
recognized nvcc flag.
If the unknown option is followed by a separate command line argument, the argument will not be
forwarded, unless it begins with the - character.
For example:
▶ nvcc -forward-unknown-to-host-linker -foo=bar a.cu will forward -foo=bar to host
linker.
▶ nvcc -forward-unknown-to-host-linker -foo bar a.cu will report an error for bar
argument.
▶ nvcc -forward-unknown-to-host-linker -foo -bar a.cu will forward -foo and -bar to
host linker.
--dont-use-profile (-noprof)
Specify the maximum number of threads to be used to execute the compilation steps in parallel.
This option can be used to improve the compilation speed when compiling for multiple architectures.
The compiler creates number threads to execute the compilation steps in parallel. If number is 1, this
option is ignored. If number is 0, the number of threads used is the number of CPUs on the machine.
--dryrun (-dryrun)
--verbose (-v)
--keep (-keep)
Keep all intermediate files that are generated during internal compilation steps.
Keep all intermediate files that are generated during internal compilation steps in this directory.
--save-temps (-save-temps)
--clean-targets (-clean)
Delete all the non-temporary files that the same nvcc command would generate without this option.
This option reverses the behavior of nvcc. When specified, none of the compilation phases will be
executed. Instead, all of the non-temporary files that nvcc would otherwise create will be deleted.
Specify command line arguments for the executable when used in conjunction with --run.
--use-local-env (-use-local-env)
Specify the target name of the generated rule when generating a dependency file (see
--generate-dependencies).
--no-align-double
Specify that -malign-double should not be passed as a compiler argument on 32-bit platforms.
WARNING: this makes the ABI incompatible with the CUDA’s kernel ABI for certain 64-bit types.
--no-device-link (-nodlink)
--allow-unsupported-compiler (-allow-unsupported-compiler)
Specify the stream that CUDA commands from the compiled program will be sent to by default.
Allowed Values
legacy
The CUDA legacy stream (per context, implicitly synchronizes with other streams)
per-thread
Normal CUDA stream (per thread, does not implicitly synchronize with other streams)
null
Deprecated alias for legacy
Default
legacy is used as the default stream.
Specify the name of the class of NVIDIA virtual GPU architecture for which the CUDA input files must
be compiled.
With the exception as described for the shorthand below, the architecture specified with this option
must be a virtual architecture (such as compute_50). Normally, this option alone does not trigger
assembly of the generated PTX for a real architecture (that is the role of nvcc option --gpu-code,
see below); rather, its purpose is to control preprocessing and compilation of the input to PTX.
For convenience, in case of simple nvcc compilations, the following shorthand is supported.
If no value for option --gpu-code is specified, then the value of this option defaults to the
value of --gpu-architecture. In this situation, as only exception to the description above,
the value specified for --gpu-architecture may be a real architecture (such as a sm_50), in
which case nvcc uses the specified real architecture and its closest virtual architecture as effec-
tive architecture values. For example, nvcc --gpu-architecture=sm_50 is equivalent to nvcc
--gpu-architecture=compute_50 --gpu-code=sm_50,compute_50.
When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes
for them, no PTX program will be generated for this option. It is a warning if no visible supported GPU
on the system, and the default architecture will be used.
If -arch=all is specified, nvcc embeds a compiled code image for all supported architectures
(sm_*), and a PTX program for the highest major virtual architecture. For -arch=all-major, nvcc
embeds a compiled code image for all supported major versions (sm_*0), plus the earliest supported,
and adds a PTX program for the highest major virtual architecture.
See Virtual Architecture Feature List for the list of supported virtual architectures and GPU Feature
List for the list of supported real architectures.
Default
sm_52 is used as the default value; PTX is generated for compute_52 then assembled and optimized
for sm_52.
Specify the name of the NVIDIA GPU to assemble and optimize PTX for.
nvcc embeds a compiled code image in the resulting executable for each specified code architecture,
which is a true binary load image for each real architecture (such as sm_50), and PTX code for the
virtual architecture (such as compute_50).
During runtime, such embedded PTX code is dynamically compiled by the CUDA runtime system if no
binary load image is found for the current GPU.
Architectures specified for options --gpu-architecture and --gpu-code may be virtual as well
as real, but the code architectures must be compatible with the arch architecture. When the
--gpu-code option is used, the value for the --gpu-architecture option must be a virtual PTX
architecture.
For instance, --gpu-architecture=compute_60 is not compatible with --gpu-code=sm_52, be-
cause the earlier compilation stages will assume the availability of compute_60 features that are not
present on sm_52.
See Virtual Architecture Feature List for the list of supported virtual architectures and GPU Feature
List for the list of supported real architectures.
Specify the global entry functions for which code must be generated.
PTX generated for all entry functions, but only the selected entry functions are assembled. Entry
function names for this option must be specified in the mangled name.
Default
nvcc generates code for all entry functions.
Specify the maximum amount of registers that GPU functions can use.
Until a function-specific limit, a higher value will generally increase the performance of individual GPU
threads that execute this function. However, because thread registers are allocated from a global
register pool on each GPU, a higher value of this option will also reduce the maximum thread block
size, thereby reducing the amount of thread parallelism. Hence, a good maxrregcount value is the
result of a trade-off.
Value less than the minimum registers required by ABI will be bumped up by the compiler to ABI mini-
mum limit.
User program may not be able to make use of all registers as some registers are reserved by compiler.
Default
No maximum is assumed.
--use_fast_math (-use_fast_math)
Default
This option is set to true and nvcc enables the IEEE round-to-nearest mode.
This option enables (disables) the contraction of floating-point multiplies and adds/subtracts into
floating-point multiply-add operations (FMAD, FFMA, or DFMA).
--use_fast_math implies --fmad=true.
Allowed Values
▶ true
▶ false
Default
This option is set to true and nvcc enables the contraction of floating-point multiplies and
adds/subtracts into floating-point multiply-add operations (FMAD, FFMA, or DFMA).
--extra-device-vectorization (-extra-device-vectorization)
--compile-as-tools-patch (-astoolspatch)
--keep-device-functions (-keep-device-functions)
In whole program compilation mode, preserve user defined external linkage __device__ function def-
initions in generated PTX.
--source-in-ptx (-src-in-ptx)
--restrict (-restrict)
--Wno-deprecated-gpu-targets (-Wno-deprecated-gpu-targets)
--Wno-deprecated-declarations (-Wno-deprecated-declarations)
--Wreorder (-Wreorder)
--Wdefault-stream-launch (-Wdefault-stream-launch)
Generate warning when an explicit stream argument is not provided in the <<<...>>> kernel launch
syntax.
--Wmissing-launch-bounds (-Wmissing-launch-bounds)
Generate warning when a __global__ function does not have an explicit __launch_bounds__ an-
notation.
--Wext-lambda-captures-this (-Wext-lambda-captures-this)
Be more strict about unsupported cross execution space calls. The compiler will generate an
error instead of a warning for a call from a __host____device__ to a __host__ function.
reorder
Generate errors when member initializers are reordered.
default-stream-launch
Generate error when an explicit stream argument is not provided in the <<<...>>> kernel
launch syntax.
missing-launch-bounds
Generate warning when a __global__ function does not have an explicit
__launch_bounds__ annotation.
ext-lambda-captures-this
Generate error when an extended lambda implicitly captures this.
deprecated-declarations
Generate error on use of a deprecated entity.
--display-error-number (-err-no)
This option displays a diagnostic number for any message generated by the CUDA frontend compiler
(note: not the host compiler).
--no-display-error-number (-no-err-no)
This option disables the display of a diagnostic number for any message generated by the CUDA fron-
tend compiler (note: not the host compiler).
Emit error for specified diagnostic message(s) generated by the CUDA frontend compiler (note: does
not affect diagnostics generated by the host compiler/preprocessor).
Suppress specified diagnostic message(s) generated by the CUDA frontend compiler (note: does not
affect diagnostics generated by the host compiler/preprocessor).
Emit warning for specified diagnostic message(s) generated by the CUDA frontend compiler (note:
does not affect diagnostics generated by the host compiler/preprocessor).
--resource-usage (-res-usage)
Show resource usage such as registers and memory of the GPU code.
This option implies --nvlink-options=--verbose when --relocatable-device-code=true is
set. Otherwise, it implies --ptxas-options=--verbose.
--help (-h)
--version (-V)
Generate a comma separated value table with the time taken by each compilation phase, and append
it at the end of the file given as the option argument. If the file is empty, the column headings are
generated in the first row of the table.
If the file name is -, the timing data is generated in stdout.
Specify the configuration ([[compiler/]version,][target]) when using q++ host compiler. The argument
will be forwarded to the q++ compiler with its -V flag.
--list-gpu-code (-code-ls)
List the gpu architectures (sm_XX) supported by the tool and exit.
If both –list-gpu-code and –list-gpu-arch are set, the list is displayed using the same format as the
–generate-code value.
--list-gpu-arch (-arch-ls)
List the virtual device architectures (compute_XX) supported by the tool and exit.
If both –list-gpu-arch and –list-gpu-code are set, the list is displayed using the same format as the
–generate-code value.
Ptxas Options
The following table lists some useful ptxas options which can be specified with nvcc option -Xptxas.
--allow-expensive-optimizations (-allow-expensive-optimizations)
Enable (disable) to allow compiler to perform expensive optimizations using maximum available re-
sources (memory and compile-time).
If unspecified, default behavior is to enable this feature for optimization level >= O2.
--compile-only (-c)
--def-load-cache (-dlcm)
--def-store-cache (-dscm)
--device-debug (-g)
--disable-optimizer-constants (-disable-optimizer-consts)
--fmad (-fmad)
--force-load-cache (-flcm)
--force-store-cache (-fscm)
--generate-line-info (-lineinfo)
--help (-h)
--machine (-m)
--opt-level N (-O)
--position-independent-code (-pic)
--preserve-relocs (-preserve-relocs)
This option will make ptxas to generate relocatable references for variables and preserve relocations
generated for them in linked executable.
--sp-bound-check (-sp-bound-check)
--verbose (-v)
--version (-V)
--warning-as-error (-Werror)
--warn-on-double-precision-use (-warn-double-usage)
--warn-on-local-memory-usage (-warn-lmem-usage)
--warn-on-spills (-warn-spills)
--compile-as-tools-patch (-astoolspatch)
NVLINK Options
The following table lists some useful nvlink options which can be specified with nvcc option
--nvlink-options.
--disable-warnings (-w)
--preserve-relocs (-preserve-relocs)
--verbose (-v)
--warning-as-error (-Werror)
--suppress-arch-warning (-suppress-arch-warning)
Suppress the warning that otherwise is printed when object does not contain code for target arch.
--suppress-stack-size-warning (-suppress-stack-size-warning)
Suppress the warning that otherwise is printed when stack size cannot be determined.
--dump-callgraph (-dump-callgraph)
These environment variables can be useful for injecting nvcc flags globally without modifying build
scripts.
The additional flags coming from either NVCC_PREPEND_FLAGS or NVCC_APPEND_FLAGS will be
listed in the verbose log (--verbose).
This chapter describes the GPU compilation model that is maintained by nvcc, in cooperation with the
CUDA driver. It goes through some technical sections, with concrete examples at the end.
43
NVIDIA CUDA Compiler Driver, Release 12.0
The above table lists the currently defined virtual architectures. The virtual architecture naming
scheme is the same as the real architecture naming scheme shown in Section GPU Feature List.
The disadvantage of just in time compilation is increased application startup delay, but this can be
alleviated by letting the CUDA driver use a compilation cache (refer to “Section 3.1.1.2. Just-in-Time
Compilation” of CUDA C++ Programming Guide) which is persistent over multiple runs of the applica-
tions.
6.6.2. Fatbinaries
A different solution to overcome startup delay by JIT while still allowing execution on newer GPUs is
to specify multiple code instances, as in
nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50,sm_52
This command generates exact code for two Maxwell variants, plus PTX code for use by JIT in case a
next-generation GPU is encountered. nvcc organizes its device code in fatbinaries, which are able to
hold multiple translations of the same GPU source code. At runtime, the CUDA driver will select the
most appropriate translation when the device function is launched.
6.7.2. Shorthand
nvcc allows a number of shorthands for simple cases.
Shorthand 1
--gpu-code arguments can be virtual architectures. In this case the stage 2 translation will be omit-
ted for such virtual architecture, and the stage 1 PTX result will be embedded instead. At application
launch, and in case the driver does not find a better alternative, the stage 2 compilation will be invoked
by the driver with the PTX as input.
Example
nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50,sm_52
Shorthand 2
The --gpu-code option can be omitted. Only in this case, the --gpu-architecture value can be
a non-virtual architecture. The --gpu-code values default to the closest virtual architecture that is
implemented by the GPU specified with --gpu-architecture, plus the --gpu-architecture, value
itself. The closest virtual architecture is used as the effective --gpu-architecture, value. If the
--gpu-architecture value is a virtual architecture, it is also used as the effective --gpu-code value.
Example
nvcc x.cu --gpu-architecture=sm_52
nvcc x.cu --gpu-architecture=compute_50
are equivalent to
nvcc x.cu --gpu-architecture=compute_52 --gpu-code=sm_52,compute_52
nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50
Shorthand 3
is equivalent to
nvcc x.cu --gpu-architecture=compute_52 --gpu-code=sm_52,compute_52
Sometimes it is necessary to perform different GPU code generation steps, partitioned over different
architectures. This is possible using nvcc option --generate-code, which then must be used instead
of a --gpu-architecture and --gpu-code combination.
Unlike option --gpu-architecture option --generate-code, may be repeated on the nvcc com-
mand line. It takes sub-options arch and code, which must not be confused with their main op-
tion equivalents, but behave similarly. If repeated architecture compilation is used, then the device
code must use conditional compilation based on the value of the architecture identification macro
__CUDA_ARCH__, which is described in the next section.
For example, the following assumes absence of half-precision floating-point operation support for the
sm_50 and sm_52 code, but full support on sm_53:
nvcc x.cu \
--generate-code arch=compute_50,code=sm_50 \
--generate-code arch=compute_50,code=sm_52 \
--generate-code arch=compute_53,code=sm_53
Or, leaving actual GPU code generation to the JIT compiler in the CUDA driver:
nvcc x.cu \
--generate-code arch=compute_50,code=compute_50 \
--generate-code arch=compute_53,code=compute_53
The code sub-options can be combined with a slightly more complex syntax:
nvcc x.cu \
--generate-code arch=compute_50,code=[sm_50,sm_52] \
--generate-code arch=compute_53,code=sm_53
Prior to the 5.0 release, CUDA did not support separate compilation, so CUDA code could not call device
functions or access variables across files. Such compilation is referred to as whole program compilation.
We have always supported the separate compilation of host code, it was just the device CUDA code
that needed to all be within one file. Starting with CUDA 5.0, separate compilation of device code is
supported, but the old whole program mode is still the default, so there are new options to invoke
separate compilation.
51
NVIDIA CUDA Compiler Driver, Release 12.0
nvcc <objects>
can be used to implicitly call both the device and host linkers. This works because if the device linker
does not see any relocatable code it does not do anything.
Figure 4 shows the flow.
7.3. Libraries
The device linker has the ability to read the static host library formats (.a on Linux and Mac OS X, .lib
on Windows). It ignores any dynamic (.so or .dll) libraries. The --library and --library-path
options can be used to pass libraries to both the device and host linker. The library name is specified
without the library file extension when the --library option is used.
nvcc --gpu-architecture=sm_50 a.o b.o --library-path=<path> --library=foo
Alternatively, the library name, including the library file extension, can be used without the --library
option on Windows.
nvcc --gpu-architecture=sm_50 a.obj b.obj foo.lib --library-path=<path>
Note that the device linker ignores any objects that do not have relocatable device code.
7.4. Examples
Suppose we have the following files:
∕∕---------- b.h ----------
#define N 8
__syncthreads();
bar();
}
foo<<<1, N>>>();
if(cudaGetSymbolAddress((void**)&dg, g)){
printf("couldn't get the symbol addr\n");
return 1;
}
if(cudaMemcpy(hg, dg, N * sizeof(int), cudaMemcpyDeviceToHost)){
printf("couldn't memcpy\n");
return 1;
}
return 0;
}
These can be compiled with the following commands (these examples are for Linux):
nvcc --gpu-architecture=sm_50 --device-c a.cu b.cu
nvcc --gpu-architecture=sm_50 a.o b.o
If you want to invoke the device and host linker separately, you can do:
nvcc --gpu-architecture=sm_50 --device-c a.cu b.cu
nvcc --gpu-architecture=sm_50 --device-link a.o b.o --output-file link.o
g++ a.o b.o link.o --library-path=<path> --library=cudart
Note that all desired target architectures must be passed to the device linker, as that specifies what will
be in the final executable (some objects or libraries may contain device code for multiple architectures,
and the link step can then choose what to put in the final executable).
If you want to use the driver API to load a linked cubin, you can request just the cubin:
nvcc --gpu-architecture=sm_50 --device-link a.o b.o \
--cubin --output-file link.cubin
7.4. Examples 53
NVIDIA CUDA Compiler Driver, Release 12.0
Note that only static libraries are supported by the device linker.
A PTX file can be compiled to a host object file and then linked by using:
nvcc --gpu-architecture=sm_50 --device-c a.ptx
An example that uses libraries, host linker, and dynamic parallelism would be:
nvcc --gpu-architecture=sm_50 --device-c a.cu b.cu
nvcc --gpu-architecture=sm_50 --device-link a.o b.o --output-file link.o
nvcc --lib --output-file libgpu.a a.o b.o link.o
g++ host.o --library=gpu --library-path=<path> \
--library=cudadevrt --library=cudart
It is possible to do multiple device links within a single host executable, as long as each device link is in-
dependent of the other. This requirement of independence means that they cannot share code across
device executables, nor can they share addresses (e.g., a device function address can be passed from
host to device for a callback only if the device link sees both the caller and potential callback callee;
you cannot pass an address from one device executable to another, as those are separate address
spaces).
sm_52 and sm_50. An object could have been compiled for a different architecture but also have PTX
available, in which case the device linker will JIT the PTX to cubin for the desired architecture and then
link. Relocatable device code requires CUDA 5.0 or later Toolkit.
If Link Time Optimization is used with -dlto, the intermediate LTOIR is only guaranteed to be com-
patible within a major release (e.g. can link together 12.0 and 12.1 LTO intermediates, but not 12.1 and
11.6).
If a kernel is limited to a certain number of registers with the launch_bounds attribute or the
--maxrregcount option, then all functions that the kernel calls must not use more than that number
of registers; if they exceed the limit, then a link error will be given.
Then if a.cu and b.cu both include a.h and instantiate getptr for the same type, and b.cu expects a
non-NULL address, and compile with:
nvcc --gpu-architecture=compute_50 --device-c a.cu
nvcc --gpu-architecture=compute_52 --device-c b.cu
nvcc --gpu-architecture=sm_52 a.o b.o
At link time only one version of the getptr is used, so the behavior would depend on which ver-
sion is picked. To avoid this, either a.cu and b.cu must be compiled for the same compute arch, or
__CUDA_ARCH__ should not be used in the shared header function.
57
NVIDIA CUDA Compiler Driver, Release 12.0
As shown in the above example, the amount of statically allocated global memory (gmem) is listed.
Global memory and some of the constant banks are module scoped resources and not per kernel
resources. Allocation of constant variables to constant banks is profile specific.
Followed by this, per kernel resource information is printed.
Stack frame is per thread stack usage used by this function. Spill stores and loads represent stores
and loads done on stack memory which are being used for storing variables that couldn’t be allocated
to physical registers.
Similarly number of registers, amount of shared memory and total space in constant bank allocated
is shown.
9.1. Notice
This document is provided for information purposes only and shall not be regarded as a warranty of a
certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no repre-
sentations or warranties, expressed or implied, as to the accuracy or completeness of the information
contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall
have no liability for the consequences or use of such information or for any infringement of patents
or other rights of third parties that may result from its use. This document is not a commitment to
develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any
other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that
such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the
time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by
authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects
to applying any customer general terms and conditions with regards to the purchase of the NVIDIA
product referenced in this document. No contractual obligations are formed either directly or indirectly
by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military,
aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA
product can reasonably be expected to result in personal injury, death, or property or environmental
damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or
applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for
any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA.
It is customer’s sole responsibility to evaluate and determine the applicability of any information con-
tained in this document, ensure the product is suitable and fit for the application planned by customer,
and perform the necessary testing for the application in order to avoid a default of the application or
the product. Weaknesses in customer’s product designs may affect the quality and reliability of the
NVIDIA product and may result in additional or different conditions and/or requirements beyond those
contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or prob-
lem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is
contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other
NVIDIA intellectual property right under this document. Information published by NVIDIA regarding
third-party products or services does not constitute a license from NVIDIA to use such products or
59
NVIDIA CUDA Compiler Driver, Release 12.0
services or a warranty or endorsement thereof. Use of such information may require a license from a
third party under the patents or other intellectual property rights of the third party, or a license from
NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA
in writing, reproduced without alteration and in full compliance with all applicable export laws and
regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE
BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR
OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WAR-
RANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CON-
SEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARIS-
ING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatso-
ever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein
shall be limited in accordance with the Terms of Sale for the product.
9.2. OpenCL
OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
9.3. Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the
U.S. and other countries. Other company and product names may be trademarks of the respective
companies with which they are associated.
9.4. Copyright
© 2012-2023, NVIDIA Corporation & Affiliates. All rights reserved.
60 Chapter 9. Notices