User Guide
User Guide
User Manual
TABLE OF CONTENTS
www.nvidia.com
User Guide v2023.1.1 | ii
Symbol Locations..........................................................................................132
2.3. Profiling QNX Targets from the GUI.................................................................132
Chapter 3. Export Formats..................................................................................134
3.1. SQLite Schema Reference............................................................................ 134
3.2. SQLite Schema Event Values......................................................................... 136
3.3. Common SQLite Examples............................................................................ 142
3.4. Arrow Format Description............................................................................ 156
3.5. JSON and Text Format Description................................................................. 157
Chapter 4. Report Scripts................................................................................... 158
Report Scripts Shipped With Nsight Systems............................................................ 158
cuda_api_gpu_sum[:base] -- CUDA Summary (API/Kernels/MemOps)............................ 158
cuda_api_sum -- CUDA API Summary..................................................................159
cuda_api_trace -- CUDA API Trace.....................................................................159
cuda_gpu_kern_sum[:base] -- CUDA GPU Kernel Summary........................................ 160
cuda_gpu_mem_size_sum -- CUDA GPU MemOps Summary (by Size)............................ 160
cuda_gpu_mem_time_sum -- CUDA GPU MemOps Summary (by Time).......................... 160
cuda_gpu_sum[:base] -- CUDA GPU Summary (Kernels/MemOps)................................ 161
cuda_gpu_trace -- CUDA GPU Trace................................................................... 161
nvtx_pushpop_sum -- NVTX Push/Pop Range Summary.............................................162
openmp_sum -- OpenMP Summary..................................................................... 162
osrt_sum -- OS Runtime Summary..................................................................... 163
vulkan_marker_sum -- Vulkan Range Summary...................................................... 163
pixsum -- D3D12 PIX Range Summary................................................................. 164
opengl_khr_range_sum -- OpenGL KHR_debug Range Summary.................................. 164
Report Formatters Shipped With Nsight Systems....................................................... 164
Column...................................................................................................... 165
Table........................................................................................................ 165
CSV.......................................................................................................... 166
TSV.......................................................................................................... 166
JSON......................................................................................................... 166
HDoc.........................................................................................................167
HTable.......................................................................................................167
Chapter 5. Migrating from NVIDIA nvprof................................................................ 168
Using the Nsight Systems CLI nvprof Command........................................................ 168
CLI nvprof Command Switch Options.....................................................................168
Next Steps.....................................................................................................171
Chapter 6. Profiling in a Docker on Linux Devices.................................................... 172
GUI VNC container...........................................................................................173
Chapter 7. Direct3D Trace.................................................................................. 176
7.1. D3D11 API trace........................................................................................176
7.2. D3D12 API Trace....................................................................................... 176
Chapter 8. WDDM Queues................................................................................... 181
Chapter 9. WDDM HW Scheduler.......................................................................... 183
www.nvidia.com
User Guide v2023.1.1 | iii
Chapter 10. Vulkan API Trace.............................................................................. 184
10.1. Vulkan Overview...................................................................................... 184
10.2. Pipeline Creation Feedback.........................................................................186
10.3. Vulkan GPU Trace Notes.............................................................................186
Chapter 11. Stutter Analysis................................................................................187
11.1. FPS Overview..........................................................................................187
11.2. Frame Health..........................................................................................191
11.3. GPU Memory Utilization............................................................................. 191
11.4. Vertical Synchronization.............................................................................191
Chapter 12. OpenMP Trace..................................................................................192
Chapter 13. OS Runtime Libraries Trace................................................................. 194
13.1. Locking a Resource...................................................................................195
13.2. Limitations............................................................................................. 195
13.3. OS Runtime Libraries Trace Filters................................................................ 196
13.4. OS Runtime Default Function List................................................................. 197
Chapter 14. NVTX Trace..................................................................................... 200
Chapter 15. CUDA Trace..................................................................................... 204
15.1. CUDA GPU Memory Allocation Graph............................................................. 207
15.2. Unified Memory Transfer Trace.................................................................... 208
Unified Memory CPU Page Faults...................................................................... 209
Unified Memory GPU Page Faults...................................................................... 210
15.3. CUDA Graph Trace....................................................................................211
15.4. CUDA Default Function List for CLI............................................................... 213
15.5. cuDNN Function List for X86 CLI...................................................................215
Chapter 16. OpenACC Trace................................................................................ 217
Chapter 17. OpenGL Trace.................................................................................. 219
17.1. OpenGL Trace Using Command Line...............................................................221
Chapter 18. Custom ETW Trace............................................................................223
Chapter 19. GPU Metrics.................................................................................... 225
Overview.......................................................................................................225
Launching GPU Metric from the CLI...................................................................... 228
Launching GPU Metrics from the GUI.................................................................... 229
Sampling frequency..........................................................................................229
Available metrics.............................................................................................230
Exporting and Querying Data.............................................................................. 233
Limitations.................................................................................................... 234
Chapter 20. CPU Profiling Using Linux OS Perf Subsystem...........................................236
Chapter 21. NVIDIA Video Codec SDK Trace.............................................................244
21.1. NV Encoder API Functions Traced by Default.................................................... 245
21.2. NV Decoder API Functions Traced by Default....................................................246
21.3. NV JPEG API Functions Traced by Default........................................................247
Chapter 22. Network Communication Profiling.........................................................248
22.1. MPI API Trace......................................................................................... 249
www.nvidia.com
User Guide v2023.1.1 | iv
22.2. OpenSHMEM Library Trace.......................................................................... 252
22.3. UCX Library Trace.................................................................................... 253
22.4. NVIDIA NVSHMEM and NCCL Trace................................................................. 254
22.5. NIC Metric Sampling................................................................................. 254
22.6. InfiniBand Switch Metric Sampling................................................................ 256
Chapter 23. Python Backtrace Sampling................................................................. 257
Chapter 24. Reading Your Report in GUI.................................................................258
24.1. Generating a New Report........................................................................... 258
24.2. Opening an Existing Report......................................................................... 258
24.3. Sharing a Report File................................................................................ 258
24.4. Report Tab............................................................................................. 258
24.5. Analysis Summary View..............................................................................259
24.6. Timeline View......................................................................................... 259
24.6.1. Timeline...........................................................................................259
Timeline Navigation................................................................................... 259
Row Height.............................................................................................. 262
Row Percentage........................................................................................ 263
24.6.2. Events View...................................................................................... 263
24.6.3. Function Table Modes.......................................................................... 265
24.6.4. Function Table Notes........................................................................... 268
24.6.5. Filter Dialog...................................................................................... 270
24.6.6. Example of Using Timeline with Function Table........................................... 270
24.7. Diagnostics Summary View..........................................................................276
24.8. Symbol Resolution Logs View....................................................................... 276
Chapter 25. Adding Report to the Timeline.............................................................277
25.1. Time Synchronization................................................................................ 277
25.2. Timeline Hierarchy................................................................................... 279
25.3. Example: MPI.......................................................................................... 280
25.4. Limitations............................................................................................. 281
Chapter 26. Using Nsight Systems Expert System......................................................282
Using Expert System from the CLI........................................................................ 282
Using Expert System from the GUI....................................................................... 282
Expert System Rules.........................................................................................283
CUDA Synchronous Operation Rules....................................................................283
GPU Low Utilization Rules.............................................................................. 284
Chapter 27. Import NVTXT..................................................................................286
Commands.....................................................................................................287
Chapter 28. Visual Studio Integration.................................................................... 289
Chapter 29. Troubleshooting............................................................................... 291
29.1. General Troubleshooting.............................................................................291
29.2. CLI Troubleshooting.................................................................................. 292
29.3. Launch Processes in Stopped State................................................................292
LD_PRELOAD............................................................................................... 293
www.nvidia.com
User Guide v2023.1.1 | v
Launcher.................................................................................................... 293
29.4. GUI Troubleshooting..................................................................................294
Ubuntu 18.04/20.04/22.04 and CentOS 7/8/9 with root privileges.............................. 294
Ubuntu 18.04/20.04/22.04 and CentOS 7/8/9 without root privileges.......................... 295
Other platforms, or if the previous steps did not help............................................ 295
29.5. Symbol Resolution.................................................................................... 295
Broken Backtraces on Tegra............................................................................ 297
Debug Versions of ELF Files.............................................................................298
29.6. Logging................................................................................................. 299
Verbose Remote Logging on Linux Targets............................................................299
Verbose CLI Logging on Linux Targets................................................................. 299
Verbose Logging on Windows Targets..................................................................300
Chapter 30. Other Resources...............................................................................301
Training Seminars............................................................................................ 301
Blog Posts..................................................................................................... 301
Feature Videos............................................................................................... 301
Conference Presentations.................................................................................. 302
For More Support............................................................................................ 302
www.nvidia.com
User Guide v2023.1.1 | vi
Chapter 1.
PROFILING FROM THE CLI
or
nsys [command_switch][optional command_switch_options][application] [optional
application_options]
All command line options are case sensitive. For command switch options, when
short options are used, the parameters should follow the switch after a space; e.g. -s
process-tree. When long options are used, the switch should be followed by an equal
sign and then the parameter(s); e.g. --sample=process-tree.
For this version of Nsight Systems, if you launch a process from the command line to
begin analysis, the launched process will be terminated when collection is complete,
including runs with --duration set, unless the user specifies the --kill none option (details
www.nvidia.com
User Guide v2023.1.1 | 1
Profiling from the CLI
below). The exception is that if the user uses NVTX, cudaProfilerStart/Stop, or hotkeys to
control the duration, the application will continue unless --kill is set.
The Nsight Systems CLI supports concurrent analysis by using sessions. Each Nsight
Systems session is defined by a sequence of CLI commands that define one or more
collections (e.g. when and what data is collected). A session begins with either a start,
launch, or profile command. A session ends with a shutdown command, when a profile
command terminates, or, if requested, when all the process tree(s) launched in the
session exit. Multiple sessions can run concurrently on the same system.
Command Description
analyze Post process existing Nsight Systems
result, either in .nsys-rep or SQLite format,
to generate expert systems report.
cancel Cancels an existing collection started
in interactive mode. All data already
collected in the current collection is
discarded.
export Generates an export file from an
existing .nsys-rep file. For more
information about the exported formats
see the /documentation/nsys-exporter
directory in your Nsight Systems
installation directory.
launch In interactive mode, launches an
application in an environment that
supports the requested options. The
www.nvidia.com
User Guide v2023.1.1 | 2
Profiling from the CLI
Command Description
launch command can be executed before
or after a start command.
nvprof Special option to help with transition
from legacy NVIDIA nvprof tool. Calling
nsys nvprof [options] will provide
the best available translation of nvprof
[options] See Migrating from NVIDIA
nvprof topic for details. No additional
functionality of nsys will be available
when using this option. Note: Not
available on IBM Power targets.
profile A fully formed profiling description
requiring and accepting no further input.
The command switch options used
(see below table) determine when the
collection starts, stops, what collectors are
used (e.g. API trace, IP sampling, etc.),
what processes are monitored, etc.
sessions Gives information about all sessions
running on the system.
shutdown Disconnects the CLI process from the
launched application and forces the CLI
process to exit. If a collection is pending or
active, it is cancelled
start Start a collection in interactive mode. The
start command can be executed before or
after a launch command.
stats Post process existing Nsight Systems
result, either in .nsys-rep or SQLite format,
to generate statistical information.
status Reports on the status of a CLI-based
collection or the suitability of the profiling
environment.
stop Stop a collection that was started in
interactive mode. When executed, all
active collections stop, the CLI process
terminates but the application continues
running.
www.nvidia.com
User Guide v2023.1.1 | 3
Profiling from the CLI
of a .nsys-rep file. If a .nsys-rep file is specified, Nsight Systems will look for an
accompanying SQLite file and use it. If no SQLite export file exists, one will be created.
After choosing the analyze command switch, the following options are available.
Usage:
nsys [global-options] analyze [options] [input-file]
www.nvidia.com
User Guide v2023.1.1 | 4
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 5
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 6
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 7
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 8
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 9
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 10
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 11
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 12
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 13
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 14
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 15
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 16
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 17
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 18
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 19
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 20
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 21
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 22
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 23
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 24
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 25
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 26
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 27
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 28
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 29
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 30
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 31
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 32
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 33
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 34
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 35
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 36
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 37
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 38
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 39
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 40
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 41
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 42
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 43
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 44
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 45
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 46
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 47
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 48
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 49
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 50
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 51
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 52
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 53
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 54
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 55
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 56
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 57
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 58
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 59
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 60
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 61
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 62
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 63
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 64
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 65
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 66
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 67
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 68
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 69
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 70
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 71
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 72
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 73
Profiling from the CLI
Subcommand Description
list List all active sessions including ID, name,
and state information
www.nvidia.com
User Guide v2023.1.1 | 74
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 75
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 76
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 77
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 78
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 79
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 80
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 81
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 82
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 83
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 84
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 85
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 86
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 87
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 88
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 89
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 90
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 91
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 92
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 93
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 94
Profiling from the CLI
Individual reports are generated by calling out to scripts that read data from the SQLite
file and return their report data in CSV format. Nsight Systems ingests this data and
formats it as requested, then displays the data to the console, writes it to a file, or pipes
it to an external process. Adding new reports is as simple as writing a script that can
read the SQLite file and generate the required CSV output. See the shipped scripts as an
example. Both reports and formatters may take arguments to tweak their processing. For
details on shipped scripts and formatters, see Report Scripts topic.
Reports are processed using a three-tuple that consists of 1) the requested report (and
any arguments), 2) the presentation format (and any arguments), and 3) the output
(filename, console, or external process). The first report specified uses the first format
specified, and is presented via the first output specified. The second report uses the
second format for the second output, and so forth. If more reports are specified than
formats or outputs, the format and/or output list is expanded to match the number of
provided reports by repeating the last specified element of the list (or the default, if
nothing was specified).
nsys stats is a very powerful command and can handle complex argument structures,
please see the topic below on Example Stats Command Sequences.
After choosing the stats command switch, the following options are available. Usage:
nsys [global-options] stats [options] [input-file]
www.nvidia.com
User Guide v2023.1.1 | 95
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 96
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 97
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 98
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 99
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 100
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 101
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 102
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 103
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 104
Profiling from the CLI
Effect: Nsight Systems CLI (and target application) will run with elevated privilege.
This is necessary for some features, such as FTrace or system-wide CPU sampling. If you
don't want the target application to be elevated, use `--run-as` option.
www.nvidia.com
User Guide v2023.1.1 | 105
Profiling from the CLI
Effect: Launch the application using the given arguments. Start collecting immediately
and end collection when the application stops. Trace CUDA, OpenGL, NVTX, and
OS runtime libraries APIs. Collect CPU sampling information and thread scheduling
information. With Nsight Systems Embedded Platforms Edition this will only analysis
the single process. With Nsight Systems Workstation Edition this will trace the process
tree. Generate the report#.nsys-rep file in the default location, incrementing the report
number if needed to avoid overwriting any existing output files.
Limited trace only run
nsys profile --trace=cuda,nvtx -d 20
--sample=none --cpuctxsw=none -o my_test <application>
[application-arguments]
Effect: Launch the application using the given arguments. Start collecting immediately
and end collection after 20 seconds or when the application ends. Trace CUDA and
NVTX APIs. Do not collect CPU sampling information or thread scheduling information.
Profile any child processes. Generate the output file as my_test.nsys-rep in the current
working directory.
Delayed start run
nsys profile -e TEST_ONLY=0 -y 20
<application> [application-arguments]
Effect: Set environment variable TEST_ONLY=0. Launch the application using the given
arguments. Start collecting after 20 seconds and end collection at application exit. Trace
CUDA, OpenGL, NVTX, and OS runtime libraries APIs. Collect CPU sampling and
thread schedule information. Profile any child processes. Generate the report#.nsys-rep
file in the default location, incrementing if needed to avoid overwriting any existing
output files.
Collect ftrace events
nsys profile --ftrace=drm/drm_vblank_event
-d 20
Effect: Launch application. Collect default options and GPU metrics for the first GPU
(a TU10x), using the tu10x-gfxt metric set at the default frequency (10 kHz). Profile any
child processes. Generate the report#.nsys-rep file in the default location, incrementing if
needed to avoid overwriting any existing output files.
www.nvidia.com
User Guide v2023.1.1 | 106
Profiling from the CLI
Effect: Launch application. Collect default options and GPU metrics for all available
GPUs using the first suitable metric set for each and sampling at 20 kHz. Profile any
child processes. Generate the report#.nsys-rep file in the default location, incrementing if
needed to avoid overwriting any existing output files.
Collect CPU IP/backtrace and CPU context switch
nsys profile --sample=system-wide --duration=5
Effect: Collects both CPU IP/backtrace samples using the default backtrace mechanism
and traces CPU context switch activity for the whole system for 5 seconds. Note that it
requires root permission to run. No hardware or OS events are sampled. Post processing
of this collection will take longer due to the large number of symbols to be resolved
caused by system-wide sampling.
Get list of available CPU core events
nsys profile --cpu-core-events=help
Effect: Lists the CPU events that can be sampled and the maximum number of CPU
events that can be sampled concurrently.
Collect system-wide CPU events and trace application
nsys profile --event-sample=system-wide
--cpu-core-events='1,2' --event-sampling-frequency=5 <app> [app args]
Effect:Collects CPU IP/backtrace samples using the default backtrace mechanism, traces
CPU context switch activity, and samples each CPU's “CPU Cycles” and “Instructions
Retired” event every 200 ms for the whole system. Note that it requires root permission
to run. Note that CUDA, NVTX, OpenGL, and OSRT within the app launched by
Nsight Systems are traced by default while using this command. Post processing of this
collection will take longer due to the large number of symbols to be resolved caused by
system-wide sampling.
Collect custom ETW trace using configuration file
nsys profile --etw-provider=file.JSON
Effect: Configure custom ETW collectors using the contents of file.JSON. Collect data for
20 seconds. Generate the report#.nsys-rep file in the current working directory.
A template JSON configuration file is located at in the Nsight Systems installation
directory as \target-windows-x64\etw_providers_template.json. This path will show up
automatically if you call
nsys profile --help
www.nvidia.com
User Guide v2023.1.1 | 107
Profiling from the CLI
The flags attribute can only be set to one or more of the following:
‣ EVENT_TRACE_FLAG_ALPC
‣ EVENT_TRACE_FLAG_CSWITCH
‣ EVENT_TRACE_FLAG_DBGPRINT
‣ EVENT_TRACE_FLAG_DISK_FILE_IO
‣ EVENT_TRACE_FLAG_DISK_IO
‣ EVENT_TRACE_FLAG_DISK_IO_INIT
‣ EVENT_TRACE_FLAG_DISPATCHER
‣ EVENT_TRACE_FLAG_DPC
‣ EVENT_TRACE_FLAG_DRIVER
‣ EVENT_TRACE_FLAG_FILE_IO
‣ EVENT_TRACE_FLAG_FILE_IO_INIT
‣ EVENT_TRACE_FLAG_IMAGE_LOAD
‣ EVENT_TRACE_FLAG_INTERRUPT
‣ EVENT_TRACE_FLAG_JOB
‣ EVENT_TRACE_FLAG_MEMORY_HARD_FAULTS
‣ EVENT_TRACE_FLAG_MEMORY_PAGE_FAULTS
‣ EVENT_TRACE_FLAG_NETWORK_TCPIP
‣ EVENT_TRACE_FLAG_NO_SYSCONFIG
‣ EVENT_TRACE_FLAG_PROCESS
‣ EVENT_TRACE_FLAG_PROCESS_COUNTERS
‣ EVENT_TRACE_FLAG_PROFILE
‣ EVENT_TRACE_FLAG_REGISTRY
‣ EVENT_TRACE_FLAG_SPLIT_IO
‣ EVENT_TRACE_FLAG_SYSTEMCALL
‣ EVENT_TRACE_FLAG_THREAD
‣ EVENT_TRACE_FLAG_VAMAP
‣ EVENT_TRACE_FLAG_VIRTUAL_ALLOC
Typical case: profile a Python script that uses CUDA
nsys profile --trace=cuda,cudnn,cublas,osrt,nvtx
--delay=60 python my_dnn_script.py
Effect: Launch a Python script and start profiling it 60 seconds after the launch, tracing
CUDA, cuDNN, cuBLAS, OS runtime APIs, and NVTX as well as collecting thread
schedule information.
Typical case: profile an app that uses Vulkan
nsys profile --trace=vulkan,osrt,nvtx
--delay=60 ./myapp
Effect: Launch an app and start profiling it 60 seconds after the launch, tracing Vulkan,
OS runtime APIs, and NVTX as well as collecting CPU sampling and thread schedule
information.
www.nvidia.com
User Guide v2023.1.1 | 108
Profiling from the CLI
Effect: Create interactive CLI process and set it up to begin collecting as soon as an
application is launched. Launch the application, set up to allow tracing of CUDA and
NVTX as well as collection of thread schedule information. Stop only when explicitly
requested. Generate the report#.nsys-rep in the default location.
If
you
start
a
collection
and
fail
to
stop
the
collection
(or
if
you
are
allowing
it
to
stop
on
Note: exit,
and
the
application
runs
for
too
long)
your
system’s
storage
space
may
be
filled
with
collected
data
causing
significant
issues
for
www.nvidia.com
User Guide v2023.1.1 | 109
Profiling from the CLI
the
system.
Nsight
Systems
will
collect
a
different
amount
of
data/
sec
depending
on
options,
but
in
general
Nsight
Systems
does
not
support
runs
of
more
than
5
minutes
duration.
Effect: Create interactive CLI and launch an application set up for default analysis.
Send application output to the terminal. No data is collected until you manually
start collection at area of interest. Profile until the application ends. Generate the
report#.nsys-rep in the default location.
If
you
launch
an
application
and
that
Note: application
and
any
descendants
exit
before
start
is
called
www.nvidia.com
User Guide v2023.1.1 | 110
Profiling from the CLI
Nsight
Systems
will
create
a
fully
formed .nsys-
rep
file
containing
no
data.
Effect: Create interactive CLI process and set it up to begin collecting as soon as
a cudaProfileStart() is detected. Launch application for default analysis, sending
application output to the terminal. Stop collection at next call to cudaProfilerStop,
when the user calls nsys stop, or when the root process terminates. Generate the
report#.nsys-rep in the default location.
If
you
call
nsys
launch
before
nsys
start
-
c
cudaProfilerApi
and
the
code
contains
a
Note: large
number
of
short
duration
cudaProfilerStart/
Stop
pairs,
Nsight
Systems
may
be
unable
to
process
them
correctly,
www.nvidia.com
User Guide v2023.1.1 | 111
Profiling from the CLI
causing
a
fault.
This
will
be
corrected
in
a
future
version.
The
Nsight
Systems
CLI
does
not
support
multiple
calls
Note:
to
the
cudaProfilerStart/
Stop
API
at
this
time.
Effect: Create interactive CLI process and set it up to begin collecting as soon as an
NVTX range with given message in given domain (capture range) is opened. Launch
application for default analysis, sending application output to the terminal. Stop
collection when all capture ranges are closed, when the user calls nsys stop, or when
the root process terminates. Generate the report#.nsys-rep in the default location.
The
Nsight
Systems
CLI
only
Note: triggers
the
profiling
session
for
the
first
www.nvidia.com
User Guide v2023.1.1 | 112
Profiling from the CLI
capture
range.
This would make the profiling start when the first range with message "profiler" is
opened in domain "service".
‣ Message@*: All ranges with given message in all domains are capture ranges. For
example:
nsys launch -w true -p profiler@* ./app
This would make the profiling start when the first range with message "profiler" is
opened in any domain.
‣ Message: All ranges with given message in default domain are capture ranges. For
example:
nsys launch -w true -p profiler ./app
This would make the profiling start when the first range with message "profiler" is
opened in the default domain.
‣ By default only messages, provided by NVTX registered strings are considered to
avoid additional overhead. To enable non-registered strings check please launch
your application with NSYS_NVTX_PROFILER_REGISTER_ONLY=0 environment:
nsys launch -w true -p profiler@service -e
NSYS_NVTX_PROFILER_REGISTER_ONLY=0 ./app
Effect: Create interactive CLI and launch an application set up for default analysis.
Send application output to the terminal. No data is collected until the start command
is executed. Collect data from start until stop requested, generate report#.qstrm in the
current working directory. Collect data from second start until the second stop request,
generate report#.nsys-rep (incremented by one) in the current working directory.
Shutdown the interactive CLI and send sigkill to the target application's process group.
Calling
nsys
cancel
Note: after
nsys
start
will
www.nvidia.com
User Guide v2023.1.1 | 113
Profiling from the CLI
cancel
the
collection
without
generating
a
report.
www.nvidia.com
User Guide v2023.1.1 | 114
Profiling from the CLI
command will filter out everything but the header, formatting, and the cudaFree data,
and display the results to the console.
Note: When the output name starts with @, it is defined as a command. The command
is run, and the output of the report is piped to the command's stdin (standard-input).
The command's stdout and stderr remain attached to the console, so any output will be
displayed directly to the console.
Be aware there are some limitations in how the command string is parsed. No shell
expansions (including *, ?, [], and ~) are supported. The command cannot be piped
to another command, nor redirected to a file using shell syntax. The command and
command arguments are split on whitespace, and no quotes (within the command
syntax) are supported. For commands that require complex command line syntax, it is
suggested that the command be put into a shell script file, and the script designated as
the output command
www.nvidia.com
User Guide v2023.1.1 | 115
Profiling from the CLI
www.nvidia.com
User Guide v2023.1.1 | 116
Profiling from the CLI
If your run traces graphics debug markers these include DX11 debug markers, DX12
debug markers, Vulkan debug markers or KHR debug markers:
www.nvidia.com
User Guide v2023.1.1 | 117
Profiling from the CLI
Recipes for these statistics as well as documentation on how to create your own metrics
will be available in a future version of the tool.
www.nvidia.com
User Guide v2023.1.1 | 118
Profiling from the CLI
The import of really large, multi-gigabyte, .qdstrm files may take up all of the memory
on the host computer and lock up the system. This will be fixed in a later version.
Importing Windows ETL files
For Windows targets, ETL files captured with Xperf or the log.cmd command supplied
with GPUView in the Windows Performance Toolkit can be imported to create reports
as if they were captured with Nsight Systems's "WDDM trace" and "Custom ETW trace"
features. Simply choose the .etl file from the Import dialog to convert it to a .nsys-rep
file.
Create .nsys-rep Using QdstrmImporter
The CLI and QdstrmImporter versions must match to convert a .qdstrm file into a .nsys-
rep file. This .nsys-rep file can then be opened in the same version or more recent
versions of the GUI.
To run QdstrmImporter on the host system, find the QdstrmImporter binary in the Host-
x86_64 directory in your installation. QdstrmImporter is available for all host platforms.
See options below.
To run QdstrmImporter on the target system, copy the Linux Host-x86_64 directory to
the target Linux system or install Nsight Systems for Linux host directly on the target.
The Windows or macOS host QdstrmImporter will not work on a Linux Target. See
options below.
www.nvidia.com
User Guide v2023.1.1 | 119
Profiling from the CLI
Profile multi-node runs: nsys profile has to be prefixed before the program to be
profiled. One report file will be created for each MPI rank. This works also for single-
node runs.
mpirun [mpirun options] nsys profile [nsys options]
www.nvidia.com
User Guide v2023.1.1 | 120
Profiling from the CLI
Profile a single MPI process or a subset of MPI processes: Use a wrapper script similar
to the following script (called "profile_rank0.sh").
#!/bin/bash
The script runs nsys on rank 0 only. Add appropriate profiling options to the script and
execute it with mpirun [mpirun options] ./profile_rank0.sh ./myapp [app
options].
If
only
a
subset
of
MPI
ranks
is
profiled,
set
the
environment
variable
NSYS_MPI_STORE_TEAMS_PER_RANK=1
to
store
all
members
Note:
of
custom
MPI
communicators
per
MPI
rank.
Otherwise,
the
execution
might
hang
or
fail
with
an
MPI
error.
Avoid redundant GPU and NIC metrics collection: If multiple instances of nsys
profile are executed concurrently on the same node and GPU and/or NIC metrics
collection is enabled, each process will collect metrics for all available NICs and tries to
www.nvidia.com
User Guide v2023.1.1 | 121
Profiling from the CLI
collect GPU metrics for the specified devices. This can be avoided with a simple bash
script similar to the following:
#!/bin/bash
This above script will collect NIC and GPU metrics only for one rank, the node-local
rank 0. Alternatively, if one rank per GPU is used, the GPU metrics devices can be
specified based on the node-local rank in a wrapper script as follows:
#!/bin/bash
www.nvidia.com
User Guide v2023.1.1 | 122
Chapter 2.
PROFILING FROM THE GUI
On Tegra:
www.nvidia.com
User Guide v2023.1.1 | 123
Profiling from the GUI
The dialog has simple controls that allow adding, removing, and modifying connections:
Security notice: SSH is only used to establish the initial connection to a target device,
perform checks, and upload necessary files. The actual profiling commands and data
are transferred through a raw, unencrypted socket. Nsight Systems should not be used
in a network setup where attacker-in-the-middle attack is possible, or where untrusted
parties may have network access to the target device.
While connecting to the target device, you will be prompted to input the user's
password. Please note that if you choose to remember the password, it will be stored in
plain text in the configuration file on the host. Stored passwords are bound to the public
key fingerprint of the remote device.
The No authentication option is useful for devices configured for passwordless
login using root username. To enable such a configuration, edit the file /etc/ssh/
sshd_config on the target and specify the following option:
PermitRootLogin yes
Then set empty password using passwd and restart the SSH service with service ssh
restart.
Open ports: The Nsight Systems daemon requires port 22 and port 45555 to be open for
listening. You can confirm that these ports are open with the following command:
sudo firewall-cmd --list-ports --permanent
sudo firewall-cmd --reload
To open a port use the following command, skip --permanent option to open only for
this session:
sudo firewall-cmd --permanent --add-port 45555/tcp
sudo firewall-cmd --reload
www.nvidia.com
User Guide v2023.1.1 | 124
Profiling from the GUI
Likewise, if you are running on a cloud system, you must open port 22 and port 45555
for ingress.
Kernel Version Number - To check for the version number of the kernel support of
Nsight Systems on a target device, run the following command on the remote device:
cat /proc/quadd/version
www.nvidia.com
User Guide v2023.1.1 | 125
Profiling from the GUI
Ftrace
profiling
option
will
not
be
displayed
in
the
GUI
unless
you
are
running
with
Note:
sudo.
Also
be
aware
that
enabling
too
many
options
can
cause
significant
setup/
teardown
overhead.
www.nvidia.com
User Guide v2023.1.1 | 126
Profiling from the GUI
Trace all processes – On compatible devices (with kernel module support version 1.107
or higher), this enables trace of all processes and threads in the system. Scheduler events
from all tasks will be recorded.
Collect PMU counters – This allows you to choose which PMU (Performance
Monitoring Unit) counters Nsight Systems will sample. Enable specific counters when
interested in correlating cache misses to functions in your application.
www.nvidia.com
User Guide v2023.1.1 | 127
Profiling from the GUI
Three different backtrace collections options are available when sampling CPU
instruction pointers. Backtraces can be generated using Intel (c) Last Branch Record
(LBR) registers. LBR backtraces generate minimal overhead but the backtraces have
limited depth. Backtraces can also be generated using DWARF debug data. DWARF
backtraces incur more overhead than LBR backtraces but have much better depth.
Finally, backtraces can be generated using frame pointers. Frame pointer backtraces
incur medium overhead and have good depth but only resolve frames in the portions
of the application and its libraries (including 3rd party libraries) that were compiled
with frame pointers enabled. Normally, frame pointers are disabled by default during
compilation.
By default, Nsight Systems will use Intel(c) LBRs if available and fall back to using dwarf
unwind if they are not. Choose modes... will allow you to override the default.
www.nvidia.com
User Guide v2023.1.1 | 128
Profiling from the GUI
The Include child processes switch controls whether API tracing is only for the
launched process, or for all existing and new child processes of the launched process. If
you are running your application through a script, for example a bash script, you need
to set this checkbox.
The Include child processes switch does not control sampling in this version of Nsight
Systems. The full process tree will be sampled regardless of this setting. This will be
fixed in a future version of the product.
Nsight Systems can sample one process tree. Sampling here means interrupting each
processor after a certain number of events and collecting an instruction pointer (IP)/
backtrace sample if the processor is executing the profilee.
When sampling the CPU on a workstation target, Nsight Systems traces thread
context switches and infers thread state as either Running or Blocked. Note that
Blocked in the timeline indicates the thread may be Blocked (Interruptible) or Blocked
(Uninterruptible). Blocked (Uninterruptible) often occurs when a thread has transitioned
into the kernel and cannot be interrupted by a signal. Sampling can be enhanced with
OS runtime libraries tracing; see OS Runtime Libraries Trace for more information.
Currently Nsight Systems can only sample one process. Sampling here means that the
profilee will be stopped periodically, and backtraces of active threads will be recorded.
Most applications use stripped libraries. In this case, many symbols may stay
unresolved. If unstripped libraries exist, paths to them can be specified using the
Symbol locations... button. Symbol resolution happens on host, and therefore does not
affect performance of profiling on the target.
Additionally, debug versions of ELF files may be picked up from the target system. Refer
to Debug Versions of ELF Files for more information.
www.nvidia.com
User Guide v2023.1.1 | 129
Profiling from the GUI
The Edit arguments... link will open an editor window, where every command line
argument is edited on a separate line. This is convenient when arguments contain spaces
or quotes.
www.nvidia.com
User Guide v2023.1.1 | 130
Profiling from the GUI
Nsight Systems can sample one process tree. Sampling here means interrupting each
processor periodically. The sampling rate is defined in the project settings and is either
100Hz, 1KHz (default value), 2Khz, 4KHz, or 8KHz.
www.nvidia.com
User Guide v2023.1.1 | 131
Profiling from the GUI
On Windows, Nsight Systems can collect thread activity of one process tree. Collecting
thread activity means that each thread context switch event is logged and (optionally) a
backtrace is collected at the point that the thread is scheduled back for execution. Thread
states are displayed on the timeline.
If it was collected, the thread backtrace is displayed when hovering over a region where
the thread execution is blocked.
Symbol Locations
Symbol resolution happens on host, and therefore does not affect performance of
profiling on the target.
Press the Symbol locations... button to open the Configure debug symbols location
dialog.
www.nvidia.com
User Guide v2023.1.1 | 132
Profiling from the GUI
‣ Filesystem on QNX device might be mounted read-only. In that case Nsight Systems
is not able to install target-side binaries, required to run the profiling session. Please
make sure that target filesystem is writable before connecting to QNX target. For
example, make sure the following command works:
echo XX > /xx && ls -l /xx
www.nvidia.com
User Guide v2023.1.1 | 133
Chapter 3.
EXPORT FORMATS
www.nvidia.com
User Guide v2023.1.1 | 134
Export Formats
0 - TRACE_PROCESS_EVENT_CUDA_RUNTIME
1 - TRACE_PROCESS_EVENT_CUDA_DRIVER
13 - TRACE_PROCESS_EVENT_CUDA_EGL_DRIVER
28 - TRACE_PROCESS_EVENT_CUDNN
29 - TRACE_PROCESS_EVENT_CUBLAS
33 - TRACE_PROCESS_EVENT_CUDNN_START
34 - TRACE_PROCESS_EVENT_CUDNN_FINISH
35 - TRACE_PROCESS_EVENT_CUBLAS_START
36 - TRACE_PROCESS_EVENT_CUBLAS_FINISH
67 - TRACE_PROCESS_EVENT_CUDABACKTRACE
77 - TRACE_PROCESS_EVENT_CUDA_GRAPH_NODE_CREATION
See CUPTI documentation for detailed information on collected event and data types.
NVTX Event Type Values
33 - NvtxCategory
34 - NvtxMark
39 - NvtxThread
59 - NvtxPushPopRange
60 - NvtxStartEndRange
75 - NvtxDomainCreate
76 - NvtxDomainDestroy
The difference between text and textId columns is that if an NVTX event message was
passed via call to nvtxDomainRegisterString function, then the message will be available
through textId field, otherwise the text field will contain the message if it was provided.
OpenGL Events
KHR event class values
62 - KhrDebugPushPopRange
63 - KhrDebugGpuPushPopRange
0x8249 - GL_DEBUG_SOURCE_THIRD_PARTY
0x824A - GL_DEBUG_SOURCE_APPLICATION
www.nvidia.com
User Guide v2023.1.1 | 136
Export Formats
0x824C - GL_DEBUG_TYPE_ERROR
0x824D - GL_DEBUG_TYPE_DEPRECATED_BEHAVIOR
0x824E - GL_DEBUG_TYPE_UNDEFINED_BEHAVIOR
0x824F - GL_DEBUG_TYPE_PORTABILITY
0x8250 - GL_DEBUG_TYPE_PERFORMANCE
0x8251 - GL_DEBUG_TYPE_OTHER
0x8268 - GL_DEBUG_TYPE_MARKER
0x8269 - GL_DEBUG_TYPE_PUSH_GROUP
0x826A - GL_DEBUG_TYPE_POP_GROUP
0x826B - GL_DEBUG_SEVERITY_NOTIFICATION
0x9146 - GL_DEBUG_SEVERITY_HIGH
0x9147 - GL_DEBUG_SEVERITY_MEDIUM
0x9148 - GL_DEBUG_SEVERITY_LOW
27 - TRACE_PROCESS_EVENT_OS_RUNTIME
31 - TRACE_PROCESS_EVENT_OS_RUNTIME_START
32 - TRACE_PROCESS_EVENT_OS_RUNTIME_FINISH
41 - TRACE_PROCESS_EVENT_DX12_API
42 - TRACE_PROCESS_EVENT_DX12_WORKLOAD
43 - TRACE_PROCESS_EVENT_DX12_START
44 - TRACE_PROCESS_EVENT_DX12_FINISH
52 - TRACE_PROCESS_EVENT_DX12_DISPLAY
59 - TRACE_PROCESS_EVENT_DX12_CREATE_OBJECT
65 - TRACE_PROCESS_EVENT_DX12_DEBUG_API
75 - TRACE_PROCESS_EVENT_DX11_DEBUG_API
www.nvidia.com
User Guide v2023.1.1 | 137
Export Formats
53 - TRACE_PROCESS_EVENT_VULKAN_API
54 - TRACE_PROCESS_EVENT_VULKAN_WORKLOAD
55 - TRACE_PROCESS_EVENT_VULKAN_START
56 - TRACE_PROCESS_EVENT_VULKAN_FINISH
60 - TRACE_PROCESS_EVENT_VULKAN_CREATE_OBJECT
66 - TRACE_PROCESS_EVENT_VULKAN_DEBUG_API
Vulkan Flags
VALID_BIT = 0x00000001
CACHE_HIT_BIT = 0x00000002
BASE_PIPELINE_ACCELERATION_BIT = 0x00000004
62 - TRACE_PROCESS_EVENT_SLI
63 - TRACE_PROCESS_EVENT_SLI_START
64 - TRACE_PROCESS_EVENT_SLI_FINISH
0 - P2P_SKIPPED
1 - P2P_EARLY_PUSH
2 - P2P_PUSH_FAILED
3 - P2P_2WAY_OR_PULL
4 - P2P_PRESENT
5 - P2P_DX12_INIT_PUSH_ON_WRITE
www.nvidia.com
User Guide v2023.1.1 | 138
Export Formats
0 - None
101 - RestoreSegments
102 - PurgeSegments
103 - CleanupPrimary
104 - AllocatePagingBufferResources
105 - FreePagingBufferResources
106 - ReportVidMmState
107 - RunApertureCoherencyTest
108 - RunUnmapToDummyPageTest
109 - DeferredCommand
110 - SuspendMemorySegmentAccess
111 - ResumeMemorySegmentAccess
112 - EvictAndFlush
113 - CommitVirtualAddressRange
114 - UncommitVirtualAddressRange
115 - DestroyVirtualAddressAllocator
116 - PageInDevice
117 - MapContextAllocation
118 - InitPagingProcessVaSpace
200 - CloseAllocation
202 - ComplexLock
203 - PinAllocation
204 - FlushPendingGpuAccess
205 - UnpinAllocation
206 - MakeResident
207 - Evict
208 - LockInAperture
209 - InitContextAllocation
210 - ReclaimAllocation
211 - DiscardAllocation
212 - SetAllocationPriority
1000 - EvictSystemMemoryOfferList
0 - VIDMM_PAGING_QUEUE_TYPE_UMD
1 - VIDMM_PAGING_QUEUE_TYPE_Default
2 - VIDMM_PAGING_QUEUE_TYPE_Evict
3 - VIDMM_PAGING_QUEUE_TYPE_Reclaim
0 - DXGKETW_RENDER_COMMAND_BUFFER
1 - DXGKETW_DEFERRED_COMMAND_BUFFER
2 - DXGKETW_SYSTEM_COMMAND_BUFFER
3 - DXGKETW_MMIOFLIP_COMMAND_BUFFER
4 - DXGKETW_WAIT_COMMAND_BUFFER
5 - DXGKETW_SIGNAL_COMMAND_BUFFER
6 - DXGKETW_DEVICE_COMMAND_BUFFER
7 - DXGKETW_SOFTWARE_COMMAND_BUFFER
www.nvidia.com
User Guide v2023.1.1 | 139
Export Formats
0 - DXGK_ENGINE_TYPE_OTHER
1 - DXGK_ENGINE_TYPE_3D
2 - DXGK_ENGINE_TYPE_VIDEO_DECODE
3 - DXGK_ENGINE_TYPE_VIDEO_ENCODE
4 - DXGK_ENGINE_TYPE_VIDEO_PROCESSING
5 - DXGK_ENGINE_TYPE_SCENE_ASSEMBLY
6 - DXGK_ENGINE_TYPE_COPY
7 - DXGK_ENGINE_TYPE_OVERLAY
8 - DXGK_ENGINE_TYPE_CRYPTO
1 = DXGK_INTERRUPT_DMA_COMPLETED
2 = DXGK_INTERRUPT_DMA_PREEMPTED
4 = DXGK_INTERRUPT_DMA_FAULTED
9 = DXGK_INTERRUPT_DMA_PAGE_FAULTED
0 = Queue_Packet
1 = Dma_Packet
2 = Paging_Queue_Packet
Driver Events
Load balance event type values
1 - LoadBalanceEvent_GPU
8 - LoadBalanceEvent_CPU
21 - LoadBalanceMasterEvent_GPU
22 - LoadBalanceMasterEvent_CPU
OpenMP Events
OpenMP event class values
78 - TRACE_PROCESS_EVENT_OPENMP
79 - TRACE_PROCESS_EVENT_OPENMP_START
80 - TRACE_PROCESS_EVENT_OPENMP_FINISH
www.nvidia.com
User Guide v2023.1.1 | 140
Export Formats
15 - OPENMP_EVENT_KIND_TASK_CREATE
16 - OPENMP_EVENT_KIND_TASK_SCHEDULE
17 - OPENMP_EVENT_KIND_CANCEL
20 - OPENMP_EVENT_KIND_MUTEX_RELEASED
21 - OPENMP_EVENT_KIND_LOCK_INIT
22 - OPENMP_EVENT_KIND_LOCK_DESTROY
25 - OPENMP_EVENT_KIND_DISPATCH
26 - OPENMP_EVENT_KIND_FLUSH
27 - OPENMP_EVENT_KIND_THREAD
28 - OPENMP_EVENT_KIND_PARALLEL
29 - OPENMP_EVENT_KIND_SYNC_REGION_WAIT
30 - OPENMP_EVENT_KIND_SYNC_REGION
31 - OPENMP_EVENT_KIND_TASK
32 - OPENMP_EVENT_KIND_MASTER
33 - OPENMP_EVENT_KIND_REDUCTION
34 - OPENMP_EVENT_KIND_MUTEX_WAIT
35 - OPENMP_EVENT_KIND_CRITICAL_SECTION
36 - OPENMP_EVENT_KIND_WORKSHARE
1 - Barrier
2 - Implicit barrier
3 - Explicit barrier
4 - Implementation-dependent barrier
5 - Taskwait
6 - Taskgroup
1 - Initial task
2 - Implicit task
3 - Explicit task
1 - Task completed
2 - Task yielded to another task
3 - Task was cancelled
7 - Task was switched out for other reasons
www.nvidia.com
User Guide v2023.1.1 | 141
Export Formats
1 - Loop region
2 - Sections region
3 - Single region (executor)
4 - Single region (waiting)
5 - Workshare region
6 - Distrubute region
7 - Taskloop region
1 - Iteration
2 - Section
.mode column
.headers on
Default column width is determined by the data in the first row of results. If this doesn’t
work out well, you can specify widths manually.
.width 10 20 50
www.nvidia.com
User Guide v2023.1.1 | 142
Export Formats
Note: globalTid field includes both TID and PID values, while globalPid only containes
the PID value.
Correlate CUDA Kernel Launches With CUDA API Kernel Launches
Results:
www.nvidia.com
User Guide v2023.1.1 | 143
Export Formats
Results:
COUNT(*)
----------
1095
Find CUDA API Calls That Resulted in Original Graph Node Creation.
www.nvidia.com
User Guide v2023.1.1 | 144
Export Formats
Results:
www.nvidia.com
User Guide v2023.1.1 | 145
Export Formats
Results:
www.nvidia.com
User Guide v2023.1.1 | 146
Export Formats
Results:
---------- -------------------------------------------------------
---------------------------------------------------------------------------------------------
19163 /tmp/nvidia/nsight_systems/streams/pid_19163_stderr.log
Thread Summary
Please note, that Nsight Systems applies additional logic during sampling events
processing to work around lost events. This means that the results of the below query
might differ slightly from the ones shown in “Analysis summary” tab.
Thread summary calculated using CPU cycles (when available).
SELECT
globalTid / 0x1000000 % 0x1000000 AS PID,
globalTid % 0x1000000 AS TID,
ROUND(100.0 * SUM(cpuCycles) /
(
SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS
GROUP BY globalTid / 0x1000000000000 % 0x100
),
2
) as CPU_utilization,
(SELECT value FROM StringIds WHERE id =
(
SELECT nameId FROM ThreadNames
WHERE ThreadNames.globalTid = COMPOSITE_EVENTS.globalTid
)
) as thread_name
FROM COMPOSITE_EVENTS
GROUP BY globalTid
ORDER BY CPU_utilization DESC
LIMIT 10;
www.nvidia.com
User Guide v2023.1.1 | 147
Export Formats
Results:
Thread running time may be calculated using scheduling data, when PMU counter data
was not collected.
SELECT
globalTid / 0x1000000 % 0x1000000 AS PID,
globalTid % 0x1000000 AS TID,
(SELECT value FROM StringIds where nameId == id) as thread_name,
ROUND(100.0 * total_duration / (SELECT SUM(total_duration) FROM CPU_USAGE),
2) as CPU_utilization
FROM CPU_USAGE
ORDER BY CPU_utilization DESC;
Results:
Function Table
These examples demonstrate how to calculate Flat and BottomUp (for top level only)
views statistics.
www.nvidia.com
User Guide v2023.1.1 | 148
Export Formats
To set up:
Results:
www.nvidia.com
User Guide v2023.1.1 | 149
Export Formats
The example demonstrates how to calculate DX12 CPU frames durartion and construct a
histogram out of it.
SELECT
CAST((end - start) / 1000000.0 AS INT) AS duration_ms,
count(*)
FROM DX12_API_FPS
WHERE end IS NOT NULL
GROUP BY duration_ms
ORDER BY duration_ms;
Results:
duration_ms count(*)
----------- ----------
3 1
4 2
5 7
6 153
7 19
8 116
9 16
10 8
11 2
12 2
13 1
14 4
16 3
17 2
18 1
SELECT (CASE tag WHEN 8 THEN "BEGIN" WHEN 7 THEN "END" END) AS tag,
globalPid / 0x1000000 % 0x1000000 AS PID,
vmId, seqNo, contextId, timestamp, gpuId FROM FECS_EVENTS
WHERE tag in (7, 8) ORDER BY seqNo LIMIT 10;
www.nvidia.com
User Guide v2023.1.1 | 150
Export Formats
Results:
WITH
event AS (
SELECT *
FROM NVTX_EVENTS
WHERE eventType IN (34, 59, 60) -- mark, push/pop, start/end
),
category AS (
SELECT
category,
domainId,
text AS categoryName
FROM NVTX_EVENTS
WHERE eventType == 33 -- new category
)
SELECT
start,
end,
globalTid,
eventType,
domainId,
category,
categoryName,
text
FROM event JOIN category USING (category, domainId)
ORDER BY start;
www.nvidia.com
User Guide v2023.1.1 | 151
Export Formats
Results:
Results:
www.nvidia.com
User Guide v2023.1.1 | 152
Export Formats
Results:
SELECT *
FROM SLI_P2P
WHERE resourceSize < 98304 AND start > 1568063100 AND end < 1579468901
ORDER BY resourceSize DESC;
www.nvidia.com
User Guide v2023.1.1 | 153
Export Formats
Results:
Generic Events
Syscall usage histogram by PID:
www.nvidia.com
User Guide v2023.1.1 | 154
Export Formats
Results:
PID total
---------- ----------
5551 32811
9680 3988
4328 1477
9564 1246
4376 1204
4377 1167
4357 656
4355 655
4356 640
4354 633
SELECT json_insert('{}',
'$.sourceId', sourceId,
'$.data', json(data)
)
FROM GENERIC_EVENT_SOURCES LIMIT 2;
SELECT json_insert('{}',
'$.typeId', typeId,
'$.sourceId', sourceId,
'$.data', json(data)
)
FROM GENERIC_EVENT_TYPES LIMIT 2;
SELECT json_insert('{}',
'$.rawTimestamp', rawTimestamp,
'$.timestamp', timestamp,
'$.typeId', typeId,
'$.data', json(data)
)
FROM GENERIC_EVENTS LIMIT 2;
www.nvidia.com
User Guide v2023.1.1 | 155
Export Formats
Results:
json_insert('{}',
'$.sourceId', sourceId,
'$.data', json(data)
)
----------------------------------------------------------------------------------------------
{"sourceId":72057602627862528,"data":
{"Name":"FTrace","TimeSource":"ClockMonotonicRaw","SourceGroup":"FTrace"}}
json_insert('{}',
'$.typeId', typeId,
'$.sourceId', sourceId,
'$.data', json(data)
)
----------------------------------------------------------------------------------------------
{"typeId":72057602627862547,"sourceId":72057602627862528,"data":
{"Name":"raw_syscalls:sys_enter","Format":"\"NR %ld (%lx,
%lx, %lx, %lx, %lx, %lx)\", REC->id, REC->args[0], REC-
>args[1], REC->args[2], REC->args[3], REC->args[4], REC-
>args[5]","Fields":[{"Name":"common_pid","Prefix":"int","Suffix":""},
{"Name":"id","Prefix":"long","S
{"typeId":72057602627862670,"sourceId":72057602627862528,"data":
{"Name":"irq:irq_handler_entry","Format":"\"irq=%d name=%s\", REC->irq,
__get_str(name)","Fields":[{"Name":"common_pid","Prefix":"int","Suffix":""},
{"Name":"irq","Prefix":"int","Suffix":""},{"Name":"name","Prefix":"__data_loc
char[]","Suffix":""},{"Name":"common_type",
json_insert('{}',
'$.rawTimestamp', rawTimestamp,
'$.timestamp', timestamp,
'$.typeId', typeId,
'$.data', json(data)
)
----------------------------------------------------------------------------------------------
{"rawTimestamp":1183694330725221,"timestamp":6236683,"typeId":72057602627862670,"data":
{"common_pid":"0","irq":"66","name":"327696","common_type":"142","common_flags":"9","common_pr
{"rawTimestamp":1183694333695687,"timestamp":9207149,"typeId":72057602627862670,"data":
{"common_pid":"0","irq":"66","name":"327696","common_type":"142","common_flags":"9","common_pr
www.nvidia.com
User Guide v2023.1.1 | 156
Export Formats
field has the key table_name. The titles of all the available tables can be found in
section SQLite Schema Reference.
{Event #1}
{Event #2}
...
{Event #N}
{Strings}
{Streams}
{Threads}
For easier grepping of JSON output, the --separate-strings switch may be used to
force manual splitting of strings, streams and thread names data.
Example line split: nsys export --export-json --separate-strings
sample.nsys-rep -- -
Note, that only last few lines are shown here for clarity and that carriage returns and
indents were added to avoid wrapping documentation.
www.nvidia.com
User Guide v2023.1.1 | 157
Chapter 4.
REPORT SCRIPTS
www.nvidia.com
User Guide v2023.1.1 | 158
Report Scripts
percent of the execution time of the APIs, kernels and memory operations listed, and not
a percentage of the application wall or CPU execution time.
This report combines data from the cuda_api_sum, cuda_gpu_kern_sum, and
cuda_gpu_mem_size_sum reports. It is very similar to profile section of nvprof --
dependency-analysis.
www.nvidia.com
User Guide v2023.1.1 | 159
Report Scripts
www.nvidia.com
User Guide v2023.1.1 | 160
Report Scripts
www.nvidia.com
User Guide v2023.1.1 | 161
Report Scripts
www.nvidia.com
User Guide v2023.1.1 | 162
Report Scripts
www.nvidia.com
User Guide v2023.1.1 | 163
Report Scripts
www.nvidia.com
User Guide v2023.1.1 | 164
Report Scripts
Column
Usage:
column[:nohdr][:nolimit][:nofmt][:<width>[:<width>]...]
Arguments
‣ nohdr : Do not display the header
‣ nolimit : Remove 100 character limit from auto-width columns Note: This can result
in extremely wide columns.
‣ nofmt : Do not reformat numbers.
‣ <width>... : Define the explicit width of one or more columns. If the value "." is
given, the column will auto-adjust. If a width of 0 is given, the column will not be
displayed.
The column formatter presents data in vertical text columns. It is primarily designed to
be a human-readable format for displaying data on a console display.
Text data will be left-justified, while numeric data will be right-justified. If the data
overflows the available column width, it will be marked with a "…" character, to indicate
the data values were clipped. Clipping always occurs on the right-hand side, even for
numeric data.
Numbers will be reformatted to make easier to visually scan and understand.
This includes adding thousands-separators. This process requires that the string
representation of the number is converted into its native representation (integer or
floating point) and then converted back into a string representation to print. This
conversion process attempts to preserve elements of number presentation, such as the
number of decimal places, or the use of scientific notation, but the conversion is not
always perfect (the number should always be the same, but the presentation may not
be). To disable the reformatting process, use the argument nofmt.
If no explicit width is given, the columns auto-adjust their width based off the header
size and the first 100 lines of data. This auto-adjustment is limited to a maximum
width of 100 characters. To allow larger auto-width columns, pass the initial argument
nolimit. If the first 100 lines do not calculate the correct column width, it is suggested
that explicit column widths be provided.
Table
Usage:
table[:nohdr][:nolimit][:nofmt][:<width>[:<width>]...]
Arguments
‣ nohdr : Do not display the header
‣ nolimit : Remove 100 character limit from auto-width columns Note: This can result
in extremely wide columns.
‣ nofmt : Do not reformat numbers.
www.nvidia.com
User Guide v2023.1.1 | 165
Report Scripts
‣ <width>... : Define the explicit width of one or more columns. If the value "." is
given, the column will auto-adjust. If a width of 0 is given, the column will not be
displayed.
The table formatter presents data in vertical text columns inside text boxes. Other than
the lines between columns, it is identical to the column formatter.
CSV
Usage:
csv[:nohdr]
Arguments
‣ nohdr : Do not display the header
The csv formatter outputs data as comma-separated values. This format is commonly
used for import into other data applications, such as spread-sheets and databases.
There are many different standards for CSV files. Most differences are in how escapes
are handled, meaning data values that contain a comma or space.
This CSV formatter will escape commas by surrounding the whole value in double-
quotes.
TSV
Usage:
tsv[:nohdr][:esc]
Arguments
‣ nohdr : Do not display the header
‣ esc : escape tab characters, rather than removing them
The tsv formatter outputs data as tab-separated values. This format is sometimes used
for import into other data applications, such as spreadsheets and databases.
Most TSV import/export systems disallow the tab character in data values. The formatter
will normally replace any tab characters with a single space. If the esc argument has
been provided, any tab characters will be replaced with the literal characters "\t".
JSON
Usage:
json
Arguments: no arguments
The json formatter outputs data as an array of JSON objects. Each object represents one
line of data, and uses the column names as field labels. All objects have the same fields.
The formatter attempts to recognize numeric values, as well as JSON keywords, and
www.nvidia.com
User Guide v2023.1.1 | 166
Report Scripts
converts them. Empty values are passed as an empty string (and not nil, or as a missing
field).
At this time the formatter does not escape quotes, so if a data value includes double-
quotation marks, it will corrupt the JSON file.
HDoc
Usage:
hdoc[:title=<title>][:css=<URL>]
Arguments:
‣ title : string for HTML document title
‣ css : URL of CSS document to include
The hdoc formatter generates a complete, verifiable (mostly), standalone HTML
document. It is designed to be opened in a web browser, or included in a larger
document via an <iframe>.
HTable
Usage:
htable
Arguments: no arguments
The htable formatter outputs a raw HTML <table> without any of the surrounding
HTML document. It is designed to be included into a larger HTML document. Although
most web browsers will open and display the document, it is better to use the hdoc
format for this type of use.
www.nvidia.com
User Guide v2023.1.1 | 167
Chapter 5.
MIGRATING FROM NVIDIA NVPROF
www.nvidia.com
User Guide v2023.1.1 | 168
Migrating from NVIDIA nvprof
www.nvidia.com
User Guide v2023.1.1 | 169
Migrating from NVIDIA nvprof
www.nvidia.com
User Guide v2023.1.1 | 170
Migrating from NVIDIA nvprof
Next Steps
NVIDIA Visual Profiler (NVVP) and NVIDIA nvprof are deprecated. New GPUs and
features will not be supported by those tools. We encourage you to make the move to
Nsight Systems now. For additional information, suggestions, and rationale, see the blog
series in Other Resources.
www.nvidia.com
User Guide v2023.1.1 | 171
Chapter 6.
PROFILING IN A DOCKER ON LINUX
DEVICES
Download the default seccomp profile file, default.json, relevant to your Docker version.
If perf_event_open is already listed in the file as guarded by CAP_SYS_ADMIN, then
remove the perf_event_open line. Add the following lines under "syscalls" and save
the resulting file as default_with_perf.json.
{
"name": "perf_event_open",
"action": "SCMP_ACT_ALLOW",
"args": []
},
www.nvidia.com
User Guide v2023.1.1 | 172
Profiling in a Docker on Linux Devices
Then you will be able to use the following switch when starting the Docker to apply the
new seccomp profile.
--security-opt seccomp=default_with_perf.json
There is a known issue where Docker collections terminate prematurely with older
versions of the driver and the CUDA Toolkit. If collection is ending unexpectedly, please
update to the latest versions.
After the Docker has been started, use the Nsight Systems CLI to launch a collection
within the Docker. The resulting .qdstrm file can be imported into the Nsight Systems
host like any other CLI result.
www.nvidia.com
User Guide v2023.1.1 | 173
Profiling in a Docker on Linux Devices
Ports
These ports can be published from the container to provide access to the Docker
container:
Volumes
www.nvidia.com
User Guide v2023.1.1 | 174
Profiling in a Docker on Linux Devices
Environment variables
Examples
With VNC access on port 5916:
sudo docker run -p 5916:5900/tcp -ti nsys-ui-vnc:1.0
With VNC access on port 5916 and HTTP access on port 8080:
sudo docker run -p 5916:5900/tcp -p 8080:80/tcp -ti nsys-ui-vnc:1.0
With VNC access on port 5916, HTTP access on port 8080 and RDP access on port 33890:
sudo docker run -p 5916:5900/tcp -p 8080:80/tcp -p 33890:3389/tcp -ti nsys-ui-
vnc:1.0
With VNC access on port 5916, shared "HOME" folder from the host, VNC server
resolution 3840x2160, and custom VNC password
sudo docker run -p 5916:5900/tcp -v $HOME:/mnt/host/home -e
NSYS_WINDOW_WIDTH=3840 -e NSYS_WINDOW_HEIGHT=2160 -e VNC_PASSWORD=7654321 -ti
nsys-ui-vnc:1.0
With VNC access on port 5916, shared "HOME" folder from the host, and the projects
folder to access reports created by Nsight Systems GUI in container
sudo docker run -p 5916:5900/tcp -v $HOME:/mnt/host/home -v /opt/NsysProjects:/
mnt/host/Projects -ti nsys-ui-vnc:1.0
www.nvidia.com
User Guide v2023.1.1 | 175
Chapter 7.
DIRECT3D TRACE
Nsight Systems has the ability to trace both the Direct3D 11 API and the Direct3D 12 API
on Windows targets.
SLI Trace
Trace SLI queries and peer-to-peer transfers of D3D11 applications. Requires SLI
hardware and an active SLI profile definition in the NVIDIA console.
www.nvidia.com
User Guide v2023.1.1 | 176
Direct3D Trace
The Command List Creation row displays time periods when command lists
were being created. This enables developers to improve their application’s multi-
threaded command list creation. Command list creation time period is measured
between the call to ID3D12GraphicsCommandList::Reset and the call to
ID3D12GraphicsCommandList::Close.
The GPU row shows a compressed view of the D3D12 queue activity, color-coded by the
queue type. Expanding it will show the individual queues and their corresponding API
calls.
A Command Queue row is displayed for each D3D12 command queue created by the
profiled application. The row’s header displays the queue's running index and its type
(Direct, Compute, Copy).
The DX12 API Memory Ops row displays all API memory operations and non-persistent
resource mappings. Event ranges in the row are color-coded by the heap type they
belong to (Default, Readback, Upload, Custom, or CPU-Visible VRAM), with usage
warnings highlighted in yellow. A breakdown of the operations can be found by
expanding the row to show rows for each individual heap type.
www.nvidia.com
User Guide v2023.1.1 | 177
Direct3D Trace
www.nvidia.com
User Guide v2023.1.1 | 178
Direct3D Trace
In addition, you can see the PIX command queue CPU-side performance markers, GPU-
side performance markers and the GPU Command List performance markers, each in
their row.
Detecting which CPU thread was blocked by a fence can be difficult in complex apps
that run tens of CPU threads. The timeline view displays the 3 operations involved:
‣ The CPU thread pushing a signal command and fence value into the command
queue. This is displayed on the DX12 Synchronization sub-row of the calling thread.
‣ The GPU executing that command, setting the fence value and signaling the fence.
This is displayed on the GPU Queue Synchronization sub-row.
‣ The CPU thread calling a Win32 wait API to block-wait until the fence is signaled.
This is displayed on the Thread's OS runtime libraries row.
Clicking one of these will highlight it and the corresponding other two calls.
www.nvidia.com
User Guide v2023.1.1 | 179
Direct3D Trace
www.nvidia.com
User Guide v2023.1.1 | 180
Chapter 8.
WDDM QUEUES
The Windows Display Driver Model (WDDM) architecture uses queues to send work
packets from the CPU to the GPU. Each D3D device in each process is associated
with one or more contexts. Graphics, compute, and copy commands that the profiled
application uses are associated with a context, batched in a command buffer, and pushed
into the relevant queue associated with that context.
Nsight Systems can capture the state of these queues during the trace session.
Enabling the "Collect additional range of ETW events" option will also capture extended
DxgKrnl events from the Microsoft-Windows-DxgKrnl provider, such as context
status, allocations, sync wait, signal events, etc.
A command buffer in a WDDM queues may have one the following types:
‣ Render
‣ Deferred
‣ System
‣ MMIOFlip
‣ Wait
‣ Signal
‣ Device
‣ Software
It may also be marked as a Present buffer, indicating that the application has finished
rendering and requests to display the source surface.
www.nvidia.com
User Guide v2023.1.1 | 181
WDDM Queues
See the Microsoft documentation for the WDDM architecture and the
DXGKETW_QUEUE_PACKET_TYPE enumeration.
To retain the .etl trace files captured, so that they can be viewed in other tools (e.g.
GPUView), change the "Save ETW log files in project folder" option under "Profile
Behavior" in Nsight Systems's global Options dialog. The .etl files will appear in the
same folder as the .nsys-rep file, accessible by right-clicking the report in the Project
Explorer and choosing "Show in Folder...". Data collected from each ETW provider will
appear in its own .etl file, and an additional .etl file named "Report XX-Merged-*.etl",
containing the events from all captured sources, will be created as well.
www.nvidia.com
User Guide v2023.1.1 | 182
Chapter 9.
WDDM HW SCHEDULER
www.nvidia.com
User Guide v2023.1.1 | 183
Chapter 10.
VULKAN API TRACE
The Command Buffer Creation row displays time periods when command buffers were
being created. This enables developers to improve their application’s multi-threaded
command buffer creation. Command buffer creation time period is measured between
the call to vkBeginCommandBuffer and the call to vkEndCommandBuffer.
www.nvidia.com
User Guide v2023.1.1 | 184
Vulkan API Trace
A Queue row is displayed for each Vulkan queue created by the profiled application.
The API sub-row displays time periods where vkQueueSubmit was called. The GPU
Workload sub-row displays time periods where workloads were executed by the GPU.
In addition, you can see Vulkan debug util labels on both the CPU and the GPU.
The Vulkan Memory Operations row contains an aggregation of all the Vulkan host-
side memory operations, such as host-blocking writes and reads or non-persistent map-
unmap ranges.
The row is separated into sub-rows by heap index and memory type - the tooltip for
each row and the ranges inside show the heap flags and the memory property flags.
www.nvidia.com
User Guide v2023.1.1 | 185
Vulkan API Trace
www.nvidia.com
User Guide v2023.1.1 | 186
Chapter 11.
STUTTER ANALYSIS
The frame duration row displays live FPS statistics for the current timeline viewport.
Values shown are:
1. Number of CPU frames shown of the total number captured
2. Average, minimal, and maximal CPU frame time of the currently displayed time
range
3. Average FPS value for the currently displayed frames
4. The 99th percentile value of the frame lengths (such that only 1% of the frames in the
range are longer than this value).
The values will update automatically when scrolling, zooming or filtering the timeline
view.
www.nvidia.com
User Guide v2023.1.1 | 187
Stutter Analysis
The stutter row highlights frames that are significantly longer than the other frames in
their immediate vicinity.
The stutter row uses an algorithm that compares the duration of each frame to the
median duration of the surrounding 19 frames. Duration difference under 4 milliseconds
is never considered a stutter, to avoid cluttering the display with frames whose absolute
stutter is small and not noticeable to the user.
For example, if the stutter threshold is set at 20%:
1. Median duration is 10 ms. Frame with 13 ms time will not be reported (relative
difference > 20%, absolute difference < 4 ms)
2. Median duration is 60 ms. Frame with 71 ms time will not be reported (relative
difference < 20%, absolute difference > 4 ms)
3. Median duration is 60 ms. Frame with 80 ms is a stutter (relative difference > 20%,
absolute difference > 4 ms, both conditions met)
OSC detection
The "19 frame window median" algorithm by itself may not work well with some cases
of "oscillation" (consecutive fast and slow frames), resulting in some false positives. The
median duration is not meaningful in cases of oscillation and can be misleading.
To address the issue and identify if oscillating frames, the following method is applied:
1. For every frame, calculate the median duration, 1st and 3rd quartiles of 19-frames
window.
2. Calculate the delta and ratio between 1st and 3rd quartiles.
3. If the 90th percentile of 3rd – 1st quartile delta array > 4 ms AND the 90th percentile
of 3rd/1st quartile array > 1.2 (120%) then mark the results with "OSC" text.
Right-clicking the Frame Duration row caption lets you choose the target frame rate (30,
60, 90 or custom frames per second).
By clicking the Customize FPS Display option, a customization dialog pops up. In the
dialog, you can now define the frame duration threshold to customize the view of the
potentially problematic frames. In addition, you can define the threshold for the stutter
analysis frames.
www.nvidia.com
User Guide v2023.1.1 | 188
Stutter Analysis
The CPU Frame Duration row displays the CPU frame duration measured between the
ends of consecutive frame boundary calls:
‣ The OpenGL frame boundaries are eglSwapBuffers/glXSwapBuffers/
SwapBuffers calls.
‣ The D3D11 and D3D12 frame boundaries are IDXGISwapChainX::Present calls.
‣ The Vulkan frame boundaries are vkQueuePresentKHR calls.
The timing of the actual calls to the frame boundary calls can be seen in the blue bar at
the bottom of the CPU frame duration row
The GPU Frame Duration row displays the time measured between
‣ The start time of the first GPU workload execution of this frame.
‣ The start time of the first GPU workload execution of the next frame.
Reflex SDK
NVIDIA Reflex SDK is a series of NVAPI calls that allow applications to integrate the
Ultra Low Latency driver feature more directly into their game to further optimize
synchronization between simulation and rendering stages and lower the latency
between user input and final image rendering. For more details about Reflex SDK, see
Reflex SDK Site.
Nsight Systems will automatically capture NVAPI functions when either Direct3D 11,
Direct3D 12, or Vulkan API trace are enabled.
The Reflex SDK row displays timeline ranges for the following types of latency markers:
‣ RenderSubmit.
‣ Simulation.
‣ Present.
www.nvidia.com
User Guide v2023.1.1 | 189
Stutter Analysis
‣ Driver.
‣ OS Render Queue.
‣ GPU Render.
www.nvidia.com
User Guide v2023.1.1 | 190
Stutter Analysis
Note that this is not the same as the CUDA kernel memory allocation graph, see CUDA
GPU Memory Graph for that functionality.
www.nvidia.com
User Guide v2023.1.1 | 191
Chapter 12.
OPENMP TRACE
Nsight Systems for Linux is capable of capturing information about OpenMP events.
This functionality is built on the OpenMP Tools Interface (OMPT), full support is
available only for runtime libraries supporting tools interface defined in OpenMP 5.0 or
greater.
As an example, LLVM OpenMP runtime library partially implements tools interface.
If you use PGI compiler <= 20.4 to build your OpenMP applications, add -mp=libomp
switch to use LLVM OpenMP runtime and enable OMPT based tracing. If you use
Clang, make sure the LLVM OpenMP runtime library you link to was compiled with
tools interface enabled.
The
raw
Note: OMPT
events
are
www.nvidia.com
User Guide v2023.1.1 | 192
OpenMP Trace
used
to
generate
ranges
indicating
the
runtime
of
OpenMP
operations
and
constructs.
Example screenshot:
www.nvidia.com
User Guide v2023.1.1 | 193
Chapter 13.
OS RUNTIME LIBRARIES TRACE
www.nvidia.com
User Guide v2023.1.1 | 194
OS Runtime Libraries Trace
You can also use Skip if shorter than. This will skip calls shorter than the given
threshold. Enabling this option will improve performances as well as reduce noise on
the timeline. We strongly encourage you to skip OS runtime libraries call shorter than 1
μs.
Note that even if a call is determined as potentially blocking, there is a chance that it
may not actually block after a few cycles have elapsed. The call will still be traced in this
scenario.
13.2. Limitations
‣ Nsight Systems only traces syscall wrappers exposed by the C runtime. It is not able
to trace syscall invoked through assembly code.
www.nvidia.com
User Guide v2023.1.1 | 195
OS Runtime Libraries Trace
‣ Additional thread states, as well as backtrace collection on long calls, are only
enabled if sampling is turned on.
‣ It is not possible to configure the depth and duration threshold when collecting
backtraces. Currently, only OS runtime libraries calls longer than 80 μs will generate
a backtrace with a maximum of 24 frames. This limitation will be removed in a
future version of the product.
‣ It is required to compile your application and libraries with the -funwind-tables
compiler flag in order for Nsight Systems to unwind the backtraces correctly.
www.nvidia.com
User Guide v2023.1.1 | 196
OS Runtime Libraries Trace
POSIX Threads
pthread_barrier_wait
pthread_cancel
pthread_cond_broadcast
pthread_cond_signal
pthread_cond_timedwait
pthread_cond_wait
pthread_create
pthread_join
pthread_kill
pthread_mutex_lock
pthread_mutex_timedlock
pthread_mutex_trylock
pthread_rwlock_rdlock
pthread_rwlock_timedrdlock
pthread_rwlock_timedwrlock
pthread_rwlock_tryrdlock
pthread_rwlock_trywrlock
pthread_rwlock_wrlock
pthread_spin_lock
pthread_spin_trylock
pthread_timedjoin_np
pthread_tryjoin_np
pthread_yield
sem_timedwait
sem_trywait
sem_wait
www.nvidia.com
User Guide v2023.1.1 | 198
OS Runtime Libraries Trace
I/O
aio_fsync
aio_fsync64
aio_suspend
aio_suspend64
fclose
fcloseall
fflush
fflush_unlocked
fgetc
fgetc_unlocked
fgets
fgets_unlocked
fgetwc
fgetwc_unlocked
fgetws
fgetws_unlocked
flockfile
fopen
fopen64
fputc
fputc_unlocked
fputs
fputs_unlocked
fputwc
fputwc_unlocked
fputws
fputws_unlocked
fread
fread_unlocked
freopen
freopen64
ftrylockfile
fwrite
fwrite_unlocked
getc
getc_unlocked
getdelim
getline
getw
getwc
getwc_unlocked
lockf
lockf64
mkfifo
mkfifoat
posix_fallocate
posix_fallocate64
putc
putc_unlocked
putwc
putwc_unlocked
Miscellaneous
forkpty
popen
posix_spawn
posix_spawnp
sigwait
sigwaitinfo
sleep
system
usleep
www.nvidia.com
User Guide v2023.1.1 | 199
Chapter 14.
NVTX TRACE
The NVIDIA Tools Extension Library (NVTX) is a powerful mechanism that allows
users to manually instrument their application. Nsight Systems can then collect the
information and present it on the timeline.
Nsight Systems supports version 3.0 of the NVTX specification.
The following features are supported:
‣ Domains
nvtxDomainCreate(), nvtxDomainDestroy()
nvtxDomainRegisterString()
‣ Push-pop ranges (nested ranges that start and end in the same thread).
nvtxRangePush(), nvtxRangePushEx()
nvtxRangePop()
nvtxDomainRangePushEx()
nvtxDomainRangePop()
‣ Start-end ranges (ranges that are global to the process and are not restricted to a
single thread)
nvtxRangeStart(), nvtxRangeStartEx()
nvtxRangeEnd()
nvtxDomainRangeStartEx()
nvtxDomainRangeEnd()
‣ Marks
nvtxMark(), nvtxMarkEx()
nvtxDomainMarkEx()
‣ Thread names
nvtxNameOsThread()
‣ Categories
nvtxNameCategory()
nvtxDomainNameCategory()
To learn more about specific features of NVTX, please refer to the NVTX header file:
nvToolsExt.h or the NVTX documentation.
www.nvidia.com
User Guide v2023.1.1 | 200
NVTX Trace
Range
annotations
should
be
matched
carefully.
If
many
ranges
are
opened
but
not
Note: closed,
Nsight
Systems
has
no
meaningful
way
to
visualize
it.
A
rule
of
thumb
www.nvidia.com
User Guide v2023.1.1 | 201
NVTX Trace
is
to
not
have
more
than
a
couple
dozen
ranges
open
at
any
point
in
time.
Nsight
Systems
does
not
support
reports
with
many
unclosed
ranges.
NVTX domains enable scoping of annotations. Unless specified differently, all events
and annotations are in the default domain. Additionally, categories can be used to group
events.
Nsight Systems gives the user the ability to include or exclude NVTX events from a
particular domain. This can be especially useful if you are profiling across multiple
libraries and are only interested in nvtx events from some of them.
This functionality is also available from the CLI. See the CLI documentation for --nvtx-
domain-include and --nvtx-domain-exclude for more details.
Categories that are set in by the user will be recognized and displayed in the GUI.
www.nvidia.com
User Guide v2023.1.1 | 202
NVTX Trace
www.nvidia.com
User Guide v2023.1.1 | 203
Chapter 15.
CUDA TRACE
Near the bottom of the timeline row tree, the GPU node will appear and contain a
CUDA node. Within the CUDA node, each CUDA context used within the process will
be shown along with its corresponding CUDA streams. Steams will contain memory
operations and kernel launches on the GPU. Kernel launches are represented by blue,
while memory transfers are displayed in red.
www.nvidia.com
User Guide v2023.1.1 | 204
CUDA Trace
The easiest way to capture CUDA information is to launch the process from Nsight
Systems, and it will setup the environment for you. To do so, simply set up a normal
launch and select the Collect CUDA trace checkbox.
For Nsight Systems Workstation Edition this looks like:
www.nvidia.com
User Guide v2023.1.1 | 205
CUDA Trace
cudaDeviceReset(), and then let the application gracefully exit (as opposed to
crashing).
This option allows flushing CUDA trace data even before the device is finalized.
However, it might introduce additional overhead to a random CUDA Driver or
CUDA Runtime API call.
‣ Skip some API calls — avoids tracing insignificant CUDA Runtime
API calls (namely, cudaConfigureCall(), cudaSetupArgument(),
cudaHostGetDevicePointers()). Not tracing these functions allows Nsight
Systems to significantly reduce the profiling overhead, without losing any
interesting data. (See CUDA Trace Filters, below)
‣ Collect GPU Memory Usage - collects information used to generate a graph of
CUDA allocated memory across time. Note that this will increase overhead. See
section on CUDA GPU Memory Allocation Graph below.
‣ Collect Unified Memory CPU page faults - collects information on page faults that
occur when CPU code tries to access a memory page that resides on the device. See
section on Unified Memory CPU Page Faults in the Unified Memory Transfer
Trace documentation below.
‣ Collect Unified Memory GPU page faults - collects information on page faults that
occur when GPU code tries to access a memory page that resides on the CPU. See
section on Unified Memory GPU Page Faults in the Unified Memory Transfer
Trace documentation below.
‣ Collect CUDA Graph trace - by default, CUDA tracing will collect and expose
information on a whole graph basis. The user can opt to collect on a node per node
basis. See section on CUDA Graph Trace below.
‣ For Nsight Systems Workstation Edition, Collect cuDNN trace, Collect cuBLAS
trace, Collect OpenACC trace - selects which (if any) extra libraries that depend on
CUDA to trace.
OpenACC versions 2.0, 2.5, and 2.6 are supported when using PGI runtime version
15.7 or greater and not compiling statically. In order to differentiate constructs, a PGI
runtime of 16.1 or later is required. Note that Nsight Systems Workstation Edition
does not support the GCC implementation of OpenACC at this time.
If
your
application
crashes
before
all
collected
CUDA
Note: trace
data
has
been
copied
out,
some
or
all
www.nvidia.com
User Guide v2023.1.1 | 206
CUDA Trace
data
might
be
lost
and
not
present
in
the
report.
Nsight
Systems
will
not
have
information
about
CUDA
events
that
were
still
in
device
buffers
when
analysis
Note: terminated.
It
is
a
good
idea,
if
using
cudaProfilerAPI
to
control
analysis
to
call
cudaDeviceReset
before
ending
analysis.
www.nvidia.com
User Guide v2023.1.1 | 207
CUDA Trace
memory graph generated during stutter analysis on the Windows target (see Stutter
Memory Trace)
Below, in the report on the left, memory is allocated and freed during the collection. In
the report on the right, memory is allocated, but not freed during the collection.
www.nvidia.com
User Guide v2023.1.1 | 208
CUDA Trace
HtoD transfer indicates the CUDA kernel accessed managed memory that was residing
on the host, so the kernel execution paused and transferred the data to the device. Heavy
traffic here will incur performance penalties in CUDA kernels, so consider using manual
cudaMemcpy operations from pinned host memory instead.
PtoP transfer indicates the CUDA kernel accessed managed memory that was residing
on a different device, so the kernel execution paused and transferred the data to this
device. Heavy traffic here will incur performance penalties, so consider using manual
cudaMemcpyPeer operations to transfer from other devices' memory instead. The row
showing these events is for the destination device -- the source device is shown in the
tooltip for each transfer event.
DtoH transfer indicates the CPU accessed managed memory that was residing on a
CUDA device, so the CPU execution paused and transferred the data to system memory.
Heavy traffic here will incur performance penalties in CPU code, so consider using
manual cudaMemcpy operations from pinned host memory instead.
Some Unified Memory transfers are highlighted with red to indicate potential
performance issues:
Collecting
Note: Unified
Memory
CPU
www.nvidia.com
User Guide v2023.1.1 | 209
CUDA Trace
page
faults
can
cause
overhead
of
up
to
70%
in
testing.
Please
use
this
functionality
only
when
needed.
Note: Collecting
Unified
www.nvidia.com
User Guide v2023.1.1 | 210
CUDA Trace
Memory
GPU
page
faults
can
cause
overhead
of
up
to
70%
in
testing.
Please
use
this
functionality
only
when
needed.
www.nvidia.com
User Guide v2023.1.1 | 211
CUDA Trace
When CUDA graph trace is set to graph, the users sees each graph as one item on the
timeline:
When CUDA graph trace is set to node, the users sees each graph as a set of nodes on
the timeline:
Tracing CUDA graphs at the graph level rather than the tracing the underlying nodes
results in significantly less overhead. This option is only available with CUDA driver
515.43 or higher.
www.nvidia.com
User Guide v2023.1.1 | 212
CUDA Trace
www.nvidia.com
User Guide v2023.1.1 | 216
Chapter 16.
OPENACC TRACE
Nsight Systems for Linux x86_64 and Power targets is capable of capturing information
about OpenACC execution in the profiled process.
OpenACC versions 2.0, 2.5, and 2.6 are supported when using PGI runtime version 15.7
or later. In order to differentiate constructs (see tooltip below), a PGI runtime of 16.0 or
later is required. Note that Nsight Systems does not support the GCC implementation of
OpenACC at this time.
Under the CPU rows in the timeline tree, each thread that uses OpenACC will show
OpenACC trace information. You can click on a OpenACC API call to see correlation
with the underlying CUDA API calls (highlighted in teal):
If the OpenACC API results in GPU work, that will also be highlighted:
www.nvidia.com
User Guide v2023.1.1 | 217
OpenACC Trace
Hovering over a particular OpenACC construct will bring up a tooltip with details about
that construct:
To capture OpenACC information from the Nsight Systems GUI, select the Collect
OpenACC trace checkbox under Collect CUDA trace configurations. Note that turning
on OpenACC tracing will also turn on CUDA tracing.
Please note that if your application crashes before all collected OpenACC trace data has
been copied out, some or all data might be lost and not present in the report.
www.nvidia.com
User Guide v2023.1.1 | 218
Chapter 17.
OPENGL TRACE
OpenGL and OpenGL ES APIs can be traced to assist in the analysis of CPU and GPU
interactions.
A few usage examples are:
1. Visualize how long eglSwapBuffers (or similar) is taking.
2. API trace can easily show correlations between thread state and graphics driver's
behavior, uncovering where the CPU may be waiting on the GPU.
3. Spot bubbles of opportunity on the GPU, where more GPU workload could be
created.
4. Use KHR_debug extension to trace GL events on both the CPU and GPU.
OpenGL trace feature in Nsight Systems consists of two different activities which will be
shown in the CPU rows for those threads
‣ CPU trace: interception of API calls that an application does to APIs (such as
OpenGL, OpenGL ES, EGL, GLX, WGL, etc.).
‣ GPU trace (or workload trace): trace of GPU workload (activity) triggered by use
of OpenGL or OpenGL ES. Since draw calls are executed back-to-back, the GPU
workload trace ranges include many OpenGL draw calls and operations in order to
optimize performance overhead, rather than tracing each individual operation.
To collect GPU trace, the glQueryCounter() function is used to measure how much
time batches of GPU workload take to complete.
www.nvidia.com
User Guide v2023.1.1 | 219
OpenGL Trace
Ranges defined by the KHR_debug calls are represented similarly to OpenGL API and
OpenGL GPU workload trace. GPU ranges in this case represent incremental draw cost.
They cannot fully account for GPUs that can execute multiple draw calls in parallel. In
this case, Nsight Systems will not show overlapping GPU ranges.
www.nvidia.com
User Guide v2023.1.1 | 220
OpenGL Trace
www.nvidia.com
User Guide v2023.1.1 | 222
Chapter 18.
CUSTOM ETW TRACE
Use the custom ETW trace feature to enable and collect any manifest-based ETW log.
The collected events are displayed on the timeline on dedicated rows for each event
type.
Custom ETW is available on Windows target machines.
www.nvidia.com
User Guide v2023.1.1 | 223
Custom ETW Trace
To retain the .etl trace files captured, so that they can be viewed in other tools (e.g.
GPUView), change the "Save ETW log files in project folder" option under "Profile
Behavior" in Nsight Systems's global Options dialog. The .etl files will appear in the
same folder as the .nsys-rep file, accessible by right-clicking the report in the Project
Explorer and choosing "Show in Folder...". Data collected from each ETW provider will
appear in its own .etl file, and an additional .etl file named "Report XX-Merged-*.etl",
containing the events from all captured sources, will be created as well.
www.nvidia.com
User Guide v2023.1.1 | 224
Chapter 19.
GPU METRICS
Overview
GPU Metrics feature is intended to identify performance limiters in applications using
GPU for computations and graphics. It uses periodic sampling to gather performance
metrics and detailed timing statistics associated with different GPU hardware units
taking advantage of specialized hardware to capture this data in a single pass with
minimal overhead.
Note: GPU Metrics will give you precise device level information, but it does not know
which process or context is involved. GPU context switch trace provides less precise
information, but will give you process and context information.
These metrics provide an overview of GPU efficiency over time within compute,
graphics, and input/output (IO) activities such as:
www.nvidia.com
User Guide v2023.1.1 | 225
GPU Metrics
Permissions:
Elevated
permissions
are
required.
On
Linux
use
sudo
to
elevate
privileges.
On
Windows
the
user
Note: must
run
from
an
admin
command
prompt
or
accept
the
UAC
escalation
dialog.
See
Permissions
Issues
and
www.nvidia.com
User Guide v2023.1.1 | 226
GPU Metrics
Performance
Counters
for
more
information.
Tensor
Core:
If
you
run
nsys
profile
--
gpu-
metrics-
device
all,
the
Tensor
Core
utilization
can
be
found
in
the
GUI
under
the
SM
instructions/
Note: Tensor
Active
row.
Please
note
that
it
is
not
practical
to
expect
a
CUDA
kernel
to
reach
100%
Tensor
Core
utilization
since
there
are
other
overheads.
www.nvidia.com
User Guide v2023.1.1 | 227
GPU Metrics
In
general,
the
more
computation-
intensive
an
operation
is,
the
higher
Tensor
Core
utilization
rate
the
CUDA
kernel
can
achieve.
By default, the first metric set which supports all selected GPUs is used. But you can
manually select another metric set from the list. To see available metric sets, use:
$ nsys profile --gpu-metrics-set=help
Possible --gpu-metrics-set values are:
[0] [tu10x] General Metrics for NVIDIA TU10x (any frequency)
[1] [tu11x] General Metrics for NVIDIA TU11x (any frequency)
[2] [ga100] General Metrics for NVIDIA GA100 (any frequency)
[3] [ga10x] General Metrics for NVIDIA GA10x (any frequency)
[4] [tu10x-gfxt] Graphics Throughput Metrics for NVIDIA TU10x (frequency
>= 10kHz)
[5] [ga10x-gfxt] Graphics Throughput Metrics for NVIDIA GA10x (frequency
>= 10kHz)
[6] [ga10x-gfxact] Graphics Async Compute Triage Metrics for NVIDIA GA10x
(frequency >= 10kHz)
www.nvidia.com
User Guide v2023.1.1 | 228
GPU Metrics
By default, sampling frequency is set to 10 kHz. But you can manually set it from 10 Hz
to 200 kHz using
--gpu-metrics-frequency=<value>
Select the GPUs dropdown to pick which GPUs you wish to sample.
Select the Metric set: dropdown to choose which available metric set you would like to
sample.
Note that metric sets for GPUs that are not being sampled will be greyed out.
Sampling frequency
Sampling frequency can be selected from the range of 10 Hz - 200 kHz. The default value
is 10 kHz.
The maximum sampling frequency without buffer overflow events depends on GPU
(SM count), GPU load intensity, and overall system load. The bigger the chip and the
higher the load, the lower the maximum frequency. If you need higher frequency, you
can increase it until you get "Buffer overflow" message in the Diagnostics Summary
report page.
Each metric set has a recommended sampling frequency range in its description. These
ranges are approximate. If you observe Inconsistent Data or Missing Data ranges
on timeline, please try closer to the recommended frequency.
www.nvidia.com
User Guide v2023.1.1 | 229
GPU Metrics
Available metrics
‣ GPC Clock Frequency - gpc__cycles_elapsed.avg.per_second
The average GPC clock frequency in hertz. In public documentation the GPC clock
may be called the "Application" clock, "Graphic" clock, "Base" clock, or "Boost" clock.
Note: The collection mechanism for GPC can result in a small fluctuation between
samples.
‣ SYS Clock Frequency - sys__cycles_elapsed.avg.per_second
The average SYS clock frequency in hertz. The GPU front end (command processor),
copy engines, and the performance monitor run at the SYS clock. On Turing and
NVIDIA GA100 GPUs the sampling frequency is based upon a period of SYS clocks
(not time) so samples per second will vary with SYS clock. On NVIDIA GA10x
GPUs the sampling frequency is based upon a fixed frequency clock. The maximum
frequency scales linearly with the SYS clock.
‣ GR Active - gr__cycles_active.sum.pct_of_peak_sustained_elapsed
The percentage of cycles the graphics/compute engine is active. The graphics/
compute engine is active if there is any work in the graphics pipe or if the compute
pipe is processing work.
GA100 MIG - MIG is not yet supported. This counter will report the activity of the
primary GR engine.
‣ Sync Compute In Flight -
gr__dispatch_cycles_active_queue_sync.avg.pct_of_peak_sustained_elapsed
The percentage of cycles with synchronous compute in flight.
CUDA: CUDA will only report synchronous queue in the case of MPS configured
with 64 sub-context. Synchronous refers to work submitted in VEID=0.
Graphics: This will be true if any compute work submitted from the direct queue is
in flight.
‣ Async Compute in Flight -
gr__dispatch_cycles_active_queue_async.avg.pct_of_peak_sustained_elapsed
The percentage of cycles with asynchronous compute in flight.
CUDA: CUDA will only report all compute work as asynchronous. The one
exception is if MPS is configured and all 64 sub-context are in use. 1 sub-context
(VEID=0) will report as synchronous.
Graphics: This will be true if any compute work submitted from a compute queue is
in flight.
‣ Draw Started - fe__draw_count.avg.pct_of_peak_sustained_elapsed
The ratio of draw calls issued to the graphics pipe to the maximum sustained rate of
the graphics pipe.
www.nvidia.com
User Guide v2023.1.1 | 230
GPU Metrics
Note:The percentage will always be very low as the front end can issue draw calls
significantly faster than the pipe can execute the draw call. The rendering of this row
will be changed to help indicate when draw calls are being issued.
‣ Dispatch Started -
gr__dispatch_count.avg.pct_of_peak_sustained_elapsed
The ratio of compute grid launches (dispatches) to the compute pipe to the
maximum sustained rate of the compute pipe.
Note: The percentage will always be very low as the front end can issue grid
launches significantly faster than the pipe can execute the draw call. The rendering
of this row will be changed to help indicate when grid launches are being issued.
‣ Vertex/Tess/Geometry Warps in Flight -
tpc__warps_active_shader_vtg_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of active vertex, geometry, tessellation, and meshlet shader warps resident
on the SMs to the maximum number of warps per SM as a percentage.
‣ Pixel Warps in Flight -
tpc__warps_active_shader_ps_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of active pixel/fragment shader warps resident on the SMs to the
maximum number of warps per SM as a percentage.
‣ Compute Warps in Flight -
tpc__warps_active_shader_cs_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of active compute shader warps resident on the SMs to the maximum
number of warps per SM as a percentage.
‣ Active SM Unused Warp Slots -
tpc__warps_inactive_sm_active_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of inactive warp slots on the SMs to the maximum number of warps per
SM as a percentage. This is an indication of how many more warps may fit on the
SMs if occupancy is not limited by a resource such as max warps of a shader type,
shared memory, registers per thread, or thread blocks per SM.
‣ Idle SM Unused Warp Slots -
tpc__warps_inactive_sm_idle_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of inactive warps slots due to idle SMs to the the maximum number of
warps per SM as a percentage.
This is an indicator that the current workload on the SM is not sufficient to put work
on all SMs. This can be due to:
‣ CPU starving the GPU
‣ current work is too small to saturate the GPU
‣ current work is trailing off but blocking next work
‣ SM Active - sm__cycles_active.avg.pct_of_peak_sustained_elapsed
The ratio of cycles SMs had at least 1 warp in flight (allocated on SM) to the number
of cycles as a percentage. A value of 0 indicates all SMs were idle (no warps in
flight). A value of 50% can indicate some gradient between all SMs active 50% of the
sample period or 50% of SMs active 100% of the sample period.
www.nvidia.com
User Guide v2023.1.1 | 231
GPU Metrics
‣ SM Issue -
sm__inst_executed_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of cycles that SM sub-partitions (warp schedulers) issued an instruction to
the number of cycles in the sample period as a percentage.
‣ Tensor Active -
sm__pipe_tensor_cycles_active_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of cycles the SM tensor pipes were active issuing tensor instructions to the
number of cycles in the sample period as a percentage.
TU102/4/6: This metric is not available on TU10x for periodic sampling. Please see
Tensor Active/FP16 Active.
‣ Tensor Active / FP16 Active -
sm__pipe_shared_cycles_active_realtime.avg.pct_of_peak_sustained_elapsed
TU102/4/6 only
The ratio of cycles the SM tensor pipes or FP16x2 pipes were active issuing tensor
instructions to the number of cycles in the sample period as a percentage.
‣ VRAM Bandwidth -
dram__throughput.avg.pct_of_peak_sustained_elapsed
The ratio of cycles the GPU device memory controllers were actively performing
read or write operations to the number of cycles in the sample period as a
percentage.
‣ NVLink bytes received -
nvlrx__bytes.avg.pct_of_peak_sustained_elapsed
The ratio of bytes received on the NVLink interface to the maximum number of
bytes receivable in the sample period as a percentage. This value includes protocol
overhead.
‣ NVLink bytes transmitted -
nvltx__bytes.avg.pct_of_peak_sustained_elapsed
The ratio of bytes transmitted on the NVLink interface to the maximum number
of bytes transmittable in the sample period as a percentage. This value includes
protocol overhead.
‣ PCIe Read Throughput -
pcie__read_bytes.avg.pct_of_peak_sustained_elapsed
The ratio of bytes received on the PCIe interface to the maximum number of bytes
receivable in the sample period as a percentage. The theoretical value is calculated
based upon the PCIe generation and number of lanes. This value includes protocol
overhead.
‣ PCIe Write Throughput -
pcie__write_bytes.avg.pct_of_peak_sustained_elapsed
The ratio of bytes transmitted on the PCIe interface to the maximum number of
bytes receivable in the sample period as a percentage. The theoretical value is
calculated based upon the PCIe generation and number of lanes. This value includes
protocol overhead.
www.nvidia.com
User Guide v2023.1.1 | 232
GPU Metrics
309277039|80
309301295|99
309325583|99
309349776|99
309373872|60
309397872|19
309421840|100
309446000|100
309470096|100
309494161|99
www.nvidia.com
User Guide v2023.1.1 | 233
GPU Metrics
Limitations
‣ If metric sets with NVLink are used but the links are not active, they may appear as
fully utilized.
‣ Only one tool that subscribes to these counters can be used at a time, therefore,
Nsight Systems GPU Metrics feature cannot be used at the same time as the
following tools:
‣ Nsight Graphics
‣ Nsight Compute
‣ DCGM (Data Center GPU Manager)
Use the following command:
‣ dcgmi profile --pause
‣ dcgmi profile --resume
Or API:
‣ dcgmProfPause
‣ dcgmProfResume
‣ Non-NVIDIA products which use:
‣
CUPTI sampling used directly in the application. CUPTI trace is okay
(although it will block Nsight Systems CUDA trace)
‣ DCGM library
‣ Nsight Systems limits the amount of memory that can be used to store GPU Metrics
samples. Analysis with higher sampling rates or on GPUs with more SMs has a risk
of exceeding this limit. This will lead to gaps on timeline filled with Missing Data
ranges. Future releases will reduce the frequency of this happening.
www.nvidia.com
User Guide v2023.1.1 | 234
GPU Metrics
www.nvidia.com
User Guide v2023.1.1 | 235
Chapter 20.
CPU PROFILING USING LINUX OS PERF
SUBSYSTEM
Nsight Systems on Linux targets, utilizes the Linux OS' perf subsystem to sample CPU
Instruction Pointers (IPs) and backtraces, trace CPU context switches, and sample CPU
and OS event counts. The Linux perf tool utilizes the same perf subsystem.
Nsight Systems, on L4T and potentially other ARM targets, may use a custom kernel
module to collect the same data. The Nsight Systems CLI command nsys status --
environment indicates when the kernel module is used instead of the Linux OS' perf
subsystem.
Features
‣ CPU Instruction Pointer / Backtrace Sampling
Nsight Systems can sample CPU Instruction Pointers / backtraces periodically. The
collection of a sample is triggered by a hardware event overflow - e.g. a sample is
collected after every 1 million CPU reference cycles on a per thread basis. In the
GUI, samples are shown on the individual thread timelines, in the Event Viewer,
and in the Top Down, Bottom Up, or Flat views which provide histogram-like
summaries of the data. IP / backtrace collections can be configured in process-tree or
system-wide mode. In process-tree mode, Nsight Systems will sample the process,
and any of its descendants, launched by the tool. In system-wide mode, Nsight
Systems will sample all processes running on the system, including any processes
launched by the tool.
‣ CPU Context Switch Tracing
Nsight Systems can trace every time the OS schedules a thread on a logical CPU
and every time the OS thread gets unscheduled from a logical CPU. The data is
used to show CPU utilization and OS thread utilization within the Nsight Systems
GUI. Context switch collections can be configured in process-tree or system-wide
mode. In process-tree mode, Nsight Systems will trace the process, and any of its
descendants, launched by Nsight Systems. In system-wide mode, Nsight Systems
will trace all processes running on the system, including any processes launched by
the Nsight Systems.
‣ CPU Event Sampling
www.nvidia.com
User Guide v2023.1.1 | 236
CPU Profiling Using Linux OS Perf Subsystem
Nsight Systems can periodically sample CPU hardware event counts and OS event
counts and show the event's rate over time in the Nsight Systems GUI. Event sample
collections can be configured in system-wide mode only. In system-wide mode,
Nsight Systems will sample event counts of all CPUs and the OS event counts
running on the system. Event counts are not directly associated with processes or
threads.
System Requirements
‣ Paranoid Level
The system's paranoid level must be 2 or lower.
www.nvidia.com
User Guide v2023.1.1 | 237
CPU Profiling Using Linux OS Perf Subsystem
a 3.10.0-693 or later kernel. Use the uname -r command to check the kernel's
version.
‣ perf_event_open syscall
The perf_event_open syscall needs to be available. When running within a Docker
container, the default seccomp settings will normally block the perf_event_open
syscall. To workaround this issue, use the Docker run --privileged switch when
launching the docker or modify the docker's seccomp settings. Some VMs (virtual
machines), e.g. AWS, may also block the perf_event_open syscall.
‣ Sampling Trigger
In some rare case, a sampling trigger is not available. The sampling trigger is either
a hardware or software event that causes a sample to be collected. Some VMs block
hardware events from being accessed and therefore, prevent hardware events from
being used as sampling triggers. In those cases, Nsight Systems will fall back to
using a software trigger if possible.
‣ Checking Your Target System
Use the nsys status --environment command to check if a system meets the
Nsight Systems CPU profiling requirements. Example output from this command is
shown below. Note that this command does not check for Linux capability overrides
- i.e. if the user or executable files have CAP_SYS_ADMIN or CAP_PERFMON
capability. Also, note that this command does not indicate if system-wide mode can
be used.
www.nvidia.com
User Guide v2023.1.1 | 238
CPU Profiling Using Linux OS Perf Subsystem
The configuration used during CPU profiling is documented in the Analysis Summary:
www.nvidia.com
User Guide v2023.1.1 | 239
CPU Profiling Using Linux OS Perf Subsystem
In the timeline, yellow-orange marks can be found under each thread's timeline that
indicate the moment an IP / backtrace sample was collected on that thread (e.g. see the
yellow-orange marks in the Specific Samples box above). Hovering the cursor over a
mark will cause a tooltip to display the backtrace for that sample.
www.nvidia.com
User Guide v2023.1.1 | 240
CPU Profiling Using Linux OS Perf Subsystem
Below the Timeline is a drop-down list with multiple options including Events View,
Top-Down View, Bottom-Up View, and Flat View. All four of these views can be used to
view CPU IP / back trace sampling data.
Example of Event Sampling
Event sampling samples hardware or software event counts during a collection and then
graphs those events as rates on the Timeline. The above screenshot shows 4 hardware
events. Core and cache events are graphed under the associated CPU row (see the red
box in the screenshot) while uncore and OS events are graphed in their own row (see
the green box in the screenshot). Hovering the cursor over an event sampling row in the
timeline shows the event's rate at that moment.
Common Issues
‣ Reducing Overhead Caused By Sampling
There are several ways to reduce overhead caused by sampling.
‣ disable sampling (i.e. use the --sampling=none switch)
‣ increase the sampling period (i.e. reduce the sampling rate) using the --
sampling-period switch
‣ stop collecting backtraces (i.e. use the --backtrace=none switch) or collect
more efficient backtraces - if available, use the --backtrace=lbr switch.
‣ reduce the number of backtraces collected per sample. See documentation for
the --samples-per-backtrace switch.
‣ Throttling
The Linux operating system enforces a maximum time to handle sampling
interrupts. This means that if collecting samples takes more than a specified amount
of time, the OS will throttle (i.e slow down) the sampling rate to prevent the perf
www.nvidia.com
User Guide v2023.1.1 | 241
CPU Profiling Using Linux OS Perf Subsystem
subsystem from causing too much overhead. When this occurs, sampling data may
become irregular even though the thread is very busy.
The above screenshot shows a case where CPU IP / backtrace sampling was throttled
during a collection. Note the irregular intervals of sampling tickmarks on the thread
timeline. The number of times a collection throttled is provided in the Nsight
Systems GUI's Diagnostics messages. If a collection throttles frequently (e.g. 1000s of
times), increasing the sampling period should help reduce throttling.
When
throttling
occurs,
the
OS
sets
a
new
(lower)
maximum
sampling
rate
in
the
Note: procfs.
This
value
must
be
reset
before
the
sampling
rate
can
be
increased
again.
Use
www.nvidia.com
User Guide v2023.1.1 | 242
CPU Profiling Using Linux OS Perf Subsystem
the
following
command
to
reset
the
OS'
max
sampling
rate
echo
'100000'
|
sudo
tee /
proc/
sys/
kernel/
perf_event_max_sample_rate
www.nvidia.com
User Guide v2023.1.1 | 243
Chapter 21.
NVIDIA VIDEO CODEC SDK TRACE
Nsight Systems for x86 Linux and Windows targets can trace calls from the NV Video
Codec SDK. This software trace can be launched from the GUI or using the --trace
nvvideo from the CLI
On the timeline, calls on the CPU to the NV Encoder API and NV Decoder API will be
shown.
www.nvidia.com
User Guide v2023.1.1 | 244
NVIDIA Video Codec SDK Trace
www.nvidia.com
User Guide v2023.1.1 | 245
NVIDIA Video Codec SDK Trace
www.nvidia.com
User Guide v2023.1.1 | 246
NVIDIA Video Codec SDK Trace
www.nvidia.com
User Guide v2023.1.1 | 248
Network Communication Profiling
Only a subset of the MPI API, including blocking and non-blocking point-to-point and
collective communication, and file I/O operations, is traced. If you require more control
over the list of traced APIs or if you are using a different MPI implementation, you can
use the NVTX wrappers for MPI. If you set the environment variable LD_PRELOAD to the
path of generated wrapper library, Nsight Systems will capture and report the MPI API
trace information when NVTX tracing is enabled. Choose an NVTX domain name other
than "MPI", since it is filtered out by Nsight Systems when MPI tracing is not enabled.
www.nvidia.com
User Guide v2023.1.1 | 249
Network Communication Profiling
www.nvidia.com
User Guide v2023.1.1 | 250
Network Communication Profiling
When not all processes that are involved in an MPI communication are loaded into
Nsight Systems the following information is available.
‣ Right-hand screenshot shows a reused communicator handle (last number
increased).
‣ Encoding: MPI_COMM[*team size*]*global-group-root-rank*.*group-ID*
www.nvidia.com
User Guide v2023.1.1 | 251
Network Communication Profiling
MPI_Init[_thread], MPI_Finalize
MPI_Send, MPI_{B,S,R}send, MPI_Recv, MPI_Mrecv
MPI_Sendrecv[_replace]
MPI_Barrier, MPI_Bcast
MPI_Scatter[v], MPI_Gather[v]
MPI_Allgather[v], MPI_Alltoall[{v,w}]
MPI_Allreduce, MPI_Reduce[_{scatter,scatter_block,local}]
MPI_Scan, MPI_Exscan
MPI_Win_allocate[_shared]
MPI_Win_create[_dynamic]
MPI_Win_{attach, detach}
MPI_Win_free
MPI_Win_fence
MPI_Win_{start, complete, post, wait}
MPI_Win_[un]lock[_all]
MPI_Win_flush[_local][_all]
MPI_Win_sync
MPI_File_{open,close,delete,sync}
MPI_File_{read,write}[_{all,all_begin,all_end}]
MPI_File_{read,write}_at[_{all,all_begin,all_end}]
MPI_File_{read,write}_shared
MPI_File_{read,write}_ordered[_{begin,end}]
MPI_File_i{read,write}[_{all,at,at_all,shared}]
MPI_File_set_{size,view,info}
MPI_File_get_{size,view,info,group,amode}
MPI_File_preallocate
MPI_Pack[_external]
MPI_Unpack[_external]
www.nvidia.com
User Guide v2023.1.1 | 252
Network Communication Profiling
shmem_my_pe
shmem_n_pes
shmem_global_exit
shmem_pe_accessible
shmem_addr_accessible
shmem_ctx_{create,destroy,get_team}
shmem_global_exit
shmem_info_get_{version,name}
shmem_{my_pe,n_pes,pe_accessible,ptr}
shmem_query_thread
shmem_team_{create_ctx,destroy}
shmem_team_get_config
shmem_team_{my_pe,n_pes,translate_pe}
shmem_team_split_{2d,strided}
shmem_test*
ucp_am_send_nb[x]
ucp_am_recv_data_nbx
ucp_am_data_release
ucp_atomic_{add{32,64},cswap{32,64},fadd{32,64},swap{32,64}}
ucp_atomic_{post,fetch_nb,op_nbx}
ucp_cleanup
ucp_config_{modify,read,release}
ucp_disconnect_nb
ucp_dt_{create_generic,destroy}
ucp_ep_{create,destroy,modify_nb,close_nbx}
ucp_ep_flush[{_nb,_nbx}]
ucp_listener_{create,destroy,query,reject}
ucp_mem_{advise,map,unmap,query}
ucp_{put,get}[_nbi]
ucp_{put,get}_nb[x]
ucp_request_{alloc,cancel,is_completed}
ucp_rkey_{buffer_release,destroy,pack,ptr}
ucp_stream_data_release
ucp_stream_recv_data_nb
ucp_stream_{send,recv}_nb[x]
ucp_stream_worker_poll
ucp_tag_msg_recv_nb[x]
ucp_tag_{send,recv}_nbr
ucp_tag_{send,recv}_nb[x]
ucp_tag_send_sync_nb[x]
ucp_worker_{create,destroy,get_address,get_efd,arm,fence,wait,signal,wait_mem}
ucp_worker_flush[{_nb,_nbx}]
ucp_worker_set_am_{handler,recv_handler}
www.nvidia.com
User Guide v2023.1.1 | 253
Network Communication Profiling
ucp_config_print
ucp_conn_request_query
ucp_context_{query,print_info}
ucp_get_version[_string]
ucp_ep_{close_nb,print_info,query,rkey_unpack}
ucp_mem_print_info
ucp_request_{check_status,free,query,release,test}
ucp_stream_recv_request_test
ucp_tag_probe_nb
ucp_tag_recv_request_test
ucp_worker_{address_query,print_info,progress,query,release_address}
Additional API functions from other UCX layers may be added in a future version of the
product.
www.nvidia.com
User Guide v2023.1.1 | 254
Network Communication Profiling
Available Metrics
‣ Bytes sent - Number of bytes sent through all NIC ports.
‣ Bytes received - Number of bytes received by all NIC ports.
‣ CNPs sent - Number of congestion notification packets sent by the NIC.
‣ CNPs received - Number of congestion notification packets received and handled
by the NIC.
‣ Send waits - The number of ticks during which ports had data to transmit but no
data was sent during the entire tick (either because of insufficient credits or because
of lack of arbitration)
Note: Each one of the mentioned metrics is shown only if it has non-zero value during
profiling.
Usage Examples
‣ The Bytes sent/sec and the Bytes received/sec metrics enables identifying
idle and busy NIC times.
‣ Developers may shift network operations from busy to idle times to reduce
network congestion and latency.
‣ Developers can use idle NIC times to send additional data without reducing
application performance.
‣ CNPs (congestion notification packets) received/sent and Send waits metrics may
explain network latencies. A developer seeing the time periods when the network
was congested may rewrite his algorithm to avoid the observed congestions.
RDMA
over
Converged
Ethernet
(RoCE)
traffic
is
Note: not
logged
into
the
Nsight
Systems
NIC
metrics.
www.nvidia.com
User Guide v2023.1.1 | 255
Network Communication Profiling
Available Metrics
‣ Bytes sent - Number of bytes sent through all switch ports
‣ Bytes received - Number of bytes received by all switch ports
www.nvidia.com
User Guide v2023.1.1 | 256
Chapter 23.
PYTHON BACKTRACE SAMPLING
Nsight Systems for Arm server (SBSA) platforms, x86 Linux and Windows targets, is
capable of periodically capturing Python backtrace information. This functionality is
available when tracing Python interpreters of version 3.9 or later. Capturing python
backtrace is done in periodic samples, in a selected frequency ranging from 1Hz - 2KHz
with a default value of 1KHz.
To enable Python backtrace sampling from Nsight Systems:
CLI — Set --python-sampling=true and use the --python-sampling-frequency
option to set the sampling rate.
GUI — Select the Collect Python backtrace samples checkbox.
Example screenshot:
www.nvidia.com
User Guide v2023.1.1 | 257
Chapter 24.
READING YOUR REPORT IN GUI
www.nvidia.com
User Guide v2023.1.1 | 258
Reading Your Report in GUI
24.6.1. Timeline
Timeline is a versatile control that contains a tree-like hierarchy on the left, and
corresponding charts on the right.
Contents of the hierarchy depend on the project settings used to collect the report. For
example, if a certain feature has not been enabled, corresponding rows will not be show
on the timeline.
To generate a timeline screenshot without opening the full GUI, use the command
nsys-ui.exe --screenshot filename.nsys-rep
Timeline Navigation
Zoom and Scroll
www.nvidia.com
User Guide v2023.1.1 | 259
Reading Your Report in GUI
At the upper right portion of your Nsight Systems GUI you will see this section:
The slider sets the vertical size of screen rows, and the magnifying glass resets it to the
original settings.
There are many ways to zoom and scroll horizontally through the timeline. Clicking on
the keyboard icon seen above, opens the below dialog that explains them.
www.nvidia.com
User Guide v2023.1.1 | 260
Reading Your Report in GUI
www.nvidia.com
User Guide v2023.1.1 | 261
Reading Your Report in GUI
Timeline/Events correlation
To display trace events in the Events View right-click a timeline row and select the
“Show in Events View” command. The events of the selected row and all of its sub-rows
will be displayed in the Events View. Note that the events displayed will correspond to
the current zoom in the timeline, zooming in or out will reset the event pane filter.
If a timeline row has been selected for display in the Events View, then double-clicking
a timeline item on that row will automatically scroll the content of the Events View to
make the corresponding events view item visible and select it. If that event has tool tip
information, it will be displayed in the right hand pane.
Likewise, double-clicking on a particular instance in the Events View will highlight the
corresponding event in the timeline.
Row Height
Several of the rows in the timeline use height as a way to model the percent utilization
of resources. This gives the user insight into what is going on even when the timeline is
zoomed all the way out.
In this picture you see that for kernel occupation there is a colored bar of variable height.
www.nvidia.com
User Guide v2023.1.1 | 262
Reading Your Report in GUI
Nsight Systems calculates the average occupancy for the period of time represented by
particular pixel width of screen. It then uses that average to set the top of the colored
section. So, for instance, if 25% of that timeslice the kernel is active, the bar goes 25% of
the distance to the top of the row.
In order to make the difference clear, if the percentage of the row height is non-zero, but
would be represented by less than one vertical pixel, Nsight Systems displays it as one
pixel high. The gray height represents the maximum usage in that time range.
This row height coding is used in the CPU utilization, thread and process occupancy,
kernel occupancy, and memory transfer activity rows.
Row Percentage
In the image below you see that there are percentages prefixing the stream rows in the
GPU.
The percentage shown in front of the stream indicates the proportion of context running
time this particular stream takes.
So "26% Stream 1" means that Stream 1 takes 26% of its context's total running time.
Total running time = sum of durations of all kernels and memory ops
that run in this context
www.nvidia.com
User Guide v2023.1.1 | 263
Reading Your Report in GUI
API calls, GPU executions, and debug markers that occurred within the boundaries of a
debug marker are displayed nested to that debug marker. Multiple levels of nesting are
supported.
Events view recognizes these types of debug markers:
‣ NVTX
‣ Vulkan VK_EXT_debug_marker markers, VK_EXT_debug_utils labels
‣ PIX events and markers
‣ OpenGL KHR_debug markers
You can copy and paste from the events view by highlighting rows, using Shift or Ctrl
to enable multi-select. Right clicking on the selection will give you a copy option.
www.nvidia.com
User Guide v2023.1.1 | 264
Reading Your Report in GUI
www.nvidia.com
User Guide v2023.1.1 | 265
Reading Your Report in GUI
Each of the views helps understand particular performance issues of the application
being profiled. For example:
‣ When trying to find specific bottleneck functions that can be optimized, the Bottom-
Up view should be used. Typically, the top few functions should be examined.
Expand them to understand in which contexts they are being used.
‣ To navigate the call tree of the application and while generally searching for
algorithms and parts of the code that consume unexpectedly large amount of CPU
time, the Top-Down view should be used.
‣ To quickly assess which parts of the application, or high level parts of an algorithm,
consume significant amount of CPU time, use the Flat view.
The Top-Down and Bottom-Up views have Self and Total columns, while the Flat view
has a Flat column. It is important to understand the meaning of each of the columns:
‣ Top-Down view
‣ Self column denotes the relative amount of time spent executing instructions of
this particular function.
‣ Total column shows how much time has been spent executing this function,
including all other functions called from this one. Total values of sibling rows
sum up to the Total value of the parent row, or 100% for the top-level rows.
‣ Bottom-Up view
‣ Self column for top-level rows, as in the Top-Down view, shows how much time
has been spent directly in this function. Self times of all top-level rows add up to
100%.
‣ Self column for children rows breaks down the value of the parent row based on
the various call chains leading to that function. Self times of sibling rows add up
to the value of the parent row.
‣ Flat view
‣ Flat column shows how much time this function has been anywhere on the
call stack. Values in this column do not add up or have other significant
relationships.
If
low-
impact
functions
have
been
filtered
out,
Note: values
may
not
add
up
correctly
to
100%,
or
www.nvidia.com
User Guide v2023.1.1 | 266
Reading Your Report in GUI
to
the
value
of
the
parent
row.
This
filtering
can
be
disabled.
Contents of the symbols table is tightly related to the timeline. Users can apply and
modify filters on the timeline, and they will affect which information is displayed in
the symbols table:
‣ Per-thread filtering — Each thread that has sampling information associated with it
has a checkbox next to it on the timeline. Only threads with selected checkboxes are
represented in the symbols table.
‣ Time filtering — A time filter can be setup on the timeline by pressing the left
mouse button, dragging over a region of interest on the timeline, and then choosing
Filter by selection in the dropdown menu. In this case, only sampling information
collected during the selected time range will be used to build the symbols table.
If
too
little
sampling
data
is
being
used
to
build
the
symbols
table
(for
example,
Note: when
the
sampling
rate
is
configured
to
be
low,
and
a
short
period
of
time
is
used
www.nvidia.com
User Guide v2023.1.1 | 267
Reading Your Report in GUI
for
time-
based
filtering),
the
numbers
in
the
symbols
table
might
not
be
representative
or
accurate
in
some
cases.
push %rbp
mov %rsp,%rbp
When frame pointers are available in a binary, full stack traces will be captured. Note
that libraries that are frequently used by apps and ship with the operating system, such
as libc, are generated in release mode and therefore do not include frame pointers.
Frequently, when a backtrace includes an address from a system library, the backtrace
will fail to resolve further as the frame pointer trail goes cold due to a missing frame
pointer.
A simple application was developed to show the difference. The application calls
function a(), which calls b(), which calls c(), etc. Function z() calls a heavy compute
function called matrix_multiply(). Almost all of the IP samples are collected while
matrix_multiple is executing. The next two screen shots show one of the main
differences between frame pointers and LBRs.
www.nvidia.com
User Guide v2023.1.1 | 268
Reading Your Report in GUI
Note that the frame pointer example, shows the full stack trace while the LBR example,
only shows part of the stack due to the limited number of LBR registers in the CPU.
Kernel Samples
When an IP sample is captured while a kernel mode (i.e. operating system) function is
executing, the sample will be shown with an address that starts with 0xffffffff and map
to the [kernel.kallsyms] module.
[vdso]
www.nvidia.com
User Guide v2023.1.1 | 269
Reading Your Report in GUI
Samples may be collected while a CPU is executing functions in the Virtual Dynamic
Shared Object. In this case, the sample will be resolved (i.e. mapped) to the [vdso]
module. The vdso man page provides the following description of the vdso:
Why does the vDSO exist at all? There are some system calls the
kernel provides that user-space code ends up using frequently, to the
point that such calls can dominate overall performance. This is due
both to the frequency of the call as well as the context-switch
overhead that results from exiting user space and entering the
kernel.
[Unknown]
When an address can not be resolved (i.e. mapped to a module), its address within the
process’ address space will be shown and its module will be marked as [Unknown].
‣ Collapse unresolved lines is useful if some of the binary code does not have
symbols. In this case, subtrees that consist of only unresolved symbols get collapsed
in the Top-Down view, since they provide very little useful information.
‣ Hide functions with CPU usage below X% is useful for large applications, where
the sampling profiler hits lots of function just a few times. To filter out the "long
tail," which is typically not important for CPU performance bottleneck analysis, this
checkbox should be selected.
www.nvidia.com
User Guide v2023.1.1 | 270
Reading Your Report in GUI
When a collection result is opened in the Nsight Systems GUI, there are multiple ways to
view the CPU profiling data - especially the CPU IP / backtrace data.
In the timeline, yellow-orange marks can be found under each thread's timeline that
indicate the moment an IP / backtrace sample was collected on that thread (e.g. see the
yellow-orange marks in the Specific Samples box above). Hovering the cursor over a
mark will cause a tooltip to display the backtrace for that sample.
Below the Timeline is a drop-down list with multiple options including Events View,
Top-Down View, Bottom-Up View, and Flat View. All four of these views can be used to
view CPU IP / backtrace sampling data.
If the Bottom-Up View is selected, here is the sampling summary shown in the bottom
half of the Timeline View screen. Notice that the summary includes the phrase “65,022
samples are used” indicating how many samples are summarized. By default, functions
that were found in less less than 0.5% of the samples are not show. Use the filter
button to modify that setting.
www.nvidia.com
User Guide v2023.1.1 | 271
Reading Your Report in GUI
When sampling data is filtered, the Sampling Summary will summarize the selected
samples. Samples can be filtered on an OS thread basis, on a time basis, or both.
Above, deselecting a checkbox next to a thread removes its samples from the sampling
summary. Dragging the cursor over the timeline and selecting “Filter and Zoom In”
chooses the samples during the time selected, as seen below. The sample summary
includes the phrase “0.35% (225 samples) of data is shown due to applied filters”
indicating that only 225 samples are included in the summary results.
www.nvidia.com
User Guide v2023.1.1 | 272
Reading Your Report in GUI
Deselecting threads one at a time by deselecting their checkbox can be tedious. Click
on the down arrow next to a thread and choose Show Only This Thread to deselect all
threads except that thread.
www.nvidia.com
User Guide v2023.1.1 | 273
Reading Your Report in GUI
If Events View is selected in the Timeline View's drop-down list, right click on a specific
thread and choose Show in Events View. The samples collected while that thread
executed will be shown in the Events View. Double clicking on a specific sample in the
Events view causes the timeline to show when that sample was collected - see the green
boxes below. The backtrace for that sample is also shown in the Events View.
Backtraces
To understand the code path used to get to a specific function shown in the sampling
summary, right click on a function and select Expand.
www.nvidia.com
User Guide v2023.1.1 | 274
Reading Your Report in GUI
The above shows what happens when a function’s backtraces are expanded. In this case,
the PCQueuePop function was called from the CmiGetNonLocal function which was
called by the CsdNextMessage function which was called by the CsdScheduleForever
function. The [Max depth] string marks the end of the collected backtrace.
Note that, by default, backtraces with less than 0.5% of the total backtraces are hidden.
This behavior can make the percentage results hard to understand. If all backtraces are
shown (i.e. the filter is disabled), the results look very different and the numbers add
up as expected. To disable the filter, click on the Filter… button and uncheck the Hide
functions with CPU usage below X% checkbox.
When the filter is disabled, the backtraces are recalculated. Note that you may need
to right click on the function and select Expand again to get all of the backtraces to be
shown.
When backtraces are collected, the whole sample (IP and backtrace) is handled as a
single sample. If two samples have the exact same IP and backtrace, they are summed in
the final results. If two samples have the same IP but a different backtrace, they will be
shown as having the same leaf (i.e. IP) but a different backtrace. As mentioned earlier,
when backtraces end, they are marked with the [Max depth] string (unless the backtrace
can be traced back to its origin - e.g. __libc_start_main) or the backtrace breaks because
an IP cannot be resolved.
www.nvidia.com
User Guide v2023.1.1 | 275
Reading Your Report in GUI
Above, the leaf function is PCQueuePop. In this case, there are 11 different backtraces
that lead to PCQueuPop - all of them end with [Max depth]. For example, the dominant
path is PCQueuPop<-CmiGetNonLocal<-CsdNextmessage<-CsdScheduleForever<-
[Max depth]. This path accounts for 5.67% of all samples as shown in line 5 (red
numbers). The second most dominant path is PCQueuPop<-CmiGetNonLocal<-[Max
depth] which accounts for 0.44% of all samples as shown in line 24 (red numbers).
The path PCQueuPop<-CmiGetNonLocal<-CsdNextmessage<-CsdScheduleForever<-
Sequencer::integrate(int)<-[Max depth] accounts for 0.03% of the samples as shown in
line 7 (red numbers). Adding up percentages shown in the [Max depth] lines (lines 5,
7, 9, 13, 15, 16, 17, 19, 21, 23, and 24) generates 7.04% which equals the percentage of
samples associated with the PCQueuePop function shown in line 0 (red numbers).
Information from this view can be selected and copied using the mouse cursor.
www.nvidia.com
User Guide v2023.1.1 | 276
Chapter 25.
ADDING REPORT TO THE TIMELINE
Starting with 2021.3, Nsight Systems can load multiple report files into a single timeline.
This is a BETA feature and will be improved in the future releases. Please let us know
about your experience on the forums or through Help > Send Feedback... in the main
menu.
To load multiple report files into a single timeline, first start by opening a report as usual
— using File > Open... from the main menu, or double clicking on a report in the Project
Explorer window. Then additional report files can be loaded into the same timeline
using one of the methods:
‣ File > Add Report (beta)... in the main menu, and select another report file that you
want to open
‣ Right click on the report in the project explorer window, and click Add Report
(beta)
www.nvidia.com
User Guide v2023.1.1 | 277
Adding Report to the Timeline
Nsight Systems can automatically adjust timestamps based on UTC time recorded
around the collection start time. This method is used by default when other more
precise methods are not available. This time can be seen as UTC time at t=0 in the
Analysis Summary page of the report file. Refer to your OS documentation to learn how
to sync the software clock using the Network Time Protocol (NTP). NTP-based time
synchronization is not very precise, with the typical errors on the scale of one to tens of
milliseconds.
Reports collected on the same physical machine can use synchronization based on
Timestamp Counter (TSC) values. These are platform-specific counters, typically
accessed in user space applications using the RDTSC instruction on x86_64 architecture,
or by reading the CNTVCT register on Arm64. Their values converted to nanoseconds
can be seen as TSC value at t=0 in the Analysis Summary page of the report file.
Reports synchronized using TSC values can be aligned with nanoseconds-level
precision.
TSC-based time synchronization is activated automatically, when Nsight Systems
detects that reports come from same target and that the same TSC value corresponds
to very close UTC times. Targets are considered to be the same when either explicitly
set environment variables NSYS_HW_ID are the same for both reports or when target
hostnames are the same and NSYS_HW_ID is not set for either target. The difference
between UTC and TSC time offsets must be below 1 second to choose TSC-based time
synchronization.
To find out which synchronization method was used, navigate to the Analysis Summary
tab of an added report and check the Report alignment source property of a target.
Note, that the first report won’t have this parameter.
www.nvidia.com
User Guide v2023.1.1 | 278
Adding Report to the Timeline
When loading multiple reports into a single timeline, it is always advisable to first
check that time synchronization looks correct, by zooming into synchronization or
communication events that are expected to be aligned.
www.nvidia.com
User Guide v2023.1.1 | 279
Adding Report to the Timeline
When each MPI rank runs on a different node, the command above works fine, since the
default pairing mode (different hardware) will be used.
When all MPI ranks run the localhost only, use this command (value "A" was chosen
arbitrarily, it can be any non-empty string):
NSYS_SYSTEM_ID=A mpirun <mpirun-options> nsys profile -o report-%p
<nsys-options> ./myApp
For convenience, the MPI rank can be encoded into the report filename. For Open MPI,
use the following command to create report files based on the global rank value:
www.nvidia.com
User Guide v2023.1.1 | 280
Adding Report to the Timeline
25.4. Limitations
‣ Only report files collected with Nsight Systems version 2021.3 and newer are fully
supported.
‣ Sequential reports collected in a single CLI profiling session cannot be loaded into a
single timeline yet.
www.nvidia.com
User Guide v2023.1.1 | 281
Chapter 26.
USING NSIGHT SYSTEMS EXPERT SYSTEM
If a .nsys-rep file is given as the input file and there is no .sqlite file with the same name
in the same directory, it will be generated.
Note: The Expert System view in the GUI will give you the equivalent command line.
www.nvidia.com
User Guide v2023.1.1 | 282
Using Nsight Systems Expert System
A context menu is available to correlate the table entry with the timeline. The options are
the same as the Events View:
‣ Zoom to Selected on Timeline (ctrl+double-click)
The highlighting is not supported for rules that do not return an event but rather an
arbitrary time range (e.g. GPU utilization rules).
The CLI and GUI share the same rule scripts and messages. There might be some
formatting differences between the output table in GUI and CLI.
www.nvidia.com
User Guide v2023.1.1 | 283
Using Nsight Systems Expert System
Synchronous Memcpy
This rule identifies synchronous memory transfers that block the host.
Suggestion: Use cudaMemcpy*Async APIs instead.
Synchronous Memset
This rule identifies synchronous memset operations that block the host.
Suggestion: Use cudaMemset*Async APIs instead.
Synchronization APIs
This rule identifies synchronization APIs that block the host until all issued CUDA calls
are complete.
Suggestions: Avoid excessive use of synchronization. Use asynchronous CUDA event
calls, such as cudaStreamWaitEvent and cudaEventSynchronize, to prevent host
synchronization.
www.nvidia.com
User Guide v2023.1.1 | 284
Using Nsight Systems Expert System
‣ For each process, each GPU is examined, and gaps are found within the time range
that starts with the beginning of the first GPU operation on that device and ends
with the end of the last GPU operation on that device.
‣ GPU gaps that cannot be addressed by the user are excluded. This includes:
‣ Profiling overhead in the middle of a GPU gap.
‣ The initial gap in the report that is seen before the first GPU operation.
‣ The final gap that is seen after the last GPU operation.
GPU Low Utilization
This rule identifies time regions with low utilization.
Suggestions: Use CPU sampling data, OS Runtime blocked state backtraces, and/or OS
Runtime APIs related to thread synchronization to understand if a sluggish or blocked
CPU is causing the gaps. Add NVTX annotations to CPU code to understand the reason
behind the gaps.
Notes:
‣ For each process, each GPU is examined, and gaps are found within the time range
that starts with the beginning of the first GPU operation on that device and ends
with the end of the last GPU operation on that device. This time range is then
divided into equal chunks, and the GPU utilization is calculated for each chunk. The
utilization includes all GPU operations as well as profiling overheads that the user
cannot address.
‣ The utilization refers to the "time" utilization and not the "resource" utilization.
This rule attempts to find time gaps when the GPU is or isn't being used, but does
not take into account how many GPU resources are being used. Therefore, a single
running memcpy is considered the same amount of "utilization" as a huge kernel
that takes over all the cores. If multiple operations run concurrently in the same
chunk, their utilization will be added up and may exceed 100%.
‣ Chunks with an in-use percentage less than the threshold value are displayed.
If consecutive chunks have a low in-use percentage, the individual chunks are
coalesced into a single display record, keeping the weighted average of percentages.
This is why returned chunks may have different durations.
www.nvidia.com
User Guide v2023.1.1 | 285
Chapter 27.
IMPORT NVTXT
Modes description:
‣ lerp - Insert with linear interpolation
--mode lerp --ns_a arg --ns_b arg [--nvtxt_a arg --nvtxt_b arg]
‣ lin - insert with linear equation
--mode lin --ns_a arg --freq arg [--nvtxt_a arg]
Modes' parameters:
www.nvidia.com
User Guide v2023.1.1 | 286
Import NVTXT
Commands
Info
To find out report's start and end time use info command.
Usage:
ImportNvtxt --cmd info -i [--input] arg
Example:
ImportNvtxt info Report.nsys-rep
Analysis start (ns) 83501026500000
Analysis end (ns) 83506375000000
Create
You can create a report file using existing NVTXT with create command.
Usage:
ImportNvtxt --cmd create -n [--nvtxt] arg -o [--output] arg [-m [--mode]
mode_name mode_args]
with:
‣ ns_a — a nanoseconds value.
‣ ns_b — a nanoseconds value (greater than ns_a).
‣ nvtxt_a — an nvtxt file's time unit value corresponding to ns_a nanoseconds.
‣ nvtxt_b — an nvtxt file's time unit value corresponding to ns_b nanoseconds.
If nvtxt_a and nvtxt_b are not specified, they are respectively set to nvtxt file's
minimum and maximum time value.
Usage for lin mode is:
--mode lin --ns_a arg --freq arg [--nvtxt_a arg]
with:
www.nvidia.com
User Guide v2023.1.1 | 287
Import NVTXT
The output will be a new generated report file which can be opened and viewed by
Nsight Systems.
Merge
To merge NVTXT file with an existing report file use merge command.
Usage:
ImportNvtxt --cmd merge -i [--input] arg -n [--nvtxt] arg -o [--output] arg [-m
[--mode] mode_name mode_args]
with:
‣ ns_a — a nanoseconds value.
‣ ns_b — a nanoseconds value (greater than ns_a).
‣ nvtxt_a — an nvtxt file's time unit value corresponding to ns_a nanoseconds.
‣ nvtxt_b — an nvtxt file's time unit value corresponding to ns_b nanoseconds.
If nvtxt_a and nvtxt_b are not specified, they are respectively set to nvtxt file's
minimum and maximum time value.
Usage for lin mode is:
--mode lin --ns_a arg --freq arg [--nvtxt_a arg]
with:
‣ ns_a — a nanoseconds value.
‣ freq — the nvtxt file's timer frequency.
‣ nvtxt_a — an nvtxt file's time unit value corresponding to ns_a nanoseconds.
If nvtxt_a is not specified, it is set to nvtxt file's minimum time value.
Time values in <filename.nvtxt> are assumed to be nanoseconds if no mode
specified.
Example
ImportNvtxt --cmd merge -i Report.nsys-rep -n Sample.nvtxt -o NewReport.nsys-rep
www.nvidia.com
User Guide v2023.1.1 | 288
Chapter 28.
VISUAL STUDIO INTEGRATION
NVIDIA Nsight Integration is a Visual Studio extension that allows you to access the
power of Nsight Systems from within Visual Studio.
When Nsight Systems is installed along with NVIDIA Nsight Integration, Nsight
Systems activities will appear under the NVIDIA Nsight menu in the Visual Studio
menu bar. These activities launch Nsight Systems with the current project settings and
executable.
Selecting the "Trace" command will launch Nsight Systems, create a new Nsight Systems
project and apply settings from the current Visual Studio project:
‣ Target application path
‣ Command line parameters
‣ Working folder
If the "Trace" command has already been used with this Visual Studio project then
Nsight Systems will load the respective Nsight Systems project and any previously
captured trace sessions will be available for review using the Nsight Systems project
explorer tree.
www.nvidia.com
User Guide v2023.1.1 | 289
Visual Studio Integration
For more information about using Nsight Systems from within Visual Studio, please
visit
‣ NVIDIA Nsight Integration Overview
‣ NVIDIA Nsight Integration User Guide
www.nvidia.com
User Guide v2023.1.1 | 290
Chapter 29.
TROUBLESHOOTING
Environment variable control support for Windows target trace is not available, but
there is a quick workaround:
‣ Create a batch file that sets the env vars and launches your application.
‣ Set Nsight Systems to launch the batch file as its target, i.e. set the project settings
target path to the path of batch file.
‣ Start the trace. Nsight Systems will launch the batch file in a new cmd instance and
trace any child process it launches. In fact, it will trace the whole process tree whose
root is the cmd running your batch file.
www.nvidia.com
User Guide v2023.1.1 | 291
Troubleshooting
WebGL Testing
Nsight Systems cannot profile using the default Chrome launch command. To profile
WebGL please follow the following command structure:
“C:\Program Files (x86)\Google\Chrome\Application\chrome.exe”
--inprocess-gpu --no-sandbox --disable-gpu-watchdog --use-angle=gl
https://webglsamples.org/aquarium/aquarium.html
www.nvidia.com
User Guide v2023.1.1 | 292
Troubleshooting
LD_PRELOAD
The first mechanism uses LD_PRELOAD environment variable. It only works with
dynamically linked binaries, since static binaries do not invoke the runtime linker, and
therefore are not affected by the LD_PRELOAD environment variable.
‣ For ARMv7 binaries, preload
/opt/nvidia/nsight_systems/libLauncher32.so
‣ Otherwise if running from host, preload
/opt/nvidia/nsight_systems/libLauncher64.so
‣ Otherwise if running from CLI, preload
[installation_directory]/libLauncher64.so
The most common way to do that is to specify the environment variable as part of the
process launch command, for example:
$ LD_PRELOAD=/opt/nvidia/nsight_systems/libLauncher64.so ./my-aarch64-binary --
arguments
When loaded, this library will send itself a SIGSTOP signal, which is equivalent to typing
Ctrl+Z in the terminal. The process is now a background job, and you can use standard
commands like jobs, fg and bg to control them. Use jobs -l to see the PID of the
launched process.
When attaching to a stopped process, Nsight Systems will send SIGCONT signal, which is
equivalent to using the bg command.
Launcher
The second mechanism can be used with any binary. Use
[installation_directory]/launcher to launch your application, for example:
$ /opt/nvidia/nsight_systems/launcher ./my-binary --arguments
The process will be launched, daemonized, and wait for SIGUSR1 signal. After attaching
to the process with Nsight Systems, the user needs to manually resume execution of the
process from command line:
$ pkill -USR1 launcher
Note
that
pkill
will
send
the
signal
Note: to
any
process
with
the
matching
name.
If
that
www.nvidia.com
User Guide v2023.1.1 | 293
Troubleshooting
is
not
desirable,
use
kill
to
send
it
to
a
specific
process.
The
standard
output
and
error
streams
are
redirected
to
/
tmp/
stdout_<PID>.tx
and
/
tmp/
stderr_<PID>.tx
The launcher mechanism is more complex and less automated than the LD_PRELOAD
option, but gives more control to the user.
or
error while loading shared libraries: [library_name]: cannot open shared object
file: No such file or directory
www.nvidia.com
User Guide v2023.1.1 | 294
Troubleshooting
If the workload does not run when launched via Nsight Systems or the timeline is
empty, check the stderr.log and stdout.log (click on drop-down menu showing Timeline
View and click on Files) to see the errors encountered by the app.
www.nvidia.com
User Guide v2023.1.1 | 295
Troubleshooting
www.nvidia.com
User Guide v2023.1.1 | 296
Troubleshooting
www.nvidia.com
User Guide v2023.1.1 | 297
Troubleshooting
The following ELF sections should be considered empty if they have size of 4 bytes:
.debug_frame, .eh_frame, .ARM.exidx. In this case, these sections only contain
termination records and no useful information.
For GCC, use the following compiler invocation to see which compiler flags are enabled
in your toolchain by default (for example, to check if -funwind-tables is enabled by
default):
$ gcc -Q --help=common
For GCC and Clang, add -### to the compiler invocation command to see which
compiler flags are actually being used.
Since EHABI and DWARF information is compiled on per-unit basis (every .cpp or
.c file, as well as every static library, can be built with or without this information),
presence of the ELF sections does not guarantee that every function has necessary
unwind information.
Frame pointers are required by the Aarch64 Procedure Call Standard. Adding frame
pointers slows down execution time, but in most cases the difference is negligible.
www.nvidia.com
User Guide v2023.1.1 | 298
Troubleshooting
29.6. Logging
To enable logging on the host, refer to this config file:
host-linux-x64/nvlog.config.template
When reporting any bugs please include the build version number as described in the
Help → About dialog. If possible, attach log files and report (.nsys-rep) files, as they
already contain necessary version information.
To enable verbose logging on the target device, when launched from the host, follow
these steps:
1. Close the host application.
2. Restart the target device.
3. Place nvlog.config from host directory to the /opt/nvidia/nsight_systems
directory on target.
4. From SSH console, launch the following command:
sudo /opt/nvidia/nsight_systems/nsys --daemon --debug
5. Start the host application and connect to the target device.
Logs on the target devices are collected into this file (if enabled):
nsys.log
to
$ }}{{{}nsys-agent.log
3. Run a collection and the target-linux.x64 directory should include a file
named nsys-agent.log.
Please note that in some cases, debug logging can significantly slow down the profiler.
www.nvidia.com
User Guide v2023.1.1 | 299
Troubleshooting
To enable verbose logging on the target device, when launched from the host, follow
these steps:
1. Close the host application.
2. Terminate the nsys process.
3. Place nvlog.config from host directory next to Nsight Systems Windows agent on
the target device
‣ Local Windows target:
C:\Program Files\NVIDIA Corporation\Nsight Systems 2023.1\target-
windows-x64
‣ Remote Windows target:
C:\Users\<user name>\AppData\Local\Temp\nvidia\nsight_systems
4. Start the host application and connect to the target device.
Logs on the target devices are collected into this file (if enabled):
nsight-sys.log
www.nvidia.com
User Guide v2023.1.1 | 300
Chapter 30.
OTHER RESOURCES
Looking for information to help you use Nsight Systems the most effectively? Here are
some more resources you might want to review:
Training Seminars
NVIDIA Deep Learning Institute Training - Self-Paced Online Course Optimizing CUDA
Machine Learning Codes With Nsight Profiling Tools
2018 NCSA Blue Waters Webinar - Video Only Introduction to NVIDIA Nsight Systems
Blog Posts
NVIDIA developer blogs, these are longer form, technical pieces written by tool and
domain experts.
‣ 2021 - Optimizing DX12 Resource Uploads to the GPU Using CPU-Visible VRAM
‣ 2019 - Migrating to NVIDIA Nsight Tools from NVVP and nvprof
‣ 2019 - Transitioning to Nsight Systems from NVIDIA Visual Profiler / nvprof
‣ 2019 - NVIDIA Nsight Systems Add Vulkan Support
‣ 2019 - TensorFlow Performance Logging Plugin nvtx-plugins-tf Goes Public
‣ 2020 - NVIDIA Nsight Systems in Containers and the Cloud
‣ 2020 - Understanding the Visualization of Overhead and Latency in Nsight Systems
‣ 2021 - Optimizing DX12 Resource Uploads to the GPU Using CPU-Visible VRAM
Feature Videos
Short videos, only a minute or two, to introduce new features.
‣ OpenMP Trace Feature Spotlight
‣ Command Line Sessions Video Spotlight
‣ Direct3D11 Feature Spotlight
www.nvidia.com
User Guide v2023.1.1 | 301
Other Resources
‣ Vulkan Trace
‣ Statistics Driven Profiling
‣ Analyzing NCCL Usage with NVDIA Nsight Systems
Conference Presentations
‣ GTC 2022 - Killing Cloud Monsters Has Never Been Smoother
‣ GTC 2022 - Optimizing Communication with Nsight Systems Network Profiling
‣ GTC 2021 - Tuning GPU Network and Memory Usage in Apache Spark
‣ GTC 2020 - Rebalancing the Load: Profile-Guided Optimization of the NAMD
Molecular Dynamics Program for Modern GPUs using Nsight Systems
‣ GTC 2020 - Scaling the Transformer Model Implementation in PyTorch Across
Multiple Nodes
‣ GTC 2019 - Using Nsight Tools to Optimize the NAMD Molecular Dynamics
Simulation Program
‣ GTC 2019 - Optimizing Facebook AI Workloads for NVIDIA GPUs
‣ GTC 2018 - Optimizing HPC Simulation and Visualization Codes Using NVIDIA
Nsight Systems
‣ GTC 2018 - Israel - Boost DNN Training Performance using NVIDIA Tools
‣ Siggraph 2018 - Taming the Beast; Using NVIDIA Tools to Unlock Hidden GPU
Performance
www.nvidia.com
User Guide v2023.1.1 | 302
Other Resources
www.nvidia.com
User Guide v2023.1.1 | 303