0% found this document useful (0 votes)
65 views

Nvidia Profiling Tools Keipert 10 4 22

https://developer.nvidia.com/compute-profiler-assistant 14 NSIGHT COMPUTE WORKFLOW 1. Profile application 2. Analyze kernel timeline 3. Dive into kernel details - Instructions - Occupancy - Register usage - Memory usage - Latency analysis 4. Repeat with optimizations - Re-profile - Compare results 5. Share reports 6. Get expert advice 15 NSIGHT COMPUTE UI Key Features: - Kernel timeline - Call stack - Occupancy - Instructions - Register usage - Memory usage - Latency analysis - Customizable views - Compare profiles

Uploaded by

TJK001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Nvidia Profiling Tools Keipert 10 4 22

https://developer.nvidia.com/compute-profiler-assistant 14 NSIGHT COMPUTE WORKFLOW 1. Profile application 2. Analyze kernel timeline 3. Dive into kernel details - Instructions - Occupancy - Register usage - Memory usage - Latency analysis 4. Repeat with optimizations - Re-profile - Compare results 5. Share reports 6. Get expert advice 15 NSIGHT COMPUTE UI Key Features: - Kernel timeline - Call stack - Occupancy - Instructions - Register usage - Memory usage - Latency analysis - Customizable views - Compare profiles

Uploaded by

TJK001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

NVIDIA PROFILING TOOLS

KRISTOPHER KEIPERT
PROGRAMMING THE NVIDIA PLATFORM
CPU, GPU, and Network

ACCELERATED STANDARD LANGUAGES PLATFORM SPECIALIZATION


ACCELERATED STANDARD
ISO C++, LANGUAGES
ISO Fortran INCREMENTAL PORTABLE OPTIMIZATION PLATFORM SPECIALIZATION
CUDA
ISO C++, ISO Fortran OpenACC, OpenMP CUDA

#pragma acc data copy(x,y) {


std::transform(par, x, x+n, y, y, __global__
...
[=](float x, float y){ return y + void saxpy(int n, float a,
std::transform(par, x, x+n, y, y,
a*x; } float *x, float *y) {
[=](float x, float y){
); int i = blockIdx.x*blockDim.x +
return y + a*x;
threadIdx.x;
});
if (i < n) y[i] += a*x[i];
...
do concurrent (i = 1:n) }
}
y(i) = y(i) + a*x(i)
enddo int main(void) {
#pragma omp target data map(x,y) {
...
...
cudaMemcpy(d_x, x, ...);
std::transform(par, x, x+n, y, y,
cudaMemcpy(d_y, y, ...);
[=](float x, float y){
import cunumeric as np
return y + a*x;
… saxpy<<<(N+255)/256,256>>>(...);
});
def saxpy(a, x, y):
...
y[:] += a*x cudaMemcpy(y, d_y, ...);
}

ACCELERATION LIBRARIES
Core Math Communication Data Analytics AI Quantum
NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud

DEVELOPMENT ANALYSIS

Programming Core Math Communication


Compilers Profilers Debugger
Models Libraries Libraries Libraries

HPC-X
Standard C++ & Fortran nvcc nvc libcu++ cuBLAS cuTENSOR Nsight cuda-gdb
MPI
UCX SHMEM

OpenACC & OpenMP nvc++ Thrust cuSPARSE cuSOLVER SHARP HCOLL Systems Host

NVSHMEM
Compute Device
CUDA nvfortran CUB cuFFT cuRAND
NCCL

Develop for the NVIDIA Platform: GPU, CPU and Interconnect


Libraries | Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
DEVELOPER TOOLS
Debuggers: cuda-gdb, Nsight Visual Studio Edition Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX)

Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition


Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
NSIGHT TOOLS WORKFLOW

Start here

Nsight Systems
Comprehensive system-level
performance
Re-check overall Re-check overall
performance performance

Dive into top CUDA kernels Dive into graphics


by using metrics/counter frames
collection

Nsight Compute Nsight Graphics


Detailed CUDA kernel performance Detailed frame/render performance

5
NSIGHT SYSTEMS
System Profiler
Key Features:
▪ System-wide application algorithm tuning
▪ Multi-process tree support
▪ Locate optimization opportunities
▪ Visualize millions of events on a very fast GUI timeline
▪ Or gaps of unused CPU and GPU time
▪ Balance your workload across multiple CPUs and GPUs
▪ CPU algorithms, utilization and thread state
GPU streams, kernels, memory transfers, etc
▪ Command Line, Standalone, IDE Integration

OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host)
GPUs: Pascal+

Docs/product:
6
https://developer.nvidia.com/nsight-systems
Thread/core
migration
Processes
and
threads Thread state

CUDA and
OpenGL API trace

cuDNN and
cuBLAS trace

Kernel and memory


transfer activities

Multi-GPU
7
ZOOM/FILTER TO EXACT AREAS OF INTEREST

• Zoom in valleys to find gaps!

+
8
NVTX: NVIDIA TOOLS EXTENSIONS
Code Annotation API

9
EXPERT SYSTEMS & STATISTICS
Built-in Data Analytics with Advice
MULTI-REPORT TILING
Visualize More Parallel Activity

Open multiple
reports

Loaded on same
Open multiple
timeline based on
reports
wall-clock
APPLICATION PROFILES WITH NSIGHT SYSTEMS

$ nsys profile –o report --stats=true ./myapp.exe

• Generated file: report.qdrep (or report.nsys-rep)


Open for viewing in the Nsight Systems UI

• When using MPI, recommended to use nsys after mpirun/srun:


$ mpirun –n 4 nsys profile ./myapp.exe
12
PROFILING DL MODELS

• Pytorch
o DNN Layer annotations are disabled by default
o ++ ”with torch.autograd.profiler.emit_nvtx():”
o Manually with torch.cuda.nvtx.range_(push/pop)
o TensorRT backend is already annotated

• Tensorflow
o Annotated by default with NVTX in NVIDIA TF containers
o TF_DISABLE_NVTX_RANGES=1 to disable for production

13
NSIGHT COMPUTE
Kernel Profiling Tool

Key Features:
▪ Interactive CUDA API debugging and kernel profiling
▪ Built-in rules expertise
▪ Fully customizable data collection and display
▪ Command Line, Standalone, IDE Integration, Remote Targets

OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX


(host only)
GPUs: Volta+

Docs/product: https://developer.nvidia.com/nsight-compute
Targeted metric
sections

Customizable data
collection and
presentation

Built-in expertise for


Guided Analysis
and optimization
Visual memory
analysis chart

Metrics for peak


performance ratios
Source/PTX/SASS
analysis and
correlation

Metric heatmap to
Source metrics per quickly identify
instruction hotspots
OCCUPANCY CALCULATOR
Model Hardware Usage and Identify Limiters

▪ Model theoretical
hardware usage
▪ Understand limitations
from hardware vs.
kernel parameters
▪ Configure model to
vary HW and kernel
parameters
▪ Opened from an
existing report or as a
new activity
HIERARCHICAL ROOFLINE

▪ Visualize multiple levels of the memory


hierarchy
▪ Identify bottlenecks caused by memory
limitations
▪ Determine how modifying algorithms may (or
may not) impact performance
KERNEL PROFILES WITH NSIGHT COMPUTE

$ ncu –k mykernel –o report ./myapp.exe

• Generated file: report.ncu-rep


• Open for viewing in the Nsight Compute UI

• (Without the –k option, Nsight Compute with profile everything and take a long time)
CUDA-GDB
Command-Line and IDE Back-End Debugger

▪Unified CPU and CUDA


Debugging
▪CUDA-C/SASS support
▪Built on GDB and uses
many of the same CLI
commands
▪Local/Remote
connection support
COMPUTE SANITIZER
Automatically Scan for Bugs and Memory Issues

▪Compute Sanitizer checks


correctness issues via sub-tools:
▪ Memcheck – Memory access error and leak
detection tool.
▪ Racecheck – Shared memory data access
hazard detection tool.
▪ Initcheck – Uninitialized device global
memory access detection tool.
▪ Synccheck – Thread synchronization
hazard detection tool.

▪ https://github.com/NVIDIA/compute-
sanitizer-samples
NSIGHT VISUAL STUDIO CODE EDITION

Visual Studio Code extensions that provides:


• CUDA code syntax highlighting Variables view

• CUDA code completion


• Build warning/errors
• Debug CPU & GPU code
• Remote connection support via SSH
• Available on the VS Code Marketplace now!

CPU & GPU


Exec debugger
registers
commands

CUDA Call Stack


Watch CPU &
GPU vars

Session status

CUDA focus

https://developer.nvidia.com/nsight-visual-studio-code-edition
ADDITIONAL RESOURCES

▪ Sessions
▪ A41100 - CUDA: New Features and Beyond
▪ A41131 - Developing Efficient CUDA Kernels for Fourth-Generation Tensor Cores
▪ Labs
▪ DLIT41277 - Optimizing CUDA Machine Learning Codes with Nsight Profiling Tools
▪ DLIT41274 - Debugging and Analyzing Correctness of CUDA Applications
▪ DLIT41276 - Developer Tools Fundamentals for Ray Tracing using NVIDIA Nsight Graphics and NVIDIA Nsight Systems
▪ Ampere Architecture Detailed Blog
▪ NVIDIA Ampere Architecture In-Depth
▪ Developer Tools are free and packaged in the latest version of the CUDA Toolkit
▪ https://developer.nvidia.com/cuda-downloads
▪ Support is available via:
▪ https://forums.developer.nvidia.com/c/development-tools/
▪ More information at:
▪ https://developer.nvidia.com/tools-overview
HANDS-ON

/lus/eagle/projects/SDL_Workshop/jacobi

▪ Solving Laplace Equation with Jacobi Iterations


WWW.OPENHACKATHONS.ORG

You might also like