0% found this document useful (0 votes)

65 views

Nvidia Profiling Tools Keipert 10 4 22

https://developer.nvidia.com/compute-profiler-assistant 14 NSIGHT COMPUTE WORKFLOW 1. Profile application 2. Analyze kernel timeline 3. Dive into kernel details - Instructions - Occupancy - Register usage - Memory usage - Latency analysis 4. Repeat with optimizations - Re-profile - Compare results 5. Share reports 6. Get expert advice 15 NSIGHT COMPUTE UI Key Features: - Kernel timeline - Call stack - Occupancy - Instructions - Register usage - Memory usage - Latency analysis - Customizable views - Compare profiles

Uploaded by

TJK001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views

Nvidia Profiling Tools Keipert 10 4 22

Uploaded by

TJK001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

NVIDIA PROFILING TOOLS

KRISTOPHER KEIPERT
PROGRAMMING THE NVIDIA PLATFORM
CPU, GPU, and Network

ACCELERATED STANDARD LANGUAGES PLATFORM SPECIALIZATION

ACCELERATED STANDARD
ISO C++, LANGUAGES
ISO Fortran INCREMENTAL PORTABLE OPTIMIZATION PLATFORM SPECIALIZATION
CUDA
ISO C++, ISO Fortran OpenACC, OpenMP CUDA

#pragma acc data copy(x,y) {

std::transform(par, x, x+n, y, y, __global__
...
[=](float x, float y){ return y + void saxpy(int n, float a,
std::transform(par, x, x+n, y, y,
a*x; } float *x, float *y) {
[=](float x, float y){
); int i = blockIdx.x*blockDim.x +
return y + a*x;
threadIdx.x;
});
if (i < n) y[i] += a*x[i];
...
do concurrent (i = 1:n) }
}
y(i) = y(i) + a*x(i)
enddo int main(void) {
#pragma omp target data map(x,y) {
...
...
cudaMemcpy(d_x, x, ...);
std::transform(par, x, x+n, y, y,
cudaMemcpy(d_y, y, ...);
[=](float x, float y){
import cunumeric as np
return y + a*x;
… saxpy<<<(N+255)/256,256>>>(...);
});
def saxpy(a, x, y):
...
y[:] += a*x cudaMemcpy(y, d_y, ...);
}

ACCELERATION LIBRARIES
Core Math Communication Data Analytics AI Quantum
NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud

DEVELOPMENT ANALYSIS

Programming Core Math Communication

Compilers Profilers Debugger
Models Libraries Libraries Libraries

HPC-X
Standard C++ & Fortran nvcc nvc libcu++ cuBLAS cuTENSOR Nsight cuda-gdb
MPI
UCX SHMEM

OpenACC & OpenMP nvc++ Thrust cuSPARSE cuSOLVER SHARP HCOLL Systems Host

NVSHMEM
Compute Device
CUDA nvfortran CUB cuFFT cuRAND
NCCL

Develop for the NVIDIA Platform: GPU, CPU and Interconnect

Libraries | Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
DEVELOPER TOOLS
Debuggers: cuda-gdb, Nsight Visual Studio Edition Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX)

Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition

Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
NSIGHT TOOLS WORKFLOW

Start here

Nsight Systems
Comprehensive system-level
performance
Re-check overall Re-check overall
performance performance

Dive into top CUDA kernels Dive into graphics

by using metrics/counter frames
collection

Nsight Compute Nsight Graphics

Detailed CUDA kernel performance Detailed frame/render performance

5
NSIGHT SYSTEMS
System Profiler
Key Features:
▪ System-wide application algorithm tuning
▪ Multi-process tree support
▪ Locate optimization opportunities
▪ Visualize millions of events on a very fast GUI timeline
▪ Or gaps of unused CPU and GPU time
▪ Balance your workload across multiple CPUs and GPUs
▪ CPU algorithms, utilization and thread state
GPU streams, kernels, memory transfers, etc
▪ Command Line, Standalone, IDE Integration

OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host)
GPUs: Pascal+

Docs/product:
6
https://developer.nvidia.com/nsight-systems
Thread/core
migration
Processes
and
threads Thread state

CUDA and
OpenGL API trace

cuDNN and
cuBLAS trace

Kernel and memory

transfer activities

Multi-GPU
7
ZOOM/FILTER TO EXACT AREAS OF INTEREST

• Zoom in valleys to find gaps!

+
8
NVTX: NVIDIA TOOLS EXTENSIONS
Code Annotation API

9
EXPERT SYSTEMS & STATISTICS
Built-in Data Analytics with Advice
MULTI-REPORT TILING
Visualize More Parallel Activity

Open multiple
reports

Loaded on same
Open multiple
timeline based on
reports
wall-clock
APPLICATION PROFILES WITH NSIGHT SYSTEMS

$ nsys profile –o report --stats=true ./myapp.exe

• Generated file: report.qdrep (or report.nsys-rep)

Open for viewing in the Nsight Systems UI

• When using MPI, recommended to use nsys after mpirun/srun:

$ mpirun –n 4 nsys profile ./myapp.exe
12
PROFILING DL MODELS

• Pytorch
o DNN Layer annotations are disabled by default
o ++ ”with torch.autograd.profiler.emit_nvtx():”
o Manually with torch.cuda.nvtx.range_(push/pop)
o TensorRT backend is already annotated

• Tensorflow
o Annotated by default with NVTX in NVIDIA TF containers
o TF_DISABLE_NVTX_RANGES=1 to disable for production

13
NSIGHT COMPUTE
Kernel Profiling Tool

Key Features:
▪ Interactive CUDA API debugging and kernel profiling
▪ Built-in rules expertise
▪ Fully customizable data collection and display
▪ Command Line, Standalone, IDE Integration, Remote Targets

OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX

(host only)
GPUs: Volta+

Docs/product: https://developer.nvidia.com/nsight-compute
Targeted metric
sections

Customizable data
collection and
presentation

Built-in expertise for

Guided Analysis
and optimization
Visual memory
analysis chart

Metrics for peak

performance ratios
Source/PTX/SASS
analysis and
correlation

Metric heatmap to
Source metrics per quickly identify
instruction hotspots
OCCUPANCY CALCULATOR
Model Hardware Usage and Identify Limiters

▪ Model theoretical
hardware usage
▪ Understand limitations
from hardware vs.
kernel parameters
▪ Configure model to
vary HW and kernel
parameters
▪ Opened from an
existing report or as a
new activity
HIERARCHICAL ROOFLINE

▪ Visualize multiple levels of the memory

hierarchy
▪ Identify bottlenecks caused by memory
limitations
▪ Determine how modifying algorithms may (or
may not) impact performance
KERNEL PROFILES WITH NSIGHT COMPUTE

$ ncu –k mykernel –o report ./myapp.exe

• Generated file: report.ncu-rep

• Open for viewing in the Nsight Compute UI

• (Without the –k option, Nsight Compute with profile everything and take a long time)
CUDA-GDB
Command-Line and IDE Back-End Debugger

▪Unified CPU and CUDA

Debugging
▪CUDA-C/SASS support
▪Built on GDB and uses
many of the same CLI
commands
▪Local/Remote
connection support
COMPUTE SANITIZER
Automatically Scan for Bugs and Memory Issues

▪Compute Sanitizer checks

correctness issues via sub-tools:
▪ Memcheck – Memory access error and leak
detection tool.
▪ Racecheck – Shared memory data access
hazard detection tool.
▪ Initcheck – Uninitialized device global
memory access detection tool.
▪ Synccheck – Thread synchronization
hazard detection tool.

▪ https://github.com/NVIDIA/compute-
sanitizer-samples
NSIGHT VISUAL STUDIO CODE EDITION

Visual Studio Code extensions that provides:

• CUDA code syntax highlighting Variables view

• CUDA code completion

• Build warning/errors
• Debug CPU & GPU code
• Remote connection support via SSH
• Available on the VS Code Marketplace now!

CPU & GPU

Exec debugger
registers
commands

CUDA Call Stack

Watch CPU &
GPU vars

Session status

CUDA focus

https://developer.nvidia.com/nsight-visual-studio-code-edition
ADDITIONAL RESOURCES

▪ Sessions
▪ A41100 - CUDA: New Features and Beyond
▪ A41131 - Developing Efficient CUDA Kernels for Fourth-Generation Tensor Cores
▪ Labs
▪ DLIT41277 - Optimizing CUDA Machine Learning Codes with Nsight Profiling Tools
▪ DLIT41274 - Debugging and Analyzing Correctness of CUDA Applications
▪ DLIT41276 - Developer Tools Fundamentals for Ray Tracing using NVIDIA Nsight Graphics and NVIDIA Nsight Systems
▪ Ampere Architecture Detailed Blog
▪ NVIDIA Ampere Architecture In-Depth
▪ Developer Tools are free and packaged in the latest version of the CUDA Toolkit
▪ https://developer.nvidia.com/cuda-downloads
▪ Support is available via:
▪ https://forums.developer.nvidia.com/c/development-tools/
▪ More information at:
▪ https://developer.nvidia.com/tools-overview
HANDS-ON

/lus/eagle/projects/SDL_Workshop/jacobi

▪ Solving Laplace Equation with Jacobi Iterations

WWW.OPENHACKATHONS.ORG

20150423-Development Environment Guide For EFT-POS TPS300 (SDK 2.0)
100% (1)
20150423-Development Environment Guide For EFT-POS TPS300 (SDK 2.0)
11 pages
Iphone User Guide For IOS 7
No ratings yet
Iphone User Guide For IOS 7
154 pages
OpenACC 2
No ratings yet
OpenACC 2
44 pages
S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools
No ratings yet
S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools
44 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
Module2
No ratings yet
Module2
50 pages
Release Notes
No ratings yet
Release Notes
7 pages
Release Notes
No ratings yet
Release Notes
7 pages
Installation Guide
No ratings yet
Installation Guide
11 pages
Acceleratingpythonongpus
No ratings yet
Acceleratingpythonongpus
33 pages
CUDA Tools
No ratings yet
CUDA Tools
25 pages
Installation Guide
No ratings yet
Installation Guide
14 pages
CUDA Zone - Library of Resources - NVIDIA Developer
No ratings yet
CUDA Zone - Library of Resources - NVIDIA Developer
7 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Owens
No ratings yet
Owens
67 pages
Profiling Guide
No ratings yet
Profiling Guide
76 pages
Introduction To CUDA Platform 1
No ratings yet
Introduction To CUDA Platform 1
18 pages
Thesis Gpu Programming
100% (2)
Thesis Gpu Programming
6 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
A Quantitative Performance Analysis Model For GPU Architectures
No ratings yet
A Quantitative Performance Analysis Model For GPU Architectures
12 pages
N Sight Compute Cli
No ratings yet
N Sight Compute Cli
47 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Fundamentals of Accelerated Computing With CUDA Python
No ratings yet
Fundamentals of Accelerated Computing With CUDA Python
2 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
New Microsoft PowerPoint Presentation
No ratings yet
New Microsoft PowerPoint Presentation
13 pages
User Guide
No ratings yet
User Guide
309 pages
Product Availability Update: Processamento Paralelo em GPU's Na Arquitetura Fermi
100% (1)
Product Availability Update: Processamento Paralelo em GPU's Na Arquitetura Fermi
44 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Sanju hpc 9,10
No ratings yet
Sanju hpc 9,10
5 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Nvidia - Rapids
No ratings yet
Nvidia - Rapids
33 pages
Main GPU
No ratings yet
Main GPU
87 pages
Intro To Gpu &amp Cuda
No ratings yet
Intro To Gpu &amp Cuda
15 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Nvidia Learning Training Course Catalog
No ratings yet
Nvidia Learning Training Course Catalog
33 pages
Christopher_Noel_Hesse
No ratings yet
Christopher_Noel_Hesse
103 pages
GPU Computing With Apache Spark and Python: April 5, 2016
No ratings yet
GPU Computing With Apache Spark and Python: April 5, 2016
55 pages
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
No ratings yet
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
1 page
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
tutorial_hpcs2011_fixed
No ratings yet
tutorial_hpcs2011_fixed
89 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
2024_GR5245 class2 notes
No ratings yet
2024_GR5245 class2 notes
10 pages
Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA - Christopher_Noel_Hesse
No ratings yet
Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA - Christopher_Noel_Hesse
103 pages
APznzaa7hF-mfVj2V8zO8HsZAO1P27t34A_Cwjs4-Z3dfKvBUC5VsYBuhEAJ9SIGkA_GXNl5dyWxHJkRO3WAl2Jt4EKGp-jnhYlLaWvgg0wLs49f16rQ9FnUS0CCjb-vIvNwOm12gNSGVrKSlqloDZSL1rH-gaTCVskKMNLwlnBmLJqnqBYBomhOI-umTK9SEbJe5htEpTgTAzDOWsEifZHJrzFN3v8RrsLh3b6BmYq
No ratings yet
APznzaa7hF-mfVj2V8zO8HsZAO1P27t34A_Cwjs4-Z3dfKvBUC5VsYBuhEAJ9SIGkA_GXNl5dyWxHJkRO3WAl2Jt4EKGp-jnhYlLaWvgg0wLs49f16rQ9FnUS0CCjb-vIvNwOm12gNSGVrKSlqloDZSL1rH-gaTCVskKMNLwlnBmLJqnqBYBomhOI-umTK9SEbJe5htEpTgTAzDOWsEifZHJrzFN3v8RrsLh3b6BmYq
6 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
Netbackup Device Configuration Guide (Unix, Windows & Linux) 6.5
No ratings yet
Netbackup Device Configuration Guide (Unix, Windows & Linux) 6.5
202 pages
Red Hat Enterprise Linux-7-Virtualization Deployment and Administration Guide-en-US PDF
No ratings yet
Red Hat Enterprise Linux-7-Virtualization Deployment and Administration Guide-en-US PDF
567 pages
Proctored Mock Exam Guidelines 2021
No ratings yet
Proctored Mock Exam Guidelines 2021
6 pages
Introduction to Docker
No ratings yet
Introduction to Docker
136 pages
Pca App Launch Dic
No ratings yet
Pca App Launch Dic
2 pages
3708 Implementing ITSmobile at Johnsonville Sausage
No ratings yet
3708 Implementing ITSmobile at Johnsonville Sausage
43 pages
Files and File Paths
No ratings yet
Files and File Paths
20 pages
BIOS Rootkit Mebromi
No ratings yet
BIOS Rootkit Mebromi
6 pages
ID 2187539.1
No ratings yet
ID 2187539.1
1 page
Brother HL-L5100
No ratings yet
Brother HL-L5100
4 pages
WildFly Cookbook - Sample Chapter
100% (1)
WildFly Cookbook - Sample Chapter
35 pages
Embedded System: Real-Time Operating Systems
100% (1)
Embedded System: Real-Time Operating Systems
44 pages
2D260 116EN - A - Alexion TSX 032A 2,3 Software
No ratings yet
2D260 116EN - A - Alexion TSX 032A 2,3 Software
206 pages
Online Submission of Budget Preparation OSBP Version 2
No ratings yet
Online Submission of Budget Preparation OSBP Version 2
16 pages
CS3307 PA Unit5 PDF
No ratings yet
CS3307 PA Unit5 PDF
6 pages
Accounting Practical
No ratings yet
Accounting Practical
9 pages
V1 Workshop 4 5 PRF192
No ratings yet
V1 Workshop 4 5 PRF192
6 pages
What Is Linux and Basic Components?
No ratings yet
What Is Linux and Basic Components?
19 pages
Quidway S2700&S3700&S5700&S6700 V100R006C00SPC800 Upgrade Guide
No ratings yet
Quidway S2700&S3700&S5700&S6700 V100R006C00SPC800 Upgrade Guide
57 pages
DevOps Course Content
No ratings yet
DevOps Course Content
10 pages
CPU Scheduling in Operating Systems
No ratings yet
CPU Scheduling in Operating Systems
3 pages
Django - Blog
No ratings yet
Django - Blog
7 pages
Unit-5 Presentation Package PDF
No ratings yet
Unit-5 Presentation Package PDF
6 pages
Using TSQLMonitor With An ODBC Connection - RAD Studio
No ratings yet
Using TSQLMonitor With An ODBC Connection - RAD Studio
4 pages
Dell PowerEdge T20 - USER MANUAL
No ratings yet
Dell PowerEdge T20 - USER MANUAL
73 pages
Unit 9 - IBM Cloud Kubernetes Service Overview
No ratings yet
Unit 9 - IBM Cloud Kubernetes Service Overview
29 pages
COMMSCOPE - Antenna Sharing Configuration Builder Tool
No ratings yet
COMMSCOPE - Antenna Sharing Configuration Builder Tool
6 pages
MUT3
No ratings yet
MUT3
17 pages

Nvidia Profiling Tools Keipert 10 4 22

Uploaded by

Nvidia Profiling Tools Keipert 10 4 22

Uploaded by

NVIDIA PROFILING TOOLS

ACCELERATED STANDARD LANGUAGES PLATFORM SPECIALIZATION

#pragma acc data copy(x,y) {

Programming Core Math Communication

Develop for the NVIDIA Platform: GPU, CPU and Interconnect

Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition

Dive into top CUDA kernels Dive into graphics

Nsight Compute Nsight Graphics

Kernel and memory

• Zoom in valleys to find gaps!

$ nsys profile –o report --stats=true ./myapp.exe

• Generated file: report.qdrep (or report.nsys-rep)

• When using MPI, recommended to use nsys after mpirun/srun:

OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX

Built-in expertise for

Metrics for peak

▪ Visualize multiple levels of the memory

$ ncu –k mykernel –o report ./myapp.exe

• Generated file: report.ncu-rep

▪Unified CPU and CUDA

▪Compute Sanitizer checks

Visual Studio Code extensions that provides:

• CUDA code completion

CPU & GPU

CUDA Call Stack

▪ Solving Laplace Equation with Jacobi Iterations

You might also like