0% found this document useful (0 votes)

60 views

Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021

This document provides techniques for optimizing PyTorch training performance, including enabling asynchronous data loading and augmentation, disabling bias for convolutions followed by batch norm, efficiently setting gradients to zero, and disabling gradient calculation for inference. It also discusses optimizations specific to GPUs like using mixed precision, enabling the cuDNN autotuner, avoiding CPU-GPU synchronization, and load balancing workload across multiple GPUs.

Uploaded by

Arun Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021

Uploaded by

Arun Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

PYTORCH PERFORMANCE

TUNING GUIDE
Szymon Migacz, 04/12/2021
CONTENT
PyTorch Performance Tuning Guide

Simple techniques to improve training performance

Implement by changing a few lines of code

2
GENERAL
OPTIMIZATIONS
3
ENABLE ASYNC DATA LOADING
& AUGMENTATION
Example: PyTorch MNIST example: DataLoader with default
PyTorch DataLoader supports
asynchronous data loading / {'num_workers': 1, 'pin_memory': True}.
augmentation
Default settings: Setting for the training DataLoader Time for one training epoch
num_workers=0,
pin_memory=False {'num_workers': 0, 'pin_memory': False} 8.2 s

Use num_workers > 0 to enable {'num_workers': 1, 'pin_memory': False} 6.75 s

asynchronous data processing {'num_workers': 1, 'pin_memory': True} 6.7 s
Use pin_memory=True
{'num_workers': 2, 'pin_memory': True} 4.2 s

{'num_workers': 4, 'pin_memory': False} 4.5 s

{'num_workers': 4, 'pin_memory': True} 4.1 s

{'num_workers': 8, 'pin_memory': True} 4.5 s

PyTorch 1.6, NVIDIA Quadro RTX 8000

4
DISABLE BIAS FOR CONVOLUTIONS DIRECTLY
FOLLOWED BY A BATCH NORM

... ...

nn.Conv2d(..., bias=True, …) nn.Conv2d(..., bias=False, …)

nn.BatchNorm2d() nn.BatchNorm2d()

... ...

Also applicable to Conv1d, Conv3d if BatchNorm normalizes on

the same dimension as convolution's bias.

5
EFFICIENTLY SET GRADIENTS TO ZERO

model.zero_grad() for param in model.parameters():

param.grad = None
# or
# or (in PyT >= 1.7)
optimizer.zero_grad()
model.zero_grad(set_to_none=True)

• executes memset for every parameter • doesn't execute memset for every
in the model parameter
• backward pass updates gradients with • memory is zeroed-out by the allocator in
"+=" operator (read + write) a more efficient way
• backward pass updates gradients with "="
operator (write)

6
DISABLE GRADIENT CALCULATION FOR
INFERENCE

# torch.no_grad() as a context manager:

with torch.no_grad():
output = model(input)

# torch.no_grad() as a function decorator:

@torch.no_grad()
def validation(model, input):
output = model(input)
return output

7
FUSE POINTWISE OPERATIONS
PyTorch JIT can fuse pointwise operations into a single CUDA kernel.
Unfused pointwise operations are memory-bound, for each unfused op PyTorch:
launches a separate kernel
loads data from global memory
performs computation
stores results back into global memory
Example: @torch.jit.script
def gelu(x): def fused_gelu(x):
return x * 0.5 * (1.0 + torch.erf(x / 1.41421)) return x * 0.5 * (1.0 + torch.erf(x / 1.41421))

Function name Number of CUDA kernels launched Execution time [us]

(input vector with 1M elements)

gelu(x) 5 65

fused_gelu(x) 1 16

PyTorch 1.6, NVIDIA Quadro RTX 8000

8
GPU SPECIFIC
OPTIMIZATIONS
9
USE MIXED PRECISION AND AMP

Set sizes to multiples of 8

See Deep Learning Performance Documentation for more details and guidelines specific to layer type

Use explicit padding when necessary (e.g. vocabulary size in NLP)

Enable AMP

Introduction to Mixed Precision Training and AMP: video, slides

Native PyTorch AMP is available starting from PyTorch 1.6: documentation, examples, tutorial

10
ENABLE cuDNN AUTOTUNER
For convolutional neural networks, enable cuDNN autotuner by setting:

torch.backends.cudnn.benchmark = True
cuDNN supports many algorithms to compute convolution
autotuner runs a short benchmark and selects algorithm with the best performance

Example:
nn.Conv2d with 64 3x3 filters applied to an input with batch size = 32, channels = width = height = 64.

Setting cudnn.benchmark = False cudnn.benchmark = True Speedup

(the default)
Forward propagation (FP32) [us] 1430 840 1.70

Forward + backward propagation (FP32) [us] 2870 2260 1.27

PyTorch 1.6, NVIDIA Quadro RTX 8000

11
CREATE TENSORS DIRECTLY ON TARGET DEVICE

torch.rand(
torch.rand(size).cuda() size,
device=torch.device(’cuda’),
)

Also applicable to:

torch.empty(), torch.zeros(), torch.full(), torch.ones(),
torch.eye(), torch.randint(), torch.randn()
and similar functions.

12
AVOID CPU-GPU SYNC

Operations which require synchronization:

print(cuda_tensor)

cuda_tensor.item()

memory copies: tensor.cuda(), cuda_tensor.cpu() and tensor.to(device) calls

cuda_tensor.nonzero()

python control flow which depends on operations on CUDA tensors e.g.

if (cuda_tensor != 0).all()

13
DISTRIBUTED
OPTIMIZATIONS
14
USE EFFICIENT MULTI-GPU BACKEND
DataParallel DistributedDataParallel

GPU 0 CPU core 0 GPU 0

GPU 1 CPU core 1 GPU 1

CPU core 0
GPU 2 CPU core 2 GPU 2

GPU 3 CPU core 3 GPU 3

• 1 CPU core drives multiple GPUs • 1 CPU core for each GPU
• 1 python process drives multiple GPUs (GIL) • 1 python process for each GPU
• only up to a single node • single-node and multi-node (same API)
• efficient implementation:
• automatic bucketing for grad all-reduce
• all-reduce overlapped with backward pass
• multi-process programing

15
LOAD-BALANCE WORKLOAD ON MULTIPLE GPUs
Gradient all-reduce after backward pass is a synchronization point in a multi-GPU setting
GPU 0 GPU 1 GPU 2 GPU 3

Forward

Backward

time
All-reduce

Optimizer
16
LOAD-BALANCE WORKLOAD ON MULTIPLE GPUs
Gradient all-reduce after backward pass is a synchronization point in a multi-GPU setting
GPU 0 GPU 1 GPU 2 GPU 3

Forward

Backward

time
idle

idle
All-reduce

idle
Optimizer
17
SUMMARY
General optimizations:
Use asynchronous data loading
Disable bias for convolutions directly followed by batch norm
Efficiently set gradients to zero
Disable gradient calculation for validation/inference
Fuse pointwise operations with PyTorch JIT

GPU specific optimizations:

Use mixed precision and AMP
Enable cuDNN autotuner
Create tensors directly on a GPU
Avoid CPU-GPU sync

Distributed optimizations
Use DistributedDataParallel
Load-balance workload on all GPUs
18
ADDITIONAL RESOURCES

PyTorch Tutorial: Performance Tuning Guide

Check for more optimizations
NVIDIA Deep Learning Performance Documentation
Introduction to Mixed Precision Training and AMP: video, slides
Using Nsight Systems to profile GPU workload (PyTorch Dev forum)

RC StudioManual en
100% (1)
RC StudioManual en
446 pages
How To Implement DBAN in A WDS Server
No ratings yet
How To Implement DBAN in A WDS Server
8 pages
Thvee Lab 8 PDF
No ratings yet
Thvee Lab 8 PDF
4 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
Pytorch
No ratings yet
Pytorch
4 pages
MP Report v2
No ratings yet
MP Report v2
10 pages
Module2
No ratings yet
Module2
50 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
DL 1 - ComputerVision With PyTorch Notes
No ratings yet
DL 1 - ComputerVision With PyTorch Notes
304 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
NVIDIA CUDA Computational Finance Geeks3D
No ratings yet
NVIDIA CUDA Computational Finance Geeks3D
39 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
GNN Python Code in Keras and Pytorch - by YashwanthReddyGoduguchintha - Medium
No ratings yet
GNN Python Code in Keras and Pytorch - by YashwanthReddyGoduguchintha - Medium
10 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
Harvard CS197 Lecture 6 & 7 Notes
No ratings yet
Harvard CS197 Lecture 6 & 7 Notes
18 pages
Chapter 6 Parallel Processor
No ratings yet
Chapter 6 Parallel Processor
21 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Introduction To PyTorch
No ratings yet
Introduction To PyTorch
35 pages
Train your image classifier model with PyTorch
No ratings yet
Train your image classifier model with PyTorch
6 pages
Day 45 PyTorch Presentation
No ratings yet
Day 45 PyTorch Presentation
67 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Dallas DS2450 Emulation With Tiny AVR
No ratings yet
Dallas DS2450 Emulation With Tiny AVR
10 pages
Game Development With SDL 2.0
No ratings yet
Game Development With SDL 2.0
41 pages
Deep-Learning-Optimization
No ratings yet
Deep-Learning-Optimization
62 pages
Pgi Cuda Tutorial
No ratings yet
Pgi Cuda Tutorial
58 pages
Opencv - Gpu - Opencv Wiki
No ratings yet
Opencv - Gpu - Opencv Wiki
4 pages
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
No ratings yet
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
68 pages
complete_dl_record
No ratings yet
complete_dl_record
28 pages
Sparsity in INT8_ Training Workflow and Best Practices for NVIDIA TensorRT Acceleration _ NVIDIA Technical Blog
No ratings yet
Sparsity in INT8_ Training Workflow and Best Practices for NVIDIA TensorRT Acceleration _ NVIDIA Technical Blog
9 pages
CUDA
No ratings yet
CUDA
33 pages
Help
No ratings yet
Help
2 pages
Exp 11 NLI USING BERT
No ratings yet
Exp 11 NLI USING BERT
4 pages
PyTorch PDF
No ratings yet
PyTorch PDF
72 pages
Hardware RSA Accelerator: Group 3: Ariel Anders, Timur Balbekov, Neil Forrester May 15, 2013
No ratings yet
Hardware RSA Accelerator: Group 3: Ariel Anders, Timur Balbekov, Neil Forrester May 15, 2013
15 pages
Pytorch Waste Classification Using Densenet Jupyter Notebook
No ratings yet
Pytorch Waste Classification Using Densenet Jupyter Notebook
14 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
2013 07 22-Python-CUDA
No ratings yet
2013 07 22-Python-CUDA
25 pages
Unix
No ratings yet
Unix
9 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Intro 2 Cuda
No ratings yet
Intro 2 Cuda
30 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
S51413 - Developing Optimal CUDA Kernels on Hopper Tensor Cores_1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels on Hopper Tensor Cores_1679452516682001bWRm
80 pages
NN From Scratch
No ratings yet
NN From Scratch
5 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
S. NO. Title of The Experiments Page No
No ratings yet
S. NO. Title of The Experiments Page No
11 pages
Tutorial 8 Demo Assignment
No ratings yet
Tutorial 8 Demo Assignment
9 pages
Crash 2023 10 28 - 21.21.32 Client
No ratings yet
Crash 2023 10 28 - 21.21.32 Client
11 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton
No ratings yet
S8906: Fast Data Pipelines For Deep Learning Training: Przemek Tredak, Simon Layton
41 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
Machine Failure Prediction
No ratings yet
Machine Failure Prediction
11 pages
Demystify OpenAI Triton · Fkong' Tech Blog
No ratings yet
Demystify OpenAI Triton · Fkong' Tech Blog
17 pages
Opencv GTC Express Shalini Gupta
No ratings yet
Opencv GTC Express Shalini Gupta
47 pages
Pytorch Lightning Readthedocs Latest
100% (1)
Pytorch Lightning Readthedocs Latest
421 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Lab 05 MIP
No ratings yet
Lab 05 MIP
13 pages
Matlab and Ccs Interface Manual With Filter Program
No ratings yet
Matlab and Ccs Interface Manual With Filter Program
65 pages
Chapter 06
No ratings yet
Chapter 06
59 pages
Openacc Online Course: Lecture 1: Introduction To Openacc
No ratings yet
Openacc Online Course: Lecture 1: Introduction To Openacc
47 pages
Making PIC Microcontroller Instruments and Controllers
From Everand
Making PIC Microcontroller Instruments and Controllers
Harprit Singh Sandhu
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
U C R O S B, M ,: Demy Ourse Obot Perating Ystem Asics Otion AND Opencv
No ratings yet
U C R O S B, M ,: Demy Ourse Obot Perating Ystem Asics Otion AND Opencv
11 pages
Linear Regression Analysis: Module - Vii
No ratings yet
Linear Regression Analysis: Module - Vii
10 pages
Data Wrangling in R PDF
No ratings yet
Data Wrangling in R PDF
12 pages
List of All Important Awards and Their Associated Fields - Eng - pdf-15
No ratings yet
List of All Important Awards and Their Associated Fields - Eng - pdf-15
5 pages
Multitone Testing of Sound System Components - Some Results and Conclusions, Part 1: History and Theory
No ratings yet
Multitone Testing of Sound System Components - Some Results and Conclusions, Part 1: History and Theory
40 pages
Encode or Decode File As MIME Base64 (RFC 1341)
No ratings yet
Encode or Decode File As MIME Base64 (RFC 1341)
17 pages
Character Stuffing
89% (9)
Character Stuffing
9 pages
Dell 24 Monitor E2424hs Datasheet
No ratings yet
Dell 24 Monitor E2424hs Datasheet
4 pages
Global App Testing - The Ultimate QA Testing Handbook
No ratings yet
Global App Testing - The Ultimate QA Testing Handbook
71 pages
Software Engineering (SE) Is A Profession Dedicated
No ratings yet
Software Engineering (SE) Is A Profession Dedicated
8 pages
SOLID
No ratings yet
SOLID
9 pages
Murphy Product SpecSheets PDF
No ratings yet
Murphy Product SpecSheets PDF
46 pages
Pages From SP
No ratings yet
Pages From SP
8 pages
Python Unit 3
No ratings yet
Python Unit 3
27 pages
QA Software Tester or SQA or Automation Tester
No ratings yet
QA Software Tester or SQA or Automation Tester
4 pages
KVM Forum 2013 OVirt Storage
No ratings yet
KVM Forum 2013 OVirt Storage
10 pages
S1Agile EN RN I.1 PDF
No ratings yet
S1Agile EN RN I.1 PDF
10 pages
Inside Your Computer
100% (9)
Inside Your Computer
71 pages
Enhanced Flexible Timeslot Assignment
No ratings yet
Enhanced Flexible Timeslot Assignment
7 pages
Configure K10STAT: 0 Performance Handicap With Less Heat Issues and Longer Battery Life!!
No ratings yet
Configure K10STAT: 0 Performance Handicap With Less Heat Issues and Longer Battery Life!!
8 pages
Profile of Shanmugam July2019
No ratings yet
Profile of Shanmugam July2019
5 pages
Developing Forms: Application Development With Visual Foxpro 6.0
No ratings yet
Developing Forms: Application Development With Visual Foxpro 6.0
26 pages
iSecureNet Products-DS25012020
100% (1)
iSecureNet Products-DS25012020
4 pages
3. Cloud Compute Solution Design
No ratings yet
3. Cloud Compute Solution Design
41 pages
Software Assignment 1
No ratings yet
Software Assignment 1
7 pages
Embedded System Lecture Notes Dr. Agfianto Eko Putra
No ratings yet
Embedded System Lecture Notes Dr. Agfianto Eko Putra
39 pages
17.8.2 Packet Tracer Skills Integration Challenge
No ratings yet
17.8.2 Packet Tracer Skills Integration Challenge
10 pages
SVM 7200 Life Scope Vital Signs Spec Sheet
No ratings yet
SVM 7200 Life Scope Vital Signs Spec Sheet
2 pages
Quasar Electronics Kit No. 1015 Electronic Mosquito Repeller
No ratings yet
Quasar Electronics Kit No. 1015 Electronic Mosquito Repeller
7 pages
TMC423 Datasheet
No ratings yet
TMC423 Datasheet
22 pages
DDCO Syllabus Up
No ratings yet
DDCO Syllabus Up
5 pages