0% found this document useful (0 votes)
60 views

Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021

This document provides techniques for optimizing PyTorch training performance, including enabling asynchronous data loading and augmentation, disabling bias for convolutions followed by batch norm, efficiently setting gradients to zero, and disabling gradient calculation for inference. It also discusses optimizations specific to GPUs like using mixed precision, enabling the cuDNN autotuner, avoiding CPU-GPU synchronization, and load balancing workload across multiple GPUs.

Uploaded by

Arun Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021

This document provides techniques for optimizing PyTorch training performance, including enabling asynchronous data loading and augmentation, disabling bias for convolutions followed by batch norm, efficiently setting gradients to zero, and disabling gradient calculation for inference. It also discusses optimizations specific to GPUs like using mixed precision, enabling the cuDNN autotuner, avoiding CPU-GPU synchronization, and load balancing workload across multiple GPUs.

Uploaded by

Arun Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

PYTORCH PERFORMANCE

TUNING GUIDE
Szymon Migacz, 04/12/2021
CONTENT
PyTorch Performance Tuning Guide

Simple techniques to improve training performance

Implement by changing a few lines of code

2
GENERAL
OPTIMIZATIONS
3
ENABLE ASYNC DATA LOADING
& AUGMENTATION
Example: PyTorch MNIST example: DataLoader with default
PyTorch DataLoader supports
asynchronous data loading / {'num_workers': 1, 'pin_memory': True}.
augmentation
Default settings: Setting for the training DataLoader Time for one training epoch
num_workers=0,
pin_memory=False {'num_workers': 0, 'pin_memory': False} 8.2 s

Use num_workers > 0 to enable {'num_workers': 1, 'pin_memory': False} 6.75 s


asynchronous data processing {'num_workers': 1, 'pin_memory': True} 6.7 s
Use pin_memory=True
{'num_workers': 2, 'pin_memory': True} 4.2 s

{'num_workers': 4, 'pin_memory': False} 4.5 s

{'num_workers': 4, 'pin_memory': True} 4.1 s

{'num_workers': 8, 'pin_memory': True} 4.5 s

PyTorch 1.6, NVIDIA Quadro RTX 8000

4
DISABLE BIAS FOR CONVOLUTIONS DIRECTLY
FOLLOWED BY A BATCH NORM

... ...

nn.Conv2d(..., bias=True, …) nn.Conv2d(..., bias=False, …)

nn.BatchNorm2d() nn.BatchNorm2d()

... ...

Also applicable to Conv1d, Conv3d if BatchNorm normalizes on


the same dimension as convolution's bias.

5
EFFICIENTLY SET GRADIENTS TO ZERO

model.zero_grad() for param in model.parameters():


param.grad = None
# or
# or (in PyT >= 1.7)
optimizer.zero_grad()
model.zero_grad(set_to_none=True)

• executes memset for every parameter • doesn't execute memset for every
in the model parameter
• backward pass updates gradients with • memory is zeroed-out by the allocator in
"+=" operator (read + write) a more efficient way
• backward pass updates gradients with "="
operator (write)

6
DISABLE GRADIENT CALCULATION FOR
INFERENCE

# torch.no_grad() as a context manager:


with torch.no_grad():
output = model(input)

# torch.no_grad() as a function decorator:


@torch.no_grad()
def validation(model, input):
output = model(input)
return output

7
FUSE POINTWISE OPERATIONS
PyTorch JIT can fuse pointwise operations into a single CUDA kernel.
Unfused pointwise operations are memory-bound, for each unfused op PyTorch:
launches a separate kernel
loads data from global memory
performs computation
stores results back into global memory
Example: @torch.jit.script
def gelu(x): def fused_gelu(x):
return x * 0.5 * (1.0 + torch.erf(x / 1.41421)) return x * 0.5 * (1.0 + torch.erf(x / 1.41421))

Function name Number of CUDA kernels launched Execution time [us]


(input vector with 1M elements)

gelu(x) 5 65

fused_gelu(x) 1 16

PyTorch 1.6, NVIDIA Quadro RTX 8000


8
GPU SPECIFIC
OPTIMIZATIONS
9
USE MIXED PRECISION AND AMP

Set sizes to multiples of 8

See Deep Learning Performance Documentation for more details and guidelines specific to layer type

Use explicit padding when necessary (e.g. vocabulary size in NLP)

Enable AMP

Introduction to Mixed Precision Training and AMP: video, slides

Native PyTorch AMP is available starting from PyTorch 1.6: documentation, examples, tutorial

10
ENABLE cuDNN AUTOTUNER
For convolutional neural networks, enable cuDNN autotuner by setting:

torch.backends.cudnn.benchmark = True
cuDNN supports many algorithms to compute convolution
autotuner runs a short benchmark and selects algorithm with the best performance

Example:
nn.Conv2d with 64 3x3 filters applied to an input with batch size = 32, channels = width = height = 64.

Setting cudnn.benchmark = False cudnn.benchmark = True Speedup


(the default)
Forward propagation (FP32) [us] 1430 840 1.70

Forward + backward propagation (FP32) [us] 2870 2260 1.27


PyTorch 1.6, NVIDIA Quadro RTX 8000

11
CREATE TENSORS DIRECTLY ON TARGET DEVICE

torch.rand(
torch.rand(size).cuda() size,
device=torch.device(’cuda’),
)

Also applicable to:


torch.empty(), torch.zeros(), torch.full(), torch.ones(),
torch.eye(), torch.randint(), torch.randn()
and similar functions.

12
AVOID CPU-GPU SYNC

Operations which require synchronization:

print(cuda_tensor)

cuda_tensor.item()

memory copies: tensor.cuda(), cuda_tensor.cpu() and tensor.to(device) calls

cuda_tensor.nonzero()

python control flow which depends on operations on CUDA tensors e.g.

if (cuda_tensor != 0).all()

13
DISTRIBUTED
OPTIMIZATIONS
14
USE EFFICIENT MULTI-GPU BACKEND
DataParallel DistributedDataParallel

GPU 0 CPU core 0 GPU 0

GPU 1 CPU core 1 GPU 1


CPU core 0
GPU 2 CPU core 2 GPU 2

GPU 3 CPU core 3 GPU 3

• 1 CPU core drives multiple GPUs • 1 CPU core for each GPU
• 1 python process drives multiple GPUs (GIL) • 1 python process for each GPU
• only up to a single node • single-node and multi-node (same API)
• efficient implementation:
• automatic bucketing for grad all-reduce
• all-reduce overlapped with backward pass
• multi-process programing

15
LOAD-BALANCE WORKLOAD ON MULTIPLE GPUs
Gradient all-reduce after backward pass is a synchronization point in a multi-GPU setting
GPU 0 GPU 1 GPU 2 GPU 3

Forward

Backward

time
All-reduce

Optimizer
16
LOAD-BALANCE WORKLOAD ON MULTIPLE GPUs
Gradient all-reduce after backward pass is a synchronization point in a multi-GPU setting
GPU 0 GPU 1 GPU 2 GPU 3

Forward

Backward

time
idle

idle
All-reduce

idle
Optimizer
17
SUMMARY
General optimizations:
Use asynchronous data loading
Disable bias for convolutions directly followed by batch norm
Efficiently set gradients to zero
Disable gradient calculation for validation/inference
Fuse pointwise operations with PyTorch JIT

GPU specific optimizations:


Use mixed precision and AMP
Enable cuDNN autotuner
Create tensors directly on a GPU
Avoid CPU-GPU sync

Distributed optimizations
Use DistributedDataParallel
Load-balance workload on all GPUs
18
ADDITIONAL RESOURCES

PyTorch Tutorial: Performance Tuning Guide


Check for more optimizations
NVIDIA Deep Learning Performance Documentation
Introduction to Mixed Precision Training and AMP: video, slides
Using Nsight Systems to profile GPU workload (PyTorch Dev forum)

19

You might also like