Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
TUNING GUIDE
Szymon Migacz, 04/12/2021
CONTENT
PyTorch Performance Tuning Guide
2
GENERAL
OPTIMIZATIONS
3
ENABLE ASYNC DATA LOADING
& AUGMENTATION
Example: PyTorch MNIST example: DataLoader with default
PyTorch DataLoader supports
asynchronous data loading / {'num_workers': 1, 'pin_memory': True}.
augmentation
Default settings: Setting for the training DataLoader Time for one training epoch
num_workers=0,
pin_memory=False {'num_workers': 0, 'pin_memory': False} 8.2 s
4
DISABLE BIAS FOR CONVOLUTIONS DIRECTLY
FOLLOWED BY A BATCH NORM
... ...
nn.BatchNorm2d() nn.BatchNorm2d()
... ...
5
EFFICIENTLY SET GRADIENTS TO ZERO
• executes memset for every parameter • doesn't execute memset for every
in the model parameter
• backward pass updates gradients with • memory is zeroed-out by the allocator in
"+=" operator (read + write) a more efficient way
• backward pass updates gradients with "="
operator (write)
6
DISABLE GRADIENT CALCULATION FOR
INFERENCE
7
FUSE POINTWISE OPERATIONS
PyTorch JIT can fuse pointwise operations into a single CUDA kernel.
Unfused pointwise operations are memory-bound, for each unfused op PyTorch:
launches a separate kernel
loads data from global memory
performs computation
stores results back into global memory
Example: @torch.jit.script
def gelu(x): def fused_gelu(x):
return x * 0.5 * (1.0 + torch.erf(x / 1.41421)) return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
gelu(x) 5 65
fused_gelu(x) 1 16
See Deep Learning Performance Documentation for more details and guidelines specific to layer type
Enable AMP
Native PyTorch AMP is available starting from PyTorch 1.6: documentation, examples, tutorial
10
ENABLE cuDNN AUTOTUNER
For convolutional neural networks, enable cuDNN autotuner by setting:
torch.backends.cudnn.benchmark = True
cuDNN supports many algorithms to compute convolution
autotuner runs a short benchmark and selects algorithm with the best performance
Example:
nn.Conv2d with 64 3x3 filters applied to an input with batch size = 32, channels = width = height = 64.
11
CREATE TENSORS DIRECTLY ON TARGET DEVICE
torch.rand(
torch.rand(size).cuda() size,
device=torch.device(’cuda’),
)
12
AVOID CPU-GPU SYNC
print(cuda_tensor)
cuda_tensor.item()
cuda_tensor.nonzero()
if (cuda_tensor != 0).all()
13
DISTRIBUTED
OPTIMIZATIONS
14
USE EFFICIENT MULTI-GPU BACKEND
DataParallel DistributedDataParallel
• 1 CPU core drives multiple GPUs • 1 CPU core for each GPU
• 1 python process drives multiple GPUs (GIL) • 1 python process for each GPU
• only up to a single node • single-node and multi-node (same API)
• efficient implementation:
• automatic bucketing for grad all-reduce
• all-reduce overlapped with backward pass
• multi-process programing
15
LOAD-BALANCE WORKLOAD ON MULTIPLE GPUs
Gradient all-reduce after backward pass is a synchronization point in a multi-GPU setting
GPU 0 GPU 1 GPU 2 GPU 3
Forward
Backward
time
All-reduce
Optimizer
16
LOAD-BALANCE WORKLOAD ON MULTIPLE GPUs
Gradient all-reduce after backward pass is a synchronization point in a multi-GPU setting
GPU 0 GPU 1 GPU 2 GPU 3
Forward
Backward
time
idle
idle
All-reduce
idle
Optimizer
17
SUMMARY
General optimizations:
Use asynchronous data loading
Disable bias for convolutions directly followed by batch norm
Efficiently set gradients to zero
Disable gradient calculation for validation/inference
Fuse pointwise operations with PyTorch JIT
Distributed optimizations
Use DistributedDataParallel
Load-balance workload on all GPUs
18
ADDITIONAL RESOURCES
19