Data Loading
1. Move the active data to the SSD
2. Dataloader(dataset, num_workers=4*num_GPU)
3. Dataloader(dataset, pin_memory=True)
Why 3? Allocate the staging memory for the data on the CPU host directly and save
the time of transferring data from pageable memory to staging memory
Data Operations
4. Directly create vectors/matrices/tensors as [Link] and at the device where
they will run operations
Example 4? device=[Link]('cuda:0'))
5. Avoid unnecessary data transfer between CPU and GPU
6. Use torch.from_numpy(numpy_array) or torch.as_tensor(others)
Why 6? If the source data is a NumPy array, it’s faster to use
torch.from_numpy(numpy_array). If the source data is a tensor with the same data
type and device type, then torch.as_tensor(others) may avoid copying data if
applicable. others can be Python list, tuple, or [Link].
7. Use [Link](non_blocking=True) when it’s applicable to overlap data transfers
Why 7? non_blocking=True allows asynchronous data transfers to reduce the execution
time.
Example 7?
for features, target in loader:
# these two calls are nonblocking and overlapping
features = [Link]('cuda:0', non_blocking=True)
target = [Link]('cuda:0', non_blocking=True)
# This is a synchronization point
# It will wait for previous two lines
output = model(features)
8. Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT
Why 8? PyTorch JIT would automatically fuse the adjacent pointwise operations into
one single kernel to save multiple memory reads/writes.
Example 8?
@[Link] # JIT decorator
def fused_gelu(x):
return x * 0.5 * (1.0 + [Link](x / 1.41421))
Model Architecture
9. Set the sizes of all different architecture designs as the multiples of 8 (for
FP16 of mixed precision)
10. Set the batch size as the multiples of 8 and maximize GPU memory usage
Why 9 and 10? To maximize the computation efficiency of GPU, it’s the best to
ensure different architecture designs (including the input and output
size/dimension/channel numbers of neural networks and batch size) are the multiples
of 8 or even larger powers of two (e.g., 64, 128 and up to 256). It’s because the
Tensor Cores of Nvidia GPUs achieve the best performance for matrix multiplication
when the matrix dimensions align to the multiples of powers of two.
11. Use mixed precision for forward pass (but not backward pass)
Why 11? setting the operations for lower precision can save both memory and
execution time
Example 11?
scaler = GradScaler()
for features, target in data:
# Forward pass with mixed precision
with [Link](): # autocast as a context manager
output = model(features)
loss = criterion(output, target)
# Backward pass without mixed precision
# It's not recommended to use mixed precision for backward pass
# Because we need more precise loss
[Link](loss).backward()
# [Link]() first unscales the gradients .
# If these gradients contain infs or NaNs,
# [Link]() is skipped.
[Link](optimizer)
# If [Link]() was skipped,
# scaling factor is reduced by the backoff_factor in GradScaler()
[Link]()
Example 11?
class AutocastModel([Link]):
...
@autocast() # autocast as a decorator
def forward(self, input):
x = [Link](input)
return x
12. Set gradients to None (e.g., model.zero_grad(set_to_none=True) ) before the
optimizer updates the weights
Why 12? Setting gradients to zeroes by model.zero_grad() or optimizer.zero_grad()
would execute memset for all parameters and update gradients with reading and
writing operations. However, setting the gradients as None would not execute memset
and would update gradients with only writing operations.
Example 12?
# Reset gradients before each step of optimizer
for param in [Link]():
[Link] = None
# or (PyTorch >= 1.7)
model.zero_grad(set_to_none=True)
# or (PyTorch >= 1.7)
optimizer.zero_grad(set_to_none=True)
13. Gradient accumulation: update weights for every other x batch to mimic the
larger batch size
Why 13? the estimation of gradients is more accurate and weights are updated more
towards the local/global minimum
Example 13?
for i, (features, target) in enumerate(dataloader):
# Forward pass
output = model(features)
loss = criterion(output, target)
# Backward pass
[Link]()
# Only update weights every other 2 iterations
# Effective batch size is doubled
if (i+1) % 2 == 0 or (i+1) == len(dataloader):
# Update weights
[Link]()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
Inference/Validation
14. Turn off gradient calculation
Example 14?
# torch.no_grad() as a context manager:
with torch.no_grad():
output = model(input)
# torch.no_grad() as a function decorator:
@torch.no_grad()
def validation(model, input):
output = model(input)
return output
CNN (Convolutional Neural Network) specific
15. [Link] = True
Why 15? Setting [Link] = True before the training loop can
accelerate the computation. Because the performance of cuDNN algorithms to compute
the convolution of different kernel sizes varies, the auto-tuner can run a
benchmark to find the best algorithm. It’s recommended to use turn on the setting
when your input size doesn’t change often.
Example 15? [Link] = True
16. Use channels_last memory format for 4D NCHW Tensors
17. Turn off bias for convolutional layers that are right before batch
normalization
Distributed optimizations
18. Use DistributedDataParallel instead of DataParallel
Code
# Combining the tips No.7, 11, 12, 13: nonblocking, AMP, setting
# gradients as None, and larger effective batch size
[Link]()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
scaler = GradScaler()
for i, (features, target) in enumerate(dataloader):
# these two calls are nonblocking and overlapping
features = [Link]('cuda:0', non_blocking=True)
target = [Link]('cuda:0', non_blocking=True)
# Forward pass with mixed precision
with [Link](): # autocast as a context manager
output = model(features)
loss = criterion(output, target)
# Backward pass without mixed precision
# It's not recommended to use mixed precision for backward pass
# Because we need more precise loss
[Link](loss).backward()
# Only update weights every other 2 iterations
# Effective batch size is doubled
if (i+1) % 2 == 0 or (i+1) == len(dataloader):
# [Link]() first unscales the gradients .
# If these gradients contain infs or NaNs,
# [Link]() is skipped.
[Link](optimizer)
# If [Link]() was skipped,
# scaling factor is reduced by the backoff_factor
# in GradScaler()
[Link]()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)