Pytorch
Pytorch
2. Dataloader(dataset, num_workers=4*num_GPU)
3. Dataloader(dataset, pin_memory=True)
Why 3? Allocate the staging memory for the data on the CPU host directly and save
the time of transferring data from pageable memory to staging memory
Data Operations
4. Directly create vectors/matrices/tensors as torch.Tensor and at the device where
they will run operations
Example 4? device=torch.device('cuda:0'))
8. Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT
Why 8? PyTorch JIT would automatically fuse the adjacent pointwise operations into
one single kernel to save multiple memory reads/writes.
Example 8?
@torch.jit.script # JIT decorator
def fused_gelu(x):
return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
Model Architecture
9. Set the sizes of all different architecture designs as the multiples of 8 (for
FP16 of mixed precision)
10. Set the batch size as the multiples of 8 and maximize GPU memory usage
Why 9 and 10? To maximize the computation efficiency of GPU, it’s the best to
ensure different architecture designs (including the input and output
size/dimension/channel numbers of neural networks and batch size) are the multiples
of 8 or even larger powers of two (e.g., 64, 128 and up to 256). It’s because the
Tensor Cores of Nvidia GPUs achieve the best performance for matrix multiplication
when the matrix dimensions align to the multiples of powers of two.
11. Use mixed precision for forward pass (but not backward pass)
Why 11? setting the operations for lower precision can save both memory and
execution time
Example 11?
scaler = GradScaler()
for features, target in data:
# Forward pass with mixed precision
with torch.cuda.amp.autocast(): # autocast as a context manager
output = model(features)
loss = criterion(output, target)
# Backward pass without mixed precision
# It's not recommended to use mixed precision for backward pass
# Because we need more precise loss
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients .
# If these gradients contain infs or NaNs,
# optimizer.step() is skipped.
scaler.step(optimizer)
# If optimizer.step() was skipped,
# scaling factor is reduced by the backoff_factor in GradScaler()
scaler.update()
Example 11?
class AutocastModel(nn.Module):
...
@autocast() # autocast as a decorator
def forward(self, input):
x = self.model(input)
return x
13. Gradient accumulation: update weights for every other x batch to mimic the
larger batch size
Why 13? the estimation of gradients is more accurate and weights are updated more
towards the local/global minimum
Example 13?
for i, (features, target) in enumerate(dataloader):
# Forward pass
output = model(features)
loss = criterion(output, target)
# Backward pass
loss.backward()
# Only update weights every other 2 iterations
# Effective batch size is doubled
if (i+1) % 2 == 0 or (i+1) == len(dataloader):
# Update weights
optimizer.step()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
Inference/Validation
14. Turn off gradient calculation
Example 14?
# torch.no_grad() as a context manager:
with torch.no_grad():
output = model(input)
# torch.no_grad() as a function decorator:
@torch.no_grad()
def validation(model, input):
output = model(input)
return output
Distributed optimizations
18. Use DistributedDataParallel instead of DataParallel
Code
# Combining the tips No.7, 11, 12, 13: nonblocking, AMP, setting
# gradients as None, and larger effective batch size
model.train()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
scaler = GradScaler()
for i, (features, target) in enumerate(dataloader):
# these two calls are nonblocking and overlapping
features = features.to('cuda:0', non_blocking=True)
target = target.to('cuda:0', non_blocking=True)
# Forward pass with mixed precision
with torch.cuda.amp.autocast(): # autocast as a context manager
output = model(features)
loss = criterion(output, target)
# Backward pass without mixed precision
# It's not recommended to use mixed precision for backward pass
# Because we need more precise loss
scaler.scale(loss).backward()
# Only update weights every other 2 iterations
# Effective batch size is doubled
if (i+1) % 2 == 0 or (i+1) == len(dataloader):
# scaler.step() first unscales the gradients .
# If these gradients contain infs or NaNs,
# optimizer.step() is skipped.
scaler.step(optimizer)
# If optimizer.step() was skipped,
# scaling factor is reduced by the backoff_factor
# in GradScaler()
scaler.update()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)