0% found this document useful (0 votes)
24 views

Pytorch

1. Optimize data loading by moving data to SSD, using multiple workers for data loading, and pinning memory to reduce data transfer times. 2. Directly create tensors on the correct device and avoid unnecessary data transfers between CPU and GPU. Overlap data transfers using non_blocking transfers. 3. Fuse pointwise operations, use mixed precision only for forward passes, set gradients to None before optimizer steps, and accumulate gradients to mimic a larger batch size for more accurate gradient estimation.

Uploaded by

Yong-Jun Cham
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Pytorch

1. Optimize data loading by moving data to SSD, using multiple workers for data loading, and pinning memory to reduce data transfer times. 2. Directly create tensors on the correct device and avoid unnecessary data transfers between CPU and GPU. Overlap data transfers using non_blocking transfers. 3. Fuse pointwise operations, use mixed precision only for forward passes, set gradients to None before optimizer steps, and accumulate gradients to mimic a larger batch size for more accurate gradient estimation.

Uploaded by

Yong-Jun Cham
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Loading

1. Move the active data to the SSD

2. Dataloader(dataset, num_workers=4*num_GPU)

3. Dataloader(dataset, pin_memory=True)
Why 3? Allocate the staging memory for the data on the CPU host directly and save
the time of transferring data from pageable memory to staging memory

Data Operations
4. Directly create vectors/matrices/tensors as torch.Tensor and at the device where
they will run operations
Example 4? device=torch.device('cuda:0'))

5. Avoid unnecessary data transfer between CPU and GPU

6. Use torch.from_numpy(numpy_array) or torch.as_tensor(others)


Why 6? If the source data is a NumPy array, it’s faster to use
torch.from_numpy(numpy_array). If the source data is a tensor with the same data
type and device type, then torch.as_tensor(others) may avoid copying data if
applicable. others can be Python list, tuple, or torch.tensor.

7. Use tensor.to(non_blocking=True) when it’s applicable to overlap data transfers


Why 7? non_blocking=True allows asynchronous data transfers to reduce the execution
time.
Example 7?
for features, target in loader:
# these two calls are nonblocking and overlapping
features = features.to('cuda:0', non_blocking=True)
target = target.to('cuda:0', non_blocking=True)
# This is a synchronization point
# It will wait for previous two lines
output = model(features)

8. Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT
Why 8? PyTorch JIT would automatically fuse the adjacent pointwise operations into
one single kernel to save multiple memory reads/writes.
Example 8?
@torch.jit.script # JIT decorator
def fused_gelu(x):
return x * 0.5 * (1.0 + torch.erf(x / 1.41421))

Model Architecture
9. Set the sizes of all different architecture designs as the multiples of 8 (for
FP16 of mixed precision)
10. Set the batch size as the multiples of 8 and maximize GPU memory usage
Why 9 and 10? To maximize the computation efficiency of GPU, it’s the best to
ensure different architecture designs (including the input and output
size/dimension/channel numbers of neural networks and batch size) are the multiples
of 8 or even larger powers of two (e.g., 64, 128 and up to 256). It’s because the
Tensor Cores of Nvidia GPUs achieve the best performance for matrix multiplication
when the matrix dimensions align to the multiples of powers of two.

11. Use mixed precision for forward pass (but not backward pass)
Why 11? setting the operations for lower precision can save both memory and
execution time
Example 11?
scaler = GradScaler()
for features, target in data:
# Forward pass with mixed precision
with torch.cuda.amp.autocast(): # autocast as a context manager
output = model(features)
loss = criterion(output, target)
# Backward pass without mixed precision
# It's not recommended to use mixed precision for backward pass
# Because we need more precise loss
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients .
# If these gradients contain infs or NaNs,
# optimizer.step() is skipped.
scaler.step(optimizer)
# If optimizer.step() was skipped,
# scaling factor is reduced by the backoff_factor in GradScaler()
scaler.update()
Example 11?
class AutocastModel(nn.Module):
...
@autocast() # autocast as a decorator
def forward(self, input):
x = self.model(input)
return x

12. Set gradients to None (e.g., model.zero_grad(set_to_none=True) ) before the


optimizer updates the weights
Why 12? Setting gradients to zeroes by model.zero_grad() or optimizer.zero_grad()
would execute memset for all parameters and update gradients with reading and
writing operations. However, setting the gradients as None would not execute memset
and would update gradients with only writing operations.
Example 12?
# Reset gradients before each step of optimizer
for param in model.parameters():
param.grad = None
# or (PyTorch >= 1.7)
model.zero_grad(set_to_none=True)
# or (PyTorch >= 1.7)
optimizer.zero_grad(set_to_none=True)

13. Gradient accumulation: update weights for every other x batch to mimic the
larger batch size
Why 13? the estimation of gradients is more accurate and weights are updated more
towards the local/global minimum
Example 13?
for i, (features, target) in enumerate(dataloader):
# Forward pass
output = model(features)
loss = criterion(output, target)
# Backward pass
loss.backward()
# Only update weights every other 2 iterations
# Effective batch size is doubled
if (i+1) % 2 == 0 or (i+1) == len(dataloader):
# Update weights
optimizer.step()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
Inference/Validation
14. Turn off gradient calculation
Example 14?
# torch.no_grad() as a context manager:
with torch.no_grad():
output = model(input)
# torch.no_grad() as a function decorator:
@torch.no_grad()
def validation(model, input):
output = model(input)
return output

CNN (Convolutional Neural Network) specific


15. torch.backends.cudnn.benchmark = True
Why 15? Setting torch.backends.cudnn.benchmark = True before the training loop can
accelerate the computation. Because the performance of cuDNN algorithms to compute
the convolution of different kernel sizes varies, the auto-tuner can run a
benchmark to find the best algorithm. It’s recommended to use turn on the setting
when your input size doesn’t change often.
Example 15? torch.backends.cudnn.benchmark = True

16. Use channels_last memory format for 4D NCHW Tensors


17. Turn off bias for convolutional layers that are right before batch
normalization

Distributed optimizations
18. Use DistributedDataParallel instead of DataParallel

Code
# Combining the tips No.7, 11, 12, 13: nonblocking, AMP, setting
# gradients as None, and larger effective batch size
model.train()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
scaler = GradScaler()
for i, (features, target) in enumerate(dataloader):
# these two calls are nonblocking and overlapping
features = features.to('cuda:0', non_blocking=True)
target = target.to('cuda:0', non_blocking=True)
# Forward pass with mixed precision
with torch.cuda.amp.autocast(): # autocast as a context manager
output = model(features)
loss = criterion(output, target)
# Backward pass without mixed precision
# It's not recommended to use mixed precision for backward pass
# Because we need more precise loss
scaler.scale(loss).backward()
# Only update weights every other 2 iterations
# Effective batch size is doubled
if (i+1) % 2 == 0 or (i+1) == len(dataloader):
# scaler.step() first unscales the gradients .
# If these gradients contain infs or NaNs,
# optimizer.step() is skipped.
scaler.step(optimizer)
# If optimizer.step() was skipped,
# scaling factor is reduced by the backoff_factor
# in GradScaler()
scaler.update()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)

You might also like