Pytorch

1. Optimize data loading by moving data to SSD, using multiple workers for data loading, and pinning memory to reduce data transfer times. 2. Directly create tensors on the correct device and avoid unnecessary data transfers between CPU and GPU. Overlap data transfers using non_blocking transfers. 3. Fuse pointwise operations, use mixed precision only for forward passes, set gradients to None before optimizer steps, and accumulate gradients to mimic a larger batch size for more accurate gradient estimation.

Uploaded by

Yong-Jun Cham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views4 pages

Pytorch

Uploaded by

Yong-Jun Cham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

Data Loading

1. Move the active data to the SSD

2. Dataloader(dataset, num_workers=4*num_GPU)

3. Dataloader(dataset, pin_memory=True)
Why 3? Allocate the staging memory for the data on the CPU host directly and save
the time of transferring data from pageable memory to staging memory

Data Operations
4. Directly create vectors/matrices/tensors as [Link] and at the device where
they will run operations
Example 4? device=[Link]('cuda:0'))

5. Avoid unnecessary data transfer between CPU and GPU

6. Use torch.from_numpy(numpy_array) or torch.as_tensor(others)

Why 6? If the source data is a NumPy array, it’s faster to use
torch.from_numpy(numpy_array). If the source data is a tensor with the same data
type and device type, then torch.as_tensor(others) may avoid copying data if
applicable. others can be Python list, tuple, or [Link].

7. Use [Link](non_blocking=True) when it’s applicable to overlap data transfers

Why 7? non_blocking=True allows asynchronous data transfers to reduce the execution
time.
Example 7?
for features, target in loader:
# these two calls are nonblocking and overlapping
features = [Link]('cuda:0', non_blocking=True)
target = [Link]('cuda:0', non_blocking=True)
# This is a synchronization point
# It will wait for previous two lines
output = model(features)

8. Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT
Why 8? PyTorch JIT would automatically fuse the adjacent pointwise operations into
one single kernel to save multiple memory reads/writes.
Example 8?
@[Link] # JIT decorator
def fused_gelu(x):
return x * 0.5 * (1.0 + [Link](x / 1.41421))

Model Architecture
9. Set the sizes of all different architecture designs as the multiples of 8 (for
FP16 of mixed precision)
10. Set the batch size as the multiples of 8 and maximize GPU memory usage
Why 9 and 10? To maximize the computation efficiency of GPU, it’s the best to
ensure different architecture designs (including the input and output
size/dimension/channel numbers of neural networks and batch size) are the multiples
of 8 or even larger powers of two (e.g., 64, 128 and up to 256). It’s because the
Tensor Cores of Nvidia GPUs achieve the best performance for matrix multiplication
when the matrix dimensions align to the multiples of powers of two.

11. Use mixed precision for forward pass (but not backward pass)
Why 11? setting the operations for lower precision can save both memory and
execution time
Example 11?
scaler = GradScaler()
for features, target in data:
# Forward pass with mixed precision
with [Link](): # autocast as a context manager
output = model(features)
loss = criterion(output, target)
# Backward pass without mixed precision
# It's not recommended to use mixed precision for backward pass
# Because we need more precise loss
[Link](loss).backward()
# [Link]() first unscales the gradients .
# If these gradients contain infs or NaNs,
# [Link]() is skipped.
[Link](optimizer)
# If [Link]() was skipped,
# scaling factor is reduced by the backoff_factor in GradScaler()
[Link]()
Example 11?
class AutocastModel([Link]):
...
@autocast() # autocast as a decorator
def forward(self, input):
x = [Link](input)
return x

12. Set gradients to None (e.g., model.zero_grad(set_to_none=True) ) before the

optimizer updates the weights
Why 12? Setting gradients to zeroes by model.zero_grad() or optimizer.zero_grad()
would execute memset for all parameters and update gradients with reading and
writing operations. However, setting the gradients as None would not execute memset
and would update gradients with only writing operations.
Example 12?
# Reset gradients before each step of optimizer
for param in [Link]():
[Link] = None
# or (PyTorch >= 1.7)
model.zero_grad(set_to_none=True)
# or (PyTorch >= 1.7)
optimizer.zero_grad(set_to_none=True)

13. Gradient accumulation: update weights for every other x batch to mimic the
larger batch size
Why 13? the estimation of gradients is more accurate and weights are updated more
towards the local/global minimum
Example 13?
for i, (features, target) in enumerate(dataloader):
# Forward pass
output = model(features)
loss = criterion(output, target)
# Backward pass
[Link]()
# Only update weights every other 2 iterations
# Effective batch size is doubled
if (i+1) % 2 == 0 or (i+1) == len(dataloader):
# Update weights
[Link]()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
Inference/Validation
14. Turn off gradient calculation
Example 14?
# torch.no_grad() as a context manager:
with torch.no_grad():
output = model(input)
# torch.no_grad() as a function decorator:
@torch.no_grad()
def validation(model, input):
output = model(input)
return output

CNN (Convolutional Neural Network) specific

15. [Link] = True
Why 15? Setting [Link] = True before the training loop can
accelerate the computation. Because the performance of cuDNN algorithms to compute
the convolution of different kernel sizes varies, the auto-tuner can run a
benchmark to find the best algorithm. It’s recommended to use turn on the setting
when your input size doesn’t change often.
Example 15? [Link] = True

16. Use channels_last memory format for 4D NCHW Tensors

17. Turn off bias for convolutional layers that are right before batch
normalization

Distributed optimizations
18. Use DistributedDataParallel instead of DataParallel

Code
# Combining the tips No.7, 11, 12, 13: nonblocking, AMP, setting
# gradients as None, and larger effective batch size
[Link]()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
scaler = GradScaler()
for i, (features, target) in enumerate(dataloader):
# these two calls are nonblocking and overlapping
features = [Link]('cuda:0', non_blocking=True)
target = [Link]('cuda:0', non_blocking=True)
# Forward pass with mixed precision
with [Link](): # autocast as a context manager
output = model(features)
loss = criterion(output, target)
# Backward pass without mixed precision
# It's not recommended to use mixed precision for backward pass
# Because we need more precise loss
[Link](loss).backward()
# Only update weights every other 2 iterations
# Effective batch size is doubled
if (i+1) % 2 == 0 or (i+1) == len(dataloader):
# [Link]() first unscales the gradients .
# If these gradients contain infs or NaNs,
# [Link]() is skipped.
[Link](optimizer)
# If [Link]() was skipped,
# scaling factor is reduced by the backoff_factor
# in GradScaler()
[Link]()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)

Robotics and Expert Systems Overview
100% (1)
Robotics and Expert Systems Overview
7 pages
FPGA-Accelerated Cloud Simulation Platform
No ratings yet
FPGA-Accelerated Cloud Simulation Platform
42 pages
rJLT_PERU Configuration Overview
No ratings yet
rJLT_PERU Configuration Overview
3 pages
Design Challenges in the Digital Era
No ratings yet
Design Challenges in the Digital Era
5 pages
Term-to-Term Rules for Sequences
No ratings yet
Term-to-Term Rules for Sequences
6 pages
Software Development Life Cycle Overview
No ratings yet
Software Development Life Cycle Overview
65 pages
Securing Cloud Migration for AI Innovation
No ratings yet
Securing Cloud Migration for AI Innovation
17 pages
Toshiba - LAPTOP: Models Specification Netbook Category
No ratings yet
Toshiba - LAPTOP: Models Specification Netbook Category
16 pages
ITBA Webmail Access and Usage Guide
No ratings yet
ITBA Webmail Access and Usage Guide
1 page
Signal Processing II: DFT & Filter Design
No ratings yet
Signal Processing II: DFT & Filter Design
30 pages
Advance Java Programming Course Plan
No ratings yet
Advance Java Programming Course Plan
23 pages
ATV312HU40N4 Variable Speed Drive Data
100% (1)
ATV312HU40N4 Variable Speed Drive Data
3 pages
VDIBA-Based Chaotic Oscillator Designs
No ratings yet
VDIBA-Based Chaotic Oscillator Designs
28 pages
Advanced Java Database Interaction Guide
No ratings yet
Advanced Java Database Interaction Guide
15 pages
Altium Designer
No ratings yet
Altium Designer
8 pages
ABAP Unit Test For Odata Services - SAP Blogs
No ratings yet
ABAP Unit Test For Odata Services - SAP Blogs
16 pages
Venkateshwara Open University Overview
No ratings yet
Venkateshwara Open University Overview
74 pages
Understanding AI Agents and Their Functions
No ratings yet
Understanding AI Agents and Their Functions
4 pages
S500 Review
No ratings yet
S500 Review
4 pages
Account Statement: SBJ/311 (2023-2024)
No ratings yet
Account Statement: SBJ/311 (2023-2024)
23 pages
System Reliability Calculation Guide
No ratings yet
System Reliability Calculation Guide
5 pages
Signal Conditioning & Data Acquisition Course
No ratings yet
Signal Conditioning & Data Acquisition Course
96 pages
Product Data Sheet MPFM 2600 M Roxar en Us 170810 PDF
No ratings yet
Product Data Sheet MPFM 2600 M Roxar en Us 170810 PDF
10 pages
UX Cheatsheets Strategy
No ratings yet
UX Cheatsheets Strategy
1 page
Magneto-Optical Current Transformer Overview
No ratings yet
Magneto-Optical Current Transformer Overview
16 pages
How to Use a Calculating Ruler
No ratings yet
How to Use a Calculating Ruler
5 pages
Machine Learning for Banking Fraud Detection
No ratings yet
Machine Learning for Banking Fraud Detection
13 pages
Test Coverage Criteria For RESTful Web APIs
No ratings yet
Test Coverage Criteria For RESTful Web APIs
7 pages
Student Management System Overview 8.0
No ratings yet
Student Management System Overview 8.0
20 pages
Affordable Home Security Setup Guide
No ratings yet
Affordable Home Security Setup Guide
76 pages

Pytorch

Uploaded by

Pytorch

Uploaded by

Data Loading

1. Move the active data to the SSD

5. Avoid unnecessary data transfer between CPU and GPU

6. Use torch.from_numpy(numpy_array) or torch.as_tensor(others)

7. Use [Link](non_blocking=True) when it’s applicable to overlap data transfers

12. Set gradients to None (e.g., model.zero_grad(set_to_none=True) ) before the

CNN (Convolutional Neural Network) specific

16. Use channels_last memory format for 4D NCHW Tensors

You might also like