0% found this document useful (0 votes)

11 views

Torch Optimization

Torch optimization in deep learning

Uploaded by

Vivekananda Ganjigunta Narayana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Torch Optimization

Torch optimization in deep learning

Uploaded by

Vivekananda Ganjigunta Narayana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Optimize your Torch training code for LLMs

Pankaj Chouhan

Florida State University

2024

Pankaj Chouhan (Florida State University) Torch Optimization 2024 1 / 17

Common Floating Point Data Types

Figure: Comparison between float16, float32 and bfloat16 (source)

Pankaj Chouhan (Florida State University) Torch Optimization 2024 2 / 17

Comparison and Best Use Cases

Figure: Histogram of activation gradient magnitudes throughout FP32 training.

From NVIDIA blog.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 3 / 17

Comparison and Best Use Cases

Figure: Loss in information using FP16 format. From NVIDIA blog

Pankaj Chouhan (Florida State University) Torch Optimization 2024 4 / 17

Comparison and Best Use Cases

Float16:
Advantages: Memory efficiency.
Disadvantages: Limited precision, not ideal for training complex
models.
Float32:
Advantages: Higher precision, widely supported.
Disadvantages: Higher memory usage compared to float16 and
bfloat16.
bfloat16:
Advantages: Same dynamic range as float32 with reduced memory use.
Disadvantages: Lower precision than float32 but often sufficient for
deep learning tasks.
Disadvantages: Only available on specific machine, e.g A100, RTX 30
series.
Conclusion: Float16 for inference, bfloat16 for training LLMs or on
GPUs/TPUs, float32 for general use.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 5 / 17

Improvement 1: Setup the matrix multiplication precision

PyTorch allows you to set the matrix multiplication precision for

float32 operations.
Example code:
torch.set_float32_matmul_precision("high")

The precision can be one of the following:

highest – Maximum precision (Default), use FP32.
high – Mixed precision. Either use TensorFloat32 or two bfloat16 to
represent float32.
medium – Reduced precision for speed. Use bfloat16 if allowed.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 6 / 17

Improvement 2: Use AMP

# Enable AMP for forward pass if IMPROVEMENT_2 is toggled

if IMPROVEMENT_2:
with torch.autocast(device_type=device, dtype=torch.float16):
logits, loss = model(x, y)
else:
logits, loss = model(x, y)

Mixed Precision: Combines float16 and float32 to reduce memory

usage while preserving accuracy.
Safe Operations: Like matrix multiplications, activation function
evaluation, element-wise operations (add, subtract) will run in float16.
Risky Operations: Like reductions (e.g., summing or averaging across
tensors), loss computation will run in float32.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 7 / 17

Improvement 3: Use Torch compile

if IMPROVEMENT_3:
model = torch.compile(model)

Consider this equation:

f (x) = sin2 (x) + cos 2 (x) (1)

In eager mode, each term of the computational graph will be sent to

GPU/CPU one-by-one.
Torch compile looks at the full computational graph at once and
optimizes it by fusing consecutive operations.
This is a good blog.
Torch compile: Graph-Based Optimization, Operation Fusion,
Memory and Python Overhead Reduction.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 8 / 17

Improvement 4: Use flash attention

# Perform attention
if self.use_flash_attention:
# Use flash attention (optimized attention)
y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
else:
# Traditional scaled dot-product attention with causal masking
att = ([email protected](-2,-1))*(1.0/math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T]==0,float("-inf"))
att = F.softmax(att, dim=-1)
y = att @ v # Apply attention weights to the values

The Main idea is to move the bottleneck of computing self-attention

from slow HBM (High Bandwidth Memory) to ultra-fast GPU on-chip
SRAM.
Self-attention is quadratic in memory and compute time.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 9 / 17

Improvement 4: What is HBM and SRAM?

Figure: GPU memory. From NVIDIA blog.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 10 / 17

Improvement 4: HBM vs. SRAM

Table: Comparison of HBM and SRAM in GPUs

Characteristic HBM (High Bandwidth Memory) SRAM (Static Random Access Memory)
Speed High Extremely fast, used for cache
Capacity High (up to several GB per stack) Low (typically in KB to MB)
Usage Main memory for GPU computation Cache memory on GPU die
Power Efficiency More power-efficient than GDDR Consumes more power per bit
Cost Expensive compared to GDDR Very expensive, takes up a lot of space

Pankaj Chouhan (Florida State University) Torch Optimization 2024 11 / 17

Improvement 4: Flash attention

Figure: Minimize call to HBM using self-attention. From HuggingFace blog

Pankaj Chouhan (Florida State University) Torch Optimization 2024 12 / 17

Improvement 4: Flash attention

Introduced in 2022 by Tri Dao

Figure: By using Tilling and re-computation, they minimize the call to HBM.
Source:arxiv.

They keep track of two variable m(x), l(x) to compute global softmax
using only local-block.
Pankaj Chouhan (Florida State University) Torch Optimization 2024 13 / 17
Improvement 5: Pad to the power of 2 for efficient GPU
computation

The Main reason is how hardware is designed.

Supported by empirical results, this approach has become a widely
accepted standard in the field.

ChatGPT’s answer:
Memory Access Efficiency: Aligning data to powers of 2 allows for coalesced memory
access, reducing memory transaction overhead.
Full Warp Utilization: Padding ensures that all threads in a warp (group of 32 threads)
are fully utilized, avoiding idle threads.
Cache Optimization: Data fits neatly into GPU cache lines (often power-of-2 sized),
minimizing cache misses and improving access speed.
Reduced Shared Memory Bank Conflicts: Ensures data is evenly distributed across shared
memory banks, improving parallel access.
SIMD Efficiency: Enables efficient execution of operations in GPU’s SIMD (Single
Instruction, Multiple Data) units.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 14 / 17

Improvement 6: Use fused Adam optimizer

if use_fused_adam:
optimizer = torch.optim.AdamW(..., fused=use_fused)
else:
optimizer = torch.optim.AdamW(...)

GPU spins up different kernels for gradient computation, momentum

updates, parameter updates etc.
Fused Adam: combines multiple operations in Adam (e.g., gradient
computation, updates) into a single GPU kernel.
Faster Training: Reduces kernel launches and memory access,
speeding up model training.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 15 / 17

Other improvement

Distributed Data-Parallel (DDP).

Model parallelism.
Tensor parallelism.
Zero optimization.
Quantization, KV-caching (for inference).

Pankaj Chouhan (Florida State University) Torch Optimization 2024 16 / 17

Thank You
Questions?

Pankaj Chouhan (Florida State University) Torch Optimization 2024 17 / 17

Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Empower B1 Word List-EnG
No ratings yet
Empower B1 Word List-EnG
70 pages
Training AI Models on CPU. Revisiting CPU for ML…
No ratings yet
Training AI Models on CPU. Revisiting CPU for ML…
15 pages
Pytorch
No ratings yet
Pytorch
38 pages
The Next Generation of GPU Performance in PyTorch with nvFuser_1647043230943001r3L1 (2)
No ratings yet
The Next Generation of GPU Performance in PyTorch with nvFuser_1647043230943001r3L1 (2)
64 pages
Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
No ratings yet
Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
20 pages
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
No ratings yet
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
108 pages
arogozhnikov-adamw_bfloat16
No ratings yet
arogozhnikov-adamw_bfloat16
10 pages
In5400 Week4 2020 Pytorch Lecture4
100% (1)
In5400 Week4 2020 Pytorch Lecture4
84 pages
Py Torch
No ratings yet
Py Torch
786 pages
CUDA_FIX_FOR_PYTHON
No ratings yet
CUDA_FIX_FOR_PYTHON
2 pages
Pytorch
No ratings yet
Pytorch
4 pages
DL_pytorch_
No ratings yet
DL_pytorch_
8 pages
S51413 - Developing Optimal CUDA Kernels on Hopper Tensor Cores_1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels on Hopper Tensor Cores_1679452516682001bWRm
80 pages
EE292A Lecture 2.ML - Hardware
No ratings yet
EE292A Lecture 2.ML - Hardware
61 pages
1.3 Energy-Aware Computing
No ratings yet
1.3 Energy-Aware Computing
13 pages
DIP Lab 10
No ratings yet
DIP Lab 10
11 pages
PFG-21-23
No ratings yet
PFG-21-23
35 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
M2Cache
No ratings yet
M2Cache
24 pages
FP-BNN-on-FPGA
No ratings yet
FP-BNN-on-FPGA
15 pages
Day5_03_Converting Neural Networks model into Optimzied Code
No ratings yet
Day5_03_Converting Neural Networks model into Optimzied Code
25 pages
vertopal.com_PyTorch_CrashCourse
No ratings yet
vertopal.com_PyTorch_CrashCourse
16 pages
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
No ratings yet
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
7 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Pytorch Tutorial PDF
No ratings yet
Pytorch Tutorial PDF
27 pages
The Era of 1-Bit LLMS: All Large Language Models Are in 1.58 Bits
No ratings yet
The Era of 1-Bit LLMS: All Large Language Models Are in 1.58 Bits
8 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
Stas Bekman - Machine Learning Engineering
No ratings yet
Stas Bekman - Machine Learning Engineering
217 pages
Introduction To PyTorch
No ratings yet
Introduction To PyTorch
35 pages
Day 45 PyTorch Presentation
No ratings yet
Day 45 PyTorch Presentation
67 pages
Transformers Inference Optimization Toolset - AstraBlog
No ratings yet
Transformers Inference Optimization Toolset - AstraBlog
29 pages
Benini ISC2023 Paving the Road for Riscv
No ratings yet
Benini ISC2023 Paving the Road for Riscv
40 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
lec-3
No ratings yet
lec-3
30 pages
Flashattention: Fast and Memory-Efficient Exact Attention With Io-Awareness
No ratings yet
Flashattention: Fast and Memory-Efficient Exact Attention With Io-Awareness
34 pages
Flowpipe3
No ratings yet
Flowpipe3
14 pages
script2
No ratings yet
script2
2 pages
PyTorch 2_ Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation - Pytorch2-2
No ratings yet
PyTorch 2_ Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation - Pytorch2-2
3 pages
PyTorch - A Comprehensive Overview
No ratings yet
PyTorch - A Comprehensive Overview
7 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Torch7 Scientific Computing For Lua (JIT) (CVPR2015)
No ratings yet
Torch7 Scientific Computing For Lua (JIT) (CVPR2015)
30 pages
PyTorch_Guide_With_Code
No ratings yet
PyTorch_Guide_With_Code
4 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Warmup Guide To Pytorch
No ratings yet
Warmup Guide To Pytorch
5 pages
BNN in FPGA
No ratings yet
BNN in FPGA
15 pages
Py Torch
No ratings yet
Py Torch
19 pages
7 Ways To Speed Up Inference of Your Hosted LLMs. in The Future, Every 1% Speedup On LLM - by Sergei Savvov - Jun, 2023 - Medium - Better Programming
No ratings yet
7 Ways To Speed Up Inference of Your Hosted LLMs. in The Future, Every 1% Speedup On LLM - by Sergei Savvov - Jun, 2023 - Medium - Better Programming
27 pages
OpTorch Optimized Deep Learning Architectures For
No ratings yet
OpTorch Optimized Deep Learning Architectures For
7 pages
Stars 4 0 0 0 + Forks 7 0 0 + License MIT
No ratings yet
Stars 4 0 0 0 + Forks 7 0 0 + License MIT
19 pages
A A Y N E - E L M: Ddition Is LL OU EED FOR Nergy Fficient Anguage Odels
No ratings yet
A A Y N E - E L M: Ddition Is LL OU EED FOR Nergy Fficient Anguage Odels
13 pages
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
No ratings yet
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
68 pages
Flash Attn 3 Gpu Mode Talk
No ratings yet
Flash Attn 3 Gpu Mode Talk
27 pages
Cpu Backend
No ratings yet
Cpu Backend
24 pages
24-0383
No ratings yet
24-0383
18 pages
Pytorch Tutorial by Chongruo Wu
No ratings yet
Pytorch Tutorial by Chongruo Wu
84 pages
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Slides for 『TCP/IP Protocol Suite』: Part 1 Introduction and Underlying Technologies
No ratings yet
Slides for 『TCP/IP Protocol Suite』: Part 1 Introduction and Underlying Technologies
2 pages
Contents of The Official Personnel File - Human Resources
No ratings yet
Contents of The Official Personnel File - Human Resources
3 pages
JNTUH B.Tech Database Management Systems Lab R13 Syllabus - Studentboxoffice - in PDF
No ratings yet
JNTUH B.Tech Database Management Systems Lab R13 Syllabus - Studentboxoffice - in PDF
9 pages
MCA V Sem Syllabus
No ratings yet
MCA V Sem Syllabus
4 pages
Installing Network Simulator 2 (NS2) On Ubuntu 14
No ratings yet
Installing Network Simulator 2 (NS2) On Ubuntu 14
12 pages
Euler's Totient Function and Euler's Theorem
No ratings yet
Euler's Totient Function and Euler's Theorem
2 pages
Recurrence Relation Guest Lecture
No ratings yet
Recurrence Relation Guest Lecture
221 pages
Difference Between Adaptive and Non - Adaptive Algorithms
No ratings yet
Difference Between Adaptive and Non - Adaptive Algorithms
6 pages
Conducting A Review
No ratings yet
Conducting A Review
4 pages
Comparison of Tdma, Cdma, Fdma and Sdma
0% (1)
Comparison of Tdma, Cdma, Fdma and Sdma
3 pages
Infinitheism by Mahatria
No ratings yet
Infinitheism by Mahatria
1 page
200 Mock Test File For Ugc Net
No ratings yet
200 Mock Test File For Ugc Net
4 pages
Distributed Dbms Tutorial PDF
100% (2)
Distributed Dbms Tutorial PDF
81 pages
Multimedia Over Ip - RSVP, RTP, RTCP, RTSP
No ratings yet
Multimedia Over Ip - RSVP, RTP, RTCP, RTSP
11 pages
Alberto Dainotti
No ratings yet
Alberto Dainotti
4 pages
Vaangmayam
No ratings yet
Vaangmayam
63 pages
Better Networking With SCTP
No ratings yet
Better Networking With SCTP
9 pages
Registration Form: 3 - 4 October 2016 3 - 4 October 2016
No ratings yet
Registration Form: 3 - 4 October 2016 3 - 4 October 2016
2 pages
Developing Good Logic Skills
No ratings yet
Developing Good Logic Skills
8 pages
Agasthiar's Easy Tharpanam Procedure. Simple Pitru Tarpanam Method. Karunya Tarpanam. Amavasya Tarpanum. Amavasai. Amaavasai - Agasthiar
No ratings yet
Agasthiar's Easy Tharpanam Procedure. Simple Pitru Tarpanam Method. Karunya Tarpanam. Amavasya Tarpanum. Amavasai. Amaavasai - Agasthiar
17 pages
Assessment 2 Learning Portfolio and Step by Step Guide - Final Version
No ratings yet
Assessment 2 Learning Portfolio and Step by Step Guide - Final Version
8 pages
Unit 3
No ratings yet
Unit 3
20 pages
Wedeco 400 Seri
No ratings yet
Wedeco 400 Seri
4 pages
Company Profile: Godrej & Boyce Mfg. Co. Ltd. Corporate Profile-2007 History
0% (1)
Company Profile: Godrej & Boyce Mfg. Co. Ltd. Corporate Profile-2007 History
9 pages
Software Upgrade Procedure
No ratings yet
Software Upgrade Procedure
4 pages
Altai Fact Sheet 150324
No ratings yet
Altai Fact Sheet 150324
6 pages
Lecture 11,12 Express
No ratings yet
Lecture 11,12 Express
21 pages
Optimizing Transformer RMS Current Using Single Phase Shift Variable Frequency Modulation For Dual Active Bridge DC-DC Converter
No ratings yet
Optimizing Transformer RMS Current Using Single Phase Shift Variable Frequency Modulation For Dual Active Bridge DC-DC Converter
8 pages
In Process Quality Control For Manufactu
No ratings yet
In Process Quality Control For Manufactu
5 pages
Pillow Block Ball Bearing
No ratings yet
Pillow Block Ball Bearing
5 pages
AI in the Classroom_ a Practical Guide for Educators
No ratings yet
AI in the Classroom_ a Practical Guide for Educators
16 pages
Tide Tool 7.0 Manual V1.1
No ratings yet
Tide Tool 7.0 Manual V1.1
38 pages
Efficient Solar Power
100% (2)
Efficient Solar Power
28 pages
JNTUK R20 B Tech CSE 1-2 Computer Organization Unit 1 Reference 2 Notes
No ratings yet
JNTUK R20 B Tech CSE 1-2 Computer Organization Unit 1 Reference 2 Notes
38 pages
Unit 5 PDF
No ratings yet
Unit 5 PDF
17 pages
JT2Go Supplier Edition
No ratings yet
JT2Go Supplier Edition
2 pages
Thesis Ideas For Computer Engineering
100% (3)
Thesis Ideas For Computer Engineering
4 pages
Lab Exercises 1
No ratings yet
Lab Exercises 1
10 pages
ebook-becoming-data-driven-germany OT
No ratings yet
ebook-becoming-data-driven-germany OT
11 pages
Text:Introduction To Database System, C. J. Date References
No ratings yet
Text:Introduction To Database System, C. J. Date References
6 pages
Major Packages of ERP
No ratings yet
Major Packages of ERP
12 pages
ZUP How To Use
No ratings yet
ZUP How To Use
17 pages
Journal of Physics and Chemistry of Solids: Venkata Sreenivas Puli, Dhiren K. Pradhan, W. Pe Rez, R.S. Katiyar
No ratings yet
Journal of Physics and Chemistry of Solids: Venkata Sreenivas Puli, Dhiren K. Pradhan, W. Pe Rez, R.S. Katiyar
10 pages
Kendriya Vidyalaya, No.2 Tiruchirappalli: (Combined Walk-in-Interview For Contractual Teachers For Schools)
No ratings yet
Kendriya Vidyalaya, No.2 Tiruchirappalli: (Combined Walk-in-Interview For Contractual Teachers For Schools)
8 pages
Samyung SRG-400 Instruction Manual
No ratings yet
Samyung SRG-400 Instruction Manual
128 pages
GR00005200D 13a PDF
No ratings yet
GR00005200D 13a PDF
187 pages
Proceedings of IEEC 2020 Final
No ratings yet
Proceedings of IEEC 2020 Final
191 pages
Baumol
No ratings yet
Baumol
9 pages

Torch Optimization

Uploaded by

Torch Optimization

Uploaded by

Optimize your Torch training code for LLMs

Florida State University

Pankaj Chouhan (Florida State University) Torch Optimization 2024 1 / 17

Figure: Comparison between float16, float32 and bfloat16 (source)

Pankaj Chouhan (Florida State University) Torch Optimization 2024 2 / 17

Figure: Histogram of activation gradient magnitudes throughout FP32 training.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 3 / 17

Figure: Loss in information using FP16 format. From NVIDIA blog

Pankaj Chouhan (Florida State University) Torch Optimization 2024 4 / 17

Pankaj Chouhan (Florida State University) Torch Optimization 2024 5 / 17

PyTorch allows you to set the matrix multiplication precision for

The precision can be one of the following:

Pankaj Chouhan (Florida State University) Torch Optimization 2024 6 / 17

# Enable AMP for forward pass if IMPROVEMENT_2 is toggled

Mixed Precision: Combines float16 and float32 to reduce memory

Pankaj Chouhan (Florida State University) Torch Optimization 2024 7 / 17

Consider this equation:

f (x) = sin2 (x) + cos 2 (x) (1)

In eager mode, each term of the computational graph will be sent to

Pankaj Chouhan (Florida State University) Torch Optimization 2024 8 / 17

The Main idea is to move the bottleneck of computing self-attention

Pankaj Chouhan (Florida State University) Torch Optimization 2024 9 / 17

Figure: GPU memory. From NVIDIA blog.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 10 / 17

Table: Comparison of HBM and SRAM in GPUs

Pankaj Chouhan (Florida State University) Torch Optimization 2024 11 / 17

Figure: Minimize call to HBM using self-attention. From HuggingFace blog

Pankaj Chouhan (Florida State University) Torch Optimization 2024 12 / 17

Introduced in 2022 by Tri Dao

The Main reason is how hardware is designed.

Pankaj Chouhan (Florida State University) Torch Optimization 2024 14 / 17

GPU spins up different kernels for gradient computation, momentum

Pankaj Chouhan (Florida State University) Torch Optimization 2024 15 / 17

Distributed Data-Parallel (DDP).

Pankaj Chouhan (Florida State University) Torch Optimization 2024 16 / 17

Pankaj Chouhan (Florida State University) Torch Optimization 2024 17 / 17

You might also like