[AMD]: reimplement fast_tanhf() to avoid overflow #8551

xiaohuguo2023 · 2025-10-27T12:54:59Z

The Problem with the Original Formula

The original formula is:

tanh(x) = (e^(2x) - 1) / (e^(2x) + 1)

Issue with large positive x:
- When x = 20: e^(40) ≈ 2.4 × 10^17 → manageable
- When x = 50: e^(100) ≈ 2.7 × 10^43 → overflow to infinity
- Result: (∞ - 1)/(∞ + 1) = NaN x
For negative x: The formula actually works fine because e^(2x) → 0, giving (-1)/(1) = -1

The Numerically Stable Solution

For Positive x: Reformulation

tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) = (e^(2x) + 1 - 2) / (e^(2x) + 1) = 1 - 2/(e^(2x) + 1)

For Negative x: Using Symmetry

tanh(-x) = (e^(-2x) - 1) / (e^(-2x) + 1) =  (2/(e^(-2x) + 1) - 1) = -1 × (1 - 2/(e^(2|x|) + 1))

Unified formulation:

tanh(x) = sign(x) × (1 - 2/(e^(2|x|) + 1))

zhanglx13

lgtm!

antiagainst

Actually can you fix up the builder API following #8572?

The original formula is: ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) ``` - Issue with large positive x: - When x = 20: e^(40) ≈ 2.4 × 10^17 → manageable - When x = 50: e^(100) ≈ 2.7 × 10^43 → overflow to infinity - Result: (∞ - 1)/(∞ + 1) = NaN x - For negative x: The formula actually works fine because e^(2x) → 0, giving (-1)/(1) = -1 - For Positive x: Reformulation ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) = (e^(2x) + 1 - 2) / (e^(2x) + 1) = 1 - 2/(e^(2x) + 1) ``` - For Negative x: Using Symmetry ``` tanh(-x) = (e^(-2x) - 1) / (e^(-2x) + 1) = (2/(e^(-2x) + 1) - 1) = -1 × (1 - 2/(e^(2|x|) + 1)) ``` ``` tanh(x) = sign(x) × (1 - 2/(e^(2|x|) + 1)) ``` (cherry picked from commit 3f5eb50)

The original formula is: ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) ``` - Issue with large positive x: - When x = 20: e^(40) ≈ 2.4 × 10^17 → manageable - When x = 50: e^(100) ≈ 2.7 × 10^43 → overflow to infinity - Result: (∞ - 1)/(∞ + 1) = NaN x - For negative x: The formula actually works fine because e^(2x) → 0, giving (-1)/(1) = -1 - For Positive x: Reformulation ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) = (e^(2x) + 1 - 2) / (e^(2x) + 1) = 1 - 2/(e^(2x) + 1) ``` - For Negative x: Using Symmetry ``` tanh(-x) = (e^(-2x) - 1) / (e^(-2x) + 1) = (2/(e^(-2x) + 1) - 1) = -1 × (1 - 2/(e^(2|x|) + 1)) ``` ``` tanh(x) = sign(x) × (1 - 2/(e^(2|x|) + 1)) ``` (cherry picked from commit 3f5eb50) (cherry picked from commit 60297e6)

…900) The original formula is: ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) ``` - Issue with large positive x: - When x = 20: e^(40) ≈ 2.4 × 10^17 → manageable - When x = 50: e^(100) ≈ 2.7 × 10^43 → overflow to infinity - Result: (∞ - 1)/(∞ + 1) = NaN x - For negative x: The formula actually works fine because e^(2x) → 0, giving (-1)/(1) = -1 - For Positive x: Reformulation ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) = (e^(2x) + 1 - 2) / (e^(2x) + 1) = 1 - 2/(e^(2x) + 1) ``` - For Negative x: Using Symmetry ``` tanh(-x) = (e^(-2x) - 1) / (e^(-2x) + 1) = (2/(e^(-2x) + 1) - 1) = -1 × (1 - 2/(e^(2|x|) + 1)) ``` ``` tanh(x) = sign(x) × (1 - 2/(e^(2|x|) + 1)) ``` (cherry picked from commit 3f5eb50) Co-authored-by: xiaohuguo2023 <149615094+xiaohuguo2023@users.noreply.github.com>

…901) The original formula is: ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) ``` - Issue with large positive x: - When x = 20: e^(40) ≈ 2.4 × 10^17 → manageable - When x = 50: e^(100) ≈ 2.7 × 10^43 → overflow to infinity - Result: (∞ - 1)/(∞ + 1) = NaN x - For negative x: The formula actually works fine because e^(2x) → 0, giving (-1)/(1) = -1 - For Positive x: Reformulation ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) = (e^(2x) + 1 - 2) / (e^(2x) + 1) = 1 - 2/(e^(2x) + 1) ``` - For Negative x: Using Symmetry ``` tanh(-x) = (e^(-2x) - 1) / (e^(-2x) + 1) = (2/(e^(-2x) + 1) - 1) = -1 × (1 - 2/(e^(2|x|) + 1)) ``` ``` tanh(x) = sign(x) × (1 - 2/(e^(2|x|) + 1)) ``` (cherry picked from commit 3f5eb50) Co-authored-by: xiaohuguo2023 <149615094+xiaohuguo2023@users.noreply.github.com>

…902) The original formula is: ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) ``` - Issue with large positive x: - When x = 20: e^(40) ≈ 2.4 × 10^17 → manageable - When x = 50: e^(100) ≈ 2.7 × 10^43 → overflow to infinity - Result: (∞ - 1)/(∞ + 1) = NaN x - For negative x: The formula actually works fine because e^(2x) → 0, giving (-1)/(1) = -1 - For Positive x: Reformulation ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) = (e^(2x) + 1 - 2) / (e^(2x) + 1) = 1 - 2/(e^(2x) + 1) ``` - For Negative x: Using Symmetry ``` tanh(-x) = (e^(-2x) - 1) / (e^(-2x) + 1) = (2/(e^(-2x) + 1) - 1) = -1 × (1 - 2/(e^(2|x|) + 1)) ``` ``` tanh(x) = sign(x) × (1 - 2/(e^(2|x|) + 1)) ``` (cherry picked from commit 3f5eb50) (cherry picked from commit 60297e6) Co-authored-by: xiaohuguo2023 <149615094+xiaohuguo2023@users.noreply.github.com>

### The Problem with the Original Formula The original formula is: ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) ``` - Issue with large positive x: - When x = 20: e^(40) ≈ 2.4 × 10^17 → manageable - When x = 50: e^(100) ≈ 2.7 × 10^43 → overflow to infinity - Result: (∞ - 1)/(∞ + 1) = NaN x - For negative x: The formula actually works fine because e^(2x) → 0, giving (-1)/(1) = -1 ### The Numerically Stable Solution - For Positive x: Reformulation ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) = (e^(2x) + 1 - 2) / (e^(2x) + 1) = 1 - 2/(e^(2x) + 1) ``` - For Negative x: Using Symmetry ``` tanh(-x) = (e^(-2x) - 1) / (e^(-2x) + 1) = (2/(e^(-2x) + 1) - 1) = -1 × (1 - 2/(e^(2|x|) + 1)) ``` ### Unified formulation: ``` tanh(x) = sign(x) × (1 - 2/(e^(2|x|) + 1)) ```

AMD CDNA3 (MI300X/gfx942) does not have a hardware tanh instruction like NVIDIA's PTX tanh.approx. Instead of using PTX inline assembly (which doesn't work on ROCm), we use OCML's __ocml_tanh_f32 function. Triton's AMD backend lowers this using a numerically stable fast exp-based formula: tanh(x) = sign(x) * (1 - 2/(e^(2|x|) + 1)) This implementation: - For f32: calls __ocml_tanh_f32 directly via extern_elementwise - For f16/bf16: extends to f32, calls __ocml_tanh_f32, truncates back Also fixes the bf16 skip condition to only apply to CUDA (not ROCm). References: - Triton PR jax-ml#7780: triton-lang/triton#7780 - Triton PR jax-ml#8551: triton-lang/triton#8551 - NVIDIA PTX ISA: https://docs.nvidia.com/cuda/parallel-thread-execution/ - AMD CDNA3 ISA: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf

xiaohuguo2023 changed the title ~~reimplement fast_tanhf() to avoid overflow~~ [AMD]: reimplement fast_tanhf() to avoid overflow Oct 28, 2025

xiaohuguo2023 force-pushed the fix_fast_tanhf branch from 280b75b to ac2622f Compare October 28, 2025 18:33

xiaohuguo2023 marked this pull request as ready for review October 29, 2025 07:36

xiaohuguo2023 requested review from antiagainst and zhanglx13 as code owners October 29, 2025 07:36

zhanglx13 approved these changes Oct 29, 2025

View reviewed changes

antiagainst enabled auto-merge (squash) October 29, 2025 15:13

antiagainst approved these changes Oct 29, 2025

View reviewed changes

auto-merge was automatically disabled October 29, 2025 16:18
Head branch was pushed to by a user without write access

xiaohuguo2023 force-pushed the fix_fast_tanhf branch from 52cd0b7 to b736335 Compare October 29, 2025 16:18

zhanglx13 enabled auto-merge (squash) October 29, 2025 16:23

antiagainst disabled auto-merge October 29, 2025 16:38

antiagainst requested changes Oct 29, 2025

View reviewed changes

xiaohuguo2023 requested a review from ptillet as a code owner October 29, 2025 20:39

xiaohuguo2023 force-pushed the fix_fast_tanhf branch from 2a624d2 to 057b104 Compare October 29, 2025 21:00

xiaohuguo2023 added 2 commits October 29, 2025 16:11

reimplement fast_tanhf() to avoid overflow

e9fad50

update format

9db540a

xiaohuguo2023 force-pushed the fix_fast_tanhf branch from dec669a to 9db540a Compare October 29, 2025 21:12

update builder API

d86ec62

antiagainst approved these changes Oct 29, 2025

View reviewed changes

antiagainst enabled auto-merge (squash) October 29, 2025 22:02

antiagainst disabled auto-merge October 29, 2025 22:26

antiagainst merged commit 3f5eb50 into triton-lang:main Oct 29, 2025
8 of 9 checks passed

xiaohuguo2023 deleted the fix_fast_tanhf branch October 29, 2025 22:34

phambinhfin mentioned this pull request Jan 20, 2026

Implement approx_tanh for ROCm using OCML tanh function ROCm/jax#614

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD]: reimplement fast_tanhf() to avoid overflow #8551

[AMD]: reimplement fast_tanhf() to avoid overflow #8551

Uh oh!

xiaohuguo2023 commented Oct 27, 2025 •

edited

Loading

Uh oh!

zhanglx13 left a comment

Uh oh!

antiagainst left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[AMD]: reimplement fast_tanhf() to avoid overflow #8551

[AMD]: reimplement fast_tanhf() to avoid overflow #8551

Uh oh!

Conversation

xiaohuguo2023 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem with the Original Formula

The Numerically Stable Solution

Unified formulation:

Uh oh!

zhanglx13 left a comment

Choose a reason for hiding this comment

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiaohuguo2023 commented Oct 27, 2025 •

edited

Loading