21 Jan 23:27

atalman

7c56a5e

Triton 3.6.0 release Latest

Latest

Triton 3.6 Release Notes

Dialect & Frontend
Backend & Compiler
AMD/HIP Backend
NVIDIA Backend
Gluon & Layout Improvements
Kernels & Benchmarks
Proton Profiling
Testing & CI
Build & Infrastructure
Documentation
Breaking Changes

Dialect & Frontend

New Features

Multidimensional Batch Support (#8542): Added support for multidimensional batches in tl.trans and tl.dot operations
Ragged TMA Atomic Add (#8238): Added atomic add support for ragged TMA operations
Integer Range Utility (#8753): Exposed an integer-range utility from AMD range analysis code for broader use
Constexpr Through Min/Max (#8733): Propagate constexpr through builtin min/max functions (BC-breaking)
Scales Dimension Checks (#8564): Added dimension checks for scales in dot_scaled operations
Loop Bounds Verification (#8243): Added verification that loop bounds are scalars

Bug Fixes

For Loop Induction Variable (#8750): Fixed modification of for loop induction variable handling
Store Broadcasting (#8661): Fixed broadcasting issues in store operations
Missing dot_scaled Handling (#8658): Fixed missing handling for None acc in dot_scaled
AugAssign Line Information (#8703): Attached proper line number information to AugAssign nodes
Starred Argument Handling (#8686): Made starred argument handling more robust
Saved Exception Cloning (#8115): Fixed clone of saved exception before raising
Tuple Mangling (#8060): Fixed mangling for tuples in JIT compilation

Improvements

Optimized tl.cdiv (#8669): Optimized tl.cdiv for common case of 32-bit divisors
Un-deprecated min/max (#8734): Un-deprecated min/max on scalar tensors
Warmup in KernelInterface (#8757): Moved warmup functionality into KernelInterface
Verification with Diagnostics (#8074): Frontend always verifies with diagnostics enabled
Constexpr with do_not_specialize Error (#8275): Added error when constexpr is combined with do_not_specialize
Deprecated ast.Num Replacement (#8698): Replaced usage of deprecated ast.Num

Backend & Compiler

LLVM Updates

LLVM Bump (#8299): Bumped to llvm/llvm-project@f6ded0be897e
LLVM Head Merge (#8612): Merged back changes from llvm-head with updated APIs
Inliner Import (#8152): Import inliner in triton-opt for better optimization

Code Generation

CTALayout as LinearLayout (#8770): Made CTALayout an honest-to-goodness LinearLayout for better representation
Shared Layout Rank Check (#8772): Added check that Shared layouts have rank equal to the tensor or one less
Backward Propagation Fix Point (#8776): Run remove backward prop until fix point for correctness
Generic tcgen05.cp Lowering (#8225): Implemented generic lowering for tcgen05.cp
Generic Matrix Descriptors (#8321): Implemented shmem matrix descriptors generically
LinearSharedEncoding Support (#8116): Added support for LinearSharedEncoding
BF16x3 Trick (#7592): Implemented BF16x3 trick for improved performance
Padded Shared Linear Remapping (#7929): Added linear remapping to padded shared layout

Optimizations

Compilation Time Improvement (#8689): Improved compilation time in constant sanitizer pass
AxisInfo Loop Removal (#8679): Removed unnecessary loop over roots in AxisInfo analysis
Constant Analysis (#8502): Improved constant analysis in AxisInfo
Combinatory Explosion Prevention (#8477): Prevented combinatory explosion when checking tmem_load uses
Layout Conversion Vectorization (#8655): Fixed vectorization for convert_layout with ldmatrix and stmatrix
Maybeduplicate Generalization (#8492): Generalized maybeDeduplicate to all layouts

Bug Fixes

cp_async Alignment (#8752): Fixed cp_async used in pipeliner when alignment info gets lost
While Op Layout Propagation (#8751): Prevented backward layout propagation through while op
AxisInfo Handling (#8723, #8754): Fixed handling of unvisited operands in AxisInfoAnalysis
64-bit Atomic CAS (#8105): Fixed 64-bit atomic_cas operation
Memdesc of Pointers (#8515): Fixed memdesc handling for pointer types
Alloc Shape Reset (#8537): Reset alloc_shape when doing memdesc_index
Denorm Flushing (#8557): Don't flush denorms for precise div/sqrt
Local Load Reordering (#8423): Prevented reordering local_load across side-effecting operations
Pattern Reordering (#8266): Restricted pattern re-ordering of alloc and reshape
Poison Op AxisInfo (#8489): Fixed AxisInfo handling of PoisonOp producing MemDesc

Analysis Improvements

Trans Contiguity (#8226): Added tt.trans contiguity analysis support
Hint Analysis (#5254): Fixed hint analysis in axis info
Topological Sort Deprecation (#8596): Deprecated triton's custom topological sort in favor of MLIR's

AMD/HIP Backend

GFX1250 (RDNA4) Support

Initial Skeleton (#8131): Added gfx1250 skeleton support
WMMA Support (#8174, #8283, #8312): Added initial and scaled WMMA support for gfx1250
TDM Support (#8333, #8392, #8479): Added Tensor Data Movement (TDM) load/store support
Async Copy (#8509, #8510, #8621, #8622): Added async copy and async wait support
Buffer Ops (#8130, #8532): Enabled buffer atomics and exposed buffer ops
Multicast Loads (#8719, #8759): Added async load to LDS multicast and multicast in tt.LoadOp
ds_read_tr (#8461): Added gfx1250 support for ds_read_tr
LDS Memory Barriers (#8681): Added support for LDS memory barriers
Shared Memory Size (#8517): Updated shared memory size from TargetInfo
num_cta > 1 (#8718): Support launches with num_cta > 1 on gfx1250
Scale Preshuffling (#8576): Implemented scale preshuffling and opSel

MXFP & Scaled Dot

Scale Preshuffling in Decomposed Dot (#8170): Support scale preshuffling in decomposed scaled dot
Pipeline Scale via LDS (#8258): Pipeline scale in decomposed scaled dot via LDS
Scaled Upcast Ops (#8088): Introduced scaled upcast ops for hardware upcasting
FP4->BF16 Optimized Conversion (#8145): Added optimized fp4->bf16 conversion for MI300
Scaled Dot Decomposition for GFX950 (#7839): Enabled f16 * mxfp scaled dot decomposition

Layout & Memory Optimizations

Permlane Swap (#7947): Use permlane_swap for layout conversions between dot operations
Padded Shared with AsyncCopy (#8365): Use PaddedLayout with AsyncCopy on gfx950 when pipelining
LDS Layout Selection Redesign (#8053): Redesigned stream pipeliner LDS layout selection logic
Padded Encoding Restrictions (#8583): Relaxed padded encoding block size restrictions
Direct-to-LDS with Padded (#8185): Coalesce direct-to-lds loads with padded encodings
Contiguity Hint for Direct-to-LDS (#8761): Use contiguity hint for direct-to-lds ops
BypassLDS Feature (#7968): Added bypassLDS feature to StreamPipeline

Code Generation

ds_read_tr with Linear Layout (#8235): Use linear layout to infer and emit ds_read_tr
ds_read_tr Restrictions Lifted (#8442): Lift unneeded ds_read_tr lowering restrictions
ds_read_tr Vec Size Limit (#8377): Limit vec size for ds_read_tr + padded layouts by min interval
Wave ID Optimization (#8601): Optimized gfx9 wave id code generation
MFMA Layout Refactor (#8213): Refactored MFMA layout implementation
MFMA Select Replacement (#8320): Replaced mfma select in LLVM conversion
FP8/BF8 WMMA Instruction Selection (#8649): Fixed instruction selection for fp8/bf8 wmma
Chained WMMA Optimization (#7374): Optimized chained multiplications for WMMA
BF16 v_dot (#8444): Use v_dot for bf16 multiplication on gfx11/gfx12

Build & Driver

ROCm 7 Docker Image (#8224): Switched to use official ROCm 7 docker image
HIP v6 Requirement (#8748): Only require HIP v6 which is necessary
HIP Header Update (#8709): Updated HIP header files to 7.1
Optional Symbols Support (#8729): Support optional symbols in driver.py
Uniform Workgroup Size (#8720): Indicate uniform workgroup size to LLVM
MIR Dump Option (#8663): Added option to dump MIR
Custom LLVM Scheduler (#8326, #8700): Added schedule hint for custom LLVM scheduler

Bug Fixes

Pointer Canonicalization (#8465, #8276): Fixed ptr-canonicalization segfault and assertion
Large Tensor Pointer Canonicalization (#8359): Disabled pointer-canonicalization for large tensors
Padded Shared Local Load (#8683): Fixed padded shared when lowering local load
Nondeterministic Atomic Tests (#8633): Fixed nondeterministic atomic tests failure on RDNA
Buffer Cache Swizzling (#8264): Turned off buffer op cache swizzling temporarily
Direct-to-LDS on CDNA1/2 (#8280): Disabled direct-to-lds loads on CDNA1 and CDNA2
Floating-point Upcasting Rounding (#8268): Skip rounding mode for floating-point upcasting
TilesPerWarp Boundary Cases (#8467): Fixed deduceTilesPerWarp boundary cases
fast_tanhf Overflow (#8551): Reimplemented fast_tanhf() to avoid overflow
MFMA Small K Selection (#8278): Avoid selecting MFMA with smaller K than problem size

NVIDIA Backend

Blackwell Features

TMEM Bitwidth (#8136): Added bitwidth to TMEM encoding for better representation
TMEM Layout Broadcasting (#8148): Represent broadcasting in TensorMemoryLayouts
TMEM Layout Construction (#8202): Simplified TMEM lay...

Assets 3

12 Nov 22:45

atalman

v3.5.1

0add682

Triton 3.5.1 release, bug fix release

This release is meant to fix the following issue:

Fix sm103 (GB300) support broken by Triton 3.5.0 release (#8045)

Assets 3

21 Oct 10:00

atalman

v3.5.0

c3c476f

Triton 3.5.0 release

Triton Release Notes

Dialect & Frontend
Backend & Compiler
AMD/HIP Backend
NVIDIA Backend
Gluon & Layout Improvements
Kernels & Benchmarks
Testing & CI
Build & Infrastructure
Documentation
Breaking Changes

Dialect & Frontend

New Features

Warp Specialization Enhancements (#8005): Made warp specialization require at least 4 warps with proper error messaging to prevent compiler crashes
Ragged TMA Support (#7792, #7783): Added support for write-only and general ragged TMAs with automatic bounds checking using higher-dimensional TMA descriptors
Device Assert Mask Support (#7905): Added mask parameter to tl.device_assert for easier debugging with masked operations
Padding Option for TMA Loads (#7993): Added support for padding option (including NaN) in TMA descriptor creation and fallback paths
Implicit Downcast in TMA Descriptor Store (#6236): Fixed missing implicit downcast when storing blocks through TMA descriptors
Mutations Disallowed (#7762): Disabled all mutations to address semantic issues in the language
Specialized Recursion (#7468): Enabled functions to recurse on specialized versions of themselves
Constexpr Function Cache Invalidation (#7802): Reworked constexpr_function to support cache invalidation and capability checks

Bug Fixes

Floating Point Argument Passing (#7439): Fixed floating point argument passing for tl.float16 and other FP types
Non-Associative Reduce Rematerialization (#7272): Avoided rematerialization for non-associative reduce operations to prevent data consistency issues
PDL Issue Fix (#7379): Fixed PDL-related issues in the frontend
Constexpr in Tuples (#7442): Improved handling of constexpr in tuples, fixing type mismatches and in-place mutations
Loop Carry Detection (#7200): Improved detection of loop carries when @builtin or @core.extern functions modify their arguments
Liveouts in Conditionals (#7318): Fixed detection of liveouts in conditional blocks

Improvements

MLIR Verifier After Parsing (#7999): Run MLIR verifier after parsing to catch errors early
Better Error for num_cta > 1 on sm < 90 (#7812): Improved error messaging for unsupported configurations
Extern Elementwise Type Handling (#7930): Fixed mismatched type handling for core.extern_elementwise
Libdevice Exposure in Gluon (#7890): Exposed libdevice functions with improved layout propagation

Backend & Compiler

LLVM Updates

LLVM Bump (#7881): Updated to llvm/llvm-project@bc773632355b with multiple API changes including:
- Switched Constant{Int|Float}Op type and value order
- Provided triple for TargetLibraryInfoImpl
- Fixed atomic sync scope for NVIDIA
- Updated MLIR lib names and ops

Code Generation

Generic Swizzling for convert_layout (#6982, #7565): Implemented generalized swizzling algorithm for convert_layout that:
- Finds optimal shared memory layout maximizing read/write vectorization
- Minimizes bank conflicts
- Supports ldmatrix/stmatrix and transpose versions
- Uses columns and diagonals for better performance
Warp-Local Layout Conversion (#7558): Improved warp-local layout conversion algorithm using shuffles with:
- Better handling of broadcasting in layouts
- Fewer select and shuffle instructions
- Register packing for sub-32-bit data types
Byte Permutes in Intra-Warp Conversion (#7809): Used byte permute instructions for better performance in layout conversions
Tmem Alloc Hoisting (#7568): Hoisted tmem alloc outside of if statements to reduce register pressure
CP.Async Lowering Improvements (#7314): Moved cp.async to better lowering sequence reusing previous optimizations

Optimizations

Simpler Codegen for Linear Layouts (#7201): Simplified code generation for linear layouts
Vectorization Fixes (#7845): Fixed vectorization for PaddedSharedEncoding with non-default order
XOR Trick Refactoring (#7397): Refactored XOR trick into helper function for better code reuse
Shared Memory Offset Fixes (#7949): Fixed various issues with smem base offsets
Min/Max Redux Optimization for Blackwell (#7465): Implemented new redux.sync optimization

Bug Fixes

Atomic RMW Broadcasting (#7460): Fixed atomic rmw ops to broadcast results when necessary
TMA Load with Multiple Users (#7398): Fixed lowering of TMA load when users have differing encodings
Subview Padding (#7404): Fixed subview padding for PaddedSharedEncoding
Memdesc Subview Fixes (#7480, #7515): Properly handled memdesc_subview with slicing and offsets
FP16 to FP32 Conversion (#7585): Fixed fp16 to fp32 conversion issues
Barrier Synchronization (#7993): Added bar.sync before deallocating tmem to prevent race conditions

AMD/HIP Backend

New Features

GFX950 (MI350) Support: Added comprehensive support for AMD's latest architecture including:
- MFMA scale support (#7799)
- Scale preshuffling (#7603, #7836)
- OpSel implementation for scaled MFMA
- Buffer load/store operations (#7738)
- Improved register usage in Float8 conversions (#7527)
ChainedDot Schedule (#7601, #7638): Added new scheduling variant for loops with 2 chained dots
Ping-Pong Transformation (#7638, #7458): Added ping-pong support for:
- Chained dot schedules
- Async load with num_stages=3
- MXFP types
Buffer Atomic CAS (#7292): Added support for buffer atomic compare-and-swap
FP64 MFMA Support (#7461): Added support for fp64 dot operations using MFMA intrinsics

Layout & Memory Optimizations

General Swizzling Support (#7482, #7606): Enabled ConvertLayoutOp general swizzling
Padded vs Swizzled Allocation (#7328, #7750): Introduced specialized allocation pass with proper layout selection strategy
Improved LDS Usage (#7750, #7813): Optimized LDS usage by:
- Preferring swizzle layouts when LDS limits allow
- Using single LDS for both transposed and non-transposed access
- Better layout selection in optimize-lds-usage pass
TilesPerWarp Parameter (#7283): Added tilesPerWarp parameter to MFMA layout for contiguous tile computation
Extract Slice Rewrite (#7128): Refactored extract_slice to support:
- Arbitrary tensor ranks
- Relaxed layout constraints
- CTA tile boundary alignment

Code Generation Improvements

PermlaneSwap Pattern (#7825, #7861): Added general permlane_swap pattern for ConvertLayoutOp
Register Broadcast (#7407): Added support for register broadcast in slice/concat ops
Shared Memory Ops for FP4 (#7626): Added support for M/N packed FP4 with transposition
Direct-to-LDS Loads (#7829): Refactored lowering via common lowerLdSt path
Local Load/Store Lowering (#7355): Enabled common code path for local_load/store operations

FP8 & Numeric Support

FP8 Variant Support:
- Software emulation for non-gfx942 architectures (#7401)
- Improved conversions with proper clamping (#7337, #7361, #7363)
- BF16 to OCP FP8 conversion on CDNA3 (#7469)
- Float8E4M3FN emulation on CDNA3 and below (#7186)
Dot Scaled Support: Enabled on gfx11 (#7954) and gfx12 (#7644) with emulation via decomposition
True16 Handling: Disabled on gfx11 due to test failures (#7953)

Stream Pipeliner Enhancements

Refactoring (#7526, #7556): Refactored to use more common pipeliner functionality
Async Wait Handling (#7577): Restricted merging async_wait when pipelining with num_stages=3
Mask Operation Support (#7620): Added ttg.mask handling in stream pipeliner

Build & Driver

LLD Library API (#7548): Replaced shell-out to lld with direct library API calls
hipGetProcAddress (#7350): Switched to using hipGetProcAddress for querying HIP symbols
Driver Version Check (#7501): Added runtime driver version check with descriptive errors
AOT Compilation (#7007): Added HIP AOT compilation support to compile.py tool

Bug Fixes

Pointer Canonicalizer (#7242): Fixed attribute propagation when ranks don't match
Global Atomic Optimization (#7496): Optimized global atomic operations following memory model semantics
FP32/FP16 to OCP FP8 (#7382): Fixed conversion for subnormal numbers
Async Copy Vectorization (#7250): Fixed async load pipeline for less than 32-bit loads
OptimizeLDSUtility Crash (#7434): Fixed nullptr crash in createTmpLayout
Memrealtime on GFX11/12 (#7357): Added proper support using s_sendmsg_rtn_b64

NVIDIA Backend

Hopper/Blackwell Features

Warp Specialization:
- Enable for persistent matmul and FA (#7642, #7623)
- Assign final try_wait to partition (#7757)
- Tightened user critical section with accumulator (#7509)
- Fixed rematerialization bug in partitioner (#7427)
- Optimized partitioning by hoisting above broadcasts (#7692)
- Enabled 1 buffer for SSA partition dependencies (#7686)
- Control flow support in TMEM allocation (#7698)
WGMMA Support in Gluon (#7300, #7313): Added Hopper WGMMA with async wait support
Aref Operations (#7479, #7561, #7645): Updated aref ops and lower_aref pass with:
- Multi-consumer support
- Stage/cluster attribute passing
- TMA load aref insertion
- Control flow handling
Partition Loops Rewrite (#7415): Reimplemented supporting general control flow using mutual recursion

Blackwell-Specific

TMEM Support:
- Fixed codegen for Nx1xf32 (#7234)
- Fixed tmem_su...

Assets 3

30 Jul 20:47

atalman

v3.4.0

c817b9b

Triton 3.4.0 Release

Highlights

Gluon Framework Comprehensive Enhancement

The Gluon framework has received major enhancements across all areas including new APIs, tensor memory management, layout operations, and synchronization primitives. Key additions include static_assert functionality, TensorDescriptor kernel arguments, async TMA operations, tensor memory implementation, thread synchronization barriers, and comprehensive tensor operations like split/join/reshape and reductions. (#7172, #7168, #7165, #7160, #7152, #7151, #7149, #7145, #7142, #7122, #7121, #7120, #7115, #7114, #7106, #7102, #7099, #7097, #7091, #7089, #7080, #7061, #7057, #7022, #7020, #7009, #7006, #7004, #7001, #6998, #6997, #6994, #6992, #6989, #6985, #6971, #6950)

Hardware Support Expansion

AMD GFX950 Architecture Support - Comprehensive support for GFX950 including WMMA operations, performance optimizations, and architectural-specific features (#7175, #7171, #7127, #6744, #6594)
Blackwell Enhanced TMEM Support - Improved tensor memory operations with better register usage and performance optimizations (#7160, #7079, #6817)
Hopper WGMMA Improvements - Enhanced matrix multiplication with subtiling and prefetching optimizations (#7136, #6130)

Performance Optimizations

Automatic Warp Specialization - Introduced automatic warp specialization optimization for enhanced kernel performance on NVIDIA GPUs (#6289, #6246, #6217)
MMAv5 Pipelining - Re-enabled and improved MMAv5 pipelining with better performance and scheduling (#6732, #6613, #6256)
TMA Operations Enhancement - Improved tensor memory access with better layout support and reduced register pressure (#6725, #6238, #6580)

New Features

Language and Frontend

Aggregate Type Support - Added @tl.aggregate decorator for autogenerating Triton types from Python classes (#6970)
JITFunction Constexpr Support - Enhanced constexpr support for function lists and improved JIT functionality (#6988, #6963, #7105)
Enhanced Boolean Operations - Improved handling of boolean operators and scalars with chained operations (#6769)
Bitonic Top-k and Sorting - Added support for bitonic top-k operations and improved sort implementations (#6461, #6486)
Masked Histograms - Added support for masked histogram operations (#6695)
Syntactic Sugar Additions - Added .item() as syntactic sugar for .reshape([]) (#6873)

Backend and Compilation

Generic Swizzling Implementation - Implemented generic swizzling algorithm for convert_layout lowering (#6982)
Enhanced Register Allocation - Improved dynamic register reallocation for warp specialization (#6877, #6694, #6407)
TMA Reduce Operations - Added TMA reduce operations for descriptor-based reducing stores (#6580)
Improved Subtiling - Enhanced subtiling code generation for tensor memory loading (#6415)
BF16 Atomic Operations - Added support for BF16 atomic add operations (#6519)
Stmatrix Support - Added comprehensive stmatrix support including transpose operations (#6910, #6899)

Hardware-Specific Features

AMD AsyncCopy Optimizations - Enhanced AsyncCopy support in StreamPipeliner with improved memory operations (#6270, #6639, #6382)
AMD Buffer Operations - Comprehensive improvements to buffer operations with better vectorization and alignment (#6126, #6145, #6329)
AMD Ping-pong Scheduler - Enhanced ping-pong scheduler for better memory operation handling (#6254, #6301, #6198)
NVIDIA PDL Support - Enabled Programmatic Dependent Launch for overlapping kernel execution (#6394)
AMD HIP AOT Support - Added HIP Ahead-of-Time compilation support (#7007)

Improvements

Performance

Routing Kernel Optimizations - Multiple performance improvements achieving up to 5% runtime reduction (#6866, #6546, #7040)
Matrix Multiplication Enhancements - Enhanced persistent TMA matmul with epilogue subtiling and metadata alignment (#6724, #6882, #7123)
SwiGLU Optimizations - Improved SwiGLU kernel performance and fused activation functions (#6797, #6553)
Attention Kernel Fixes - Fixed and optimized attention tutorials with better performance metrics (#7037, #6839)

Developer Experience

Enhanced CI/CD - Improved continuous integration with better caching and timeout handling (#6815, #6816, #6582)
Testing Infrastructure - Enhanced test coverage and organization (#7109, #6867)
Documentation Updates - Improved documentation for installation and new features (#7103, #6778, #6235)
Build System Improvements - Better CMake support and dependency management ([#6330](https://github.com/tri...

Assets 3

Releases: triton-lang/triton

Triton 3.6.0 release

Triton 3.6 Release Notes

Table of Contents

Dialect & Frontend

New Features

Bug Fixes

Improvements

Backend & Compiler

LLVM Updates

Code Generation

Optimizations

Bug Fixes

Analysis Improvements

AMD/HIP Backend

GFX1250 (RDNA4) Support

MXFP & Scaled Dot

Layout & Memory Optimizations

Code Generation

Build & Driver

Bug Fixes

NVIDIA Backend

Blackwell Features

Uh oh!

Triton 3.5.1 release, bug fix release

Uh oh!

Triton 3.5.0 release

Triton Release Notes

Table of Contents

Dialect & Frontend

New Features

Bug Fixes

Improvements

Backend & Compiler

LLVM Updates

Code Generation

Optimizations

Bug Fixes

AMD/HIP Backend

New Features

Layout & Memory Optimizations

Code Generation Improvements

FP8 & Numeric Support

Stream Pipeliner Enhancements

Build & Driver

Bug Fixes

NVIDIA Backend

Hopper/Blackwell Features

Blackwell-Specific

Uh oh!

Triton 3.4.0 Release

Highlights

Gluon Framework Comprehensive Enhancement

Hardware Support Expansion

Performance Optimizations

New Features

Language and Frontend

Backend and Compilation

Hardware-Specific Features

Improvements

Performance

Developer Experience

Uh oh!