Releases: triton-lang/triton
Triton 3.6.0 release
Triton 3.6 Release Notes
Table of Contents
- Dialect & Frontend
- Backend & Compiler
- AMD/HIP Backend
- NVIDIA Backend
- Gluon & Layout Improvements
- Kernels & Benchmarks
- Proton Profiling
- Testing & CI
- Build & Infrastructure
- Documentation
- Breaking Changes
Dialect & Frontend
New Features
- Multidimensional Batch Support (#8542): Added support for multidimensional batches in
tl.transandtl.dotoperations - Ragged TMA Atomic Add (#8238): Added atomic add support for ragged TMA operations
- Integer Range Utility (#8753): Exposed an integer-range utility from AMD range analysis code for broader use
- Constexpr Through Min/Max (#8733): Propagate constexpr through builtin min/max functions (BC-breaking)
- Scales Dimension Checks (#8564): Added dimension checks for scales in
dot_scaledoperations - Loop Bounds Verification (#8243): Added verification that loop bounds are scalars
Bug Fixes
- For Loop Induction Variable (#8750): Fixed modification of for loop induction variable handling
- Store Broadcasting (#8661): Fixed broadcasting issues in store operations
- Missing
dot_scaledHandling (#8658): Fixed missing handling for None acc indot_scaled - AugAssign Line Information (#8703): Attached proper line number information to AugAssign nodes
- Starred Argument Handling (#8686): Made starred argument handling more robust
- Saved Exception Cloning (#8115): Fixed clone of saved exception before raising
- Tuple Mangling (#8060): Fixed mangling for tuples in JIT compilation
Improvements
- Optimized
tl.cdiv(#8669): Optimizedtl.cdivfor common case of 32-bit divisors - Un-deprecated min/max (#8734): Un-deprecated min/max on scalar tensors
- Warmup in KernelInterface (#8757): Moved warmup functionality into KernelInterface
- Verification with Diagnostics (#8074): Frontend always verifies with diagnostics enabled
- Constexpr with do_not_specialize Error (#8275): Added error when constexpr is combined with do_not_specialize
- Deprecated ast.Num Replacement (#8698): Replaced usage of deprecated
ast.Num
Backend & Compiler
LLVM Updates
- LLVM Bump (#8299): Bumped to llvm/llvm-project@f6ded0be897e
- LLVM Head Merge (#8612): Merged back changes from llvm-head with updated APIs
- Inliner Import (#8152): Import inliner in triton-opt for better optimization
Code Generation
- CTALayout as LinearLayout (#8770): Made CTALayout an honest-to-goodness LinearLayout for better representation
- Shared Layout Rank Check (#8772): Added check that Shared layouts have rank equal to the tensor or one less
- Backward Propagation Fix Point (#8776): Run remove backward prop until fix point for correctness
- Generic
tcgen05.cpLowering (#8225): Implemented generic lowering fortcgen05.cp - Generic Matrix Descriptors (#8321): Implemented shmem matrix descriptors generically
- LinearSharedEncoding Support (#8116): Added support for LinearSharedEncoding
- BF16x3 Trick (#7592): Implemented BF16x3 trick for improved performance
- Padded Shared Linear Remapping (#7929): Added linear remapping to padded shared layout
Optimizations
- Compilation Time Improvement (#8689): Improved compilation time in constant sanitizer pass
- AxisInfo Loop Removal (#8679): Removed unnecessary loop over roots in AxisInfo analysis
- Constant Analysis (#8502): Improved constant analysis in AxisInfo
- Combinatory Explosion Prevention (#8477): Prevented combinatory explosion when checking tmem_load uses
- Layout Conversion Vectorization (#8655): Fixed vectorization for convert_layout with ldmatrix and stmatrix
- Maybeduplicate Generalization (#8492): Generalized maybeDeduplicate to all layouts
Bug Fixes
- cp_async Alignment (#8752): Fixed cp_async used in pipeliner when alignment info gets lost
- While Op Layout Propagation (#8751): Prevented backward layout propagation through while op
- AxisInfo Handling (#8723, #8754): Fixed handling of unvisited operands in AxisInfoAnalysis
- 64-bit Atomic CAS (#8105): Fixed 64-bit
atomic_casoperation - Memdesc of Pointers (#8515): Fixed memdesc handling for pointer types
- Alloc Shape Reset (#8537): Reset alloc_shape when doing memdesc_index
- Denorm Flushing (#8557): Don't flush denorms for precise div/sqrt
- Local Load Reordering (#8423): Prevented reordering local_load across side-effecting operations
- Pattern Reordering (#8266): Restricted pattern re-ordering of alloc and reshape
- Poison Op AxisInfo (#8489): Fixed AxisInfo handling of PoisonOp producing MemDesc
Analysis Improvements
- Trans Contiguity (#8226): Added tt.trans contiguity analysis support
- Hint Analysis (#5254): Fixed hint analysis in axis info
- Topological Sort Deprecation (#8596): Deprecated triton's custom topological sort in favor of MLIR's
AMD/HIP Backend
GFX1250 (RDNA4) Support
- Initial Skeleton (#8131): Added gfx1250 skeleton support
- WMMA Support (#8174, #8283, #8312): Added initial and scaled WMMA support for gfx1250
- TDM Support (#8333, #8392, #8479): Added Tensor Data Movement (TDM) load/store support
- Async Copy (#8509, #8510, #8621, #8622): Added async copy and async wait support
- Buffer Ops (#8130, #8532): Enabled buffer atomics and exposed buffer ops
- Multicast Loads (#8719, #8759): Added async load to LDS multicast and multicast in
tt.LoadOp - ds_read_tr (#8461): Added gfx1250 support for ds_read_tr
- LDS Memory Barriers (#8681): Added support for LDS memory barriers
- Shared Memory Size (#8517): Updated shared memory size from TargetInfo
- num_cta > 1 (#8718): Support launches with num_cta > 1 on gfx1250
- Scale Preshuffling (#8576): Implemented scale preshuffling and opSel
MXFP & Scaled Dot
- Scale Preshuffling in Decomposed Dot (#8170): Support scale preshuffling in decomposed scaled dot
- Pipeline Scale via LDS (#8258): Pipeline scale in decomposed scaled dot via LDS
- Scaled Upcast Ops (#8088): Introduced scaled upcast ops for hardware upcasting
- FP4->BF16 Optimized Conversion (#8145): Added optimized fp4->bf16 conversion for MI300
- Scaled Dot Decomposition for GFX950 (#7839): Enabled f16 * mxfp scaled dot decomposition
Layout & Memory Optimizations
- Permlane Swap (#7947): Use permlane_swap for layout conversions between dot operations
- Padded Shared with AsyncCopy (#8365): Use PaddedLayout with AsyncCopy on gfx950 when pipelining
- LDS Layout Selection Redesign (#8053): Redesigned stream pipeliner LDS layout selection logic
- Padded Encoding Restrictions (#8583): Relaxed padded encoding block size restrictions
- Direct-to-LDS with Padded (#8185): Coalesce direct-to-lds loads with padded encodings
- Contiguity Hint for Direct-to-LDS (#8761): Use contiguity hint for direct-to-lds ops
- BypassLDS Feature (#7968): Added bypassLDS feature to StreamPipeline
Code Generation
- ds_read_tr with Linear Layout (#8235): Use linear layout to infer and emit ds_read_tr
- ds_read_tr Restrictions Lifted (#8442): Lift unneeded ds_read_tr lowering restrictions
- ds_read_tr Vec Size Limit (#8377): Limit vec size for ds_read_tr + padded layouts by min interval
- Wave ID Optimization (#8601): Optimized gfx9 wave id code generation
- MFMA Layout Refactor (#8213): Refactored MFMA layout implementation
- MFMA Select Replacement (#8320): Replaced mfma select in LLVM conversion
- FP8/BF8 WMMA Instruction Selection (#8649): Fixed instruction selection for fp8/bf8 wmma
- Chained WMMA Optimization (#7374): Optimized chained multiplications for WMMA
- BF16 v_dot (#8444): Use v_dot for bf16 multiplication on gfx11/gfx12
Build & Driver
- ROCm 7 Docker Image (#8224): Switched to use official ROCm 7 docker image
- HIP v6 Requirement (#8748): Only require HIP v6 which is necessary
- HIP Header Update (#8709): Updated HIP header files to 7.1
- Optional Symbols Support (#8729): Support optional symbols in driver.py
- Uniform Workgroup Size (#8720): Indicate uniform workgroup size to LLVM
- MIR Dump Option (#8663): Added option to dump MIR
- Custom LLVM Scheduler (#8326, #8700): Added schedule hint for custom LLVM scheduler
Bug Fixes
- Pointer Canonicalization (#8465, #8276): Fixed ptr-canonicalization segfault and assertion
- Large Tensor Pointer Canonicalization (#8359): Disabled pointer-canonicalization for large tensors
- Padded Shared Local Load (#8683): Fixed padded shared when lowering local load
- Nondeterministic Atomic Tests (#8633): Fixed nondeterministic atomic tests failure on RDNA
- Buffer Cache Swizzling (#8264): Turned off buffer op cache swizzling temporarily
- Direct-to-LDS on CDNA1/2 (#8280): Disabled direct-to-lds loads on CDNA1 and CDNA2
- Floating-point Upcasting Rounding (#8268): Skip rounding mode for floating-point upcasting
- TilesPerWarp Boundary Cases (#8467): Fixed deduceTilesPerWarp boundary cases
- fast_tanhf Overflow (#8551): Reimplemented fast_tanhf() to avoid overflow
- MFMA Small K Selection (#8278): Avoid selecting MFMA with smaller K than problem size
NVIDIA Backend
Blackwell Features
Triton 3.5.1 release, bug fix release
This release is meant to fix the following issue:
Fix sm103 (GB300) support broken by Triton 3.5.0 release (#8045)
Triton 3.5.0 release
Triton Release Notes
Table of Contents
- Dialect & Frontend
- Backend & Compiler
- AMD/HIP Backend
- NVIDIA Backend
- Gluon & Layout Improvements
- Kernels & Benchmarks
- Testing & CI
- Build & Infrastructure
- Documentation
- Breaking Changes
Dialect & Frontend
New Features
- Warp Specialization Enhancements (#8005): Made warp specialization require at least 4 warps with proper error messaging to prevent compiler crashes
- Ragged TMA Support (#7792, #7783): Added support for write-only and general ragged TMAs with automatic bounds checking using higher-dimensional TMA descriptors
- Device Assert Mask Support (#7905): Added
maskparameter totl.device_assertfor easier debugging with masked operations - Padding Option for TMA Loads (#7993): Added support for padding option (including NaN) in TMA descriptor creation and fallback paths
- Implicit Downcast in TMA Descriptor Store (#6236): Fixed missing implicit downcast when storing blocks through TMA descriptors
- Mutations Disallowed (#7762): Disabled all mutations to address semantic issues in the language
- Specialized Recursion (#7468): Enabled functions to recurse on specialized versions of themselves
- Constexpr Function Cache Invalidation (#7802): Reworked
constexpr_functionto support cache invalidation and capability checks
Bug Fixes
- Floating Point Argument Passing (#7439): Fixed floating point argument passing for
tl.float16and other FP types - Non-Associative Reduce Rematerialization (#7272): Avoided rematerialization for non-associative reduce operations to prevent data consistency issues
- PDL Issue Fix (#7379): Fixed PDL-related issues in the frontend
- Constexpr in Tuples (#7442): Improved handling of constexpr in tuples, fixing type mismatches and in-place mutations
- Loop Carry Detection (#7200): Improved detection of loop carries when
@builtinor@core.externfunctions modify their arguments - Liveouts in Conditionals (#7318): Fixed detection of liveouts in conditional blocks
Improvements
- MLIR Verifier After Parsing (#7999): Run MLIR verifier after parsing to catch errors early
- Better Error for num_cta > 1 on sm < 90 (#7812): Improved error messaging for unsupported configurations
- Extern Elementwise Type Handling (#7930): Fixed mismatched type handling for
core.extern_elementwise - Libdevice Exposure in Gluon (#7890): Exposed libdevice functions with improved layout propagation
Backend & Compiler
LLVM Updates
- LLVM Bump (#7881): Updated to llvm/llvm-project@bc773632355b with multiple API changes including:
- Switched
Constant{Int|Float}Optype and value order - Provided triple for
TargetLibraryInfoImpl - Fixed atomic sync scope for NVIDIA
- Updated MLIR lib names and ops
- Switched
Code Generation
- Generic Swizzling for convert_layout (#6982, #7565): Implemented generalized swizzling algorithm for
convert_layoutthat:- Finds optimal shared memory layout maximizing read/write vectorization
- Minimizes bank conflicts
- Supports
ldmatrix/stmatrixand transpose versions - Uses columns and diagonals for better performance
- Warp-Local Layout Conversion (#7558): Improved warp-local layout conversion algorithm using shuffles with:
- Better handling of broadcasting in layouts
- Fewer
selectandshuffleinstructions - Register packing for sub-32-bit data types
- Byte Permutes in Intra-Warp Conversion (#7809): Used byte permute instructions for better performance in layout conversions
- Tmem Alloc Hoisting (#7568): Hoisted tmem alloc outside of if statements to reduce register pressure
- CP.Async Lowering Improvements (#7314): Moved cp.async to better lowering sequence reusing previous optimizations
Optimizations
- Simpler Codegen for Linear Layouts (#7201): Simplified code generation for linear layouts
- Vectorization Fixes (#7845): Fixed vectorization for
PaddedSharedEncodingwith non-default order - XOR Trick Refactoring (#7397): Refactored XOR trick into helper function for better code reuse
- Shared Memory Offset Fixes (#7949): Fixed various issues with smem base offsets
- Min/Max Redux Optimization for Blackwell (#7465): Implemented new redux.sync optimization
Bug Fixes
- Atomic RMW Broadcasting (#7460): Fixed atomic rmw ops to broadcast results when necessary
- TMA Load with Multiple Users (#7398): Fixed lowering of TMA load when users have differing encodings
- Subview Padding (#7404): Fixed subview padding for PaddedSharedEncoding
- Memdesc Subview Fixes (#7480, #7515): Properly handled memdesc_subview with slicing and offsets
- FP16 to FP32 Conversion (#7585): Fixed fp16 to fp32 conversion issues
- Barrier Synchronization (#7993): Added bar.sync before deallocating tmem to prevent race conditions
AMD/HIP Backend
New Features
- GFX950 (MI350) Support: Added comprehensive support for AMD's latest architecture including:
- ChainedDot Schedule (#7601, #7638): Added new scheduling variant for loops with 2 chained dots
- Ping-Pong Transformation (#7638, #7458): Added ping-pong support for:
- Chained dot schedules
- Async load with num_stages=3
- MXFP types
- Buffer Atomic CAS (#7292): Added support for buffer atomic compare-and-swap
- FP64 MFMA Support (#7461): Added support for fp64 dot operations using MFMA intrinsics
Layout & Memory Optimizations
- General Swizzling Support (#7482, #7606): Enabled ConvertLayoutOp general swizzling
- Padded vs Swizzled Allocation (#7328, #7750): Introduced specialized allocation pass with proper layout selection strategy
- Improved LDS Usage (#7750, #7813): Optimized LDS usage by:
- Preferring swizzle layouts when LDS limits allow
- Using single LDS for both transposed and non-transposed access
- Better layout selection in optimize-lds-usage pass
- TilesPerWarp Parameter (#7283): Added tilesPerWarp parameter to MFMA layout for contiguous tile computation
- Extract Slice Rewrite (#7128): Refactored extract_slice to support:
- Arbitrary tensor ranks
- Relaxed layout constraints
- CTA tile boundary alignment
Code Generation Improvements
- PermlaneSwap Pattern (#7825, #7861): Added general permlane_swap pattern for ConvertLayoutOp
- Register Broadcast (#7407): Added support for register broadcast in slice/concat ops
- Shared Memory Ops for FP4 (#7626): Added support for M/N packed FP4 with transposition
- Direct-to-LDS Loads (#7829): Refactored lowering via common
lowerLdStpath - Local Load/Store Lowering (#7355): Enabled common code path for local_load/store operations
FP8 & Numeric Support
- FP8 Variant Support:
- Dot Scaled Support: Enabled on gfx11 (#7954) and gfx12 (#7644) with emulation via decomposition
- True16 Handling: Disabled on gfx11 due to test failures (#7953)
Stream Pipeliner Enhancements
- Refactoring (#7526, #7556): Refactored to use more common pipeliner functionality
- Async Wait Handling (#7577): Restricted merging async_wait when pipelining with num_stages=3
- Mask Operation Support (#7620): Added ttg.mask handling in stream pipeliner
Build & Driver
- LLD Library API (#7548): Replaced shell-out to lld with direct library API calls
- hipGetProcAddress (#7350): Switched to using hipGetProcAddress for querying HIP symbols
- Driver Version Check (#7501): Added runtime driver version check with descriptive errors
- AOT Compilation (#7007): Added HIP AOT compilation support to compile.py tool
Bug Fixes
- Pointer Canonicalizer (#7242): Fixed attribute propagation when ranks don't match
- Global Atomic Optimization (#7496): Optimized global atomic operations following memory model semantics
- FP32/FP16 to OCP FP8 (#7382): Fixed conversion for subnormal numbers
- Async Copy Vectorization (#7250): Fixed async load pipeline for less than 32-bit loads
- OptimizeLDSUtility Crash (#7434): Fixed nullptr crash in createTmpLayout
- Memrealtime on GFX11/12 (#7357): Added proper support using s_sendmsg_rtn_b64
NVIDIA Backend
Hopper/Blackwell Features
- Warp Specialization:
- Enable for persistent matmul and FA (#7642, #7623)
- Assign final try_wait to partition (#7757)
- Tightened user critical section with accumulator (#7509)
- Fixed rematerialization bug in partitioner (#7427)
- Optimized partitioning by hoisting above broadcasts (#7692)
- Enabled 1 buffer for SSA partition dependencies (#7686)
- Control flow support in TMEM allocation (#7698)
- WGMMA Support in Gluon (#7300, #7313): Added Hopper WGMMA with async wait support
- Aref Operations (#7479, #7561, #7645): Updated aref ops and lower_aref pass with:
- Multi-consumer support
- Stage/cluster attribute passing
- TMA load aref insertion
- Control flow handling
- Partition Loops Rewrite (#7415): Reimplemented supporting general control flow using mutual recursion
Blackwell-Specific
- TMEM Support:
- Fixed codegen for Nx1xf32 (#7234)
- Fixed tmem_su...
Triton 3.4.0 Release
Highlights
Gluon Framework Comprehensive Enhancement
The Gluon framework has received major enhancements across all areas including new APIs, tensor memory management, layout operations, and synchronization primitives. Key additions include static_assert functionality, TensorDescriptor kernel arguments, async TMA operations, tensor memory implementation, thread synchronization barriers, and comprehensive tensor operations like split/join/reshape and reductions. (#7172, #7168, #7165, #7160, #7152, #7151, #7149, #7145, #7142, #7122, #7121, #7120, #7115, #7114, #7106, #7102, #7099, #7097, #7091, #7089, #7080, #7061, #7057, #7022, #7020, #7009, #7006, #7004, #7001, #6998, #6997, #6994, #6992, #6989, #6985, #6971, #6950)
Hardware Support Expansion
- AMD GFX950 Architecture Support - Comprehensive support for GFX950 including WMMA operations, performance optimizations, and architectural-specific features (#7175, #7171, #7127, #6744, #6594)
- Blackwell Enhanced TMEM Support - Improved tensor memory operations with better register usage and performance optimizations (#7160, #7079, #6817)
- Hopper WGMMA Improvements - Enhanced matrix multiplication with subtiling and prefetching optimizations (#7136, #6130)
Performance Optimizations
- Automatic Warp Specialization - Introduced automatic warp specialization optimization for enhanced kernel performance on NVIDIA GPUs (#6289, #6246, #6217)
- MMAv5 Pipelining - Re-enabled and improved MMAv5 pipelining with better performance and scheduling (#6732, #6613, #6256)
- TMA Operations Enhancement - Improved tensor memory access with better layout support and reduced register pressure (#6725, #6238, #6580)
New Features
Language and Frontend
- Aggregate Type Support - Added
@tl.aggregatedecorator for autogenerating Triton types from Python classes (#6970) - JITFunction Constexpr Support - Enhanced constexpr support for function lists and improved JIT functionality (#6988, #6963, #7105)
- Enhanced Boolean Operations - Improved handling of boolean operators and scalars with chained operations (#6769)
- Bitonic Top-k and Sorting - Added support for bitonic top-k operations and improved sort implementations (#6461, #6486)
- Masked Histograms - Added support for masked histogram operations (#6695)
- Syntactic Sugar Additions - Added
.item()as syntactic sugar for.reshape([])(#6873)
Backend and Compilation
- Generic Swizzling Implementation - Implemented generic swizzling algorithm for convert_layout lowering (#6982)
- Enhanced Register Allocation - Improved dynamic register reallocation for warp specialization (#6877, #6694, #6407)
- TMA Reduce Operations - Added TMA reduce operations for descriptor-based reducing stores (#6580)
- Improved Subtiling - Enhanced subtiling code generation for tensor memory loading (#6415)
- BF16 Atomic Operations - Added support for BF16 atomic add operations (#6519)
- Stmatrix Support - Added comprehensive stmatrix support including transpose operations (#6910, #6899)
Hardware-Specific Features
- AMD AsyncCopy Optimizations - Enhanced AsyncCopy support in StreamPipeliner with improved memory operations (#6270, #6639, #6382)
- AMD Buffer Operations - Comprehensive improvements to buffer operations with better vectorization and alignment (#6126, #6145, #6329)
- AMD Ping-pong Scheduler - Enhanced ping-pong scheduler for better memory operation handling (#6254, #6301, #6198)
- NVIDIA PDL Support - Enabled Programmatic Dependent Launch for overlapping kernel execution (#6394)
- AMD HIP AOT Support - Added HIP Ahead-of-Time compilation support (#7007)
Improvements
Performance
- Routing Kernel Optimizations - Multiple performance improvements achieving up to 5% runtime reduction (#6866, #6546, #7040)
- Matrix Multiplication Enhancements - Enhanced persistent TMA matmul with epilogue subtiling and metadata alignment (#6724, #6882, #7123)
- SwiGLU Optimizations - Improved SwiGLU kernel performance and fused activation functions (#6797, #6553)
- Attention Kernel Fixes - Fixed and optimized attention tutorials with better performance metrics (#7037, #6839)
Developer Experience
- Enhanced CI/CD - Improved continuous integration with better caching and timeout handling (#6815, #6816, #6582)
- Testing Infrastructure - Enhanced test coverage and organization (#7109, #6867)
- Documentation Updates - Improved documentation for installation and new features (#7103, #6778, #6235)
- Build System Improvements - Better CMake support and dependency management ([#6330](https://github.com/tri...