[AMD] Improve layout selection in optimize-lds-usage pass to prefer swizzled layouts #7750

niconunezz · 2025-08-02T23:42:07Z

PR Description

What changed

This PR optimizes the optimize-lds-usage pass by improving layout selection strategy for convert-layout operations. The changes include:

Removed hardcoded preference for padded layouts in the AMD-specific LDS optimization pass
Enhanced estimateResourcesForReplacement method to calculate memory requirements for both padded and swizzled layouts
Modified layout selection logic to prefer swizzled layouts when LDS limits allow, falling back to padded layouts only when necessary

Why this change was needed

The optimize-lds-usage pass previously enforced padded layouts based on the assumption that "padded conversion seems more friendly with this optimization." While this approach reduced LDS consumption, it came with significant performance penalties:

Hundreds of additional LLVM IR lines generated for padded versions
Reduced vectorization

The New Approach:

Prioritizes swizzled layouts for better performance and vectorization
Only falls back to padded layouts when swizzled versions exceed LDS limits
Maintains the pass's core functionality of reducing shared memory consumption through intermediate buffering
Achieves better balance between LDS usage and execution performance

niconunezz · 2025-08-07T11:57:31Z

@antiagainst Hi! Just wanted to follow up on this PR. I'm happy to address any concerns or make changes if needed. Thanks!

antiagainst

Thanks for the patch! I left some comments. @alefimov-amd who touches this part frequently is out of office now; I'd prefer to have him taking a look too next week. BTW do you have some statistics regarding the improvements?

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUtility.h

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

guacamoleo · 2025-08-14T20:10:06Z

Just a general question about swizzled vs padded layouts. It looks like the heuristic should generally chooses the right option, but do we have any way for kernels to specify exactly which they want so they can explore the lds storage options?

niconunezz · 2025-08-14T20:18:23Z

@antiagainst Thanks for the suggestions. Regarding the benchmarks, I prepared the file below with several scenarios where this pass is required and compared the outputs of triton-opt %s -split-input-file -optimize-amd-lds-usage=target-arch=gfx90a --allocate-shared-memory --convert-triton-amdgpu-to-llvm=arch=gfx950. This leads to the following reductions in number of lines: 15.21%, 0.0%, 16.9%, 0.0%, 19.78%, 27.49%, 28.02%, where each number corresponds to its respective function. This same method can be used with the existing optimize-lds-usage.mlir file where it yields up to 36.74% reductions.

#blocked = #ttg.blocked<{sizePerThread = [4, 1], threadsPerWarp = [16, 4], warpsPerCTA = [2, 4], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----

#blocked = #ttg.blocked<{sizePerThread = [2, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----
#blocked = #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}
// -----
#blocked = #ttg.blocked<{sizePerThread = [2, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 2], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----

#blocked = #ttg.blocked<{sizePerThread = [4, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 2], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----

#blocked = #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 2], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----

#blocked = #ttg.blocked<{sizePerThread = [8, 4], threadsPerWarp = [16, 4], warpsPerCTA = [4, 2], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

niconunezz · 2025-08-14T20:39:10Z

@guacamoleo If you're asking whether users have control over which layout is used, currently there isn't a way to manually specify this. The system defaults to swizzled layouts since they typically provide better performance and more efficient memory access patterns. It's worth noting that padded layouts have been phased out in NVIDIA's backend. For AMD backend, padded layouts exist for lds consumption reasons, which you can read more about in this PR.

antiagainst · 2025-08-15T00:18:02Z

Just a general question about swizzled vs padded layouts. It looks like the heuristic should generally chooses the right option, but do we have any way for kernels to specify exactly which they want so they can explore the lds storage options?

With Gluon we will have such ability where you can directly program shared memory so the layout therein.

) # PR Description ## What changed This PR optimizes the `optimize-lds-usage` pass by improving layout selection strategy for convert-layout operations. The changes include: - Removed hardcoded preference for padded layouts in the AMD-specific LDS optimization pass - Enhanced `estimateResourcesForReplacement` method to calculate memory requirements for both padded and swizzled layouts - Modified layout selection logic to prefer swizzled layouts when LDS limits allow, falling back to padded layouts only when necessary ## Why this change was needed The `optimize-lds-usage` pass previously enforced padded layouts based on the assumption that "padded conversion seems more friendly with this optimization." While this approach reduced LDS consumption, it came with significant performance penalties: - Hundreds of additional LLVM IR lines generated for padded versions - Reduced vectorization **The New Approach:** - Prioritizes swizzled layouts for better performance and vectorization - Only falls back to padded layouts when swizzled versions exceed LDS limits - Maintains the pass's core functionality of reducing shared memory consumption through intermediate buffering - Achieves better balance between LDS usage and execution performance

niconunezz requested review from antiagainst, ptillet and zhanglx13 as code owners August 2, 2025 23:42

niconunezz force-pushed the main branch from a9ffba0 to b681506 Compare August 13, 2025 10:37

antiagainst requested changes Aug 14, 2025

View reviewed changes

[AMD] Optimize selection preference in optimize-lds-usage pass

cc8014b

niconunezz force-pushed the main branch from b681506 to cc8014b Compare August 14, 2025 11:31

address suggested changes

3a5e227

antiagainst approved these changes Aug 15, 2025

View reviewed changes

antiagainst enabled auto-merge (squash) August 15, 2025 00:27

antiagainst merged commit 1049858 into triton-lang:main Aug 15, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Improve layout selection in optimize-lds-usage pass to prefer swizzled layouts #7750

[AMD] Improve layout selection in optimize-lds-usage pass to prefer swizzled layouts #7750

niconunezz commented Aug 2, 2025

Uh oh!

niconunezz commented Aug 7, 2025

Uh oh!

antiagainst left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guacamoleo commented Aug 14, 2025

Uh oh!

niconunezz commented Aug 14, 2025

Uh oh!

niconunezz commented Aug 14, 2025

Uh oh!

antiagainst commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[AMD] Improve layout selection in optimize-lds-usage pass to prefer swizzled layouts #7750

[AMD] Improve layout selection in optimize-lds-usage pass to prefer swizzled layouts #7750

Conversation

niconunezz commented Aug 2, 2025

PR Description

What changed

Why this change was needed

Uh oh!

niconunezz commented Aug 7, 2025

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guacamoleo commented Aug 14, 2025

Uh oh!

niconunezz commented Aug 14, 2025

Uh oh!

niconunezz commented Aug 14, 2025

Uh oh!

antiagainst commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants