Skip to content

Conversation

@niconunezz
Copy link
Contributor

PR Description

What changed

This PR optimizes the optimize-lds-usage pass by improving layout selection strategy for convert-layout operations. The changes include:

  • Removed hardcoded preference for padded layouts in the AMD-specific LDS optimization pass
  • Enhanced estimateResourcesForReplacement method to calculate memory requirements for both padded and swizzled layouts
  • Modified layout selection logic to prefer swizzled layouts when LDS limits allow, falling back to padded layouts only when necessary

Why this change was needed

The optimize-lds-usage pass previously enforced padded layouts based on the assumption that "padded conversion seems more friendly with this optimization." While this approach reduced LDS consumption, it came with significant performance penalties:

  • Hundreds of additional LLVM IR lines generated for padded versions
  • Reduced vectorization

The New Approach:

  • Prioritizes swizzled layouts for better performance and vectorization
  • Only falls back to padded layouts when swizzled versions exceed LDS limits
  • Maintains the pass's core functionality of reducing shared memory consumption through intermediate buffering
  • Achieves better balance between LDS usage and execution performance

@niconunezz
Copy link
Contributor Author

@antiagainst Hi! Just wanted to follow up on this PR. I'm happy to address any concerns or make changes if needed. Thanks!

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! I left some comments. @alefimov-amd who touches this part frequently is out of office now; I'd prefer to have him taking a look too next week. BTW do you have some statistics regarding the improvements?

@guacamoleo
Copy link
Contributor

Just a general question about swizzled vs padded layouts. It looks like the heuristic should generally chooses the right option, but do we have any way for kernels to specify exactly which they want so they can explore the lds storage options?

@niconunezz
Copy link
Contributor Author

@antiagainst Thanks for the suggestions. Regarding the benchmarks, I prepared the file below with several scenarios where this pass is required and compared the outputs of triton-opt %s -split-input-file -optimize-amd-lds-usage=target-arch=gfx90a --allocate-shared-memory --convert-triton-amdgpu-to-llvm=arch=gfx950. This leads to the following reductions in number of lines: 15.21%, 0.0%, 16.9%, 0.0%, 19.78%, 27.49%, 28.02%, where each number corresponds to its respective function. This same method can be used with the existing optimize-lds-usage.mlir file where it yields up to 36.74% reductions.

#blocked = #ttg.blocked<{sizePerThread = [4, 1], threadsPerWarp = [16, 4], warpsPerCTA = [2, 4], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----

#blocked = #ttg.blocked<{sizePerThread = [2, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----
#blocked = #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}
// -----
#blocked = #ttg.blocked<{sizePerThread = [2, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 2], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----

#blocked = #ttg.blocked<{sizePerThread = [4, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 2], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----

#blocked = #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 2], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

// -----

#blocked = #ttg.blocked<{sizePerThread = [8, 4], threadsPerWarp = [16, 4], warpsPerCTA = [4, 2], order = [0, 1]}>
#mma = #ttg.amd_mfma<{version = 2, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = false}>
#shared = #ttg.swizzled_shared<{vec = 4, perPhase = 1, maxPhase = 16, order = [0, 1]}>
#smem = #ttg.shared_memory
module attributes {"ttg.num-warps" = 8 : i32, "ttg.threads-per-warp" = 64 : i32} {
  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) {
    %1 = ttg.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !ttg.memdesc<128x128xf16, #shared, #smem>
    %2 = ttg.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
    %3 = ttg.local_load %1 : !ttg.memdesc<128x128xf16, #shared, #smem> -> tensor<128x128xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 4}>>
    tt.return
  }
}

@niconunezz
Copy link
Contributor Author

@guacamoleo If you're asking whether users have control over which layout is used, currently there isn't a way to manually specify this. The system defaults to swizzled layouts since they typically provide better performance and more efficient memory access patterns. It's worth noting that padded layouts have been phased out in NVIDIA's backend. For AMD backend, padded layouts exist for lds consumption reasons, which you can read more about in this PR.

@antiagainst
Copy link
Collaborator

Just a general question about swizzled vs padded layouts. It looks like the heuristic should generally chooses the right option, but do we have any way for kernels to specify exactly which they want so they can explore the lds storage options?

With Gluon we will have such ability where you can directly program shared memory so the layout therein.

@antiagainst antiagainst enabled auto-merge (squash) August 15, 2025 00:27
@antiagainst antiagainst merged commit 1049858 into triton-lang:main Aug 15, 2025
9 checks passed
nicolasvasilache pushed a commit to nicolasvasilache/triton that referenced this pull request Aug 19, 2025
)

# PR Description

## What changed
This PR optimizes the `optimize-lds-usage` pass by improving layout
selection strategy for convert-layout operations. The changes include:

- Removed hardcoded preference for padded layouts in the AMD-specific
LDS optimization pass
- Enhanced `estimateResourcesForReplacement` method to calculate memory
requirements for both padded and swizzled layouts
- Modified layout selection logic to prefer swizzled layouts when LDS
limits allow, falling back to padded layouts only when necessary

## Why this change was needed
The `optimize-lds-usage` pass previously enforced padded layouts based
on the assumption that "padded conversion seems more friendly with this
optimization." While this approach reduced LDS consumption, it came with
significant performance penalties:

- Hundreds of additional LLVM IR lines generated for padded versions
- Reduced vectorization

**The New Approach:**
- Prioritizes swizzled layouts for better performance and vectorization
- Only falls back to padded layouts when swizzled versions exceed LDS
limits
- Maintains the pass's core functionality of reducing shared memory
consumption through intermediate buffering
- Achieves better balance between LDS usage and execution performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants