[AMD] Use v_perm instruction for convert_layout acceleration #9014

alefimov-amd · 2025-12-16T21:07:33Z

This PR introduces AMD specific ttg->llvm pattern which uses v_perm instructions instead of combinations of shifts and logical operations.

Limitations of this pattern:

Applied only for 8 bit data types;
Conversion required to be bijective;
No permutation across threads in workgroup.

This PR introduces AMD specific ttg->llvm pattern which uses v_perm instructions instead of combinations of shifts and logical operations. Current limitations of this pattern: - Applied only for 8 bit data types; - Conversion required to be bijective; - No permutation across threads in workgroup.

lezcano

@antiagainst happy to review this one if you guys want, but I'd need a bit of context on the semantics of the instruction.

lezcano · 2025-12-17T11:01:16Z

third_party/amd/lib/TritonAMDGPUToLLVM/ConvertLayoutOpToLLVM.cpp

+    if (srcTy.getElementType().getIntOrFloatBitWidth() != 8)
+      return failure();
+    // TODO: broadcasting is not supported at the moment.
+    if (!conversion.isInvertible())


You probably support warp / cta broadcasting tho, right?

If it was present in input layout, yes.

This pattern improves cases when no inter-warp/thread communication happens, only permutations between registers. In such cases "conversion" contains only "register" input dim and we do not care if there are broadcasting in warps/lanes.

ah, right, I see. Then I would assume this would be better suited as an optimisation in LLVM, as we do tons of register packing and unpacking all across triton and we rely on LLVM / ptxas (in nvidia) to optimise it

Yes, ideally it LLVM should be able to combine everything in optimal way, but in practice it is not good all the time. I've experimented and found that in some simple cases it succeeds, in other cases it combines permutations only partially, and sometimes llvm fallback to series of bit operations, which requires 3x more instructions that we expect with "optimal" approach.

For more details about "optimal pattern", see message below.

Right there are a lot of loose coupled patterns and passes in LLVM that can make it hard to sustain such lowering flow end-to-end always. So having a dedicated pattern to make sure we always emit the optimal code sounds good to me.

@FrederickVu will also take a look at this given he had some thoughts in #7809.

My point here is that sure, this is one of the cases where this pattern could be used, and if it's really useful for some real cases, it makes sense to add it. Now, more generally, this whole "unpack, reorder, pack again" pattern we do in quite a few places in triton and we just expect LLVM to generate nice code for us. This is why I was hinting at potentially having a look at improving the instcombine heuristics.

Will review the PR tomorrow tho.

Yup; actively looking at improving LLVM alongside too. :)

alefimov-amd · 2025-12-22T17:31:27Z

Details about this change for reviewers

This pattern aims to optimally convert intra-thread ttg.convert_layoutoperation on tensor with 8 bit elements(fp8, int8). For example, from
linearLayout(registers=[[0, 1], [0, 2], [1, 0], [2, 0]], lanes=[[0, 4], [0, 8], [4, 0], [8, 0], [0, 0]], warps = [])
to
linearLayout(registers=[[1, 0], [2, 0], [0, 1], [0, 2]], lanes=[[0, 4], [0, 8], [4, 0], [8, 0], [0, 0]], warps = [])

V_PERM instruction

AMD GPUs (both CDNA and RDNA) have instruction v_perm, which semantic is very similar to llvm shufflevector. It takes two 32 bit registers and a "mask" immediate value which selects bytes from these two inputs and concatenates them in output register.

Let's take a look at example: v_perm dst, src0, src1, mask.
First we concat src0 and src1 in int64 and then take bytes from mask as an index in this concated integer.

let's take src0 = 0x11 22 33 44, src1=0x55 66 77 88, mask=0x01 02 04 05.
Concated int64 is

int64 | 0x11 22 33 44 55 66 77 88 
idx   |    7  6  5  4  3  2  1  0

indexing with mask bytes we get dst = 0x77 66 44 33

Fast way to shuffle data

v_perm is very efficient if we need to shuffle bytes in registers, for example for in_thread_transpose optimization.

Let's try to transpose 4x4xfp8 tensor. Each register contain 4 packed elements, so in total we need only 4 register:

We can do this with v_perm instructions in two steps, first we combine halves of input registers in temporary values and then combine output values from pairs of these tmp values:

Each temporary value requires one v_perm instruction to construct, and then each output register requires one instruction to combine two tmp values, total 8 v_perm instructions to transpose 4x4 tensor.

If we do same with bit operations this will require 24 instructions. For every output register we need to extract bytes from each input register (requires 3 instructions) and then combine them back (3 bitwise fused or+shift instructions).

This difference becomes even worse if registers hold more values. We have a kernel, which requires to shuffle 16x4xf8 tiles. which takes significant time.

alefimov-amd · 2025-12-23T14:29:29Z

@lezcano This PR is ready for review, while we are look for ways to make it in LLVM.

lezcano

Sadly, I didn't get to review this one, and I'm starting holidays tomorrow for a few weeks, so feel free to review internally.

alefimov-amd · 2025-12-24T16:49:57Z

@lezcano
Have a great holidays!

…_perm

- Assertion in transferWithVPerm that regBytes | numValues, this would fail if the layout conversion only has 2 8-bit elements per register. matchAndRewrite doesn't have a check to bail out in this case. - I can not make an in-thread layout convert which requires permutation, so I just added a test to cover simple 2 elements per thread case. - Small typos like "indeces" instead of "indices" and "mergable" instead of "mergeable". - fixed typos, reworked comment describing 4-way algorithm. - Redundant checks, like in processOneWayDependencies, needBytePermute should always evaluate to true since otherwise kRegister would not be present in the output of minimalCvtLayout(srcTy, dstTy). - This check is redundant for now, because we do not permit swizzling in registers. I've added them for generalization in case we permit such layouts. Consider following example: [[0, 1], [0, 2], [1, 1]] - Broadcasting can be handled as in the general layout conversion pathways. https://github.com/triton-lang/triton/blob/0cd582fe4645a146bd7c140806ecaae334fd676b/lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp#L149 - Approach from convert layout is not always applicable here, because broadcasting can happen inside one register. Applying layout transformation will probably lead to generation of bit operations this conversion pattern aims to avoid. I've tried to implement this generatlization, but it makes code even more complex, I did not see live examples, so I will leave this for future if we need it.

antiagainst

Largely looks to me; just some minor comments.

@FrederickVu is looking at doing the optimization in LLVM proper. But it would take some time to fully land and incorporated. In the meanwhile we can have this to improve perf. @lezcano would you mind taking another look?

third_party/amd/lib/TritonAMDGPUToLLVM/ConvertLayoutOpToLLVM.cpp

python/test/gluon/test_core.py

lezcano

I didn't go carefully through the maths, but overall looks reasonable to me.

…riton-lang#9014) This PR introduces AMD specific ttg->llvm pattern which uses v_perm instructions instead of combinations of shifts and logical operations. Limitations of this pattern: - Applied only for 8 bit data types; - Conversion required to be bijective; - No permutation across threads in workgroup. --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

…lang#9014) This PR introduces AMD specific ttg->llvm pattern which uses v_perm instructions instead of combinations of shifts and logical operations. Limitations of this pattern: - Applied only for 8 bit data types; - Conversion required to be bijective; - No permutation across threads in workgroup. --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

alefimov-amd force-pushed the convert_layout_with_v_perm branch from d67fc4d to 72304fa Compare December 16, 2025 21:07

lezcano reviewed Dec 17, 2025

View reviewed changes

alefimov-amd changed the title ~~[WIP][AMD] Use v_perm instruction for convert_layout acceleration~~ [AMD] Use v_perm instruction for convert_layout acceleration Dec 23, 2025

Merge branch 'main' into convert_layout_with_v_perm

52a2df8

alefimov-amd requested review from antiagainst and lezcano December 23, 2025 14:21

alefimov-amd marked this pull request as ready for review December 23, 2025 14:33

alefimov-amd requested review from peterbell10, ptillet and zhanglx13 as code owners December 23, 2025 14:33

lezcano reviewed Dec 23, 2025

View reviewed changes

binarman added 2 commits January 22, 2026 16:59

Merge remote-tracking branch 'openai/main' into convert_layout_with_v…

0c27706

…_perm

alefimov-amd mentioned this pull request Jan 26, 2026

[AMD][ITT][GLUON] In thread transpose optimization #9304

Closed

binarman added a commit to binarman/triton that referenced this pull request Jan 28, 2026

v_perm related changes from PR triton-lang#9014

b921383

alefimov-amd mentioned this pull request Jan 28, 2026

Explicit itt for mla decode without rope ROCm/aiter#1924

Merged

antiagainst reviewed Jan 29, 2026

View reviewed changes

binarman and others added 2 commits January 29, 2026 15:51

review comments

38412e5

Merge branch 'main' into convert_layout_with_v_perm

33b74b6

lezcano approved these changes Jan 30, 2026

View reviewed changes

antiagainst approved these changes Jan 30, 2026

View reviewed changes

antiagainst merged commit 71e3c11 into triton-lang:main Jan 30, 2026
9 checks passed

[AMD] Use v_perm instruction for convert_layout acceleration #9014

[AMD] Use v_perm instruction for convert_layout acceleration #9014

Conversation

alefimov-amd commented Dec 16, 2025

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

lezcano Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

alefimov-amd Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

lezcano Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

alefimov-amd Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

antiagainst Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

lezcano Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

antiagainst Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

alefimov-amd commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details about this change for reviewers

V_PERM instruction

Fast way to shuffle data

Uh oh!

alefimov-amd commented Dec 23, 2025

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

alefimov-amd commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alefimov-amd commented Dec 22, 2025 •

edited

Loading

alefimov-amd commented Dec 24, 2025 •

edited

Loading