Skip to content

Conversation

@alefimov-amd
Copy link
Contributor

This PR introduces AMD specific ttg->llvm pattern which uses v_perm instructions instead of combinations of shifts and logical operations.

Limitations of this pattern:

  • Applied only for 8 bit data types;
  • Conversion required to be bijective;
  • No permutation across threads in workgroup.

This PR introduces AMD specific ttg->llvm pattern which uses v_perm instructions
instead of combinations of shifts and logical operations.

Current limitations of this pattern:

- Applied only for 8 bit data types;
- Conversion required to be bijective;
- No permutation across threads in workgroup.
@alefimov-amd alefimov-amd force-pushed the convert_layout_with_v_perm branch from d67fc4d to 72304fa Compare December 16, 2025 21:07
Copy link
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antiagainst happy to review this one if you guys want, but I'd need a bit of context on the semantics of the instruction.

if (srcTy.getElementType().getIntOrFloatBitWidth() != 8)
return failure();
// TODO: broadcasting is not supported at the moment.
if (!conversion.isInvertible())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably support warp / cta broadcasting tho, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was present in input layout, yes.

This pattern improves cases when no inter-warp/thread communication happens, only permutations between registers. In such cases "conversion" contains only "register" input dim and we do not care if there are broadcasting in warps/lanes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right, I see. Then I would assume this would be better suited as an optimisation in LLVM, as we do tons of register packing and unpacking all across triton and we rely on LLVM / ptxas (in nvidia) to optimise it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally it LLVM should be able to combine everything in optimal way, but in practice it is not good all the time. I've experimented and found that in some simple cases it succeeds, in other cases it combines permutations only partially, and sometimes llvm fallback to series of bit operations, which requires 3x more instructions that we expect with "optimal" approach.

For more details about "optimal pattern", see message below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right there are a lot of loose coupled patterns and passes in LLVM that can make it hard to sustain such lowering flow end-to-end always. So having a dedicated pattern to make sure we always emit the optimal code sounds good to me.

@FrederickVu will also take a look at this given he had some thoughts in #7809.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point here is that sure, this is one of the cases where this pattern could be used, and if it's really useful for some real cases, it makes sense to add it. Now, more generally, this whole "unpack, reorder, pack again" pattern we do in quite a few places in triton and we just expect LLVM to generate nice code for us. This is why I was hinting at potentially having a look at improving the instcombine heuristics.

Will review the PR tomorrow tho.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup; actively looking at improving LLVM alongside too. :)

@alefimov-amd
Copy link
Contributor Author

alefimov-amd commented Dec 22, 2025

Details about this change for reviewers

This pattern aims to optimally convert intra-thread ttg.convert_layoutoperation on tensor with 8 bit elements(fp8, int8). For example, from
linearLayout(registers=[[0, 1], [0, 2], [1, 0], [2, 0]], lanes=[[0, 4], [0, 8], [4, 0], [8, 0], [0, 0]], warps = [])
to
linearLayout(registers=[[1, 0], [2, 0], [0, 1], [0, 2]], lanes=[[0, 4], [0, 8], [4, 0], [8, 0], [0, 0]], warps = [])

V_PERM instruction

AMD GPUs (both CDNA and RDNA) have instruction v_perm, which semantic is very similar to llvm shufflevector. It takes two 32 bit registers and a "mask" immediate value which selects bytes from these two inputs and concatenates them in output register.

Let's take a look at example: v_perm dst, src0, src1, mask.
First we concat src0 and src1 in int64 and then take bytes from mask as an index in this concated integer.

let's take src0 = 0x11 22 33 44, src1=0x55 66 77 88, mask=0x01 02 04 05.
Concated int64 is

int64 | 0x11 22 33 44 55 66 77 88 
idx   |    7  6  5  4  3  2  1  0

indexing with mask bytes we get dst = 0x77 66 44 33

Fast way to shuffle data

v_perm is very efficient if we need to shuffle bytes in registers, for example for in_thread_transpose optimization.

Let's try to transpose 4x4xfp8 tensor. Each register contain 4 packed elements, so in total we need only 4 register:
image

We can do this with v_perm instructions in two steps, first we combine halves of input registers in temporary values and then combine output values from pairs of these tmp values:
image

Each temporary value requires one v_perm instruction to construct, and then each output register requires one instruction to combine two tmp values, total 8 v_perm instructions to transpose 4x4 tensor.

If we do same with bit operations this will require 24 instructions. For every output register we need to extract bytes from each input register (requires 3 instructions) and then combine them back (3 bitwise fused or+shift instructions).

This difference becomes even worse if registers hold more values. We have a kernel, which requires to shuffle 16x4xf8 tiles. which takes significant time.

@alefimov-amd alefimov-amd changed the title [WIP][AMD] Use v_perm instruction for convert_layout acceleration [AMD] Use v_perm instruction for convert_layout acceleration Dec 23, 2025
@alefimov-amd
Copy link
Contributor Author

@lezcano This PR is ready for review, while we are look for ways to make it in LLVM.

@alefimov-amd alefimov-amd marked this pull request as ready for review December 23, 2025 14:33
Copy link
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, I didn't get to review this one, and I'm starting holidays tomorrow for a few weeks, so feel free to review internally.

@alefimov-amd
Copy link
Contributor Author

alefimov-amd commented Dec 24, 2025

@lezcano
Have a great holidays!

- Assertion in transferWithVPerm that regBytes | numValues, this would fail if the layout conversion only has 2 8-bit elements per register. matchAndRewrite doesn't have a check to bail out in this case.
  - I can not make an in-thread layout convert which requires permutation, so I just added a test to cover simple 2 elements per thread case.
- Small typos like "indeces" instead of "indices" and "mergable" instead of "mergeable".
  - fixed typos, reworked comment describing 4-way algorithm.
- Redundant checks, like in processOneWayDependencies, needBytePermute should always evaluate to true since otherwise kRegister would not be present in the output of minimalCvtLayout(srcTy, dstTy).
  - This check is redundant for now, because we do not permit swizzling in registers. I've added them for generalization in case we permit such layouts. Consider following example: [[0, 1], [0, 2], [1, 1]]
- Broadcasting can be handled as in the general layout conversion pathways. https://github.com/triton-lang/triton/blob/0cd582fe4645a146bd7c140806ecaae334fd676b/lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp#L149
  - Approach from convert layout is not always applicable here, because broadcasting can happen inside one register. Applying layout transformation will probably lead to generation of bit operations this conversion pattern aims to avoid.
    I've tried to implement this generatlization, but it makes code even more complex, I did not see live examples, so I will leave this for future if we need it.
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely looks to me; just some minor comments.

@FrederickVu is looking at doing the optimization in LLVM proper. But it would take some time to fully land and incorporated. In the meanwhile we can have this to improve perf. @lezcano would you mind taking another look?

Copy link
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go carefully through the maths, but overall looks reasonable to me.

@antiagainst antiagainst merged commit 71e3c11 into triton-lang:main Jan 30, 2026
9 checks passed
alefimov-amd added a commit to ROCm/triton that referenced this pull request Feb 4, 2026
…riton-lang#9014)

This PR introduces AMD specific ttg->llvm pattern which uses v_perm
instructions instead of combinations of shifts and logical operations.

Limitations of this pattern:

- Applied only for 8 bit data types;
- Conversion required to be bijective;
- No permutation across threads in workgroup.

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
alefimov-amd added a commit to ROCm/triton that referenced this pull request Feb 4, 2026
…lang#9014)

This PR introduces AMD specific ttg->llvm pattern which uses v_perm
instructions instead of combinations of shifts and logical operations.

Limitations of this pattern:

- Applied only for 8 bit data types;
- Conversion required to be bijective;
- No permutation across threads in workgroup.

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants