[BACKEND] Implement generic swizzling when lowering `convert_layout` #6982

lezcano · 2025-05-29T16:29:17Z

We implement a generic swizzling algorithm by @apgoucher that, given two linear layouts, finds the optimal shared memory layout that maximises read/write vectorisation and, provided that, minimises bank conflicts.

We also implement an algorithm to find the minimum tile size necessary to perform the convert_layout given the restrictions above, and we use it to perform the convert_layout iteratively.

This PR does not yet implement a lowering to ldmatrix/stmatrix, we'll do that in a future PR.

lezcano · 2025-05-29T21:42:36Z

I'll run benchmarks and do a couple minor clean-ups tomorrow. Will also add a couple lit tests, although there is already one for the fp8 transpose which shows that we can indeed vectorise it.

lib/Tools/GenericSwizzling.cpp

Jokeren · 2025-05-31T14:58:21Z

test/Analysis/test-allocation.mlir

 // Shared memory is available after a tensor's liveness range ends
 // expected-remark @below {{reusable}}
-// expected-remark @below {{size = 4608}}
+// expected-remark @below {{size = 8192}}


Seems like shared memory usage has been increased a lot

these often come from being able to vectorise more than before (and as such, not being abl eto do so many reps).

So you haven't seen any cases in internal benchmarks that either slow down or out of shared memory?

I'm tracking some regressions, but no, I have not seen any OOM

Alright, should I approve the PR now or after you've finished debugging?

Either works, I have to figure out what's going on in those regressions before landing either way.

include/triton/Tools/LayoutUtils.h

include/triton/Tools/GenericSwizzling.h

lib/Analysis/Allocation.cpp

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

lib/Tools/GenericSwizzling.cpp

Jokeren · 2025-05-31T16:44:02Z

lib/Tools/GenericSwizzling.cpp

+  auto logBankConflicts = std::min<int32_t>(
+      std::max<int32_t>(0, lenSegment - A.size() - segment.size()), A.size());
+  // Conflict-free
+  for (int i = logBankConflicts; i < A.size(); ++i)


This ^ operator here isn't clear to me, but we can chat offline

this part is in the explanation of the algorithm in the paper, but yes, I agree it is quite a tricky part

Oh I reminded now this is the union of the two subspaces

yep! This code finds the largest subspace that's not in the union of the two given ones.

lezcano · 2025-06-05T11:04:12Z

Added in the last two commits a couple optimisations to alleviate the arithmetic intensity of convert_layouts.
The first one computes addresses already in i8. When we computed them in bf16, then we ended up with many shl 1 that were not able to be optimised by any of the compilers below.

The second optimisation was proposed by @apgoucher and it's a bit more complex to explain, but it basically boils down to the following:
We have a computation of the form b + (a ^ c) where b is the base pointer, a is a variable that depends on the thread and warp id, and c is a constant. We split c = c1 + c2 where c2 has the property that none of its bits are one in a for any thread or warp. When we have this property we can split the computation above as

b + (a ^ c) = b + (a ^ (c1 + c2)) = b +  (a ^ c1) + c2

Then, b + (a ^ c1) does not need to be computed for every load/store, and the + c2 can be passed as an immediate to LDS/STS. In short, when this optimisation applies (and it mostly applies to larger tensors, which is exactly when we want to reduce the arithmetic pressure) we generate code like:

	bar.sync 	0;
	and.b32 	%r1099, %r1124, 112;
	or.b32 	%r1100, %r1099, %r5;
	selp.b32 	%r1101, 0, 2064, %p218;
	xor.b32 	%r1102, %r1100, %r1101;
	or.b32 	%r1103, %r32, %r1102;
	or.b32 	%r1104, %r1103, %r33;
	add.s32 	%r1105, %r243, %r1104;
	ld.shared.v2.b32 	{%r1045, %r1061}, [%r1105];
	ld.shared.v2.b32 	{%r1049, %r1065}, [%r1105+512];
	ld.shared.v2.b32 	{%r1053, %r1069}, [%r1105+1024];
	ld.shared.v2.b32 	{%r1057, %r1073}, [%r1105+1536];
	xor.b32 	%r1106, %r1104, 4128;
	add.s32 	%r1107, %r243, %r1106;
	ld.shared.v2.b32 	{%r1046, %r1062}, [%r1107];
	ld.shared.v2.b32 	{%r1050, %r1066}, [%r1107+512];
	ld.shared.v2.b32 	{%r1054, %r1070}, [%r1107+1024];
	ld.shared.v2.b32 	{%r1058, %r1074}, [%r1107+1536];
	xor.b32 	%r1108, %r1104, 8256;
	add.s32 	%r1109, %r243, %r1108;
	ld.shared.v2.b32 	{%r1047, %r1063}, [%r1109];
	ld.shared.v2.b32 	{%r1051, %r1067}, [%r1109+512];
	ld.shared.v2.b32 	{%r1055, %r1071}, [%r1109+1024];
	ld.shared.v2.b32 	{%r1059, %r1075}, [%r1109+1536];
	xor.b32 	%r1110, %r1104, 12384;
	add.s32 	%r1111, %r243, %r1110;
	ld.shared.v2.b32 	{%r1048, %r1064}, [%r1111];
	ld.shared.v2.b32 	{%r1052, %r1068}, [%r1111+512];
	ld.shared.v2.b32 	{%r1056, %r1072}, [%r1111+1024];
	ld.shared.v2.b32 	{%r1060, %r1076}, [%r1111+1536];

This reverts commit d7a2fa8.

@apgoucher

…riton-lang#6982) We implement a generic swizzling algorithm by @apgoucher that, given two linear layouts, finds the optimal shared memory layout that maximises read/write vectorisation and, provided that, minimises bank conflicts. We also implement an algorithm to find the minimum tile size necessary to perform the `convert_layout` given the restrictions above, and we use it to perform the `convert_layout` iteratively. This PR does not yet implement a lowering to ldmatrix/stmatrix, we'll do that in a future PR. --------- Co-authored-by: Adam P. Goucher <apgoucher@openai.com>

lezcano requested review from Jokeren and ptillet as code owners May 29, 2025 16:29

lezcano marked this pull request as draft May 29, 2025 16:30

lezcano changed the title ~~[BACKEND][DNR] Implement generic swizzling when lowering convert_layout~~ [BACKEND][WIP] Implement generic swizzling when lowering convert_layout May 29, 2025

lezcano closed this May 29, 2025

lezcano reopened this May 29, 2025

lezcano marked this pull request as ready for review May 29, 2025 21:38

lezcano changed the title ~~[BACKEND][WIP] Implement generic swizzling when lowering convert_layout~~ [BACKEND] Implement generic swizzling when lowering convert_layout May 29, 2025

lezcano requested a review from ThomasRaoux May 29, 2025 21:38

Jokeren reviewed May 29, 2025

View reviewed changes

lib/Tools/GenericSwizzling.cpp Outdated Show resolved Hide resolved

apgoucher approved these changes May 30, 2025

View reviewed changes

Jokeren reviewed May 31, 2025

View reviewed changes

lezcano force-pushed the cvt_ldst branch 3 times, most recently from b278e9f to bed6ee3 Compare June 5, 2025 10:24

lezcano force-pushed the cvt_ldst branch 3 times, most recently from d73a965 to 35eb1f2 Compare June 5, 2025 14:55

lezcano mentioned this pull request Jun 5, 2025

[BACKEND] Create llvm.store when we do not need predication #7067

Merged

lezcano force-pushed the cvt_ldst branch 4 times, most recently from 1273f87 to 1c8b65e Compare June 17, 2025 08:20

lezcano added 4 commits June 17, 2025 11:41

Compiles

a27f9bf

kill prints

5b65917

Init implementation of reps

f498621

Implement reps. Missing divideRight

30459a0

lezcano and others added 24 commits June 17, 2025 11:41

oops

37af9d6

shadowed variable...

fdf48ee

address review

e38f368

Don't swizzle if there are bank conflicts

b8bf4e2

address reviews

30330d9

new algo

2afcc91

Return ordered basis

abe5044

Rename read/write to load/store

5ebf19e

Compute addresses in i8

41e4321

minor

5bae4ad

Kill most addr comp

2f61876

fix rebase

0a471ab

bail out if we'd need to use PRMTs

58caa06

clean

97a70fa

bail on broadcasting

88aeca5

Better vectorisation in the presence of broadcasting

b723491

diagonal matrix iteration

3b3e5d1

'Fix' lit tests

006839a

Partially revert "bail out if we'd need to use PRMTs"

9228410

This reverts commit d7a2fa8.

fix lit tests

8143ee9

more revert

23770ec

progress

19490fa

Correct the dim order

05c1152

fix order and minor

82bebf2

lezcano force-pushed the cvt_ldst branch from b11216c to 82bebf2 Compare June 17, 2025 10:43

deactivete swizzling for amd

3406345

lezcano enabled auto-merge (squash) June 17, 2025 12:44

lezcano merged commit b78022a into main Jun 17, 2025
9 checks passed

lezcano deleted the cvt_ldst branch June 17, 2025 12:47

[BACKEND] Implement generic swizzling when lowering convert_layout #6982

[BACKEND] Implement generic swizzling when lowering convert_layout #6982

Uh oh!

Conversation

lezcano commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lezcano commented May 29, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lezcano Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lezcano commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[BACKEND] Implement generic swizzling when lowering `convert_layout` #6982

[BACKEND] Implement generic swizzling when lowering `convert_layout` #6982

lezcano commented May 29, 2025 •

edited

Loading

lezcano Jun 1, 2025 •

edited

Loading

lezcano commented Jun 5, 2025 •

edited

Loading