Skip to content

Conversation

@lezcano
Copy link
Contributor

@lezcano lezcano commented May 29, 2025

We implement a generic swizzling algorithm by @apgoucher that, given two linear layouts, finds the optimal shared memory layout that maximises read/write vectorisation and, provided that, minimises bank conflicts.

We also implement an algorithm to find the minimum tile size necessary to perform the convert_layout given the restrictions above, and we use it to perform the convert_layout iteratively.

This PR does not yet implement a lowering to ldmatrix/stmatrix, we'll do that in a future PR.

@lezcano lezcano requested review from Jokeren and ptillet as code owners May 29, 2025 16:29
@lezcano lezcano marked this pull request as draft May 29, 2025 16:30
@lezcano lezcano changed the title [BACKEND][DNR] Implement generic swizzling when lowering convert_layout [BACKEND][WIP] Implement generic swizzling when lowering convert_layout May 29, 2025
@lezcano lezcano closed this May 29, 2025
@lezcano lezcano reopened this May 29, 2025
@lezcano lezcano marked this pull request as ready for review May 29, 2025 21:38
@lezcano lezcano changed the title [BACKEND][WIP] Implement generic swizzling when lowering convert_layout [BACKEND] Implement generic swizzling when lowering convert_layout May 29, 2025
@lezcano lezcano requested a review from ThomasRaoux May 29, 2025 21:38
@lezcano
Copy link
Contributor Author

lezcano commented May 29, 2025

I'll run benchmarks and do a couple minor clean-ups tomorrow. Will also add a couple lit tests, although there is already one for the fp8 transpose which shows that we can indeed vectorise it.

// Shared memory is available after a tensor's liveness range ends
// expected-remark @below {{reusable}}
// expected-remark @below {{size = 4608}}
// expected-remark @below {{size = 8192}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like shared memory usage has been increased a lot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these often come from being able to vectorise more than before (and as such, not being abl eto do so many reps).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you haven't seen any cases in internal benchmarks that either slow down or out of shared memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tracking some regressions, but no, I have not seen any OOM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, should I approve the PR now or after you've finished debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either works, I have to figure out what's going on in those regressions before landing either way.

auto logBankConflicts = std::min<int32_t>(
std::max<int32_t>(0, lenSegment - A.size() - segment.size()), A.size());
// Conflict-free
for (int i = logBankConflicts; i < A.size(); ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ^ operator here isn't clear to me, but we can chat offline

Copy link
Contributor Author

@lezcano lezcano Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is in the explanation of the algorithm in the paper, but yes, I agree it is quite a tricky part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I reminded now this is the union of the two subspaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! This code finds the largest subspace that's not in the union of the two given ones.

@lezcano lezcano force-pushed the cvt_ldst branch 3 times, most recently from b278e9f to bed6ee3 Compare June 5, 2025 10:24
@lezcano
Copy link
Contributor Author

lezcano commented Jun 5, 2025

Added in the last two commits a couple optimisations to alleviate the arithmetic intensity of convert_layouts.
The first one computes addresses already in i8. When we computed them in bf16, then we ended up with many shl 1 that were not able to be optimised by any of the compilers below.

The second optimisation was proposed by @apgoucher and it's a bit more complex to explain, but it basically boils down to the following:
We have a computation of the form b + (a ^ c) where b is the base pointer, a is a variable that depends on the thread and warp id, and c is a constant. We split c = c1 + c2 where c2 has the property that none of its bits are one in a for any thread or warp. When we have this property we can split the computation above as

b + (a ^ c) = b + (a ^ (c1 + c2)) = b +  (a ^ c1) + c2

Then, b + (a ^ c1) does not need to be computed for every load/store, and the + c2 can be passed as an immediate to LDS/STS. In short, when this optimisation applies (and it mostly applies to larger tensors, which is exactly when we want to reduce the arithmetic pressure) we generate code like:

	bar.sync 	0;
	and.b32 	%r1099, %r1124, 112;
	or.b32 	%r1100, %r1099, %r5;
	selp.b32 	%r1101, 0, 2064, %p218;
	xor.b32 	%r1102, %r1100, %r1101;
	or.b32 	%r1103, %r32, %r1102;
	or.b32 	%r1104, %r1103, %r33;
	add.s32 	%r1105, %r243, %r1104;
	ld.shared.v2.b32 	{%r1045, %r1061}, [%r1105];
	ld.shared.v2.b32 	{%r1049, %r1065}, [%r1105+512];
	ld.shared.v2.b32 	{%r1053, %r1069}, [%r1105+1024];
	ld.shared.v2.b32 	{%r1057, %r1073}, [%r1105+1536];
	xor.b32 	%r1106, %r1104, 4128;
	add.s32 	%r1107, %r243, %r1106;
	ld.shared.v2.b32 	{%r1046, %r1062}, [%r1107];
	ld.shared.v2.b32 	{%r1050, %r1066}, [%r1107+512];
	ld.shared.v2.b32 	{%r1054, %r1070}, [%r1107+1024];
	ld.shared.v2.b32 	{%r1058, %r1074}, [%r1107+1536];
	xor.b32 	%r1108, %r1104, 8256;
	add.s32 	%r1109, %r243, %r1108;
	ld.shared.v2.b32 	{%r1047, %r1063}, [%r1109];
	ld.shared.v2.b32 	{%r1051, %r1067}, [%r1109+512];
	ld.shared.v2.b32 	{%r1055, %r1071}, [%r1109+1024];
	ld.shared.v2.b32 	{%r1059, %r1075}, [%r1109+1536];
	xor.b32 	%r1110, %r1104, 12384;
	add.s32 	%r1111, %r243, %r1110;
	ld.shared.v2.b32 	{%r1048, %r1064}, [%r1111];
	ld.shared.v2.b32 	{%r1052, %r1068}, [%r1111+512];
	ld.shared.v2.b32 	{%r1056, %r1072}, [%r1111+1024];
	ld.shared.v2.b32 	{%r1060, %r1076}, [%r1111+1536];

@lezcano lezcano force-pushed the cvt_ldst branch 3 times, most recently from d73a965 to 35eb1f2 Compare June 5, 2025 14:55
@lezcano lezcano force-pushed the cvt_ldst branch 4 times, most recently from 1273f87 to 1c8b65e Compare June 17, 2025 08:20
@lezcano lezcano enabled auto-merge (squash) June 17, 2025 12:44
@lezcano lezcano merged commit b78022a into main Jun 17, 2025
9 checks passed
@lezcano lezcano deleted the cvt_ldst branch June 17, 2025 12:47
tie-pilot-qxw pushed a commit to tie-pilot-qxw/triton that referenced this pull request Aug 30, 2025
…riton-lang#6982)

We implement a generic swizzling algorithm by @apgoucher that, given two
linear layouts, finds the optimal shared memory layout that maximises
read/write vectorisation and, provided that, minimises bank conflicts.

We also implement an algorithm to find the minimum tile size necessary
to perform the `convert_layout` given the restrictions above, and we use
it to perform the `convert_layout` iteratively.

This PR does not yet implement a lowering to ldmatrix/stmatrix, we'll do
that in a future PR.

---------

Co-authored-by: Adam P. Goucher <apgoucher@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants