-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[BACKEND] Implement generic swizzling when lowering convert_layout
#6982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
convert_layoutconvert_layout
convert_layoutconvert_layout
|
I'll run benchmarks and do a couple minor clean-ups tomorrow. Will also add a couple lit tests, although there is already one for the fp8 transpose which shows that we can indeed vectorise it. |
| // Shared memory is available after a tensor's liveness range ends | ||
| // expected-remark @below {{reusable}} | ||
| // expected-remark @below {{size = 4608}} | ||
| // expected-remark @below {{size = 8192}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like shared memory usage has been increased a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these often come from being able to vectorise more than before (and as such, not being abl eto do so many reps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you haven't seen any cases in internal benchmarks that either slow down or out of shared memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tracking some regressions, but no, I have not seen any OOM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, should I approve the PR now or after you've finished debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either works, I have to figure out what's going on in those regressions before landing either way.
| auto logBankConflicts = std::min<int32_t>( | ||
| std::max<int32_t>(0, lenSegment - A.size() - segment.size()), A.size()); | ||
| // Conflict-free | ||
| for (int i = logBankConflicts; i < A.size(); ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ^ operator here isn't clear to me, but we can chat offline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is in the explanation of the algorithm in the paper, but yes, I agree it is quite a tricky part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I reminded now this is the union of the two subspaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep! This code finds the largest subspace that's not in the union of the two given ones.
b278e9f to
bed6ee3
Compare
|
Added in the last two commits a couple optimisations to alleviate the arithmetic intensity of The second optimisation was proposed by @apgoucher and it's a bit more complex to explain, but it basically boils down to the following: Then, |
d73a965 to
35eb1f2
Compare
1273f87 to
1c8b65e
Compare
This reverts commit d7a2fa8.
…riton-lang#6982) We implement a generic swizzling algorithm by @apgoucher that, given two linear layouts, finds the optimal shared memory layout that maximises read/write vectorisation and, provided that, minimises bank conflicts. We also implement an algorithm to find the minimum tile size necessary to perform the `convert_layout` given the restrictions above, and we use it to perform the `convert_layout` iteratively. This PR does not yet implement a lowering to ldmatrix/stmatrix, we'll do that in a future PR. --------- Co-authored-by: Adam P. Goucher <apgoucher@openai.com>
We implement a generic swizzling algorithm by @apgoucher that, given two linear layouts, finds the optimal shared memory layout that maximises read/write vectorisation and, provided that, minimises bank conflicts.
We also implement an algorithm to find the minimum tile size necessary to perform the
convert_layoutgiven the restrictions above, and we use it to perform theconvert_layoutiteratively.This PR does not yet implement a lowering to ldmatrix/stmatrix, we'll do that in a future PR.