-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[AMD] Rewrite extract_slice op implementation #7128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td
Outdated
Show resolved
Hide resolved
third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td
Outdated
Show resolved
Hide resolved
third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUDialectToLLVM/ExtractSliceOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUDialectToLLVM/ExtractSliceOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUDialectToLLVM/ExtractSliceOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
680299e to
f961aff
Compare
|
@antiagainst I addressed the comments. Thanks for the review! |
antiagainst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now. Thanks for tidying up it! Adding @ThomasRaoux to take another look to make sure this also looks good.
Hi @ThomasRaoux. Would you have time to have a look at this PR? |
ThomasRaoux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for improving this
This PR refactors the extract_slice operation to support two major
improvements:
1) Relaxed Layout Constraints
The operation now allows more flexible source and destination layouts,
aligning better with linear layouts.
2) Support for Arbitrary Tensor Ranks
extract_slice is no longer limited to 2D tensors and can now handle
tensors of any rank.
The "extract_slice" operation enables extracting a slice of a tensor in
registers.
It supports the following arguments:
* source: the base tensor on which to create a view tensor
* offsets: offsets into the base tensor at which to create the view
In distributed layouts, tensors are divided into CTA tiles.
A CTA tile represents the smallest contiguous portion of a tensor that
is distributed across all threads and warps within a workgroup. The
ExtractSlice operation extracts a portion of the tensor that aligns with
CTA tile boundaries.
This op is designed to work on logical tensors directly, avoiding the
need for complex layout reinterpretation or reshaping.
For example, the tt.split operation only supports splitting along the
innermost dimension,
and requires that the resulting innermost dimension provide 2 elements
per thread, distributed across registers.
In contrast, extract_slice op imposes no constraints on the extraction
dimension or the size of dimensions.
---------
Co-authored-by: Ognjen Plavsic <plognjen@amd.com>
Co-authored-by: Lei Zhang <antiagainst@gmail.com>
This PR refactors the extract_slice operation to support two major improvements:
Relaxed Layout Constraints
The operation now allows more flexible source and destination layouts, aligning better with linear layouts.
Support for Arbitrary Tensor Ranks
extract_slice is no longer limited to 2D tensors and can now handle tensors of any rank.
The "extract_slice" operation enables extracting a slice of a tensor in registers.
It supports the following arguments:
* source: the base tensor on which to create a view tensor
* offsets: offsets into the base tensor at which to create the view
In distributed layouts, tensors are divided into CTA tiles.
A CTA tile represents the smallest contiguous portion of a tensor that is distributed across all threads and warps within a workgroup. The ExtractSlice operation extracts a portion of the tensor that aligns with CTA tile boundaries.
This op is designed to work on logical tensors directly, avoiding the need for complex layout reinterpretation or reshaping.
For example, the tt.split operation only supports splitting along the innermost dimension,
and requires that the resulting innermost dimension provide 2 elements per thread, distributed across registers.
In contrast, extract_slice op imposes no constraints on the extraction dimension or the size of dimensions.