Skip to content

Conversation

@zwu-2025
Copy link
Contributor

@zwu-2025 zwu-2025 commented Aug 1, 2025

Expose AMD buffer_load and buffer_store Gluon. Example usage looks like:

def buffer_ldst_kernel(x, y):
    layout: ttgl.constexpr = ttgl.BlockedLayout(size_per_thread=[1, 1], threads_per_warp=[1, 64], warps_per_cta=[4, 1],  order=[1, 0])
    offsets = ttgl.arange(0, 64 * 64, layout=layout)
    a = ttgl.amd.cdna3.buffer_load(ptr=x, offsets=offsets)
    ttgl.amd.cdna3.buffer_store(stored_value=a, ptr=y, offsets=offsets)

layout = ttgl._unwrap_if_constexpr(layout)

ret_ty = ttgl.distributed_type(element_type, shape, layout)
handle = self.builder.create_buffer_load(ret_ty.to_ir(self.builder), ptr, offsets, cache_modifier, mask, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you infer the layout from the the offsets, the dtype from the pointer and add defaults for mask and other? You probably also need to add broadcasting.

My general feedback would be to think about this as a language API, not just a wrapper to emit IR.

Comment on lines 358 to 359
handle = self.builder.create_buffer_store(stored_value, ptr, offsets, cache_modifier, mask)
return ttgl.tensor(handle, ttgl.void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to move away from tl.void. I don't think it adds any value.

Suggested change
handle = self.builder.create_buffer_store(stored_value, ptr, offsets, cache_modifier, mask)
return ttgl.tensor(handle, ttgl.void)
self.builder.create_buffer_store(stored_value, ptr, offsets, cache_modifier, mask)



@builtin
def create_buffer_load(ptr, element_type, offsets, cache, mask, layout, other, _semantic=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The create_ prefix is only used by the builder, not in the language.

Suggested change
def create_buffer_load(ptr, element_type, offsets, cache, mask, layout, other, _semantic=None):
def buffer_load(ptr, element_type, offsets, cache, mask, layout, other, _semantic=None):

@zwu-2025 zwu-2025 changed the title Expose buffer_load and buffer_store in Gluon Expose buffer_load and buffer_store to Gluon Aug 1, 2025
@antiagainst antiagainst changed the title Expose buffer_load and buffer_store to Gluon [AMD][Gluon] Expose buffer_load and buffer_store to Gluon Aug 2, 2025
@zwu-2025 zwu-2025 force-pushed the buffer_ldst branch 3 times, most recently from d6fcb54 to 79fdfea Compare August 4, 2025 02:13
same style as the implementation in other backend
element_type = ptr.type.scalar.element_ty

if mask is not None:
assert mask.shape == shape, "offsets must have the same shape as offsets"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"mask must have .." :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not addressed yet?

@antiagainst antiagainst marked this pull request as ready for review August 5, 2025 00:44
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice just one final comments from me. Also @peterbell10 to take another look.

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll land once @peterbell10 is okay.

@antiagainst antiagainst merged commit 376b9b9 into triton-lang:main Aug 5, 2025
9 checks passed
ptillet pushed a commit that referenced this pull request Aug 7, 2025
Expose AMD buffer_load and buffer_store Gluon. Example usage looks like:

```
def buffer_ldst_kernel(x, y):
    layout: ttgl.constexpr = ttgl.BlockedLayout(size_per_thread=[1, 1], threads_per_warp=[1, 64], warps_per_cta=[4, 1],  order=[1, 0])
    offsets = ttgl.arange(0, 64 * 64, layout=layout)
    a = ttgl.amd.cdna3.buffer_load(ptr=x, offsets=offsets)
    ttgl.amd.cdna3.buffer_store(stored_value=a, ptr=y, offsets=offsets)
```
@zwu-2025 zwu-2025 deleted the buffer_ldst branch August 14, 2025 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants