Skip to content

Conversation

@Mogball
Copy link
Collaborator

@Mogball Mogball commented May 31, 2025

No description provided.

@Mogball Mogball force-pushed the mogball/attn branch 4 times, most recently from 784d43d to 2096971 Compare June 1, 2025 03:52
@Mogball Mogball changed the base branch from main to merge_base June 1, 2025 03:52
@Mogball Mogball force-pushed the mogball/attn branch 4 times, most recently from b8cde4d to 9fed043 Compare June 2, 2025 21:01
@Mogball Mogball changed the base branch from merge_base to main June 2, 2025 21:02
@Mogball Mogball changed the base branch from main to merge_base June 2, 2025 21:02
@Mogball Mogball force-pushed the mogball/attn branch 3 times, most recently from 97ed0fd to 984460f Compare June 3, 2025 18:52
@Mogball Mogball changed the base branch from merge_base to main June 3, 2025 18:53
@Mogball Mogball force-pushed the mogball/attn branch 14 times, most recently from 40eea56 to e98c5ff Compare June 12, 2025 07:53
@Mogball Mogball changed the title [WIP][Gluon] Naive attention implementation [Gluon] Implement attention kernels for d64 and d128 Jun 13, 2025
@Mogball Mogball marked this pull request as ready for review June 13, 2025 05:33
@Mogball Mogball requested a review from ptillet as a code owner June 13, 2025 05:33
self.ready_bars = ready_bars
self.empty_bars = empty_bars
self.num_buffers = gl.constexpr(num_buffers)
self.num_consumers = gl.constexpr(num_consumers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, would an annotated assignment work here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because this is being executed as Python code :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love ideas on how to improve the syntax here. There are a couple of other places where explicit wrapping of values with gl.constexpr is needed, even inside Gluon code

@gluon.jit
def release(self):
if isinstance(self.mem, gl.shared_memory_descriptor):
self.mem._keep_alive()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add _keep_alive for tensor memory descriptors as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no corresponding operation for it, but thinking about it, we might need one

Mogball added 13 commits June 16, 2025 22:25
```
fused-attention-batch4-head32-d64-fwd-causal=True-config=Config(warp_specialize=True, use_gluon=False):
     N_CTX  Triton [FP16]  Triton [FP8]
0   1024.0     206.903414    226.941886
1   2048.0     397.637820    397.894749
2   4096.0     538.967275    537.617627
3   8192.0     633.806800    630.562116
4  16384.0     690.005849    685.848451
fused-attention-batch4-head32-d64-fwd-causal=True-config=Config(warp_specialize=False, use_gluon=True):
     N_CTX  Triton [FP16]  Triton [FP8]
0   1024.0     158.854438    167.127676
1   2048.0     430.994165    445.445875
2   4096.0     574.152492    593.125123
3   8192.0     669.977292    692.156473
4  16384.0     725.290898    749.413157
```
```
fused-attention-batch4-head32-d128-fwd-causal=True-config=Config(warp_specialize=True, use_gluon=False):
     N_CTX  Triton [FP16]  Triton [FP8]
0   1024.0     168.101257    180.147833
1   2048.0     232.412164    251.379866
2   4096.0     282.099851    306.706495
3   8192.0     313.280068    342.515038
4  16384.0     331.734557    362.578482
fused-attention-batch4-head32-d128-fwd-causal=True-config=Config(warp_specialize=False, use_gluon=True):
     N_CTX  Triton [FP16]  Triton [FP8]
0   1024.0     153.825955    303.351930
1   2048.0     504.991716    532.393691
2   4096.0     723.003324    750.103002
3   8192.0     888.283770    924.001097
4  16384.0     979.523873   1045.632286
```
Copy link
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@Mogball
Copy link
Collaborator Author

Mogball commented Jun 17, 2025

     N_CTX  triton-fp16  cudnn-fp16
0   1024.0   439.260007  567.950997
1   2048.0   624.868309  805.275085
2   4096.0   702.756313  888.467925
3   8192.0   744.283398  942.050952
4  16384.0   762.487030  929.791875
5  32768.0   765.736062  929.684688
6  65536.0   758.262973  900.141163
Attention Z=4 H=32 D=64 causal=True:
     N_CTX  triton-fp16  cudnn-fp16
0   1024.0   222.745914  359.200951
1   2048.0   433.364914  625.464007
2   4096.0   575.664633  723.858504
3   8192.0   666.592775  812.912723
4  16384.0   706.420267  811.271677
5  32768.0   751.074616  913.670943
6  65536.0   743.274618  920.962759
Attention Z=4 H=32 D=128 causal=False:
     N_CTX  triton-fp16   cudnn-fp16
0   1024.0   636.543972   921.466612
1   2048.0   945.496280  1216.067987
2   4096.0  1055.771847  1404.380647
3   8192.0  1126.162364  1357.966027
4  16384.0  1147.708498  1221.496024
5  32768.0  1209.871280  1243.138221
6  65536.0  1189.113742  1234.391085
Attention Z=4 H=32 D=128 causal=True:
     N_CTX  triton-fp16   cudnn-fp16
0   1024.0   290.267626   557.898058
1   2048.0   491.085408   847.120446
2   4096.0   720.423634  1070.995853
3   8192.0   864.481281  1213.994755
4  16384.0  1026.677104  1202.142195
5  32768.0  1139.247450  1222.141755
6  65536.0  1126.982660  1244.151394

Final perf numbers for this. The gap for D64 is quite big now since cublas appears to have improved perf a lot from 720->920. For D128, most of the gap esp for small ctx is making the kernel persistent.

@Mogball Mogball enabled auto-merge (squash) June 17, 2025 17:09
@Mogball Mogball merged commit 5c63c72 into main Jun 17, 2025
9 checks passed
@Mogball Mogball deleted the mogball/attn branch June 17, 2025 17:14
tie-pilot-qxw pushed a commit to tie-pilot-qxw/triton that referenced this pull request Aug 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants