[Gluon] Implement attention kernels for d64 and d128 #7009

Mogball · 2025-05-31T17:56:20Z

No description provided.

python/tutorials/06-fused-attention.py

python/tutorials/gluon_attention.py

peterbell10 · 2025-06-13T12:55:33Z

python/tutorials/gluon_attention.py

+            self.ready_bars = ready_bars
+            self.empty_bars = empty_bars
+            self.num_buffers = gl.constexpr(num_buffers)
+            self.num_consumers = gl.constexpr(num_consumers)


I wonder, would an annotated assignment work here?

No because this is being executed as Python code :/

Would love ideas on how to improve the syntax here. There are a couple of other places where explicit wrapping of values with gl.constexpr is needed, even inside Gluon code

peterbell10 · 2025-06-13T12:58:16Z

python/tutorials/gluon_attention.py

+        @gluon.jit
+        def release(self):
+            if isinstance(self.mem, gl.shared_memory_descriptor):
+                self.mem._keep_alive()


Do we need to add _keep_alive for tensor memory descriptors as well?

There is no corresponding operation for it, but thinking about it, we might need one

python/tutorials/gluon_attention.py

``` fused-attention-batch4-head32-d64-fwd-causal=True-config=Config(warp_specialize=True, use_gluon=False): N_CTX Triton [FP16] Triton [FP8] 0 1024.0 206.903414 226.941886 1 2048.0 397.637820 397.894749 2 4096.0 538.967275 537.617627 3 8192.0 633.806800 630.562116 4 16384.0 690.005849 685.848451 fused-attention-batch4-head32-d64-fwd-causal=True-config=Config(warp_specialize=False, use_gluon=True): N_CTX Triton [FP16] Triton [FP8] 0 1024.0 158.854438 167.127676 1 2048.0 430.994165 445.445875 2 4096.0 574.152492 593.125123 3 8192.0 669.977292 692.156473 4 16384.0 725.290898 749.413157 ```

``` fused-attention-batch4-head32-d128-fwd-causal=True-config=Config(warp_specialize=True, use_gluon=False): N_CTX Triton [FP16] Triton [FP8] 0 1024.0 168.101257 180.147833 1 2048.0 232.412164 251.379866 2 4096.0 282.099851 306.706495 3 8192.0 313.280068 342.515038 4 16384.0 331.734557 362.578482 fused-attention-batch4-head32-d128-fwd-causal=True-config=Config(warp_specialize=False, use_gluon=True): N_CTX Triton [FP16] Triton [FP8] 0 1024.0 153.825955 303.351930 1 2048.0 504.991716 532.393691 2 4096.0 723.003324 750.103002 3 8192.0 888.283770 924.001097 4 16384.0 979.523873 1045.632286 ```

peterbell10

🚀

Mogball · 2025-06-17T16:55:18Z

     N_CTX  triton-fp16  cudnn-fp16
0   1024.0   439.260007  567.950997
1   2048.0   624.868309  805.275085
2   4096.0   702.756313  888.467925
3   8192.0   744.283398  942.050952
4  16384.0   762.487030  929.791875
5  32768.0   765.736062  929.684688
6  65536.0   758.262973  900.141163
Attention Z=4 H=32 D=64 causal=True:
     N_CTX  triton-fp16  cudnn-fp16
0   1024.0   222.745914  359.200951
1   2048.0   433.364914  625.464007
2   4096.0   575.664633  723.858504
3   8192.0   666.592775  812.912723
4  16384.0   706.420267  811.271677
5  32768.0   751.074616  913.670943
6  65536.0   743.274618  920.962759
Attention Z=4 H=32 D=128 causal=False:
     N_CTX  triton-fp16   cudnn-fp16
0   1024.0   636.543972   921.466612
1   2048.0   945.496280  1216.067987
2   4096.0  1055.771847  1404.380647
3   8192.0  1126.162364  1357.966027
4  16384.0  1147.708498  1221.496024
5  32768.0  1209.871280  1243.138221
6  65536.0  1189.113742  1234.391085
Attention Z=4 H=32 D=128 causal=True:
     N_CTX  triton-fp16   cudnn-fp16
0   1024.0   290.267626   557.898058
1   2048.0   491.085408   847.120446
2   4096.0   720.423634  1070.995853
3   8192.0   864.481281  1213.994755
4  16384.0  1026.677104  1202.142195
5  32768.0  1139.247450  1222.141755
6  65536.0  1126.982660  1244.151394

Final perf numbers for this. The gap for D64 is quite big now since cublas appears to have improved perf a lot from 720->920. For D128, most of the gap esp for small ctx is making the kernel persistent.

Mogball force-pushed the mogball/attn branch 4 times, most recently from 784d43d to 2096971 Compare June 1, 2025 03:52

Mogball changed the base branch from main to merge_base June 1, 2025 03:52

Mogball force-pushed the mogball/attn branch 4 times, most recently from b8cde4d to 9fed043 Compare June 2, 2025 21:01

Mogball changed the base branch from merge_base to main June 2, 2025 21:02

Mogball changed the base branch from main to merge_base June 2, 2025 21:02

Mogball force-pushed the mogball/attn branch 3 times, most recently from 97ed0fd to 984460f Compare June 3, 2025 18:52

Mogball changed the base branch from merge_base to main June 3, 2025 18:53

Mogball force-pushed the mogball/attn branch 14 times, most recently from 40eea56 to e98c5ff Compare June 12, 2025 07:53

Mogball force-pushed the mogball/attn branch from e98c5ff to 6787ab8 Compare June 13, 2025 03:50

Mogball force-pushed the mogball/attn branch from 460ccbe to 1a6e2e4 Compare June 13, 2025 05:33

Mogball changed the title ~~[WIP][Gluon] Naive attention implementation~~ [Gluon] Implement attention kernels for d64 and d128 Jun 13, 2025

Mogball marked this pull request as ready for review June 13, 2025 05:33

Mogball requested a review from ptillet as a code owner June 13, 2025 05:33

Mogball force-pushed the mogball/attn branch from 1a6e2e4 to a64c717 Compare June 13, 2025 05:34

peterbell10 reviewed Jun 13, 2025

View reviewed changes

Mogball force-pushed the mogball/attn branch from 3a493e1 to 2a8c532 Compare June 13, 2025 17:01

Mogball added 13 commits June 16, 2025 22:25

correction partition

4b3e573

data partition but slow

7209598

block_n=128 and p reuse

86e8ede

perf

e783550

refactoring

55a4057

d64 and d128 kernels

9768d08

review comments, tiny optimization

53a8f5a

split the tutorials

d37bcea

guard the test

872dc88

wip benchmarking against cublas

4460a81

undo 06 changes

b205bf6

Mogball force-pushed the mogball/attn branch from 6da4fec to b205bf6 Compare June 17, 2025 03:44

Mogball added 3 commits June 17, 2025 12:48

benchmarking vs cublas

11481a5

make a separate folder

3b5b082

update test command

570797f

peterbell10 approved these changes Jun 17, 2025

View reviewed changes

Mogball enabled auto-merge (squash) June 17, 2025 17:09

Mogball merged commit 5c63c72 into main Jun 17, 2025
9 checks passed

Mogball deleted the mogball/attn branch June 17, 2025 17:14

tie-pilot-qxw pushed a commit to tie-pilot-qxw/triton that referenced this pull request Aug 30, 2025

[Gluon] Implement attention kernels for d64 and d128 (triton-lang#7009)

e4df7ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gluon] Implement attention kernels for d64 and d128 #7009

[Gluon] Implement attention kernels for d64 and d128 #7009

Uh oh!

Mogball commented May 31, 2025

Uh oh!

Uh oh!

Uh oh!

peterbell10 Jun 13, 2025

Uh oh!

Mogball Jun 13, 2025

Uh oh!

Mogball Jun 13, 2025

Uh oh!

peterbell10 Jun 13, 2025

Uh oh!

Mogball Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peterbell10 left a comment

Uh oh!

Mogball commented Jun 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Gluon] Implement attention kernels for d64 and d128 #7009

[Gluon] Implement attention kernels for d64 and d128 #7009

Uh oh!

Conversation

Mogball commented May 31, 2025

Uh oh!

Uh oh!

Uh oh!

peterbell10 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Mogball Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Mogball Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

peterbell10 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Mogball Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peterbell10 left a comment

Choose a reason for hiding this comment

Uh oh!

Mogball commented Jun 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants