Changelog

0.0.4 (2024-05-01)

Features

pytorch 2.3 support
gpu sampling kernels (top-p, top-k)
more gqa group sizes
add mma instructions for fp8 (#179) (d305798)
mma rowsum for fp8 (#180) (5af935c)
support any num_heads for get_alibi_slope (#200) (b217a6f)

Bug Fixes

fix python package dispatch error message (#182) (8eed01c)

0.0.3 (2024-03-08)

Features

adding sm_scale field for all attention APIs (#145) (85d4018)
enable head_dim=256 for attention kernels (#132) (0372acc)
pytorch api of fp8 kv-cache (#156) (66ee066)
support ALiBi (#146) (383518b)

Bug Fixes

bugfix to pr 135 (#136) (3d55c71)
fix bugs introduced in #132 (#135) (9b7b0b9)
fix FindThrust.cmake (#161) (30fa584)

Misc

add stream argument in BeginForwardFunction of TVMWrapper (#164) (fabfcb5)

Performance Improvements

multiple q by sm_scale in decode kernels (#144) (660c559)

0.0.2 (2024-02-17)

Bug Fixes

add python 3.9 wheels to ci/cd (#114) (2d8807d)
version names cannot include multiple + (#118) (af6bd10)
version naming issue (#117) (c849a90)