-
Notifications
You must be signed in to change notification settings - Fork 4.3k
feat: Add FP4 (E2M1) KV Cache Support with Quantization Utilities for MLA #10078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add FP4 (E2M1) KV Cache Support with Quantization Utilities for MLA #10078
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @JackChuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances SGLang's memory efficiency and inference performance by enabling the use of FP4 (E2M1) precision for the KV cache in Multi-Head Latent Attention. This low-precision caching mechanism allows for substantial reductions in GPU memory consumption while largely preserving model accuracy, and it maintains full backward compatibility with existing FP16 and FP8 workflows.
Highlights
- FP4 (E2M1) KV Cache Support: Added support for FP4 (E2M1) KV cache for Multi-Head Latent Attention (MLA) to reduce memory usage and improve inference efficiency.
- FP4 Quantization Utilities: Introduced
KVFP4QuantizeUtilfor efficient block-wise FP4 quantization and dequantization of tensors. - Core KV Cache Integration: Integrated FP4 KV cache into
ModelRunnerandMLATokenToKVPool, including a newkv_scale_bufferand a Triton kernel (set_mla_kv_scale_buffer_kernel) for handling scale factors. - Server Argument Extension: Extended server arguments with
--kv-cache-dtype=fp4_e2m1for easy activation. - Unit Tests and Benchmarks: Included comprehensive unit tests to validate FP4 correctness (MSE, MAE, PSNR, Relative Error) and benchmarks comparing FP4 and FP8 performance.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces FP4 (E2M1) KV cache support, which is a great feature for reducing memory usage. The implementation of the quantization utilities and their integration into the memory pool and model runner looks mostly correct. However, I've found a few issues that need to be addressed:
- There is a critical bug in the memory footprint calculation for the FP4 KV cache in
model_runner.py, which could lead to incorrect memory allocation. - The new test file for FP4 quantization has a typo in an import path, which will cause it to fail.
- There are some repeated imports in
memory_pool.pythat could be consolidated for better code clarity.
I have provided specific comments and suggestions for these points. Once these are addressed, this PR should be in good shape.
8995fd7 to
66f1dd7
Compare
66f1dd7 to
087a4a1
Compare
|
great work! looks solid overall. |
|
Are there any more tests to evaluate accuracy drop in long-context scenarios? |
|
please fix ci and lint |
|
Thanks @AniZpZ for the prompt review and helpful feedback! We will fix them and then you know.
Do you have a specific dataset result you’d like to see? We can collect the data. |
I've observed a significantly larger drop in accuracy in AIME 25 compared to GSM8K, which leads me to hypothesize that accuracy is related to context length. |
087a4a1 to
6d3f263
Compare
If AIME25 is considered a long-context dataset, then GPQA_Diamond should also fall into the long-context category. If you have any more representative long-context datasets to recommend and if needed, I can give them a try. |
Hi @Fridge003 @zhyncs, I've rebased to v0.5.3 and fix the conflicts. Can you please launch the CI and check again? Thank you~ |
|
@JackChuang It shows there are still conflicts with latest main |
@Fridge003 Oops. You are right. It seems like v0.5.3 is also outdated. I will rebase to main then. Thanks, and sry for the inconvenience. |
While merging with the main branch, I noticed that the latest main branch includes the assumption:
I’ll need to look into this further and run some tests before merging. Thanks. |
56abc5c to
bdf4871
Compare
|
Hi @Fridge003 @zhyncs, I’ve rebased to main and fixed the conflicts. Note: Turns out the current 'main' and 'v0.5.4.post1' branches cannot run. As a result, I’ve tested my code on v0.5.4, and it’s working. |
bdf4871 to
d2cc365
Compare
Do you mean the v0.5.4.post1 cannot run on this PR, or is there other bugs? |
@Fridge003 What I mean is: I was originally developing on the main and v0.5.4.post1 branches, and I found that my KV4 setup couldn’t run. Even after removing the KV cache quantization option, it still didn’t work. Then I switched to v0.5.4, and everything worked fine, so I confirmed that the issue wasn’t caused by my code. |
|
Hi @Fridge003 @zhyncs, When you get a chance, could you please check again if this is ready to merge? Thanks a lot! |
| dtype=self.data_type, | ||
| device=self.device, | ||
| ) | ||
| if self.data_type == getattr(torch, "float4_e2m1fn_x2", None) and _is_cuda: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change self.data_type == getattr(torch, "float4_e2m1fn_x2", None) and _is_cuda to a utils function for better readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment
Extend the `--kv-cache-dtype` argument in ServerArgs to support "fp4_e2m1". Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Yichen Wang <[email protected]>
- Introduce `KVFP4QuantizeUtil` for FP4 (E2M1) quantization and dequantization. - Provides `batched_quantize` and `batched_dequantize` methods for block-wise (16) processing of [M, N, K] tensors. Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Yichen Wang <[email protected]>
- Introduce `test_kvfp4_quant_dequant.py` to validate correctness and performance of KVFP4 quantization. - Provides metrics calculation (MSE, MAE, PSNR, Relative Error) to compare original and dequantized tensors. - Benchmarks KVFP4 vs FP8 quant/dequant performance on GPU with large tensors. Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Yichen Wang <[email protected]>
Core change enabling low-precision FP4 KV caching for MLA, improving inference efficiency while keeping existing workflows intact. - Introduce FP4 KV cache support in MLATokenToKVPool for reduced memory usage. - Add kv_scale_buffer to store FP4 scaling factors and updated allocation logic. - Implement Triton kernel to combine nope + rope tensors and write to KV + scale buffers. - Modify ModelRunner to account for FP4 buffer sizing and dtype. - Maintains backward compatibility with FP16/FP8 KV cache. Also, move Triton kernels from mem_cache/memory_pool.py to srt/mem_cache/utils.py. Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Yichen Wang <[email protected]>
d2cc365 to
46d44aa
Compare
|
@JackChuang Please fix CI bugs |
- Added `is_cuda()` and `is_float4_e2m1fn_x2()` in `sglang/srt/utils/torch_utils.py` - Replaced inline checks in relevant modules Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
46d44aa to
c128f91
Compare
Based on PR sgl-project#10078, this patch - introduces FP4 KV cache support in MHATokenToKVPool with uint8 storage. - adds k_scale_buffer and v_scale_buffer to store FP4 scaling factors. - implements batched quantization on cache update and dequantization on access. - updates ModelRunner memory estimation to account for FP4 scale buffers. - maintains backward compatibility with FP16/FP8 KV cache. Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Yichen Wang <[email protected]>
Summary
This PR introduces FP4 (E2M1) support for Multi-Head Latent Attention (MLA) KV cache in SGLang, enabling low-precision caching to reduce memory usage and improve inference efficiency. It integrates FP4 quantization utilities, Triton kernels, and unit tests while remaining backward compatible with FP16/FP8. See #10083, points 1-1, for more context.
Co-authored-by: @yicwang Yichen Wang [email protected]
Usage
Added
--kv-cache-dtype=fp4_e2m1option.Key Changes
Server Argument Extension
--kv-cache-dtype=fp4_e2m1option.FP4 Quantization Utility
KVFP4QuantizeUtilwithbatched_quantizeandbatched_dequantizemethods for block-wise (16) processing of [M, N, K] tensors.Core KV Cache Integration
ModelRunnerandMLATokenToKVPoolto support FP4 KV cache.kv_scale_bufferfor FP4 scaling.set_mla_kv_scale_buffer_kernelfor efficient nope+rope tensor handling.Unit Test & Benchmark
Impact
Accuracy Tests
The results show that on simpler datasets, the accuracy is nearly lossless compared to the baseline. On more challenging datasets, there is some accuracy degradation, but it remains within an acceptable range.
Performance Results
Tested on B200. Server is running with --model DeepSeek-R1-0528-FP4 --tp-size 4 --moe-runner-backend flashinfer_trtllm --disable-radix-cache. Client is running with with --goodput ttft:5000 tpot:50 --random-input-len 3500 --random-output-len 1500 --max-concurrency 50 and --num-prompts 100.
Baseline has the best performance as the trtllm kernel accepts the BF16 KVCache directly, so no dequantization overhead at all.
When adding --kv-cache-dtype fp8_e4m3, the performance dropped significantly as KVCache is quanted at writing and dequanted at reading for attentions.
With --kv-cache-dtype fp4_e2m1, the throughput is 17.8% higher compared to fp8_e4m3. Mostly due to quant/dequant to/from FP4 is faster than FP8.
Future Work
We plan to support Multi-Head Attention (MHA) next.
Checklist