
- Shanghai
-
00:17
- 8h ahead
Starred repositories
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
16-fold memory access reduction with nearly no loss
MAGI-1: Autoregressive Video Generation at Scale
MPI programming lessons in C and executable code examples
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
Minimalistic 4D-parallelism distributed training framework for education purpose
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
A streamlined and customizable framework for efficient large model evaluation and performance benchmarking
Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, Du…
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
A cheatsheet of modern C++ language and library features.
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Efficient and easy multi-instance LLM serving
FlatBuffers: Memory Efficient Serialization Library
Collective communications library with various primitives for multi-machine training.
Flash Attention in ~100 lines of CUDA (forward pass only)
Ongoing research training transformer models at scale
verl: Volcano Engine Reinforcement Learning for LLMs
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
An extremely fast Python package and project manager, written in Rust.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Distributed Triton for Parallel Systems