Skip to content
View hhy3's full-sized avatar

Organizations

@milvus-io

Block or report hhy3

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

Python 440 56 Updated Aug 1, 2024

LLM Inference on consumer devices

Python 113 16 Updated Mar 17, 2025

[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation

Python 203 14 Updated Dec 16, 2024

16-fold memory access reduction with nearly no loss

Python 91 7 Updated Mar 26, 2025

MAGI-1: Autoregressive Video Generation at Scale

Python 2,611 117 Updated Apr 25, 2025

Low-bit LLM inference on CPU with lookup table

C++ 752 60 Updated Apr 22, 2025

MPI programming lessons in C and executable code examples

C 2,259 759 Updated Aug 14, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…

C++ 10,335 1,384 Updated Apr 26, 2025

Minimalistic 4D-parallelism distributed training framework for education purpose

Python 1,008 85 Updated Mar 7, 2025

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 809 65 Updated Sep 4, 2024

Fastest kernels written from scratch

Cuda 245 31 Updated Apr 3, 2025

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

Python 837 93 Updated Apr 25, 2025

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, Du…

Rust 4,543 287 Updated Apr 25, 2025

Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs

Python 277 32 Updated Apr 25, 2025

A cheatsheet of modern C++ language and library features.

20,357 2,155 Updated Apr 5, 2025

Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

Cuda 1,399 97 Updated Apr 21, 2025
Python 92 8 Updated Sep 9, 2024

Efficient and easy multi-instance LLM serving

Python 387 31 Updated Apr 25, 2025

FlatBuffers: Memory Efficient Serialization Library

C++ 24,117 3,331 Updated Apr 14, 2025

KV cache store for distributed LLM inference

C++ 158 14 Updated Apr 2, 2025

Collective communications library with various primitives for multi-machine training.

C++ 1,289 324 Updated Apr 25, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 790 77 Updated Dec 30, 2024

Ongoing research training transformer models at scale

Python 12,175 2,731 Updated Apr 26, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 7,139 788 Updated Apr 26, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 1,061 79 Updated Apr 26, 2025

LongBench v2 and LongBench (ACL 2024)

Python 850 83 Updated Jan 15, 2025

[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Cuda 1,536 86 Updated Apr 26, 2025

An extremely fast Python package and project manager, written in Rust.

Rust 51,422 1,448 Updated Apr 26, 2025

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 6,222 536 Updated Apr 25, 2025

Distributed Triton for Parallel Systems

MLIR 566 33 Updated Apr 24, 2025
Next
Showing results