hhy3

zh Wang hhy3

50 followers · 46 following

@zilliztech
Shanghai
00:17 - 8h ahead

Achievements

Organizations

Starred repositories

FMInference / H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

Python 440 56 Updated Aug 1, 2024

Infini-AI-Lab / UMbreLLa

LLM Inference on consumer devices

Python 113 16 Updated Mar 17, 2025

Infini-AI-Lab / MagicPIG

[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation

Python 203 14 Updated Dec 16, 2024

andy-yang-1 / DoubleSparse

16-fold memory access reduction with nearly no loss

Python 91 7 Updated Mar 26, 2025

SandAI-org / MAGI-1

MAGI-1: Autoregressive Video Generation at Scale

Python 2,611 117 Updated Apr 25, 2025

microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table

C++ 752 60 Updated Apr 22, 2025

mpitutorial / mpitutorial

MPI programming lessons in C and executable code examples

C 2,259 759 Updated Aug 14, 2024

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…

C++ 10,335 1,384 Updated Apr 26, 2025

huggingface / picotron

Minimalistic 4D-parallelism distributed training framework for education purpose

Python 1,008 85 Updated Mar 7, 2025

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 809 65 Updated Sep 4, 2024

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 245 31 Updated Apr 3, 2025

modelscope / evalscope

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

Python 837 93 Updated Apr 25, 2025

lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, Du…

Rust 4,543 287 Updated Apr 25, 2025

neuralmagic / guidellm

Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs

Python 277 32 Updated Apr 25, 2025

AnthonyCalandra / modern-cpp-features

A cheatsheet of modern C++ language and library features.

20,357 2,155 Updated Apr 5, 2025

thu-ml / SageAttention

Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

Cuda 1,399 97 Updated Apr 21, 2025