vLLM

vLLM · 2025-12-20T04:30:29.773Z

Diffusion serving is expensive: dozens of timesteps per image, and a lot of redundant compute between adjacent steps. ⚡ vLLM-Omni now supports diffusion cache acceleration backends (TeaCache + Cache-DiT) to reuse intermediate Transformer computations — no retraining, minimal quality impact! 🚀 Benchmarks (NVIDIA H200, Qwen-Image 1024x1024): TeaCache 1.91x, Cache-DiT 1.85x. For Qwen-Image-Edit, Cache-DiT hits 2.38x! Blog: https://lnkd.in/gV2-SPD9 Docs: https://lnkd.in/gPxMsp6w #vLLM #vLLMOmni #DiffusionModels #AIInference

Software Development

An open source, high-throughput and memory-efficient inference and serving engine for LLMs.

View all 17 employees

About us

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs

Website: https://github.com/vllm-project/vllm
External link for vLLM
Industry: Software Development
Company size: 51-200 employees
Type: Nonprofit

Employees at vLLM

See all employees

Updates

vLLM

10,691 followers
1d
Report this post
Huge congratulations to the community on shipping vLLM-Omni v0.12.0rc1! 🎉 This release marks a shift from experimental multimodal support to production-grade serving stability. We are moving from "it runs" to "it runs fast and fits your existing stack." Here is what changed in this milestone (187 commits): 🚀 Diffusion Engine Overhaul We refactored the diffusion stack to support state-of-the-art acceleration techniques natively. Integrated TeaCache and Cache-DiT to reduce redundant computation. Added Extended Attention support (Sage, Ulysses, Ring Attention) for high-resolution workloads. Optimized Torch.compile for DiT and RoPE kernels. 🔌 OpenAI-Compatible Serving One of the biggest friction points for adoption has been custom APIs. Now shipping native OpenAI-compatible endpoints for Image (/v1/images/generations) and Speech. Includes production essentials like Streaming Output and Request Abort to save compute resources. 📹 Expanded Model Support Video: Wan2.2 (T2V/I2V/TI2V) Image: Qwen-Image-2512, Stable Diffusion 3, Ovis Image Hardware: Official AMD ROCm Docker and CI support. A massive thank you to the 45 contributors (including 34 new faces!) who drove this release. Check out the full release notes and documentation below. Release: https://lnkd.in/gjF9Te9C Docs: https://lnkd.in/g8Sk-6CC #vLLM #Multimodal #AIInference #OpenSource #DiffusionModels
3 Comments

Like Comment Share
vLLM reposted this
Xunzhuo Liu
2d Edited
Report this post
Today, we’re shipping our first major release: 𝘃𝗟𝗟𝗠 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗥𝗼𝘂𝘁𝗲𝗿 𝘃𝟬.𝟭 — 𝗰𝗼𝗱𝗲𝗻𝗮𝗺𝗲 𝗜𝗿𝗶𝘀 🌈 In just ~3 months since our experimental launch, the vLLM Semantic Router has grower insanely fast: ✅ 600+ PRs merged ✅ 300+ issues addressed ✅ 50+ engineers contributing worldwide We are building the 𝗦𝘆𝘀𝘁𝗲𝗺-𝗹𝗲𝘃𝗲𝗹 𝗶𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 for 𝗠𝗶𝘅𝘁𝘂𝗿𝗲-𝗼𝗳-𝗠𝗼𝗱𝗲𝗹𝘀 (MoM): it sits between users and models, extracts signals from requests/responses/context, and makes routing decisions—model selection, guardrails (jailbreak + PII + hallucination + toolcall), semantic caching etc. What’s new in Iris: • 𝗦𝗶𝗴𝗻𝗮𝗹–𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗣𝗹𝘂𝗴𝗶𝗻 𝗖𝗵𝗮𝗶𝗻 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 (from “14 fixed categories” → “unlimited decisions”) • 𝗠𝗼𝗱𝘂𝗹𝗮𝗿 𝗟𝗼𝗥𝗔 𝗸𝗲𝗿𝗻𝗲𝗹 (shared base compute + lightweight adapters) • 𝗛𝗮𝗹𝘂𝗚𝗮𝘁𝗲: 3-stage hallucination detection (sentinel → token detector → explainer) besides Jailbreak, PII detection. • 𝗢𝗻𝗲-𝗰𝗼𝗺𝗺𝗮𝗻𝗱 𝗶𝗻𝘀𝘁𝗮𝗹𝗹 + production Helm charts + Dashboard • 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲𝘀 𝗔𝗣𝗜 𝘀𝘂𝗽𝗽𝗼𝗿𝘁 + 𝗶𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝘁 𝘁𝗼𝗼𝗹 𝘀𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻 for agentic workflows • more details in the blog... Next milestone: 𝘃𝟬.𝟮 “𝗔𝘁𝗵𝗲𝗻𝗮” We’re aiming for 𝗦𝗶𝗴𝗻𝗮𝗹 𝗖𝗼𝗺𝗽𝗼𝘀𝗲𝗿, Advanced ML powered 𝗠𝗼𝗱𝗲𝗹 𝗦𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗔𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺𝘀 (kMeans/MLP/Graph/Ratings/RL), out-of-box 𝗠𝗲𝗺𝗼𝗿𝘆 + 𝗥𝗼𝘂𝘁𝗲𝗿 𝗥𝗲𝗽𝗹𝗮𝘆, and deep seek the 𝗺𝘂𝗹𝘁𝗶-𝘁𝘂𝗿𝗻 + 𝘀𝗮𝗳𝗲𝘁𝘆 areas. Builders wanted — research + infra folks: 𝘤𝘰𝘮𝘦 𝘤𝘰-𝘣𝘶𝘪𝘭𝘥 𝘵𝘩𝘦 𝘧𝘶𝘵𝘶𝘳𝘦 𝘰𝘧 𝘪𝘯𝘵𝘦𝘭𝘭𝘪𝘨𝘦𝘯𝘵 𝘳𝘰𝘶𝘵𝘪𝘯𝘨. Read the Blog: https://lnkd.in/gXyX6Mbn Shout out to the forks who made great helps so far, we cannot make it without the wide community: Huamin Chen Simon Mo Chen Wang Yue Zhu Haichen Zhang Andy Luo Senan Zedan Yossi Ovadia Anish Maddipoti Tyler Hutcherson Ivar Flakstad Website: https://lnkd.in/gQu2pHt3 GitHub: https://lnkd.in/gDnEJjvi Slack: https://lnkd.in/g-7XRtwE #vLLM #OpenSource #LLM #GenAI #Inference #MLOps #Agents #Routing #MoM #SemanticRouter #LLMRouting #IntelligentRouting #Agentic #MLSys

vLLM Semantic Router v0.1 Iris: The First Major Release blog.vllm.ai

11 Comments

Like Comment Share
vLLM

10,691 followers
5d
Report this post
🚀 Update: The vLLM Talent Pool is delivering results! In just over one month, we've already helped several students and engineers land internships at top AI labs and infra teams via our referrals. The industry demand for vLLM expertise is massive. We are still actively collecting resumes for roles in the US & China, specifically looking for: ✅ Custom Kernels (CUDA/CUTLASS) ✅ Distributed Systems (Ray/K8s) ✅ Speculative Decoding & RL ✅ Multimodal (Diffusion/Audio/Video) ✅ Model Architecture Support ✅ other inference skills If you build on vLLM, we want to hear from you. 🤝 Apply: talentpool@vllm.ai (Sending your resume means you agree to share it with partner companies.) Let's get you hired! 👜

vLLM

10,691 followers
1mo

🚀 vLLM Talent Pool is Open! As LLM adoption accelerates, vLLM has become the mainstream inference engine used across major cloud providers (AWS, Google Cloud, Azure, Alibaba Cloud, ByteDance, Tencent, Baidu…) and leading model labs (DeepSeek, Moonshot, Qwen…). To meet the strong demand from top companies, the vLLM community is now collecting resumes year-round and helping with referrals (internships & full-time). If you have experience in any of the following areas, we’d love to hear from you: • RL frameworks & algorithms for LLMs • Tool calling, MCP, Harmony format, OpenAI/Anthropic API • Structured output / constraint decoding • High-performance kernels: attention, GEMM, sampling, sorting • CUTLASS / CUTE DSL / TileLang • Distributed systems: Ray, multiprocessing • vLLM + Kubernetes • Tensor / expert / context parallelism • NCCL, DeepEP, NVSHMEM, RDMA, NVLink • Prefill/Decode separation, KV-cache transport • Speculative decoding (Eagle, MTP, …) • MoE optimization • KV-cache memory management (hybrid models, prefix caching) • Multimodal inference (audio/image/video/text) • LoRA • Rust / Go / C++ / Python serving stacks • Attention mechanisms (MLA, MQA, SWA, linear attention) • Position encodings (RoPE, mRoPE) • Model architectures (DeepSeek, Qwen, etc.) • Embedding model support • torch.compile integration …or any other LLM inference engineering experience. Bonus points if you have: • Implemented core features in vLLM • Contributed to vLLM integrations (verl, OpenRLHF, Unsloth, LlamaFactory…) • Written widely-shared technical blogs on vLLM 💰 Compensation: Highly competitive, with no upper limit for exceptional inference engineers. 📍 Locations: Major cities in the US (SF Bay Area, etc.) Major cities in China (Beijing / Shanghai / Shenzhen / Guangzhou / Chengdu…) 📨 Apply: Send your resume to talentpool@vllm.ai (Sending your resume means you agree to share it with partner companies.) 🌱 Join the vLLM community: Slack: Apply at http://slack.vllm.ai Chinese community (WeChat): Add vllm_project with your name & affiliation Let’s build easy, fast, and cheap LLM serving for everyone — together! ⚡

Community Inviter - Auto Invitation for Slack communityinviter.com

Like Comment Share
vLLM

10,691 followers
1w
Report this post
✨ What a way to end 2025! 🎉 vllm-project/vllm just hit 2,000 contributors on GitHub. This milestone belongs to all of you. From the first-time contributor fixing a doc typo to the systems engineer rewriting kernels—you are the reason vLLM evolves so fast. Thank you for every PR, every issue, and every debate. We built this engine together. With this speed comes complexity. To help you track your code, we added a little utility. 👇 **New on http://vllm.ai: PR Release Finder** Ever wondered "Which release first included my PR?" Now you can find out instantly on our new website. Just enter your PR number or URL to track your code's journey into production. Try it out: https://vllm.ai/pr-lookup Together, let's keep making LLM serving easy, fast, and cheap for everyone. #vLLM #OpenSource #Milestone #AIInfrastructure
Like Comment Share
vLLM

10,691 followers
1w
Report this post
🎉 Big news: The official vLLM website is LIVE! https://vllm.ai We’ve built a dedicated hub for our growing community, separating logistics from code so the GitHub repo can focus purely on development. Highlights: ✨ Interactive vLLM Install Selector (GPU, CPU, etc) 📆 Community Events Calendar (Office hours, Meetups, etc) 📚 Centralized Resources (Doc, Recipe, etc) In addition to the new website, 🔖 New dedicated channels for Talent, Collaboration, and Social Promotions (details on site!) 📈 Track dev velocity with the new vLLM Daily repo: https://lnkd.in/gKHb3bCV More details: https://lnkd.in/gsV92uMb Let's build the future of inference together!
5 Comments

Like Comment Share
vLLM

10,691 followers
1w
Report this post
Huge congrats to MiniMax on shipping M2.1! 🎉 A massive step forward for open-source agents. 🚀 vLLM provides Day-0 support for this release. We are excited to empower the community to run this full-stack development powerhouse with maximum efficiency. Model Repo: https://lnkd.in/gWwEbV9q Deploy Recipe: https://lnkd.in/g8AJ3EnF Commands below 👇 #vLLM #MiniMax #OpenSource #AI #LLM
Like Comment Share
vLLM

10,691 followers
2w
Report this post
🎉 The vLLM community has added support for LongCat-Image-Edit (from Meituan LongCat team) in vLLM-Omni. - Simpler path to serve instruction-following image edits - Supports common operations like object add/replace, background changes, and style adjustments - Useful for retouching tools and creative editing pipelines Recipe: https://lnkd.in/gSaD6r_h Repo: https://lnkd.in/gKU2v9e9
Like Comment Share
vLLM

10,691 followers
2w
Report this post
🎉 Congrats to the GLM team on GLM-4.7 — a step up in the GLM-4.x series, with day-0 serving support in vLLM! ⚡ Support MTP decode (faster throughput). ⚙️ Tool/function calling. 🧠 Thinking controls: interleaved/preserved/per-turn. Read more: https://lnkd.in/gu3cZw_4
Like Comment Share
vLLM

10,691 followers
2w
Report this post
Diffusion serving is expensive: dozens of timesteps per image, and a lot of redundant compute between adjacent steps. ⚡ vLLM-Omni now supports diffusion cache acceleration backends (TeaCache + Cache-DiT) to reuse intermediate Transformer computations — no retraining, minimal quality impact! 🚀 Benchmarks (NVIDIA H200, Qwen-Image 1024x1024): TeaCache 1.91x, Cache-DiT 1.85x. For Qwen-Image-Edit, Cache-DiT hits 2.38x! Blog: https://lnkd.in/gV2-SPD9 Docs: https://lnkd.in/gPxMsp6w #vLLM #vLLMOmni #DiffusionModels #AIInference

Diffusion Acceleration Overview ¶ docs.vllm.ai

Like Comment Share
vLLM reposted this
PyTorch

303,983 followers
2w
Report this post
In this PyTorch Foundation Spotlight, Simon Mo shares why vLLM became a PyTorch Foundation project in May 2025 and how the project has been built on PyTorch from the beginning. He discusses vLLM’s role within a broader ecosystem of model builders and hardware providers, its focus on ease of use and efficiency, and how the project helps users get value quickly while continuing to improve inference efficiency and cost. 🎥 Watch the Spotlight: https://lnkd.in/gUNDgi5z #PyTorch #vLLM #AIInfrastructure #OpenSourceAI

Like Comment Share

LinkedIn respects your privacy

vLLM

Software Development

An open source, high-throughput and memory-efficient inference and serving engine for LLMs.

About us

Employees at vLLM

Michael Goin

Robert Shaw

Flora (Sida) Feng

Wenlong Wang

Updates

Join now to see what you are missing

Similar pages

SGLang

Ollama

Embedded LLM

Hugging Face

Unsloth AI

LMCache Lab

Prime Intellect

llm-d

Anyscale

Thinking Machines Lab