AI Inference

NVIDIA Dynamo Platform

Scale and Serve Generative AI, Fast.

Get Started

Read the Press Release | Read the Tech Blog | Deploy Dynamo-Triton

Overview
Features
Benefits
Starting Options
Use Cases
Customer Testimonials
Resources
Next Steps

Overview
Features
Benefits
Starting Options
Use Cases
Customer Testimonials
Resources
Next Steps

Get Started

Overview

The Operating System of AI

The NVIDIA Dynamo Platform is a high-performance, low-latency inference platform designed to serve all AI models—across any framework, architecture, or deployment scale. Whether you're running image recognition on a single entry-level GPU or deploying billion-parameter large language reasoning models across hundreds of thousands of data center GPUs, the NVIDIA Dynamo Platform delivers scalable, efficient AI inference.

What Is Distributed Inference?

Distributed inference is the process of running AI model inference across multiple computing devices or nodes to maximize throughput by parallelizing computations.

This approach enables efficient scaling for large-scale AI applications, such as generative AI, by distributing workloads across GPUs or cloud infrastructure. Distributed inference improves overall performance and resource utilization by allowing users to optimize latency and throughput for the unique requirements of each workload.

A Closer Look at the Platform

The NVIDIA Dynamo Platform includes two open-source inference-serving frameworks.

NVIDIA Dynamo serves generative AI models in large-scale distributed environments. It features large language model (LLM)-specific optimizations, such as disaggregated serving and key-value cache (KV cache)-aware routing, to enable AI factories to maximize token revenue generation and run at the lowest possible cost. NVIDIA NIM™ microservices will include Dynamo capabilities, providing a quick and easy deployment option. Dynamo will also be supported and available with NVIDIA AI Enterprise.

NVIDIA Dynamo-Triton, formerly NVIDIA Triton™ Inference Server, standardizes AI model deployment and execution across every workload. It supports all AI inference backends and can run on GPUs or CPUs, letting enterprises and ISVs quickly and cost-effectively integrate AI into their products and services. Dynamo-Triton is available today in NVIDIA NIM microservices and with NVIDIA AI Enterprise for enterprise-grade support, security, and stability.

By Solution

Find the Best Solution for You

Standardized AI Model Serving

NVIDIA Dynamo-Triton

Deploy AI inference on trained machine learning or deep learning models from any framework on any processor—GPU, CPU, or other.

Get Started

Distributed Generative AI Serving

NVIDIA Dynamo

Deploy generative AI models in large-scale, multi-node distributed environments at the lowest cost and highest efficiency.

Get Started

Easy and Fast Production Deployment

NVIDIA NIM

NIM microservices, also available with NVIDIA AI Enterprise, will continuously include Dynamo Platform capabilities over time, enabling quick and easy deployment.

Learn More

Features

Explore the Features of NVIDIA Dynamo

Disaggregated Serving

Separates LLM context (prefill) and generation (decode) phases across distinct GPUs, enabling tailored model parallelism and independent GPU allocation to increase requests served per GPU.

GPU Planner

Monitors GPU capacity in distributed inference environments and dynamically allocates GPU workers across context and generation phases to resolve bottlenecks and optimize performance.

Smart Router

Routes inference traffic efficiently, minimizing costly recomputation of repeat or overlapping requests to preserve compute resources while ensuring balanced load distribution across large GPU fleets.

NIXL Low-Latency Communication Library

Accelerates data movement in distributed inference settings while simplifying transfer complexities across diverse hardware, including GPUs, CPUs, networks, and storage.

Benefits

The Benefits of NVIDIA Dynamo

Seamlessly Scale From One GPU to Thousands of GPUs

Streamline and automate GPU cluster setup with prebuilt, easy-to-deploy tools and enable dynamic autoscaling with real-time LLM-specific metrics, avoiding over or under provisioning of GPU resources.

Increase Inference Serving Capacity While Reducing Costs

Leverage advanced LLM inference serving optimizations like disaggregated serving to increase the number of inference requests served without compromising user experience.

Future-Proof Your AI Infrastructure and Avoid Costly Migrations

Open and modular design allows you to easily pick and choose the inference-serving components that suit your unique needs, ensuring compatibility with your existing AI stack and avoiding costly migration projects.

Accelerate Time to Deploy New AI Models in Production

NVIDIA Dynamo’s support for all major frameworks—including TensorRT-LLM, vLLM, SGLang, PyTorch, and more—ensures your ability to quickly deploy new generative AI models, regardless of their backend.

Accelerate Distributed Inference

NVIDIA Dynamo is fully open source, giving you complete transparency and flexibility. Deploy NVIDIA Dynamo, contribute to its growth, and seamlessly integrate it into your existing stack.

Check it out on GitHub and join the community!

Get Started

Develop

For individuals looking to get access to Triton Inference Server open-source code for development.

Access Code

Develop

For individuals looking to access free Triton Inference Server containers for development.

Get Container

Experience

Access NVIDIA-hosted infrastructure and guided hands-on labs that include step-by-step instructions and examples, available for free on NVIDIA LaunchPad.

Access Hands-On Labs

Deploy

Get a free license to try NVIDIA AI Enterprise in production for 90 days using your existing infrastructure.

Request a 90-Day License

Use Cases

Deploying AI with NVIDIA Dynamo

Find out how you can drive innovation with NVIDIA Dynamo.

Reasoning Model Serving
Distributed Inference
Scalable AI Agents
Code Generation

Serving Reasoning Models

Reasoning models generate more tokens to solve complex problems, increasing inference costs. NVIDIA Dynamo optimizes these models with features like disaggregated serving. This approach separates the prefill and decode computational phases onto distinct GPUs, allowing AI inference teams to optimize each phase independently. The result is better resource utilization, more queries served per GPU, and lower inference costs.

Distributed Inference

As AI models grow too large to fit on a single node, serving them efficiently becomes a challenge. Distributed inference requires splitting models across multiple nodes, which adds complexity in orchestration, scaling, and communication. Ensuring these nodes function as a cohesive unit—especially under dynamic workloads—demands careful management. NVIDIA Dynamo simplifies this by providing prebuilt capabilities on Kubernetes, seamlessly handling scheduling, scaling, and serving so you can focus on deploying AI—not managing infrastructure.

Scalable AI Agents

AI agents rely on multiple models—LLMs, retrieval systems, and specialized tools—working in sync in real time. Scaling these agents is a complex challenge, requiring intelligent GPU scheduling, efficient KV cache management, and ultra-low-latency communication to maintain responsiveness.
NVIDIA Dynamo streamlines this process with built-in intelligent GPU planner, smart router, and low-latency communication library, making AI agent scaling seamless and efficient.

Code Generation

Code generation often requires iterative refinement to adjust prompts, clarify requirements, or debug outputs based on the model’s responses. This back-and-forth necessitates context re-computation with each user turn, increasing inference costs. NVIDIA Dynamo optimizes this process by enabling context reuse and offloading to cost-effective memory, minimizing expensive re-computation and reducing overall inference costs.

Customer Testimonials

See What Industry Leaders Have to Say About NVIDIA Dynamo

Cohere

“Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination and low-latency communication libraries that transfer reasoning contexts seamlessly across memory and storage. We expect Dynamo will help us deliver a premier user experience to our enterprise customers.” Saurabh Baji, Senior Vice President of Engineering at Cohere

Perplexity AI

"Handling hundreds of millions of requests monthly, we rely on NVIDIA’s GPUs and inference software to deliver the performance, reliability, and scale our business and users demand, "We'll look forward to leveraging Dynamo with its enhanced distributed serving capabilities to drive even more inference serving efficiencies and meet the compute demands of new AI reasoning models." Denis Yarats, CTO of Perplexity AI.

Together AI

“Scaling reasoning models cost-effectively requires new advanced inference techniques, including disaggregated serving and context-aware routing. Together AI provides industry leading performance using our proprietary inference engine. The openness and modularity of Dynamo will allow us to seamlessly plug its components into our engine to serve more requests while optimizing resource utilization—maximizing our accelerated computing investment. " Ce Zhang, CTO of Together AI.

Customer Stories

How Industry Leaders Are Enhancing Model Deployment With the NVIDIA Dynamo Platform

Adopters

Leading Adopters Across All Industries

Customers
Ecosystem Integrations

Resources

The Latest in NVIDIA Inference

Blogs
Sessions
Training
Videos

Get the Latest News

Read about the latest inference updates and announcements for NVIDIA Dynamo Inference Server.

See All Inference Blogs

Explore Technical Blogs

Read technical walkthroughs on how to get started with inference.

See All Technical LLM Inference Blogs

Take a Deep Dive

Get tips and best practices for deploying, running, and scaling AI models for inference for generative AI, LLMs, recommender systems, computer vision, and more.

Read Now

View All Blogs

Deploying, Optimizing, and Benchmarking LLMs

Learn how to serve LLMs efficiently with step-by-step instructions. We’ll cover how to easily deploy an LLM across multiple backends and compare their performance, as well as how to fine-tune deployment configurations for optimal performance.

Watch On-Demand GTC Session

Move Enterprise AI Use Cases From Development to Production

Learn what AI inference is, how it fits into your enterprise's AI deployment strategy, what key challenges in deploying enterprise-grade AI use cases are, why a full-stack AI inference solution is needed to address these challenges, the main components of a full-stack platform are, and how to deploy your first AI inferencing solution.

Watch On-Demand Session

Harness the Power of Cloud-Ready AI Inference Solutions

Explore how the NVIDIA AI inferencing platform seamlessly integrates with leading cloud service providers, simplifying deployment and expediting the launch of LLM-powered AI use cases.

Watch On-Demand Session

View More Sessions

Quick-Start Guide

New to NVIDIA Dynamo and want to deploy your model quickly? Make use of this quick-start guide to begin your NVIDIA Dynamo journey.

Read Now

Tutorials

Getting started with NVIDIA Dynamo can lead to many questions. Explore this repository to familiarize yourself with NVIDIA Dynamo’s features and find guides and examples that can help ease migration.

Read Now

NVIDIA Brev

Unlock NVIDIA GPU power in seconds with NVIDIA Brev—instant access, automatic setup, and flexible deployment on top cloud platforms. Start building and scaling your AI projects right away.

Explore Now

Top 5 Reasons Why NVIDIA Dynamo Is Simplifying Inference

NVIDIA Dynamo-Triton simplifies the deployment of AI models at scale in production, letting teams deploy trained AI models from any framework from local storage or cloud platform on any GPU- or CPU-based infrastructure.

Watch Now

Deploy HuggingFace’s Stable Diffusion Pipeline With NVIDIA Dynamo

This video showcases deploying the Stable Diffusion pipeline available through the HuggingFace diffuser library. We use NVIDIA Dynamo-Triton to deploy and run the pipeline.

Watch Now

Getting Started With NVIDIA Dynamo-Triton

NVIDIA Dynamo is an open-source inference solution that standardizes model deployment and enables fast and scalable AI in production. Because of its many features, a natural question to ask is, where do I begin? Watch to find out.

Watch Now

View More Videos

Next Steps

Ready to Get Started?

Download on GitHub and join the community!

For Developers

Explore everything you need to start developing with NVIDIA Dynamo, including the latest documentation, tutorials, technical blogs, and more.

Start Developing

Get in Touch

Talk to an NVIDIA product specialist about moving from pilot to production with the security, API stability, and support of NVIDIA AI Enterprise.

Learn How Snapchat Is Using Triton to Enhance the Shopping Experience

See How Triton Model Analyzer Optimizes Model Deployment

Read the Generative AI Performance Analyzer Guide

Read About Serving Model Pipelines on Triton With Ensemble Models

Deploy on Amazon SageMaker

Deploy on Google Vertex AI

Deploy on Azure ML Studio

Deploy on Oracle Cloud

Read the Press Release | Read the Tech Blog

Blogs
Sessions
Training
Videos

Get the Latest News

Read about the latest inference updates and announcements for Dynamo Inference Server.

See All Dynemo Blogs

Explore Technical Blogs

Read technical walkthroughs on how to get started with inference.

See All Technical LLM Inference Blogs

Take a Deep Dive

Get tips and best practices for deploying, running, and scaling AI models for inference for generative AI, LLMs, recommender systems, computer vision, and more.

Read Now

View All Blogs

Deploying, Optimizing, and Benchmarking LLMs

Watch On-Demand GTC Session

Move Enterprise AI Use Cases From Development to Production

Watch On-Demand Session

Harness the Power of Cloud-Ready AI Inference Solutions

Explore how the NVIDIA AI inferencing platform seamlessly integrates with leading cloud service providers, simplifying deployment and expediting the launch of LLM-powered AI use cases.

Watch On-Demand Session

View More Sessions

Quick-Start Guide

New to Dynamo and want to deploy your model quickly? Make use of this quick-start guide to begin your Dynamo journey.

Read Now

Tutorials

Getting started with Dynamo can lead to many questions. Explore this repository to familiarize yourself with Dynamo’s features and find guides and examples that can help ease migration.

Read Now

NVIDIA LaunchPad

In hands-on labs, experience fast and scalable AI using NVIDIA Dynamo. You’ll be able to immediately unlock the benefits of NVIDIA’s accelerated computing infrastructure and scale your AI workloads.

Explore Now

View All Blogs

Top 5 Reasons Why Dynamo Is Simplifying Inference

NVIDIA Dynamo Inference Server simplifies the deployment of AI models at scale in production, letting teams deploy trained AI models from any framework from local storage or cloud platform on any GPU- or CPU-based infrastructure.

Watch Now

Deploy HuggingFace’s Stable Diffusion Pipeline With Dynamo

This video showcases deploying the Stable Diffusion pipeline available through the HuggingFace diffuser library. We use Dynamo Inference Server to deploy and run the pipeline.

Watch Now

Getting Started With NVIDIA Dynamo Inference Server

Dynamo Inference Server is an open-source inference solution that standardizes model deployment and enables fast and scalable AI in production. Because of its many features, a natural question to ask is, where do I begin? Watch to find out.

Watch Now

View All Blogs