NVIDIA Dynamo Adds Support for AWS Services to Deliver Cost-Efficient Inference at Scale

Amazon Web Services (AWS) developers and solution architects can now take advantage of NVIDIA Dynamo on NVIDIA GPU-based Amazon EC2, including Amazon EC2 P6 accelerated by NVIDIA Blackwell, with added support for Amazon Simple Storage (S3), in addition to existing integrations with Amazon Elastic Kubernetes Services (EKS) and AWS Elastic Fabric Adapter (EFA). This update unlocks a new level of performance, scalability, and cost-efficiency for serving large language models (LLMs) at scale.

NVIDIA Dynamo scales and serves generative AI

NVIDIA Dynamo is an open-source inference-serving framework purpose-built for large-scale distributed environments. It supports all major inference frameworks such as PyTorch, SGLang, TensorRT-LLM, and vLLM, and includes advanced optimization capabilities such as:

Disaggregated serving: separates prefill and decode inference stages on distinct GPUs to increase throughput.
LLM-aware routing: routes requests to maximize KV cache hit rates and avoid recomputation costs.
KV cache offloading: Offload KV cache across cost-efficient memory hierarchies to reduce inference costs.

Together, these features enable NVIDIA Dynamo to deliver best-in-class inference performance and cost efficiency for large-scale, multi-node LLM deployments.

Seamless integration with AWS Services

For AWS developers and solution architects serving LLMs on AWS cloud, Dynamo will integrate seamlessly into your existing inference architecture:

Amazon S3: Dynamo NIXL now supports Amazon S3, an object storage service that offers virtually unlimited scalability, high performance, and low costs.

Computing KV cache is resource-intensive and costly. It’s common to reuse cached values instead of recomputing them. However, as AI workloads grow, the amount of KV cache required for reuse can quickly overwhelm GPU and even host memory. By offloading KV cache to S3, developers can free up valuable GPU memory for serving new requests. This integration eases the burden on developers from building custom plug-ins, enabling them to seamlessly offload KV cache to S3 and reducing overall inference costs.
Amazon EKS: Dynamo runs on Amazon EKS, a fully managed Kubernetes service that enables developers to run and scale containerized applications without having to manage Kubernetes infrastructure.

As LLMs grow in size and complexity, production inference deployment now requires advanced components like LLM-aware request routing, disaggregated serving, and KV cache offloading. These tightly integrated components add complexity when deploying in Kubernetes environments. With this support, developers can seamlessly deploy Dynamo into their EKS-managed Kubernetes clusters, enabling them to quickly spin up new Dynamo replicas on demand to handle inference workload spikes.

Dynamo on AWS architecture showing Availability Zone, Virtual Private Cloud, EKS Control Plane and CPU and GPU nodes. — Figure 1: Dynamo on AWS deployment architecture using Amazon EKS

AWS Elastic Fabric Adapter (EFA)– Dynamo’s NIXL data transfer library supports Amazon’s EFA, a network interface that provides low-latency internode communication between Amazon EC2 instances.

As LLMs grow in size and adopt a sparse Mixture of Experts architecture, sharding them across multiple GPUs boosts throughput while maintaining low latency. In these setups, inference data transfers across GPU nodes for workloads running on AWS take place using EFA. With Dynamo’s EFA support, developers can easily move KV cache across nodes using simple get, push, and delete commands through NIXL’s front-end API. This enables access to Dynamo’s advanced features like disaggregated serving, without the need for custom plug-ins, accelerating time to production for AI applications.

Optimizing Inference with Dynamo on Blackwell-powered Amazon P6 Instances

Dynamo is compatible with any NVIDIA GPU-accelerated AWS instance, but when paired with Amazon EC2 P6 instances powered by Blackwell, it delivers a significant performance boost when deploying advanced reasoning models like DeepSeek R1 and the latest Llama 4. Dynamo streamlines and automates the complexities of serving disaggregated MoE models by managing critical tasks such as prefill and decode autoscaling, along with rate matching.

At the same time, the Amazon P6-B200 instance features fifth-generation Tensor Cores, FP4 acceleration, and 2x the NVIDIA NVLink bandwidth compared to the prior generation, while the P6e-GB200 Ultra Servers powered by the NVIDIA GB200 NVL72 feature a unique scale-up architecture that delivers an aggregate all-to-all bandwidth of 130 TBps, designed to accelerate the intensive communication patterns required by wide expert-parallel decode operations in MoE deployments. Together, the combination of Dynamo and P6-powered Blackwell instances enhances GPU utilization, increases request throughput per dollar, and drives sustainable margin growth for production-scale AI workloads.

Get started with NVIDIA Dynamo

Deepening Dynamo’s integrations with AWS helps developers seamlessly scale their inference workloads.

NVIDIA Dynamo runs on any NVIDIA GPU-accelerated AWS instance. Start optimizing your inference stack today by deploying with NVIDIA Dynamo.

NVIDIA Dynamo Adds Support for AWS Services to Deliver Cost-Efficient Inference at Scale

NVIDIA Dynamo scales and serves generative AI

Seamless integration with AWS Services

Optimizing Inference with Dynamo on Blackwell-powered Amazon P6 Instances

Get started with NVIDIA Dynamo

Related resources

Tags

About the Authors

NVIDIA Dynamo Adds Support for AWS Services to Deliver Cost-Efficient Inference at Scale

NVIDIA Dynamo scales and serves generative AI

Seamless integration with AWS Services

Optimizing Inference with Dynamo on Blackwell-powered Amazon P6 Instances

Get started with NVIDIA Dynamo

Related resources

Tags

About the Authors

Comments

Related posts

NVIDIA Dynamo Accelerates llm-d Community Initiatives for Advancing Large-Scale Distributed Inference

NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations

NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models

Optimize AI Inference Performance with NVIDIA Full-Stack Solutions

New on NGC: SDKs for Large Language Models, Digital Twins, Digital Biology, and More

Related posts

Safeguard Agentic AI Systems with the NVIDIA Safety Recipe

Upcoming Livestream: Techniques for Building High-Performance RAG Applications

Enhancing Multilingual Human-Like Speech and Voice Cloning with NVIDIA Riva TTS

Just Released: NVDIA Run:ai 2.22

Reinforcement Learning with NVIDIA NeMo-RL: Reproducing a DeepScaleR Recipe Using GRPO