Best Practices for AI Inference Infrastructure
Best Practices for AI Inference Infrastructure
Performance
E-book
With insights, frameworks, and examples, The IT Leader’s Guide to AI Inference and
Performance equips you with the knowledge to evaluate, deploy, and scale AI solutions
effectively — making it a must-read for decision-makers who want to lead with confidence
in the AI era.
Inference is what brings AI to the real world, solving advanced deployment challenges to
unlock applications — from writing code to summarizing long documents, and even
generating images and short videos. The potential for inference to automate work ows,
create new business models, and power groundbreaking products is transformative —
especially for CIOs and IT leaders tasked with driving AI innovation within their enterprise.
However, with this opportunity comes a key challenge: managing the recurring costs
associated with scaling inference workloads while balancing latency with performance.
Inference costs are tightly correlated with two key factors. The rst is the performance, or
throughput, of the AI platform itself. The more requests a system can handle e ciently,
the better organizations can spread costs across incoming requests, lowering the cost per
request. However, as AI models generate tokens in response to requests, whose complexity
and type can vary signi cantly, cost per token becomes an essential metric for assessing
both system performance and overall cost e ciency.
However, lowering cost per token must be balanced with maintaining a high-quality user
experience. Maximizing request volume at the expense of user experience can reduce
adoption of the AI-enabled application or service. A poor user experience can lead to
customer churn, which, in turn, erodes revenue. Therefore, it’s essential to measure both
cost per token and system latency.
Latency is typically measured across two dimensions — Time to First Token (TTFT) and Time
Per Output Token (TPOT). As we will explore later in more detail, optimizing these
dimensions often presents trade-o s. Increasing one can lead to the deterioration of
another, so it’s crucial for organizations to strike the right balance.
Figure 1. Optimizing Time to First Token and Time Between Tokens Often
Presents Three Tradeo s If Not Correctly Balanced
The second critical factor driving Inference costs is time to market. The AI landscape is
evolving rapidly, and being rst to market can provide a signi cant competitive edge.
However, an inference software stack that is unreliable, poorly tested, or lacks
comprehensive documentation and strong community support can lead to costly delays,
integration challenges, and steep learning curves for internal development teams.
be challenging, the effect is tangible, and IT leaders should incorporate it into their overall
inference cost management strategy.
As you continue reading, we will dive deeper into how you can address these factors to
optimize AI Inference performance, reduce costs, and maximize the long-term value of your
AI investments.
Chatbot
Consider, for instance, LLM-powered chatbots commonly used in e-commerce and
customer service work ows. These chatbots are responsible for providing quick responses
to short user inquiries and concerns, making responsiveness imperative to prevent user
churn. In this context, a fast TTFT – a measure of how quickly the chatbot begins
outputting a response – is essential for delivering a positive user experience, while a
moderate TPOT that matches or slightly surpasses human reading speed is considered
acceptable. Typical production implementations of chatbots serve very large user bases
and have large batch sizes, grouping multiple user requests together and sending them to
the AI system for inference in a single request.
Summarization
On the other end of the spectrum are document summarization use cases, which are
gaining traction across various industries, such as healthcare for summarizing medical
research papers, news and media for summarizing key events and developments, and web
conferencing providers for generating action items from video meetings. In these
scenarios, the input sequences are lengthy, ranging from a few pages to hundreds, with
the focus on quickly generating the full summary rather than instant responsiveness. As a
result, the system must be optimized for fast TPOT, far exceeding human reading speed, as
users are more tolerant of longer TTFT — particularly at long input sequence lengths. To
achieve fast TPOT, the batch sizes in these cases tend to be smaller, requiring advanced
software stack optimizations to maximize throughput at low batch sizes.
Question-Answering
Question-answering bots share many similarities with chatbots, particularly in terms of
fast responsiveness and handling short user queries. However, they di er in that
question-answering bots are required to generate answers from a prede ned knowledge
base or dataset. While the user query itself may be brief, the actual inference request sent
to the LLM is much longer, as it includes relevant data in the form of embeddings pulled
from the knowledge base. This work ow is commonly referred to as Retrieval Augmented
Generation or RAG. In this scenario, it’s essential to choose an AI system and accelerator
that performs optimally with long input sequence lengths and short to moderate output
sequence lengths to achieve the best TCO and user experience. Additionally, since the
knowledge base data is typically stored outside the GPU memory, in a much larger CPU
memory, deploying systems with fast interconnects between the CPU and GPU —
surpassing traditional PCIe interconnect speeds — can further optimize the user
experience and reduce TCO.
AI Agents
AI chatbots currently leverage generative AI to provide responses based on single
interactions, utilizing natural language processing to reply to user queries. The next
evolution in AI is agentic AI, which goes beyond simple responses by employing advanced
reasoning and iterative planning to solve complex, multi-step problems. This type of AI
ingests large amounts of data from various sources, allowing it to independently analyze
challenges, develop strategies, and execute tasks. By continuously learning and improving
through a feedback loop, agentic AI systems enhance decision-making and operational
e ciency.
Agentic AI necessitates function calling and the coordination of multiple models that result
in the generation of additional tokens during inference. For leaders leveraging AI agents
within their organization, selecting the right inference platform stack — one that o ers the
necessary abstraction tools and blueprints to e ectively orchestrate these interactions —
is vital to ensure performant inference operations.
As demonstrated, di erent use cases have distinct requirements for Input Sequence
Lengths, Batch Sizes, Time to First Token, Time Per Output token and CPU to GPU
interconnect. Focusing on benchmarking that closely mirrors your target production use
case when selecting the right instance is crucial for maintaining consistent user service
level agreements and preventing infrastructure costs from gradually increasing due to
under-provisioned hardware.
In the previous section, we examined the di erent performance metrics you can measure
and optimize within an inference solution, and how these metrics are in uenced by the
speci c use case. In this section, we will focus on how deployment factors and architectural
considerations a ect performance, such as:
The aim is not to provide detailed guidance on deploying or architecting your particular
solution, but to equip you with the knowledge needed to design it for optimal performance.
Selecting the right GPU is a crucial factor that signi cantly impacts performance. However
relying solely on peak chip or instance metrics like rated FLOPS or memory speci cations
may not tell the full story, especially as delivered workload performance depends greatly on
many factors, including the e ciency of the software stack. Similarly, the absolute price of
a GPU isn't a meaningful metric. What truly matters is the performance delivered per unit
of power or dollar spent. This means that one GPU with a higher per-hour cost than
another may provide a lower overall cost per token than one with a lower per-hour cost.
Data centers are increasingly power limited, making it critical to maximize data center
inference throughput within a given power budget. Delivered inference throughput for a
given amount of energy use – in other words, energy e ciency – is critical to maximizing
data center inference throughput and ultimately revenue potential.
Since their debut in data centers in 2012, NVIDIA GPUs have transformed the industry by
enabling parallel processing and signi cantly cutting down the time required for
resource-intensive tasks. This shift has led to dramatic improvements, o ering up to 30x
more performance per watt and 60x more performance per dollar compared to traditional
CPU-based systems.
Figure 3. NVIDIA Blackwell GPU Built with 208 Billion Transistors to Deliver
Unparalleled Inference Performance
Blackwell, built with 208 billion transistors, over 2.5x the transistors of its predecessor, is
the largest GPU ever made. It introduces the second-generation Transformer Engine,
combining custom Blackwell Tensor Cores with TensorRT-LLM to accelerate inference for
LLMs and Mixture of Experts (MoE) models. Blackwell Tensor Cores o er new precisions
and microscaling formats for improved accuracy and throughput. The Transformer Engine
enhances performance with micro-tensor scaling and enables FP4 AI, doubling
performance, HBM bandwidth, and model size per GPU, compared to FP8.
The NVIDIA GB200 Grace Blackwell Superchip connects two high-performance NVIDIA
Blackwell Tensor Core GPUs and an NVIDIA Grace CPU using the NVIDIA® NVLink®-C2C
interconnect that delivers 900 gigabytes per second (GB/s) of bidirectional bandwidth to
the two GPUs.
Figure 4. NVIDIA GB200 Superchip Includes Two Blackwell GPUs and One
Grace CPU
The NVIDIA Grace CPU combines 72 high-performance and energy-e cient Arm Neoverse
V2 cores, connected with the NVIDIA Scalable Coherency Fabric (SCF). The NVIDIA SCF is a
high-bandwidth, on-chip fabric that provides a total of 3.2 TB/s of bisection bandwidth —
double that of traditional CPUs.
Grace is the rst data center CPU to use high-speed LPDDR5X memory with server-class
reliability through mechanisms like error-correcting code (ECC). Grace delivers up to 500
GB/s of memory bandwidth while consuming just one- fth the energy of traditional DDR
memory at similar cost.
These numerous innovations mean that NVIDIA Grace delivers outstanding performance,
memory bandwidth, and data-movement capabilities with breakthrough performance per
watt.
While selecting the right GPU is critical for AI deployments, exascale computing and
trillion-parameter AI models require seamless GPU-to-GPU communication allowing
multiple GPUs to work in tandem as one single massive GPU during Inference.
The NVIDIA GB200 NVL72 connects 36 GB200 Superchips (36 Grace CPUs and 72
Blackwell GPUs) in a rack-scale design. The GB200 NVL72 is a liquid cooled, rack-scale
72-GPU NVLink domain, that can act as a single massive GPU.
Switch, which enables 130TB/s GPU bandwidth in a 72-GPU NVLink domain (NVL72) for
model parallelism. When combined, NVLink and the NVLink Switch support multi-server
clusters with 1.8 TB/s interconnect, scaling GPU communications and computing.
GB200 NVL72 delivers a 30X inference speedup compared to prior generation with 25X
lower TCO and 25X less energy with the same number of GPUs for massive models such as
a GPT-MoE- 1.8T.
Model Size
To serve a broad set of use cases, deployment scenarios, and budgets, model developers
will often provide variants of their models in a range of sizes. For example, the Llama 3.1
family of open models is available in sizes of 8B, 70B, and 405B parameters. Generally,
within a model family, larger models will provide more accurate responses. It is important to
keep in mind, however, that larger models also require more computational resources to
generate tokens than smaller ones do.
This means that model size selection depends on both the intended use case as well as
available compute resources. For use cases that demand the highest response accuracy for
complex tasks, larger models may be the preferred choice. However, for scenarios where
output token generation speed is critical or available compute resources are limited, using
a smaller model may be preferred.
As data center GPUs continue to evolve, the compute and memory resources they o er are
growing rapidly with each new generation. These next-generation GPUs are highly e ective
at accelerating massive-scale inferencing, such as serving trillion-parameter models.
However, this power comes with a caveat: when serving smaller models, the large compute
and memory resources often remain underutilized, leading to higher Total Cost of
Ownership (TCO).
In these scenarios, the GPU’s resources are not being e ciently used, as large portions of
the compute capacity and memory are left idle. This results in unnecessary infrastructure
costs, creating an imbalance between deployed resources and actual workload
requirements.
NVIDIA Triton Inference Server streamlines the process of deploying models with
concurrency enabled. Developers can easily implement model concurrency by switching on
a single ag in the model’s con guration le. From there, they can specify how many
concurrent model instances to deploy and which target GPUs to use. This simplicity
empowers developers to leverage the full power of their GPU infrastructure without
complex con guration or manual scaling processes.
In scenarios where multiple independent models need to be served but cannot all t
concurrently on the same instance or node, NVIDIA Triton can function as a multiplexer. It
dynamically loads and unloads models based on incoming user requests, ensuring that the
required models are available at the right time. This allows organizations to maintain high
service availability and performance without the need for additional infrastructure
investments.
Figure 8. How NVIDIA Triton Can Load and Unload Models on a GPU,
Increasing Service Availability and Performance
In addition to improving throughput and reducing cost, model concurrency also addresses
situations where user demand for a service is not predictable in advance. IT teams can
begin with a single instance of a model and scale up gradually by deploying multiple
concurrent instances of the model as demand increases. The ability to adjust the number
of concurrent models makes it possible to meet changing demands while minimizing the
cost of unused compute resources.
AI models capable of processing and integrating information from multiple data modalities
simultaneously such as text, images, audio and video are called multimodal models. Unlike
the traditional unimodal AI models that work with a single input modality, multimodal
models have complex inference serving requirements that must enable processing inputs
and generating outputs across various modalities.
NVIDIA AI inference platform provides optimized support for multimodal models with
highly performant input encoders for audio and image using the NVIDIA TensorRT library
and text decoder optimized using the TensorRT-LLM library. These models can be deployed
using NVIDIA’s Triton Inference Server that supports several advanced features such as
multi-image inference where a single request can contain multiple images, and e ective KV
cache reuse where images could be shared across multiple text input tokens to help reduce
latency and improve performance proportional to the length of the KV cache reuse.
Data Center AI accelerators, such as NVIDIA GPUs, utilize multiple cores to process
requests in parallel. To maximize the potential of these cores, requests are often grouped
together into batches and sent for inference. However, a critical tradeo exists between
batch size, throughput, and latency. Adjusting any two of these factors can negatively
impact the third, creating challenges for businesses seeking optimal performance.
For instance, a smaller batch size can reduce latency but leads to underutilized GPU
resources, resulting in ine ciencies and higher operational costs. On the other hand, larger
batches maximize the use of GPU resources but come at the cost of increased latency,
which can harm the user experience.
IT teams often perform A/B testing to determine the ideal balance for their use case, but
this can be a complex and time-consuming process. Leveraging an advanced inference
software platform, such as NVIDIA Triton Inference Server and NVIDIA TensorRT-LLM, can
mitigate this tradeo . NVIDIA AI inference platform deploys sophisticated batching
techniques to help organizations strike a better balance between throughput and latency,
improving overall system performance.
Dynamic Batching
Dynamic batch creates batches based on speci c criteria such as maximum or preferred
batch size and maximum waiting time. If the batch can be formed at the preferred size, it
will be processed at that size; otherwise, the system will form a batch of the largest
possible size that meets max batch size con gured. By allowing requests to remain in the
queue for a short time, dynamic batching can wait till additional requests arrive to create a
larger batch size enhancing throughput and resource utilization, ultimately lowering cost
per token.
Sequence Batching
For use cases that require speci c request sequencing, such as video streaming, sequence
batching ensures that related requests (e.g., frames in a video) are processed in a
meaningful order. When requests are correlated, such as those in a video stream where the
model needs to maintain state between frames, sequence batching ensures that requests
are processed sequentially on the same instance. This guarantees the correct output while
optimizing performance.
Inflight Batching
In ight batching enhances traditional batch processing by enabling continuous request
handling. With in ight batching, the TensorRT-LLM runtime (an NVIDIA SDK that optimizes
LLMs for inference) processes requests as soon as they are ready, without waiting for the
entire batch to complete. This allows for faster processing and higher throughput by
immediately initiating new requests while others are still being processed.
Figure 12. How In ight Batching Can Evict Requests within a Batch Once
Completed and Replace them with New Ones
E cient use of compute resources requires careful management of batch size, throughput,
and latency. Advanced batching techniques o ered by the NVIDIA AI inference platform,
such as dynamic batching, sequence batching, and in ight batching, provide organizations
with the exibility to optimize performance, reduce costs, and improve the user experience.
These technologies help strike a balance between throughput and latency, enabling
businesses to scale AI-driven solutions more e ectively.
As the size and complexity of large language models (LLMs) increase, ensuring that these
models can be deployed e ciently across multiple GPUs becomes crucial. When models
exceed the memory capacity of a single GPU, parallelization techniques are employed to
split the workload and optimize performance. For executives and IT leaders, understanding
the primary parallelism methods and their impact on throughput and user interactivity is
key to making informed infrastructure decisions. Below are the primary methods for
parallelizing inference in large models and examples from leading case studies.
Data parallelism involves duplicating the model across multiple GPUs or GPU clusters. Each
GPU independently processes groups of user requests, ensuring that no communication is
required between these request groups. This method scales linearly with the number of
GPUs used, meaning the number of requests served increases directly with the GPU
resources allocated. However, it is important to note that data parallelism alone is usually
insu cient for the latest large-scale LLMs, as their model weights often cannot t onto a
single GPU. Consequently, DP is commonly used alongside other parallelism techniques.
Impact on Performance:
In tensor parallelism, the model parameters are split across multiple GPUs, with user
requests shared across GPUs or GPU clusters. The results of the computations performed
on di erent GPUs are combined over the GPU-to-GPU network. This method can improve
user interactivity, especially for transformer-based models like Llama 405B and GPT4 1.8T
MoE (1.8 Trillion Parameters Mixture of Expert Model), by ensuring that each request gets
processed with more GPU resources, thus speeding up processing time.
Impact on Performance:
● GPU Throughput: Scaling TP to large GPU counts, without a fast interconnect like
NVIDIA NVLINK, can reduce throughput due to increased communication overhead.
● User Interactivity: Enhanced as user requests are processed faster with more GPU
resources allocated to each request.
Pipeline parallelism divides the model layers, with each group of layers assigned to di erent
GPUs. The model processes requests sequentially across GPUs, with each GPU performing
computations on its assigned portion of the model. This method is bene cial for
distributing large model weights that do not t on a single GPU but has limitations in terms
of e ciency.
Impact on Performance:
● GPU Throughput: May result in lower e ciency as model weights are distributed
● User Interactivity: Does not enable the signi cant optimization of user interactivity
as processing must proceed sequentially across GPUs.
Expert parallelism involves routing user requests to specialized "experts" within the model.
By limiting each request to a smaller set of model parameters (i.e., speci c experts), the
system reduces computational overhead and optimizes processing. Requests are
processed by individual experts before being recombined at the nal output stage, which
requires high-bandwidth GPU-to-GPU communication.
Impact on Performance:
Figure 16. Using Expert Parallelism on a Deep Neural Network that Consists
of Four Experts
The most e ective way to optimize the performance of LLMs across multiple GPUs is by
combining multiple parallelism techniques. By doing so, the tradeo between throughput
and user interactivity can be minimized, ensuring both high performance and responsive
user experiences.
For instance, when serving the GPT 1.8T MoE model with 16 experts, using 64 GPUs, each
with 192 GB of memory, we observe that Expert + Pipeline Parallelism (EP16PP4) o ers a
2x improvement in user interactivity with negligible reduction reduction in GPU throughput
compared to expert-only parallelism and Tensor + Expert + Pipeline Parallelism
(TP4EP4PP4) delivers 3x more GPU throughput compared to tensor-only parallelism, while
maintaining user interactivity.
For CIOs and IT leaders overseeing the deployment of LLMs, it is essential to understand
the tradeo s and bene ts of various parallelization strategies. Combining parallelism
methods like data, tensor, pipeline, and expert parallelism allows organizations to optimize
GPU resource utilization and improve both throughput and user interactivity. In practice,
well-planned con gurations can deliver optimal results, balancing high performance with
responsive user experiences across multiple GPUs. As demand for larger and more complex
models increases, these parallelization techniques will play a critical role in ensuring
scalable and e cient AI model deployment.
As IT leaders roll out AI applications, one of the most challenging decisions they face is
forecasting user demand and understanding how that demand will uctuate over time.
These forecasts signi cantly in uence infrastructure decisions, particularly in relation to
provisioned resources like GPUs, which directly impact both cost and performance.
Balancing these elements is key to ensuring that AI systems can scale dynamically while
controlling overhead.
The NVIDIA AI inference platform, including NVIDIA Triton as well as NVIDIA NIM, supports
Kubernetes to facilitate the scaling process. As inference requests increase, Triton metrics
are scraped by Prometheus and sent to Kubernetes Horizontal Pod Autoscaler (HPA), which
adds more pods to the deployment, each with one or more GPUS. One example of a custom
metric that can be scrapped from Triton using Prometheus and sent to the Kubernetes
HPA to inform scaling decisions is the queue-to-compute ratio. This ratio re ects the
response time of inference requests. It’s de ned as the queue time divided by the compute
time for an inference request.
For further details on optimizing AI deployments with NVIDIA Triton and Kubernetes, visit
Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes. For further
details on optimizing AI deployments with NVIDIA NIM and Kubernetes, visit Autoscaling of
NVIDIA NIM on Kubernetes and Managing AI Inference Pipelines on Kubernetes with
NVIDIA NIM Operator.
AI and ML systems are rarely deployed as standalone models. Instead, they are often part
of a broader, more complex pipeline that integrates various models, pre-processing steps,
and post-processing tasks. This approach, known as Model Ensembles, has become a
cornerstone of modern AI work ows, especially in enterprise applications.
By stitching together these individual steps into a cohesive pipeline, businesses can ensure
the smooth ow of data through the models while reducing latency and optimizing
resource usage.
Building and managing these complex model pipelines can be challenging. However, NVIDIA
AI inference platform tools like the Triton Inference Server o er powerful solutions for
automating the process providing advanced capabilities for orchestrating model
ensembles.
Triton’s Model Ensembles feature eliminates the need for writing manual code to manage
each step, reducing complexity and minimizing the risk of errors. Additionally, Triton
supports running pre- and post-processing on CPUs, while the core AI model can run on
GPUs, providing exibility in balancing processing power. Additionally, Triton supports
adding advanced features to the ensemble, such as conditional logic and loops, allowing
developers to build more sophisticated and exible AI work ows.
The use of Model Ensembles comes with several key benefits, especially when integrated
with Triton:
To better understand how Model Ensembles work, let's consider an example involving
generative AI. Imagine a text-to-image application where an input text is converted into a
synthesized image. The pipeline for such an application typically consists of two main
components: a LLM for encoding the input text, and a diffusion model for generating the
image.
Before the input text is fed to the LLM, some pre-processing is needed. This could involve
cleaning the text, tokenizing it, or formatting it in a way that’s compatible with the LLM.
Similarly, the output image might require post-processing, such as resizing or adding
effects, before it can be used in the final application.
Using Model Ensembles in this case would allow the text-to-image process to be fully
automated and optimized. Triton could seamlessly connect each step, from text
preprocessing to image generation, and even handle the post-processing, all within one
unified pipeline.
By integrating multiple models and pre- and post-processing steps into a single, optimized
pipeline, IT leaders can enhance the e ciency of their AI applications, reduce latency, and
minimize resource usage.
The NVIDIA AI inference platform provides a powerful platform for managing these
ensembles, o ering exibility in resource allocation and simplifying the integration of
di erent components. With its low-code approach and automated optimization, it enables
teams to build and scale AI systems with greater ease, making it an essential tool for
enterprises looking to increase the performance of their AI models.
LLM inference typically involves two key phases: pre ll and decode. In the pre ll phase, the
system computes the contextual understanding of the user's input (KV cache), which is
computationally intensive. The decode phase follows, generating tokens sequentially, with
the rst token derived from the KV cache. However, traditional methods often struggle
with balancing the heavy computational demand of the pre ll phase and the lighter load of
the decode phase.
Chunked Pre ll is an optimization that divides the pre ll phase into smaller chunks,
improving parallelization with the decode phase and reducing bottlenecks. This approach
helps in handling longer contexts and higher concurrency levels while maximizing GPU
memory and compute resources. It also o ers exibility by decoupling memory usage from
input sequence length, making it easier to process large requests without straining
memory capacity. With dynamic chunk sizing, TensorRT-LLM intelligently adjusts chunk
sizes based on GPU utilization, simplifying deployment and eliminating the need for manual
con guration.
As AI models evolve, the size of context windows — allowing for better cognitive
understanding — has grown exponentially. Llama 2 started with 4K tokens, and the recent
Llama 3.1 expanded this to an impressive 128K tokens. Handling these long sequences in
real-time inference scenarios presents unique challenges, particularly with GPU resource
allocation.
KV Cache Early Reuse optimizes this process by allowing portions of the cache to be reused
as they are being generated, rather than waiting for full completion. This technique
signi cantly accelerates inference, especially in scenarios where system prompts or
prede ned instructions are required. For instance, in enterprise chatbots, where user
requests often share the same system prompt, this feature can accelerate TTFT by up to
5x during periods of high demand, providing a faster and more responsive user experience.
Traditional inference setups often co-locate the pre ll and decode phases on the same
GPU, leading to ine cient resource allocation and suboptimal throughput. NVIDIA Triton
Disaggregated Serving (DistServe) strategy decouples these phases, allowing AI inference
teams to allocate resources independently based on the speci c needs of each phase. This
enables independent resourcing and decoupled scaling, meaning that more GPUs can be
allocated to the pre ll phase to optimize TTFT, while additional GPUs can be dedicated to
the decode phase to improve Time Between Tokens (TBT).
The process of generating tokens in an autoregressive manner (one at a time) can be slow
and ine cient. Speculative Decoding optimizes this by generating multiple potential token
sequences in parallel, reducing the time required for token generation. TensorRT-LLM
integrates various speculative decoding methods, such as Draft Target and Eagle Decoding,
which allow the system to predict multiple tokens and select the most appropriate one.
For models like Llama 3.3 70B, speculative decoding leads to a signi cant 3.5x increase in
tokens per second, improving throughput and user experience without compromising
output quality. This technique is particularly bene cial in low-latency, high-throughput
environments, where maximizing the e ciency of each computational step is essential.
The NVIDIA AI inference platform is built to support organizations at any stage of their AI
inference journey. For those still in the experimentation phase or focused on faster time to
market, the strategies outlined in the chapter, Deployment Factors Impacting Inference
Deployment, provide an excellent starting point. However, for organizations where AI
inference represents a major cost driver impacting gross margins, the advanced
performance and cost-saving techniques covered in the chapter, Unlocking AI Inference
Performance and Cost E ciency in the Cloud, will help maximize system e ciency,
performance, and minimize costs.
The NVIDIA AI inference platform o ers exible deployment options for inference to meet
a broad range of business and IT requirements.
Enterprises seeking the fastest time to value can leverage NVIDIA NIM, which o ers
prepackaged, optimized inference microservices for running the latest AI foundation
models on NVIDIA accelerated infrastructure anywhere.
For maximum fl xibility, configu ability and extensibility to t your unique AI inference
needs, NVIDIA o ers NVIDIA Triton Inference Server and NVIDIA TensorRT which provide
the ability to customize and optimize your inference serving platform for your specifi
requirements.
This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality,
condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or
implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility
for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any
infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to
develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this
document, at any time without notice.
Customers should obtain the latest relevant information before placing orders and should verify that such information is
current and complete. NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the
time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized
representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer
general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No
contractual obligations are formed either directly or indirectly by this document.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual
property right under this document. Information published by NVIDIA regarding third-party products or services does not
constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such
information may require a license from a third party under the patents or other intellectual property rights of the third
party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced
without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated
conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS,
AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO
WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY
DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL
DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS
DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative
liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the
product.
Trademarks
NVIDIA, the NVIDIA logo, NVIDIA Grace GPU, CUDA, NVLink, NVIDIA GPU Cloud, and NSight are trademarks and/or registered
trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks
of the respective companies with which they are associated.
VESA DisplayPort
DisplayPort and DisplayPort Compliance Logo, DisplayPort Compliance Logo for Dual-mode Sources, and DisplayPort
Compliance Logo for Active Cables are trademarks owned by the Video Electronics Standards Association in the United
States and other countries.
HDMI
HDMI, the HDMI logo, and High-Definition Multimedia Interface are trademarks or registered trademarks of HDMI Licensing
LLC.
Arm
Arm, AMBA, and Arm Powered are registered trademarks of Arm Limited. Cortex, MPCore, and Mali are trademarks of Arm
Limited. All other brands or product names are the property of their respective holders. ʺArm” is used to represent Arm
Holdings plc; its operating company Arm Limited; and the regional subsidiaries Arm Inc.; Arm KK; Arm Korea Limited.; Arm
Taiwan Limited; Arm France SAS; Arm Consulting (Shanghai) Co. Ltd.; Arm Germany GmbH; Arm Embedded Technologies
Pvt. Ltd.; Arm Norway, AS, and Arm Sweden AB.
OpenCL
OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
Copyright