AI Inference

Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.


When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.


Click here to view other performance data.

MLPerf Inference v4.1 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset
Llama2 70B11,264 tokens/sec1x B200NVIDIA B200NVIDIA B200-SXM-180GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162OpenOrca
34,864 tokens/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GB-CTSrouge1=44.4312, rouge2=22.0352, rougeL=28.6162OpenOrca
24,525 tokens/sec8x H100NVIDIA DGX H100NVIDIA H100-SXM-80GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162OpenOrca
4,068 tokens/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162OpenOrca
Mixtral 8x7B59,335 tokens/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)OpenOrca, GSM8K, MBXP
52,818 tokens/sec8x H100SMC H100NVIDIA H100-SXM-80GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)OpenOrca, GSM8K, MBXP
8,021 tokens/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)OpenOrca, GSM8K, MBXP
Stable Diffusion XL18 samples/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
16 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
2.3 samples/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
ResNet-50768,235 samples/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GB76.46% Top1ImageNet (224x224)
710,521 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB76.46% Top1ImageNet (224x224)
95,105 samples/sec1x GH200NVIDIA GH200-GraceHopper-SuperchipNVIDIA GH200 Grace Hopper Superchip 96GB76.46% Top1ImageNet (224x224)
RetinaNet15,015 samples/sec8x H200ThinkSystem SR685a V3NVIDIA H200-SXM-141GB0.3755 mAPOpenImages (800x800)
14,538 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB0.3755 mAPOpenImages (800x800)
1,923 samples/sec1x GH200NVIDIA GH200-GraceHopper-SuperchipNVIDIA GH200 Grace Hopper Superchip 96GB0.3755 mAPOpenImages (800x800)
BERT73,791 samples/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GB90.87% f1SQuAD v1.1
72,876 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB90.87% f1SQuAD v1.1
9,864 samples/sec1x GH200NVIDIA GH200-GraceHopper-SuperchipNVIDIA GH200 Grace Hopper Superchip 96GB90.87% f1SQuAD v1.1
GPT-J20,552 tokens/sec8x H200ThinkSystem SR680a V3NVIDIA H200-SXM-141GBrouge1=42.9865, rouge2=20.1235, rougeL=29.9881CNN Dailymail
19,878 tokens/sec8x H100ESC-N8-E11NVIDIA H100-SXM-80GBrouge1=42.9865, rouge2=20.1235, rougeL=29.9881CNN Dailymail
2,804 tokens/sec1x GH200GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRTNVIDIA GH200 Grace Hopper Superchip 96GBrouge1=42.9865, rouge2=20.1235, rougeL=29.9881CNN Dailymail
DLRMv2639,512 samples/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GB80.31% AUCSynthetic Multihot Criteo Dataset
602,108 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB80.31% AUCSynthetic Multihot Criteo Dataset
86,731 samples/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GB80.31% AUCSynthetic Multihot Criteo Dataset
3D-UNET55 samples/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GB0.863 DICE meanKiTS 2019
52 samples/sec8x H100AS-4125GS-TNHR2-LCCNVIDIA H100-SXM-80GB0.863 DICE meanKiTS 2019
7 samples/sec1x GH200GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRTNVIDIA GH200 Grace Hopper Superchip 96GB0.863 DICE meanKiTS 2019

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
Llama2 70B10,756 tokens/sec1x B200NVIDIA B200NVIDIA B200-SXM-180GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162TTFT/TPOT: 2000 ms/200 msOpenOrca
32,790 tokens/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GB-CTSrouge1=44.4312, rouge2=22.0352, rougeL=28.6162TTFT/TPOT: 2000 ms/200 msOpenOrca
23,700 tokens/sec8x H100AS-4125GS-TNHR2-LCCNVIDIA H100-SXM-80GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162TTFT/TPOT: 2000 ms/200 msOpenOrca
3,884 tokens/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=44.4312, rouge2=22.0352, rougeL=28.6162TTFT/TPOT: 2000 ms/200 msOpenOrca
Mixtral 8x7B57,177 tokens/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)TTFT/TPOT: 2000 ms/200 msOpenOrca, GSM8K, MBXP
51,028 tokens/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)TTFT/TPOT: 2000 ms/200 msOpenOrca, GSM8K, MBXP
7,450 tokens/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)TTFT/TPOT: 2000 ms/200 msOpenOrca, GSM8K, MBXP
Stable Diffusion XL17 samples/sec8x H200ThinkSystem SR680a V3NVIDIA H200-SXM-141GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
16 samples/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
2.02 samples/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
ResNet-50681,328 queries/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GB76.46% Top115 msImageNet (224x224)
634,193 queries/sec8x H100SYS-821GE-TNHRNVIDIA H100-SXM-80GB76.46% Top115 msImageNet (224x224)
77,012 queries/sec1x GH200NVIDIA GH200-GraceHopper-SuperchipNVIDIA GH200 Grace Hopper Superchip 96GB76.46% Top115 msImageNet (224x224)
RetinaNet14,012 queries/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GB0.3755 mAP100 msOpenImages (800x800)
13,979 queries/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB0.3755 mAP100 msOpenImages (800x800)
1,731 queries/sec1x GH200GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRTNVIDIA GH200 Grace Hopper Superchip 96GB0.3755 mAP100 msOpenImages (800x800)
BERT58,091 queries/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GB90.87% f1130 msSQuAD v1.1
58,929 queries/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB90.87% f1130 msSQuAD v1.1
7,103 queries/sec1x GH200GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRTNVIDIA GH200 Grace Hopper Superchip 96GB90.87% f1130 msSQuAD v1.1
GPT-J20,139 queries/sec8x H200Dell PowerEdge XE9680NVIDIA H200-SXM-141GBrouge1=42.9865, rouge2=20.1235, rougeL=29.988120 sCNN Dailymail
19,811 queries/sec8x H100AS-4125GS-TNHR2-LCCNVIDIA H100-SXM-80GBrouge1=42.9865, rouge2=20.1235, rougeL=29.988120 sCNN Dailymail
2,513 queries/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GBrouge1=42.9865, rouge2=20.1235, rougeL=29.988120 sCNN Dailymail
DLRMv2585,209 queries/sec8x H200GIGABYTE G593-SD1NVIDIA H200-SXM-141GB80.31% AUC60 msSynthetic Multihot Criteo Dataset
556,101 queries/sec8x H100SYS-421GE-TNHR2-LCCNVIDIA H100-SXM-80GB80.31% AUC60 msSynthetic Multihot Criteo Dataset
81,010 queries/sec1x GH200NVIDIA GH200 NVL2 PlatformNVIDIA GH200 Grace Hopper Superchip 144GB80.31% AUC60 msSynthetic Multihot Criteo Dataset

Power Efficiency Offline Scenario - Closed Division

Network Throughput Throughput per Watt GPU Server GPU Version Dataset
Llama2 70B25,262 tokens/sec4 tokens/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenOrca
Mixtral 8x7B48,988 tokens/sec8 tokens/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenOrca, GSM8K, MBXP
Stable Diffusion XL13 samples/sec0.002 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSubset of coco-2014 val
ResNet-50556,234 samples/sec112 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBImageNet (224x224)
RetinaNet10,803 samples/sec2 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenImages (800x800)
BERT54,063 samples/sec10 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSQuAD v1.1
GPT-J13,097 samples/sec3. samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBCNN Dailymail
DLRMv2503,719 samples/sec84 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSynthetic Multihot Criteo Dataset
3D-UNET42 samples/sec0.009 samples/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBKiTS 2019

Power Efficiency Server Scenario - Closed Division

Network Throughput Throughput per Watt GPU Server GPU Version Dataset
Llama2 70B23,113 tokens/sec4 tokens/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenOrca
Mixtral 8x7B45,497 tokens/sec7 tokens/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenOrca, GSM8K, MBXP
Stable Diffusion13 queries/sec0.002 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSubset of coco-2014 val
ResNet-50480,131 queries/sec96 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBImageNet (224x224)
RetinaNet9,603 queries/sec2 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBOpenImages (800x800)
BERT41,599 queries/sec8 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSQuAD v1.1
GPT-J11,701 queries/sec2 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBCNN Dailymail
DLRMv2420,107 queries/sec69 queries/sec/watt8x H200NVIDIA H200NVIDIA H200-SXM-141GBSynthetic Multihot Criteo Dataset

MLPerf™ v4.1 Inference Closed: Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP32 and 99.9% of FP32, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 4.1-0005, 4.1-0021, 4.1-0027, 4.1-0037, 4.1-0038, 4.1-0043, 4.1-0044, 4.1-0046, 4.1-0048, 4.1-0049, 4.1-0053, 4.1-0057, 4.1-0060, 4.1-0063, 4.1-0064, 4.1-0065, 4.1-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA B200 is a preview submission
Llama2 70B Max Sequence Length = 1,024
Mixtral 8x7B Max Sequence Length = 2,048
BERT-Large Max Sequence Length = 384.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

H200 Inference Performance - High Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 405B181281283,953 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 405B1812820485,974 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 405B1812840964,947 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 405B812048128764 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14aNVIDIA H200
Llama v3.1 405B185000500679 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 405B1850020005,066 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 405B18100010003,481 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 405B18204820482,927 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 405B18200002000482 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14.0NVIDIA H200
Llama v3.1 70B111281283,924 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.13.0NVIDIA H200
Llama v3.1 70B1212820487,939 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 70B1212840966,297 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 70B112048128460 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.13.0NVIDIA H200
Llama v3.1 70B115000500560 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 70B1250020006,683 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 70B11100010002,704 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 70B12204820483,835 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 70B12200002000633 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 8B1112812828,126 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.13.0NVIDIA H200
Llama v3.1 8B11128204824,158 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 8B11128409616,460 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 8B1120481283,661 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 8B1150005003,836 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 8B11500200020,345 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 8B111000100016,801 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Llama v3.1 8B112048204811,073 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.13.0NVIDIA H200
Llama v3.1 8B112000020001,741 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x7B1112812816,796 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x7B11128204814,830 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x7B12128409621,520 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.14.0NVIDIA H200
Mixtral 8x7B1120481281,995 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x7B1150005002,295 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x7B11500200011,983 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x7B111000100010,254 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x7B122048204814,018 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.13.0NVIDIA H200
Mixtral 8x7B122000020002,227 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B1812812825,179 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14.0NVIDIA H200
Mixtral 8x22B18128204832,623 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B18128409625,531 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B1820481283,095 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B1850005004,209 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B18500200027,396 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B181000100020,097 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B182048204813,796 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14.0NVIDIA H200
Mixtral 8x22B182000020002,897 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14.0NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

H100 Inference Performance - High Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 70B121281286,399 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Llama v3.1 70B1212840963,581 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Llama v3.1 70B122048128774 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Llama v3.1 70B1250020004,776 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Llama v3.1 70B12100010004,247 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Llama v3.1 70B14204820485,166 total tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Llama v3.1 70B14200002000915 total tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Mixtral 8x7B1212812827,156 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Mixtral 8x7B12128204823,010 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Mixtral 8x7B18128409647,834 total tokens/sec8x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Mixtral 8x7B1220481283,368 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Mixtral 8x7B1250005003,592 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Mixtral 8x7B12500200018,186 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.14.0H100-SXM5-80GB
Mixtral 8x7B121000100015,932 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.14.0H100-SXM5-80GB
Mixtral 8x7B122048204810,465 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB
Mixtral 8x7B122000020001,739 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.15.0H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - High Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 8B111281288,983 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Llama v3.1 8B1112820485,297 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Llama v3.1 8B1112840962,989 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Llama v3.1 8B1120481281,056 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Llama v3.1 8B115000500972 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Llama v3.1 8B1150020004,264 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Llama v3.1 8B11100010004,014 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Llama v3.1 8B11204820482,163 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Llama v3.1 8B11200002000326 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B4112812815,278 total tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B2212820489,087 total tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B1412840965,655 total tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B4120481282,098 total tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B2250005001,558 total tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B2250020007,974 total tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B22100010006,579 total tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B22204820484,217 total tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

H200 Inference Performance - High Throughput at Low Latency Under 1 Second

Model Batch Size TP Input Length Output Length Time to 1st Token Throughput/GPU GPU Server Precision Framework GPU Version
GPT-J 6B51211281280.64 seconds25,126 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
GPT-J 6B64112820480.08 seconds7,719 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
GPT-J 6B32120481280.68 seconds2,469 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
GPT-J 6B321204820480.68 seconds3,167 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Llama v2 7B51211281280.84 seconds19,975 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Llama v2 7B64112820480.11 seconds7,149 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Llama v2 7B32120481280.9 seconds2,101 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Llama v2 7B321204820480.9 seconds3,008 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Llama v2 70B6411281280.92 seconds2,044 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Llama v2 70B64112820480.93 seconds2,238 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Llama v2 70B4120481280.95 seconds128 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Llama v2 70B168204820480.97 seconds173 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Falcon 180B3241281280.36 seconds365 total tokens/sec4x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Falcon 180B64812820480.43 seconds408 total tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Falcon 180B4420481280.71 seconds43 total tokens/sec4x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200
Falcon 180B44204820480.71 seconds53 total tokens/sec4x H200DGX H200FP8TensorRT-LLM 0.9.0NVIDIA H200

TP: Tensor Parallelism
Batch size per GPU
Low Latency Target: Highest measured throughput with less than 1 second 1st token latency

H100 Inference Performance - High Throughput at Low Latency Under 1 Second

Model Batch Size TP Input Length Output Length Time to 1st Token Throughput/GPU GPU Server Precision Framework GPU Version
GPT-J 6B51211281280.63 seconds24,167 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
GPT-J 6B120112820480.16 seconds7,351 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
GPT-J 6B32120481280.67 seconds2,257 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
GPT-J 6B321204820480.68 seconds2,710 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Llama v2 7B51211281280.83 seconds19,258 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Llama v2 7B120112820480.2 seconds6,944 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Llama v2 7B32120481280.89 seconds1,904 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Llama v2 7B321204820480.89 seconds2,484 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Llama v2 70B6411281280.92 seconds1,702 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Llama v2 70B128412820480.73 seconds1,494 total tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Llama v2 70B4820481280.74 seconds105 total tokens/sec8x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Llama v2 70B84204820480.74 seconds141 total tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Falcon 180B6441281280.71 seconds372 total tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Falcon 180B64412820480.7 seconds351 total tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Falcon 180B8820481280.87 seconds45 total tokens/sec8x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB
Falcon 180B88204820480.87 seconds61 total tokens/sec8x H100DGX H100FP8TensorRT-LLM 0.9.0H100-SXM5-80GB

TP: Tensor Parallelism
Batch size per GPU
Low Latency Target: Highest measured throughput with less than 1 second 1st token latency

Inference Performance of NVIDIA Data Center Products

H200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)14.33 images/sec- 231.261x H200DGX H20024.10-py3INT8SyntheticTensorRT 10.5.0.26NVIDIA H200
46.8 images/sec- 588.081x H200DGX H20024.10-py3INT8SyntheticTensorRT 10.5.0.26NVIDIA H200
Stable Diffusion XL10.86 images/sec- 1157.271x H200DGX H20024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA H200
ResNet-50v1.5821,347 images/sec70 images/sec/watt0.371x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
12863,356 images/sec104 images/sec/watt2.021x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
BERT-BASE89,390 sequences/sec21 sequences/sec/watt0.851x H200DGX H20024.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA H200
12825,341 sequences/sec38 sequences/sec/watt5.051x H200DGX H20024.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA H200
BERT-LARGE84,034 sequences/sec6 sequences/sec/watt1.981x H200DGX H20024.09-py3MixedSyntheticTensorRT 10.4.0.26NVIDIA H200
1288,374 sequences/sec13 sequences/sec/watt15.281x H200DGX H20024.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA H200
EfficientNet-B0816,634 images/sec76 images/sec/watt0.481x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
12856,960 images/sec122 images/sec/watt2.251x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
EfficientNet-B484,525 images/sec14 images/sec/watt1.771x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
1288,940 images/sec15 images/sec/watt14.321x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
HF Swin Base85,083 samples/sec11 samples/sec/watt1.571x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
328,304 samples/sec12 samples/sec/watt3.851x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
HF Swin Large83,435 samples/sec6 samples/sec/watt2.331x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
324,732 samples/sec7 samples/sec/watt6.761x H200DGX H20024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA H200
HF ViT Base88,948 samples/sec19 samples/sec/watt0.891x H200DGX H20024.12-py3FP8SyntheticTensorRT 10.7.0NVIDIA H200
6415,403 samples/sec23 samples/sec/watt4.161x H200DGX H20024.12-py3FP8SyntheticTensorRT 10.7.0NVIDIA H200
HF ViT Large83,743 samples/sec6 samples/sec/watt2.141x H200DGX H20024.12-py3FP8SyntheticTensorRT 10.7.0NVIDIA H200
645,415 samples/sec8 samples/sec/watt11.821x H200DGX H20024.12-py3FP8SyntheticTensorRT 10.7.0NVIDIA H200
Megatron BERT Large QAT84,966 sequences/sec13 sequences/sec/watt1.611x H200DGX H20024.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA H200
12812,481 sequences/sec18 sequences/sec/watt10.261x H200DGX H20024.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA H200
QuartzNet86,780 samples/sec24 samples/sec/watt1.181x H200DGX H20024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA H200
12833,906 samples/sec89 samples/sec/watt3.781x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200
RetinaNet-RN3482,967 images/sec9 images/sec/watt2.71x H200DGX H20024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA H200

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

GH200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)14.27 images/sec- 234.41x GH200NVIDIA P388024.09-py3INT8SyntheticTensorRT 10.4.0.26GH200 96GB
45.82 images/sec- 687.911x GH200NVIDIA P388024.09-py3INT8SyntheticTensorRT 10.4.0.26GH200 96GB
Stable Diffusion XL10.68 images/sec- 1149.441x GH200NVIDIA P388024.10-py3INT8SyntheticTensorRT 10.5.0GH200 96GB
ResNet-50v1.5820,979 images/sec61 images/sec/watt0.381x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
12863,043 images/sec99 images/sec/watt2.031x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
BERT-BASE89,593 sequences/sec22 sequences/sec/watt0.831x GH200NVIDIA P388024.09-py3INT8SyntheticTensorRT 10.4.0.26GH200 96GB
12820,399 sequences/sec45 sequences/sec/watt6.271x GH200NVIDIA P388024.10-py3INT8SyntheticTensorRT 10.5.0.26GH200 96GB
BERT-LARGE83,625 sequences/sec7 sequences/sec/watt2.211x GH200NVIDIA P388024.09-py3INT8SyntheticTensorRT 10.4.0.26GH200 96GB
1287,285 sequences/sec14 sequences/sec/watt17.571x GH200NVIDIA P388024.09-py3INT8SyntheticTensorRT 10.4.0.26GH200 96GB
EfficientNet-B0816,695 images/sec67 images/sec/watt0.481x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
12856,674 images/sec113 images/sec/watt2.261x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
EfficientNet-B484,531 images/sec13 images/sec/watt1.771x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
1288,784 images/sec14 images/sec/watt14.571x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
HF Swin Base85,106 samples/sec10 samples/sec/watt1.571x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
328,197 samples/sec12 samples/sec/watt3.91x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
HF Swin Large83,403 samples/sec6 samples/sec/watt2.351x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
324,846 samples/sec6 samples/sec/watt6.61x GH200NVIDIA P388024.12-py3MixedSyntheticTensorRT 10.7.0GH200 96GB
HF ViT Base88,990 samples/sec18 samples/sec/watt0.891x GH200NVIDIA P388024.12-py3FP8SyntheticTensorRT 10.7.0GH200 96GB
6415,562 samples/sec21 samples/sec/watt4.111x GH200NVIDIA P388024.12-py3FP8SyntheticTensorRT 10.7.0GH200 96GB
HF ViT Large83,707 samples/sec6 samples/sec/watt2.161x GH200NVIDIA P388024.12-py3FP8SyntheticTensorRT 10.7.0GH200 96GB
645,703 samples/sec7 samples/sec/watt11.221x GH200NVIDIA P388024.12-py3FP8SyntheticTensorRT 10.7.0GH200 96GB
Megatron BERT Large QAT84,927 sequences/sec12 sequences/sec/watt1.621x GH200NVIDIA P388024.09-py3INT8SyntheticTensorRT 10.4.0.26GH200 96GB
12810,896 sequences/sec19 sequences/sec/watt11.751x GH200NVIDIA P388024.09-py3INT8SyntheticTensorRT 10.4.0.26GH200 96GB
QuartzNet86,688 samples/sec22 samples/sec/watt1.21x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
12834,272 samples/sec85 samples/sec/watt3.731x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB
RetinaNet-RN3482,945 images/sec4 images/sec/watt2.721x GH200NVIDIA P388024.12-py3INT8SyntheticTensorRT 10.7.0GH200 96GB

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)14.22 images/sec- 236.81x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0.26H100 SXM5-80GB
46.41 images/sec- 624.61x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0.26H100 SXM5-80GB
Stable Diffusion XL10.83 images/sec- 1210.081x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100 SXM5-80GB
ResNet-50v1.5821,136 images/sec70 images/sec/watt0.381x H100DGX H10024.12-py3INT8SyntheticTensorRT 10.7.0H100-SXM5-80GB
12858,139 images/sec102 images/sec/watt2.21x H100DGX H10024.12-py3INT8SyntheticTensorRT 10.7.0H100-SXM5-80GB
BERT-BASE89,505 sequences/sec19 sequences/sec/watt0.841x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100-SXM5-80GB
12823,883 sequences/sec36 sequences/sec/watt5.361x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100-SXM5-80GB
BERT-LARGE83,979 sequences/sec8 sequences/sec/watt2.011x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100 SXM5-80GB
1287,999 sequences/sec12 sequences/sec/watt161x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100-SXM5-80GB
EfficientNet-B0816,279 images/sec62 images/sec/watt0.491x H100DGX H10024.12-py3INT8SyntheticTensorRT 10.7.0H100-SXM5-80GB
12855,100 images/sec113 images/sec/watt2.321x H100DGX H10024.12-py3INT8SyntheticTensorRT 10.7.0H100-SXM5-80GB
EfficientNet-B484,542 images/sec13 images/sec/watt1.761x H100DGX H10024.12-py3INT8SyntheticTensorRT 10.7.0H100-SXM5-80GB
1288,519 images/sec15 images/sec/watt15.031x H100DGX H10024.12-py3INT8SyntheticTensorRT 10.7.0H100-SXM5-80GB
HF Swin Base85,055 samples/sec9 samples/sec/watt1.581x H100DGX H10024.12-py3INT8SyntheticTensorRT 10.7.0H100-SXM5-80GB
327,819 samples/sec12 samples/sec/watt4.091x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100-SXM5-80GB
HF Swin Large83,313 samples/sec6 samples/sec/watt2.411x H100DGX H10024.12-py3MixedSyntheticTensorRT 10.7.0H100-SXM5-80GB
324,446 samples/sec6 samples/sec/watt7.21x H100DGX H10024.12-py3INT8SyntheticTensorRT 10.7.0H100-SXM5-80GB
HF ViT Base89,027 samples/sec19 samples/sec/watt0.891x H100DGX H10024.10-py3FP8SyntheticTensorRT 10.5.0H100-SXM5-80GB
6414,992 samples/sec22 samples/sec/watt4.271x H100DGX H10024.10-py3FP8SyntheticTensorRT 10.5.0H100-SXM5-80GB
HF ViT Large83,707 samples/sec6 samples/sec/watt2.161x H100DGX H10024.10-py3FP8SyntheticTensorRT 10.5.0H100-SXM5-80GB
645,348 samples/sec8 samples/sec/watt11.971x H100DGX H10024.10-py3FP8SyntheticTensorRT 10.5.0H100-SXM5-80GB
Megatron BERT Large QAT84,571 sequences/sec12 sequences/sec/watt1.751x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100-SXM5-80GB
12812,005 sequences/sec17 sequences/sec/watt10.661x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100-SXM5-80GB
QuartzNet86,697 samples/sec22 samples/sec/watt1.191x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100-SXM5-80GB
12834,597 samples/sec81 samples/sec/watt3.71x H100DGX H10024.10-py3INT8SyntheticTensorRT 10.5.0H100-SXM5-80GB
RetinaNet-RN3482,780 images/sec8 images/sec/watt2.881x H100DGX H10024.12-py3INT8SyntheticTensorRT 10.7.0H100-SXM5-80GB

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion XL10.37 images/sec- 2678.191x L40SSupermicro SYS-521GE-TNRT24.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L40S
ResNet-50v1.5823,472 images/sec78 images/sec/watt0.341x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
3237,069 images/sec109 images/sec/watt0.861x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
BERT-BASE88,412 sequences/sec26 sequences/sec/watt0.951x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
12813,169 sequences/sec38 sequences/sec/watt9.721x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
BERT-LARGE83,188 sequences/sec10 sequences/sec/watt2.511x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
244,034 sequences/sec12 sequences/sec/watt31.731x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
EfficientDet-D084,716 images/sec17 images/sec/watt1.71x L40SSupermicro SYS-521GE-TNRT24.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA L40S
EfficientNet-B0820,534 images/sec106 images/sec/watt0.391x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
3241,526 images/sec140 images/sec/watt0.771x L40SSupermicro SYS-521GE-TNRT24.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L40S
EfficientNet-B485,149 images/sec17 images/sec/watt1.551x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
166,116 images/sec18 images/sec/watt2.621x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
HF Swin Base83,825 samples/sec11 samples/sec/watt2.091x L40SSupermicro SYS-521GE-TNRT24.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA L40S
164,371 samples/sec13 samples/sec/watt3.661x L40SSupermicro SYS-521GE-TNRT24.09-py3MixedSyntheticTensorRT 10.4.0.26NVIDIA L40S
HF Swin Large81,932 samples/sec6 samples/sec/watt4.141x L40SSupermicro SYS-521GE-TNRT24.11-py3MixedSyntheticTensorRT 10.6.0NVIDIA L40S
162,141 samples/sec6 samples/sec/watt7.471x L40SSupermicro SYS-521GE-TNRT24.11-py3INT8SyntheticTensorRT 10.6.0NVIDIA L40S
HF ViT Base85,799 samples/sec17 samples/sec/watt1.381x L40SSupermicro SYS-521GE-TNRT24.11-py3FP8SyntheticTensorRT 10.6.0NVIDIA L40S
HF ViT Large81,926 samples/sec6 samples/sec/watt4.151x L40SSupermicro SYS-521GE-TNRT24.11-py3FP8SyntheticTensorRT 10.6.0NVIDIA L40S
Megatron BERT Large QAT84,213 sequences/sec13 sequences/sec/watt1.91x L40SSupermicro SYS-521GE-TNRT24.11-py3INT8SyntheticTensorRT 10.6.0NVIDIA L40S
245,097 sequences/sec15 sequences/sec/watt4.711x L40SSupermicro SYS-521GE-TNRT24.11-py3INT8SyntheticTensorRT 10.6.0NVIDIA L40S
QuartzNet87,643 samples/sec32 samples/sec/watt1.051x L40SSupermicro SYS-521GE-TNRT24.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L40S
12822,582 samples/sec65 samples/sec/watt5.671x L40SSupermicro SYS-521GE-TNRT24.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA L40S

1,024 x 1,024 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L4 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)10.82 images/sec- 1221.731x L4GIGABYTE G482-Z54-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
Stable Diffusion XL10.11 images/sec- 9098.41x L4GIGABYTE G482-Z54-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
ResNet-50v1.589,911 images/sec138 images/sec/watt0.811x L4GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L4
3210,101 images/sec111 images/sec/watt16.271x L4GIGABYTE G482-Z54-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L4
BERT-BASE83,323 sequences/sec46 sequences/sec/watt2.411x L4GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
244,052 sequences/sec56 sequences/sec/watt5.921x L4GIGABYTE G482-Z54-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
BERT-LARGE81,081 sequences/sec15 sequences/sec/watt7.41x L4GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
131,314 sequences/sec19 sequences/sec/watt9.91x L4GIGABYTE G482-Z54-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
EfficientNet-B481,831 images/sec25 images/sec/watt4.371x L4GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L4
HF Swin Base81,215 samples/sec17 samples/sec/watt6.581x L4GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA L4
HF Swin Large8621 samples/sec9 samples/sec/watt12.881x L4GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L4
HF ViT Base161,899 samples/sec26 samples/sec/watt8.421x L4GIGABYTE G482-Z52-0024.12-py3FP8SyntheticTensorRT 10.7.0NVIDIA L4
HF ViT Large8613 samples/sec9 samples/sec/watt13.061x L4GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L4
Megatron BERT Large QAT241,789 sequences/sec25 sequences/sec/watt13.421x L4GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA L4
QuartzNet84,063 samples/sec57 samples/sec/watt1.971x L4GIGABYTE G482-Z52-0024.11-py3INT8SyntheticTensorRT 10.6.0NVIDIA L4
1286,083 samples/sec84 samples/sec/watt21.041x L4GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L4
RetinaNet-RN348364 images/sec5 images/sec/watt21.951x L4GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA L4

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A40 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5811,172 images/sec40 images/sec/watt0.721x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0A40
12815,401 images/sec51 images/sec/watt8.311x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0A40
BERT-BASE84,257 sequences/sec15 sequences/sec/watt1.881x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
1285,667 sequences/sec19 sequences/sec/watt22.591x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
BERT-LARGE81,573 sequences/sec5 sequences/sec/watt5.081x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
1281,966 sequences/sec7 sequences/sec/watt65.111x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
EfficientNet-B0811,142 images/sec59 images/sec/watt0.721x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A40
12820,068 images/sec67 images/sec/watt6.381x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A40
EfficientNet-B482,138 images/sec8 images/sec/watt3.741x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A40
1282,700 images/sec9 images/sec/watt47.411x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A40
HF Swin Base81,694 samples/sec6 samples/sec/watt4.721x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A40
321,838 samples/sec6 samples/sec/watt17.411x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A40
HF Swin Large8956 samples/sec3 samples/sec/watt8.371x A40GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A40
321,008 samples/sec3 samples/sec/watt31.761x A40GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A40
HF ViT Base82,170 samples/sec7 samples/sec/watt3.691x A40GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A40
642,330 samples/sec8 samples/sec/watt27.471x A40GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A40
HF ViT Large8693 samples/sec2 samples/sec/watt11.541x A40GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A40
64746 samples/sec2 samples/sec/watt85.781x A40GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A40
Megatron BERT Large QAT82,059 sequences/sec7 sequences/sec/watt3.891x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
1282,650 sequences/sec9 sequences/sec/watt48.311x A40GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A40
QuartzNet84,380 samples/sec21 samples/sec/watt1.831x A40GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A40
1288,468 samples/sec29 samples/sec/watt15.121x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A40
RetinaNet-RN348705 images/sec2 images/sec/watt11.341x A40GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A40

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A30 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5810,243 images/sec71 images/sec/watt0.81x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
12816,633 images/sec101 images/sec/watt7.71x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
BERT-BASE84,334 sequences/sec26 sequences/sec/watt1.851x A30GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A30
1285,820 sequences/sec35 sequences/sec/watt21.991x A30GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A30
BERT-LARGE81,500 sequences/sec10 sequences/sec/watt5.331x A30GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A30
1282,053 sequences/sec13 sequences/sec/watt62.341x A30GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A30
EfficientNet-B088,997 images/sec81 images/sec/watt0.91x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
12817,252 images/sec106 images/sec/watt7.41x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
EfficientNet-B481,877 images/sec13 images/sec/watt4.31x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
1282,416 images/sec15 images/sec/watt531x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
HF Swin Base81,647 samples/sec10 samples/sec/watt4.91x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
321,846 samples/sec11 samples/sec/watt17.31x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
HF Swin Large8910 samples/sec5 samples/sec/watt8.81x A30GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A30
321,003 samples/sec6 samples/sec/watt31.91x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
HF ViT Base82,060 samples/sec12 samples/sec/watt3.91x A30GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A30
642,328 samples/sec14 samples/sec/watt27.51x A30GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A30
HF ViT Large8674 samples/sec4 samples/sec/watt11.91x A30GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A30
64709 samples/sec4 samples/sec/watt90.21x A30GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A30
Megatron BERT Large QAT81,802 sequences/sec12 sequences/sec/watt4.441x A30GIGABYTE G482-Z52-0024.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA A30
1282,724 sequences/sec17 sequences/sec/watt46.991x A30GIGABYTE G482-Z52-0024.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA A30
QuartzNet83,460 samples/sec30 samples/sec/watt2.31x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
1289,998 samples/sec73 samples/sec/watt12.81x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30
RetinaNet-RN348702 images/sec4 images/sec/watt11.41x A30GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A30

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A10 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.588,562 images/sec57 images/sec/watt0.931x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
12810,657 images/sec71 images/sec/watt12.011x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
BERT-BASE83,109 sequences/sec21 sequences/sec/watt2.571x A10GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A10
1283,822 sequences/sec26 sequences/sec/watt33.491x A10GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.5.0NVIDIA A10
BERT-LARGE81,086 sequences/sec7 sequences/sec/watt7.361x A10GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.6.0NVIDIA A10
1281,265 sequences/sec8 sequences/sec/watt101.171x A10GIGABYTE G482-Z52-0024.10-py3INT8SyntheticTensorRT 10.6.0NVIDIA A10
EfficientNet-B089,616 images/sec64 images/sec/watt0.831x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
12814,494 images/sec97 images/sec/watt8.831x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
EfficientNet-B481,625 images/sec11 images/sec/watt4.921x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
1281,897 images/sec13 images/sec/watt67.491x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
HF Swin Base81,223 samples/sec8 samples/sec/watt6.541x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
321,283 samples/sec9 samples/sec/watt24.931x A10GIGABYTE G482-Z52-0024.09-py3INT8SyntheticTensorRT 10.4.0.26NVIDIA A10
HF Swin Large8622 samples/sec4 samples/sec/watt12.861x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
32668 samples/sec4 samples/sec/watt47.91x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
HF ViT Base81,395 samples/sec9 samples/sec/watt5.741x A10GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A10
641,526 samples/sec10 samples/sec/watt41.931x A10GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A10
HF ViT Large8460 samples/sec3 samples/sec/watt17.381x A10GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A10
Megatron BERT Large QAT81,566 sequences/sec10 sequences/sec/watt5.111x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
1281,801 sequences/sec12 sequences/sec/watt71.061x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
QuartzNet83,851 samples/sec26 samples/sec/watt2.081x A10GIGABYTE G482-Z52-0024.12-py3MixedSyntheticTensorRT 10.7.0NVIDIA A10
1285,924 samples/sec40 samples/sec/watt21.611x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10
RetinaNet-RN348505 images/sec3 images/sec/watt15.831x A10GIGABYTE G482-Z52-0024.12-py3INT8SyntheticTensorRT 10.7.0NVIDIA A10

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

NVIDIA Performance with Triton Inference Server

H200 Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Number of Concurrent Client Requests Latency (ms) Throughput Triton Container Version
BERT Base InferenceNVIDIA H200tensorrtTensorRTMixed4140.773,182 inf/sec24.09-py3
BERT Large InferenceNVIDIA H200onnxPyTorchMixed111617.9961,777 inf/sec24.09-py3
BERT Large InferenceNVIDIA H200onnxPyTorchMixed123235.8621,784 inf/sec24.09-py3
DLRMNVIDIA H200ts-tracePyTorchMixed41320.86836,852 inf/sec24.02-py3
DLRMNVIDIA H200ts-tracePyTorchMixed12321.50472,006 inf/sec24.09-py3
FastPitch InferenceNVIDIA H200ts-tracePyTorchMixed21512108.0564,736 inf/sec24.09-py3
FastPitch InferenceNVIDIA H200ts-tracePyTorchMixed22256108.4774,717 inf/sec24.09-py3
GPUNet-0NVIDIA H200onnxPyTorchMixed11323.9927,930 inf/sec24.09-py3
GPUNet-0NVIDIA H200onnxPyTorchMixed226411.5511,011 inf/sec24.09-py3
GPUNet-1NVIDIA H200onnxPyTorchMixed11647.9518,012 inf/sec24.09-py3
GPUNet-1NVIDIA H200onnxPyTorchMixed126414.2698,943 inf/sec24.09-py3
ResNet-50 v1.5NVIDIA H200onnxPyTorchMixed11323.8018,370 inf/sec24.09-py3
ResNet-50 v1.5NVIDIA H200onnxPyTorchMixed22647.48217,037 inf/sec24.09-py3
TFT InferenceNVIDIA H200tensorrtPyTorchMixed2142.75132,970 inf/sec24.09-py3
TFT InferenceNVIDIA H200tensorrtPyTorchMixed1251242.75440,098 inf/sec24.09-py3

GH200 Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Number of Concurrent Client Requests Latency (ms) Throughput Triton Container Version
BERT Base InferenceNVIDIA GH200 96GBtensorrtTensorRTMixed4141.1533,458 inf/sec24.09-py3
BERT Large InferenceNVIDIA GH200 96GBonnxPyTorchMixed216441.7141,534 inf/sec24.09-py3
BERT Large InferenceNVIDIA GH200 96GBonnxPyTorchMixed42128166.1251,540 inf/sec24.09-py3
DLRMNVIDIA GH200 96GBts-tracePyTorchMixed21641.24151,529 inf/sec24.02-py3
DLRMNVIDIA GH200 96GBts-tracePyTorchMixed42161.18974,741 inf/sec24.09-py3
FastPitch InferenceNVIDIA GH200 96GBts-tracePyTorchMixed211024257.7273,968 inf/sec24.09-py3
FastPitch InferenceNVIDIA GH200 96GBts-tracePyTorchMixed221024524.6943,893 inf/sec24.09-py3
GPUNet-0NVIDIA GH200 96GBonnxPyTorchMixed41322.48912,701 inf/sec24.09-py3
GPUNet-0NVIDIA GH200 96GBonnxPyTorchMixed42162.31413,651 inf/sec24.09-py3
GPUNet-1NVIDIA GH200 96GBonnxPyTorchMixed21322.74611,560 inf/sec24.09-py3
GPUNet-1NVIDIA GH200 96GBonnxPyTorchMixed1212823.59810,837 inf/sec24.09-py3
ResNet-50 v1.5NVIDIA GH200 96GBonnxPyTorchMixed4151261.9298,262 inf/sec24.09-py3
ResNet-50 v1.5NVIDIA GH200 96GBonnxPyTorchMixed42645.94521,469 inf/sec24.09-py3
TFT InferenceNVIDIA GH200 96GBts-tracePyTorchMixed4125612.58320,330 inf/sec24.09-py3
TFT InferenceNVIDIA GH200 96GBts-tracePyTorchMixed421286.36240,179 inf/sec24.09-py3

H100 Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Number of Concurrent Client Requests Latency (ms) Throughput Triton Container Version
BERT Base InferenceH100 SXM5-80GBtensorrtTensorRTMixed4141.2073,311 inf/sec24.02-py3
BERT Large InferenceH100 SXM5-80GBtensorrtPyTorchMixed411614.7841,082 inf/sec24.02-py3
BERT Large InferenceH100 SXM5-80GBtensorrtPyTorchMixed42812.7151,258 inf/sec24.02-py3
DLRMH100 SXM5-80GBts-tracePyTorchMixed11320.9434,027 inf/sec24.02-py3
DLRMH100 SXM5-80GBts-tracePyTorchMixed42320.91370,071 inf/sec24.02-py3
FastPitch InferenceH100 SXM5-80GBts-tracePyTorchMixed21512119.5314,281 inf/sec24.02-py3
FastPitch InferenceH100 SXM5-80GBts-tracePyTorchMixed22256119.364,287 inf/sec24.02-py3
ResNet-50 v1.5H100 SXM5-80GBtensorrtPyTorchMixed41161.9778,090 inf/sec24.02-py3
ResNet-50 v1.5H100 SXM5-80GBtensorrtPyTorchMixed42164.1017,801 inf/sec24.02-py3
TFT InferenceH100 SXM5-80GBts-scriptPyTorchMixed21102433.02730,996 inf/sec24.02-py3
TFT InferenceH100 SXM5-80GBts-scriptPyTorchMixed2251225.52240,114 inf/sec24.02-py3

H100 NVL Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Number of Concurrent Client Requests Latency (ms) Throughput Triton Container Version
BERT Base InferenceNVIDIA H100 NVLtensorrtTensorRTMixed4141.3652,919 inf/sec24.09-py3
BERT Large InferenceNVIDIA H100 NVLonnxPyTorchMixed113225.761,242 inf/sec24.09-py3
BERT Large InferenceNVIDIA H100 NVLonnxPyTorchMixed223250.8841,257 inf/sec24.09-py3
DLRMNVIDIA H100 NVLts-tracePyTorchMixed21320.80439,745 inf/sec24.02-py3
DLRMNVIDIA H100 NVLts-tracePyTorchMixed22321.07159,691 inf/sec24.02-py3
FastPitch InferenceNVIDIA H100 NVLts-tracePyTorchMixed2125670.9153,609 inf/sec24.09-py3
FastPitch InferenceNVIDIA H100 NVLts-tracePyTorchMixed22256149.3333,426 inf/sec24.09-py3
GPUNet-0NVIDIA H100 NVLonnxPyTorchMixed11324.2187,492 inf/sec24.09-py3
GPUNet-0NVIDIA H100 NVLonnxPyTorchMixed22325.58511,355 inf/sec24.09-py3
GPUNet-1NVIDIA H100 NVLonnxPyTorchMixed11647.8518,105 inf/sec24.09-py3
GPUNet-1NVIDIA H100 NVLonnxPyTorchMixed12326.6479,561 inf/sec24.09-py3
ResNet-50 v1.5NVIDIA H100 NVLonnxPyTorchMixed11646.6739,546 inf/sec24.09-py3
ResNet-50 v1.5NVIDIA H100 NVLonnxPyTorchMixed22647.44617,116 inf/sec24.09-py3
TFT InferenceNVIDIA H100 NVLts-tracePyTorchMixed2151216.84630,387 inf/sec24.02-py3
TFT InferenceNVIDIA H100 NVLts-tracePyTorchMixed4225621.73323,544 inf/sec24.09-py3

L40S Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Number of Concurrent Client Requests Latency (ms) Throughput Triton Container Version
BERT Base InferenceNVIDIA L40StensorrtTensorRTMixed4141.3982,853 inf/sec24.09-py3
BERT Large InferenceNVIDIA L40SonnxPyTorchMixed211621.281751 inf/sec24.09-py3
BERT Large InferenceNVIDIA L40SonnxPyTorchMixed12820.42783 inf/sec24.09-py3
DLRMNVIDIA L40Sts-tracePyTorchMixed11641.54541,403 inf/sec24.02-py3
DLRMNVIDIA L40Sts-tracePyTorchMixed12320.92968,867 inf/sec24.02-py3
FastPitch InferenceNVIDIA L40Sts-tracePyTorchMixed21256106.5832,401 inf/sec24.09-py3
FastPitch InferenceNVIDIA L40Sts-tracePyTorchMixed226452.8612,421 inf/sec24.09-py3
GPUNet-0NVIDIA L40SonnxPyTorchMixed21323.888,118 inf/sec24.09-py3
GPUNet-0NVIDIA L40SonnxPyTorchMixed22327.0099,061 inf/sec24.09-py3
GPUNet-1NVIDIA L40SonnxPyTorchMixed21323.598,808 inf/sec24.09-py3
GPUNet-1NVIDIA L40SonnxPyTorchMixed22163.8518,217 inf/sec24.09-py3
ResNet-50 v1.5NVIDIA L40SonnxPyTorchMixed4151257.958,807 inf/sec24.09-py3
ResNet-50 v1.5NVIDIA L40StensorrtPyTorchMixed22325.87810,836 inf/sec24.09-py3
TFT InferenceNVIDIA L40Sts-tracePyTorchMixed111289.3713,629 inf/sec24.09-py3
TFT InferenceNVIDIA L40Sts-tracePyTorchMixed221289.79226,099 inf/sec24.09-py3

Inference Performance of NVIDIA GPUs in the Cloud

A100 Inference Performance in the Cloud

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5813,768 images/sec- images/sec/watt0.581x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB
12830,338 images/sec- images/sec/watt4.221x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB
BERT-LARGE82,308 images/sec- images/sec/watt3.471x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB
1284,045 images/sec- images/sec/watt31.641x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB

BERT-Large: Sequence Length = 128

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More