AI Inference

Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.

When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.

Click here to view other performance data.

MLPerf Inference v4.1 Performance Benchmarks

Offline Scenario, Closed Division

Network	Throughput	GPU	Server	GPU Version	Target Accuracy	Dataset
Llama2 70B	11,264 tokens/sec	1x B200	NVIDIA B200	NVIDIA B200-SXM-180GB	rouge1=44.4312, rouge2=22.0352, rougeL=28.6162	OpenOrca
	34,864 tokens/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB-CTS	rouge1=44.4312, rouge2=22.0352, rougeL=28.6162	OpenOrca
	24,525 tokens/sec	8x H100	NVIDIA DGX H100	NVIDIA H100-SXM-80GB	rouge1=44.4312, rouge2=22.0352, rougeL=28.6162	OpenOrca
	4,068 tokens/sec	1x GH200	NVIDIA GH200 NVL2 Platform	NVIDIA GH200 Grace Hopper Superchip 144GB	rouge1=44.4312, rouge2=22.0352, rougeL=28.6162	OpenOrca
Mixtral 8x7B	59,335 tokens/sec	8x H200	GIGABYTE G593-SD1	NVIDIA H200-SXM-141GB	rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)	OpenOrca, GSM8K, MBXP
	52,818 tokens/sec	8x H100	SMC H100	NVIDIA H100-SXM-80GB	rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)	OpenOrca, GSM8K, MBXP
	8,021 tokens/sec	1x GH200	NVIDIA GH200 NVL2 Platform	NVIDIA GH200 Grace Hopper Superchip 144GB	rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)	OpenOrca, GSM8K, MBXP
Stable Diffusion XL	18 samples/sec	8x H200	Dell PowerEdge XE9680	NVIDIA H200-SXM-141GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
	16 samples/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
	2.3 samples/sec	1x GH200	NVIDIA GH200 NVL2 Platform	NVIDIA GH200 Grace Hopper Superchip 144GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
ResNet-50	768,235 samples/sec	8x H200	Dell PowerEdge XE9680	NVIDIA H200-SXM-141GB	76.46% Top1	ImageNet (224x224)
	710,521 samples/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	76.46% Top1	ImageNet (224x224)
	95,105 samples/sec	1x GH200	NVIDIA GH200-GraceHopper-Superchip	NVIDIA GH200 Grace Hopper Superchip 96GB	76.46% Top1	ImageNet (224x224)
RetinaNet	15,015 samples/sec	8x H200	ThinkSystem SR685a V3	NVIDIA H200-SXM-141GB	0.3755 mAP	OpenImages (800x800)
	14,538 samples/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	0.3755 mAP	OpenImages (800x800)
	1,923 samples/sec	1x GH200	NVIDIA GH200-GraceHopper-Superchip	NVIDIA GH200 Grace Hopper Superchip 96GB	0.3755 mAP	OpenImages (800x800)
BERT	73,791 samples/sec	8x H200	Dell PowerEdge XE9680	NVIDIA H200-SXM-141GB	90.87% f1	SQuAD v1.1
	72,876 samples/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	90.87% f1	SQuAD v1.1
	9,864 samples/sec	1x GH200	NVIDIA GH200-GraceHopper-Superchip	NVIDIA GH200 Grace Hopper Superchip 96GB	90.87% f1	SQuAD v1.1
GPT-J	20,552 tokens/sec	8x H200	ThinkSystem SR680a V3	NVIDIA H200-SXM-141GB	rouge1=42.9865, rouge2=20.1235, rougeL=29.9881	CNN Dailymail
	19,878 tokens/sec	8x H100	ESC-N8-E11	NVIDIA H100-SXM-80GB	rouge1=42.9865, rouge2=20.1235, rougeL=29.9881	CNN Dailymail
	2,804 tokens/sec	1x GH200	GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT	NVIDIA GH200 Grace Hopper Superchip 96GB	rouge1=42.9865, rouge2=20.1235, rougeL=29.9881	CNN Dailymail
DLRMv2	639,512 samples/sec	8x H200	GIGABYTE G593-SD1	NVIDIA H200-SXM-141GB	80.31% AUC	Synthetic Multihot Criteo Dataset
	602,108 samples/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	80.31% AUC	Synthetic Multihot Criteo Dataset
	86,731 samples/sec	1x GH200	NVIDIA GH200 NVL2 Platform	NVIDIA GH200 Grace Hopper Superchip 144GB	80.31% AUC	Synthetic Multihot Criteo Dataset
3D-UNET	55 samples/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	0.863 DICE mean	KiTS 2019
	52 samples/sec	8x H100	AS-4125GS-TNHR2-LCC	NVIDIA H100-SXM-80GB	0.863 DICE mean	KiTS 2019
	7 samples/sec	1x GH200	GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT	NVIDIA GH200 Grace Hopper Superchip 96GB	0.863 DICE mean	KiTS 2019

Server Scenario - Closed Division

Network	Throughput	GPU	Server	GPU Version	Target Accuracy	MLPerf Server Latency Constraints (ms)	Dataset
Llama2 70B	10,756 tokens/sec	1x B200	NVIDIA B200	NVIDIA B200-SXM-180GB	rouge1=44.4312, rouge2=22.0352, rougeL=28.6162	TTFT/TPOT: 2000 ms/200 ms	OpenOrca
	32,790 tokens/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB-CTS	rouge1=44.4312, rouge2=22.0352, rougeL=28.6162	TTFT/TPOT: 2000 ms/200 ms	OpenOrca
	23,700 tokens/sec	8x H100	AS-4125GS-TNHR2-LCC	NVIDIA H100-SXM-80GB	rouge1=44.4312, rouge2=22.0352, rougeL=28.6162	TTFT/TPOT: 2000 ms/200 ms	OpenOrca
	3,884 tokens/sec	1x GH200	NVIDIA GH200 NVL2 Platform	NVIDIA GH200 Grace Hopper Superchip 144GB	rouge1=44.4312, rouge2=22.0352, rougeL=28.6162	TTFT/TPOT: 2000 ms/200 ms	OpenOrca
Mixtral 8x7B	57,177 tokens/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca, GSM8K, MBXP
	51,028 tokens/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca, GSM8K, MBXP
	7,450 tokens/sec	1x GH200	NVIDIA GH200 NVL2 Platform	NVIDIA GH200 Grace Hopper Superchip 144GB	rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca, GSM8K, MBXP
Stable Diffusion XL	17 samples/sec	8x H200	ThinkSystem SR680a V3	NVIDIA H200-SXM-141GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
	16 samples/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
	2.02 samples/sec	1x GH200	NVIDIA GH200 NVL2 Platform	NVIDIA GH200 Grace Hopper Superchip 144GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
ResNet-50	681,328 queries/sec	8x H200	GIGABYTE G593-SD1	NVIDIA H200-SXM-141GB	76.46% Top1	15 ms	ImageNet (224x224)
	634,193 queries/sec	8x H100	SYS-821GE-TNHR	NVIDIA H100-SXM-80GB	76.46% Top1	15 ms	ImageNet (224x224)
	77,012 queries/sec	1x GH200	NVIDIA GH200-GraceHopper-Superchip	NVIDIA GH200 Grace Hopper Superchip 96GB	76.46% Top1	15 ms	ImageNet (224x224)
RetinaNet	14,012 queries/sec	8x H200	GIGABYTE G593-SD1	NVIDIA H200-SXM-141GB	0.3755 mAP	100 ms	OpenImages (800x800)
	13,979 queries/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	0.3755 mAP	100 ms	OpenImages (800x800)
	1,731 queries/sec	1x GH200	GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT	NVIDIA GH200 Grace Hopper Superchip 96GB	0.3755 mAP	100 ms	OpenImages (800x800)
BERT	58,091 queries/sec	8x H200	Dell PowerEdge XE9680	NVIDIA H200-SXM-141GB	90.87% f1	130 ms	SQuAD v1.1
	58,929 queries/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	90.87% f1	130 ms	SQuAD v1.1
	7,103 queries/sec	1x GH200	GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT	NVIDIA GH200 Grace Hopper Superchip 96GB	90.87% f1	130 ms	SQuAD v1.1
GPT-J	20,139 queries/sec	8x H200	Dell PowerEdge XE9680	NVIDIA H200-SXM-141GB	rouge1=42.9865, rouge2=20.1235, rougeL=29.9881	20 s	CNN Dailymail
	19,811 queries/sec	8x H100	AS-4125GS-TNHR2-LCC	NVIDIA H100-SXM-80GB	rouge1=42.9865, rouge2=20.1235, rougeL=29.9881	20 s	CNN Dailymail
	2,513 queries/sec	1x GH200	NVIDIA GH200 NVL2 Platform	NVIDIA GH200 Grace Hopper Superchip 144GB	rouge1=42.9865, rouge2=20.1235, rougeL=29.9881	20 s	CNN Dailymail
DLRMv2	585,209 queries/sec	8x H200	GIGABYTE G593-SD1	NVIDIA H200-SXM-141GB	80.31% AUC	60 ms	Synthetic Multihot Criteo Dataset
	556,101 queries/sec	8x H100	SYS-421GE-TNHR2-LCC	NVIDIA H100-SXM-80GB	80.31% AUC	60 ms	Synthetic Multihot Criteo Dataset
	81,010 queries/sec	1x GH200	NVIDIA GH200 NVL2 Platform	NVIDIA GH200 Grace Hopper Superchip 144GB	80.31% AUC	60 ms	Synthetic Multihot Criteo Dataset

Power Efficiency Offline Scenario - Closed Division

Network	Throughput	Throughput per Watt	GPU	Server	GPU Version	Dataset
Llama2 70B	25,262 tokens/sec	4 tokens/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	OpenOrca
Mixtral 8x7B	48,988 tokens/sec	8 tokens/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	OpenOrca, GSM8K, MBXP
Stable Diffusion XL	13 samples/sec	0.002 samples/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	Subset of coco-2014 val
ResNet-50	556,234 samples/sec	112 samples/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	ImageNet (224x224)
RetinaNet	10,803 samples/sec	2 samples/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	OpenImages (800x800)
BERT	54,063 samples/sec	10 samples/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	SQuAD v1.1
GPT-J	13,097 samples/sec	3. samples/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	CNN Dailymail
DLRMv2	503,719 samples/sec	84 samples/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	Synthetic Multihot Criteo Dataset
3D-UNET	42 samples/sec	0.009 samples/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	KiTS 2019

Power Efficiency Server Scenario - Closed Division

Network	Throughput	Throughput per Watt	GPU	Server	GPU Version	Dataset
Llama2 70B	23,113 tokens/sec	4 tokens/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	OpenOrca
Mixtral 8x7B	45,497 tokens/sec	7 tokens/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	OpenOrca, GSM8K, MBXP
Stable Diffusion	13 queries/sec	0.002 queries/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	Subset of coco-2014 val
ResNet-50	480,131 queries/sec	96 queries/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	ImageNet (224x224)
RetinaNet	9,603 queries/sec	2 queries/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	OpenImages (800x800)
BERT	41,599 queries/sec	8 queries/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	SQuAD v1.1
GPT-J	11,701 queries/sec	2 queries/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	CNN Dailymail
DLRMv2	420,107 queries/sec	69 queries/sec/watt	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB	Synthetic Multihot Criteo Dataset

MLPerf™ v4.1 Inference Closed: Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP32 and 99.9% of FP32, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 4.1-0005, 4.1-0021, 4.1-0027, 4.1-0037, 4.1-0038, 4.1-0043, 4.1-0044, 4.1-0046, 4.1-0048, 4.1-0049, 4.1-0053, 4.1-0057, 4.1-0060, 4.1-0063, 4.1-0064, 4.1-0065, 4.1-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA B200 is a preview submission
Llama2 70B Max Sequence Length = 1,024
Mixtral 8x7B Max Sequence Length = 2,048
BERT-Large Max Sequence Length = 384.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

H200 Inference Performance - High Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 405B	1	8	128	128	3,953 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 405B	1	8	128	2048	5,974 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 405B	1	8	128	4096	4,947 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 405B	8	1	2048	128	764 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14a	NVIDIA H200
Llama v3.1 405B	1	8	5000	500	679 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 405B	1	8	500	2000	5,066 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 405B	1	8	1000	1000	3,481 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 405B	1	8	2048	2048	2,927 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 405B	1	8	20000	2000	482 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14.0	NVIDIA H200

Llama v3.1 70B	1	1	128	128	3,924 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.13.0	NVIDIA H200
Llama v3.1 70B	1	2	128	2048	7,939 total tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 70B	1	2	128	4096	6,297 total tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 70B	1	1	2048	128	460 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.13.0	NVIDIA H200
Llama v3.1 70B	1	1	5000	500	560 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 70B	1	2	500	2000	6,683 total tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 70B	1	1	1000	1000	2,704 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 70B	1	2	2048	2048	3,835 total tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 70B	1	2	20000	2000	633 total tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200

Llama v3.1 8B	1	1	128	128	28,126 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.13.0	NVIDIA H200
Llama v3.1 8B	1	1	128	2048	24,158 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 8B	1	1	128	4096	16,460 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 8B	1	1	2048	128	3,661 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 8B	1	1	5000	500	3,836 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 8B	1	1	500	2000	20,345 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 8B	1	1	1000	1000	16,801 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Llama v3.1 8B	1	1	2048	2048	11,073 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.13.0	NVIDIA H200
Llama v3.1 8B	1	1	20000	2000	1,741 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200

Mixtral 8x7B	1	1	128	128	16,796 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x7B	1	1	128	2048	14,830 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x7B	1	2	128	4096	21,520 total tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.14.0	NVIDIA H200
Mixtral 8x7B	1	1	2048	128	1,995 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x7B	1	1	5000	500	2,295 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x7B	1	1	500	2000	11,983 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x7B	1	1	1000	1000	10,254 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x7B	1	2	2048	2048	14,018 total tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.13.0	NVIDIA H200
Mixtral 8x7B	1	2	20000	2000	2,227 total tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200

Mixtral 8x22B	1	8	128	128	25,179 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14.0	NVIDIA H200
Mixtral 8x22B	1	8	128	2048	32,623 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	128	4096	25,531 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	2048	128	3,095 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	5000	500	4,209 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	500	2000	27,396 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	1000	1000	20,097 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	2048	2048	13,796 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14.0	NVIDIA H200
Mixtral 8x22B	1	8	20000	2000	2,897 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14.0	NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

H100 Inference Performance - High Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 70B	1	2	128	128	6,399 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	128	4096	3,581 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	2048	128	774 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	500	2000	4,776 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	1000	1000	4,247 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Llama v3.1 70B	1	4	2048	2048	5,166 total tokens/sec	4x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Llama v3.1 70B	1	4	20000	2000	915 total tokens/sec	4x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB

Mixtral 8x7B	1	2	128	128	27,156 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	128	2048	23,010 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Mixtral 8x7B	1	8	128	4096	47,834 total tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	2048	128	3,368 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	5000	500	3,592 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	500	2000	18,186 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.14.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	1000	1000	15,932 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.14.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	2048	2048	10,465 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	20000	2000	1,739 total tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - High Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 8B	1	1	128	128	8,983 total tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Llama v3.1 8B	1	1	128	2048	5,297 total tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Llama v3.1 8B	1	1	128	4096	2,989 total tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Llama v3.1 8B	1	1	2048	128	1,056 total tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Llama v3.1 8B	1	1	5000	500	972 total tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Llama v3.1 8B	1	1	500	2000	4,264 total tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Llama v3.1 8B	1	1	1000	1000	4,014 total tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Llama v3.1 8B	1	1	2048	2048	2,163 total tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Llama v3.1 8B	1	1	20000	2000	326 total tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S

Mixtral 8x7B	4	1	128	128	15,278 total tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	128	2048	9,087 total tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	1	4	128	4096	5,655 total tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	4	1	2048	128	2,098 total tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	5000	500	1,558 total tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	500	2000	7,974 total tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	1000	1000	6,579 total tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	2048	2048	4,217 total tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

H200 Inference Performance - High Throughput at Low Latency Under 1 Second

Model	Batch Size	TP	Input Length	Output Length	Time to 1st Token	Throughput/GPU	GPU	Server	Precision	Framework	GPU Version
GPT-J 6B	512	1	128	128	0.64 seconds	25,126 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
GPT-J 6B	64	1	128	2048	0.08 seconds	7,719 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
GPT-J 6B	32	1	2048	128	0.68 seconds	2,469 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
GPT-J 6B	32	1	2048	2048	0.68 seconds	3,167 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200

Llama v2 7B	512	1	128	128	0.84 seconds	19,975 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
Llama v2 7B	64	1	128	2048	0.11 seconds	7,149 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
Llama v2 7B	32	1	2048	128	0.9 seconds	2,101 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
Llama v2 7B	32	1	2048	2048	0.9 seconds	3,008 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200

Llama v2 70B	64	1	128	128	0.92 seconds	2,044 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
Llama v2 70B	64	1	128	2048	0.93 seconds	2,238 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
Llama v2 70B	4	1	2048	128	0.95 seconds	128 total tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
Llama v2 70B	16	8	2048	2048	0.97 seconds	173 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200

Falcon 180B	32	4	128	128	0.36 seconds	365 total tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
Falcon 180B	64	8	128	2048	0.43 seconds	408 total tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
Falcon 180B	4	4	2048	128	0.71 seconds	43 total tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200
Falcon 180B	4	4	2048	2048	0.71 seconds	53 total tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 0.9.0	NVIDIA H200

TP: Tensor Parallelism
Batch size per GPU
Low Latency Target: Highest measured throughput with less than 1 second 1st token latency

H100 Inference Performance - High Throughput at Low Latency Under 1 Second

Model	Batch Size	TP	Input Length	Output Length	Time to 1st Token	Throughput/GPU	GPU	Server	Precision	Framework	GPU Version
GPT-J 6B	512	1	128	128	0.63 seconds	24,167 total tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
GPT-J 6B	120	1	128	2048	0.16 seconds	7,351 total tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
GPT-J 6B	32	1	2048	128	0.67 seconds	2,257 total tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
GPT-J 6B	32	1	2048	2048	0.68 seconds	2,710 total tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB

Llama v2 7B	512	1	128	128	0.83 seconds	19,258 total tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
Llama v2 7B	120	1	128	2048	0.2 seconds	6,944 total tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
Llama v2 7B	32	1	2048	128	0.89 seconds	1,904 total tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
Llama v2 7B	32	1	2048	2048	0.89 seconds	2,484 total tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB

Llama v2 70B	64	1	128	128	0.92 seconds	1,702 total tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
Llama v2 70B	128	4	128	2048	0.73 seconds	1,494 total tokens/sec	4x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
Llama v2 70B	4	8	2048	128	0.74 seconds	105 total tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
Llama v2 70B	8	4	2048	2048	0.74 seconds	141 total tokens/sec	4x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB

Falcon 180B	64	4	128	128	0.71 seconds	372 total tokens/sec	4x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
Falcon 180B	64	4	128	2048	0.7 seconds	351 total tokens/sec	4x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
Falcon 180B	8	8	2048	128	0.87 seconds	45 total tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB
Falcon 180B	8	8	2048	2048	0.87 seconds	61 total tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 0.9.0	H100-SXM5-80GB

TP: Tensor Parallelism
Batch size per GPU
Low Latency Target: Highest measured throughput with less than 1 second 1st token latency

Inference Performance of NVIDIA Data Center Products

H200 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	4.33 images/sec	-	231.26	1x H200	DGX H200	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	NVIDIA H200
	4	6.8 images/sec	-	588.08	1x H200	DGX H200	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	NVIDIA H200
Stable Diffusion XL	1	0.86 images/sec	-	1157.27	1x H200	DGX H200	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA H200
ResNet-50v1.5	8	21,347 images/sec	70 images/sec/watt	0.37	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
	128	63,356 images/sec	104 images/sec/watt	2.02	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
BERT-BASE	8	9,390 sequences/sec	21 sequences/sec/watt	0.85	1x H200	DGX H200	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA H200
	128	25,341 sequences/sec	38 sequences/sec/watt	5.05	1x H200	DGX H200	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA H200
BERT-LARGE	8	4,034 sequences/sec	6 sequences/sec/watt	1.98	1x H200	DGX H200	24.09-py3	Mixed	Synthetic	TensorRT 10.4.0.26	NVIDIA H200
	128	8,374 sequences/sec	13 sequences/sec/watt	15.28	1x H200	DGX H200	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA H200
EfficientNet-B0	8	16,634 images/sec	76 images/sec/watt	0.48	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
	128	56,960 images/sec	122 images/sec/watt	2.25	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
EfficientNet-B4	8	4,525 images/sec	14 images/sec/watt	1.77	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
	128	8,940 images/sec	15 images/sec/watt	14.32	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
HF Swin Base	8	5,083 samples/sec	11 samples/sec/watt	1.57	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
	32	8,304 samples/sec	12 samples/sec/watt	3.85	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
HF Swin Large	8	3,435 samples/sec	6 samples/sec/watt	2.33	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
	32	4,732 samples/sec	7 samples/sec/watt	6.76	1x H200	DGX H200	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA H200
HF ViT Base	8	8,948 samples/sec	19 samples/sec/watt	0.89	1x H200	DGX H200	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	NVIDIA H200
	64	15,403 samples/sec	23 samples/sec/watt	4.16	1x H200	DGX H200	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	NVIDIA H200
HF ViT Large	8	3,743 samples/sec	6 samples/sec/watt	2.14	1x H200	DGX H200	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	NVIDIA H200
	64	5,415 samples/sec	8 samples/sec/watt	11.82	1x H200	DGX H200	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	NVIDIA H200
Megatron BERT Large QAT	8	4,966 sequences/sec	13 sequences/sec/watt	1.61	1x H200	DGX H200	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA H200
	128	12,481 sequences/sec	18 sequences/sec/watt	10.26	1x H200	DGX H200	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA H200
QuartzNet	8	6,780 samples/sec	24 samples/sec/watt	1.18	1x H200	DGX H200	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA H200
	128	33,906 samples/sec	89 samples/sec/watt	3.78	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200
RetinaNet-RN34	8	2,967 images/sec	9 images/sec/watt	2.7	1x H200	DGX H200	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA H200

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

GH200 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	4.27 images/sec	-	234.4	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
	4	5.82 images/sec	-	687.91	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
Stable Diffusion XL	1	0.68 images/sec	-	1149.44	1x GH200	NVIDIA P3880	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	GH200 96GB
ResNet-50v1.5	8	20,979 images/sec	61 images/sec/watt	0.38	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	128	63,043 images/sec	99 images/sec/watt	2.03	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
BERT-BASE	8	9,593 sequences/sec	22 sequences/sec/watt	0.83	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
	128	20,399 sequences/sec	45 sequences/sec/watt	6.27	1x GH200	NVIDIA P3880	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	GH200 96GB
BERT-LARGE	8	3,625 sequences/sec	7 sequences/sec/watt	2.21	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
	128	7,285 sequences/sec	14 sequences/sec/watt	17.57	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
EfficientNet-B0	8	16,695 images/sec	67 images/sec/watt	0.48	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	128	56,674 images/sec	113 images/sec/watt	2.26	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
EfficientNet-B4	8	4,531 images/sec	13 images/sec/watt	1.77	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	128	8,784 images/sec	14 images/sec/watt	14.57	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
HF Swin Base	8	5,106 samples/sec	10 samples/sec/watt	1.57	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	32	8,197 samples/sec	12 samples/sec/watt	3.9	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
HF Swin Large	8	3,403 samples/sec	6 samples/sec/watt	2.35	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	32	4,846 samples/sec	6 samples/sec/watt	6.6	1x GH200	NVIDIA P3880	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	GH200 96GB
HF ViT Base	8	8,990 samples/sec	18 samples/sec/watt	0.89	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
	64	15,562 samples/sec	21 samples/sec/watt	4.11	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
HF ViT Large	8	3,707 samples/sec	6 samples/sec/watt	2.16	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
	64	5,703 samples/sec	7 samples/sec/watt	11.22	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
Megatron BERT Large QAT	8	4,927 sequences/sec	12 sequences/sec/watt	1.62	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
	128	10,896 sequences/sec	19 sequences/sec/watt	11.75	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
QuartzNet	8	6,688 samples/sec	22 samples/sec/watt	1.2	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	128	34,272 samples/sec	85 samples/sec/watt	3.73	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
RetinaNet-RN34	8	2,945 images/sec	4 images/sec/watt	2.72	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB

H100 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	4.22 images/sec	-	236.8	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	H100 SXM5-80GB
	4	6.41 images/sec	-	624.6	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	H100 SXM5-80GB
Stable Diffusion XL	1	0.83 images/sec	-	1210.08	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100 SXM5-80GB
ResNet-50v1.5	8	21,136 images/sec	70 images/sec/watt	0.38	1x H100	DGX H100	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB
	128	58,139 images/sec	102 images/sec/watt	2.2	1x H100	DGX H100	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB
BERT-BASE	8	9,505 sequences/sec	19 sequences/sec/watt	0.84	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
	128	23,883 sequences/sec	36 sequences/sec/watt	5.36	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
BERT-LARGE	8	3,979 sequences/sec	8 sequences/sec/watt	2.01	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100 SXM5-80GB
	128	7,999 sequences/sec	12 sequences/sec/watt	16	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
EfficientNet-B0	8	16,279 images/sec	62 images/sec/watt	0.49	1x H100	DGX H100	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB
	128	55,100 images/sec	113 images/sec/watt	2.32	1x H100	DGX H100	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB
EfficientNet-B4	8	4,542 images/sec	13 images/sec/watt	1.76	1x H100	DGX H100	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB
	128	8,519 images/sec	15 images/sec/watt	15.03	1x H100	DGX H100	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB
HF Swin Base	8	5,055 samples/sec	9 samples/sec/watt	1.58	1x H100	DGX H100	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB
	32	7,819 samples/sec	12 samples/sec/watt	4.09	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
HF Swin Large	8	3,313 samples/sec	6 samples/sec/watt	2.41	1x H100	DGX H100	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB
	32	4,446 samples/sec	6 samples/sec/watt	7.2	1x H100	DGX H100	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB
HF ViT Base	8	9,027 samples/sec	19 samples/sec/watt	0.89	1x H100	DGX H100	24.10-py3	FP8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
	64	14,992 samples/sec	22 samples/sec/watt	4.27	1x H100	DGX H100	24.10-py3	FP8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
HF ViT Large	8	3,707 samples/sec	6 samples/sec/watt	2.16	1x H100	DGX H100	24.10-py3	FP8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
	64	5,348 samples/sec	8 samples/sec/watt	11.97	1x H100	DGX H100	24.10-py3	FP8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
Megatron BERT Large QAT	8	4,571 sequences/sec	12 sequences/sec/watt	1.75	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
	128	12,005 sequences/sec	17 sequences/sec/watt	10.66	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
QuartzNet	8	6,697 samples/sec	22 samples/sec/watt	1.19	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
	128	34,597 samples/sec	81 samples/sec/watt	3.7	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100-SXM5-80GB
RetinaNet-RN34	8	2,780 images/sec	8 images/sec/watt	2.88	1x H100	DGX H100	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	H100-SXM5-80GB

L40S Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion XL	1	0.37 images/sec	-	2678.19	1x L40S	Supermicro SYS-521GE-TNRT	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L40S
ResNet-50v1.5	8	23,472 images/sec	78 images/sec/watt	0.34	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	32	37,069 images/sec	109 images/sec/watt	0.86	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
BERT-BASE	8	8,412 sequences/sec	26 sequences/sec/watt	0.95	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	128	13,169 sequences/sec	38 sequences/sec/watt	9.72	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
BERT-LARGE	8	3,188 sequences/sec	10 sequences/sec/watt	2.51	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	24	4,034 sequences/sec	12 sequences/sec/watt	31.73	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
EfficientDet-D0	8	4,716 images/sec	17 images/sec/watt	1.7	1x L40S	Supermicro SYS-521GE-TNRT	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA L40S
EfficientNet-B0	8	20,534 images/sec	106 images/sec/watt	0.39	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	32	41,526 images/sec	140 images/sec/watt	0.77	1x L40S	Supermicro SYS-521GE-TNRT	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L40S
EfficientNet-B4	8	5,149 images/sec	17 images/sec/watt	1.55	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	16	6,116 images/sec	18 images/sec/watt	2.62	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
HF Swin Base	8	3,825 samples/sec	11 samples/sec/watt	2.09	1x L40S	Supermicro SYS-521GE-TNRT	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA L40S
	16	4,371 samples/sec	13 samples/sec/watt	3.66	1x L40S	Supermicro SYS-521GE-TNRT	24.09-py3	Mixed	Synthetic	TensorRT 10.4.0.26	NVIDIA L40S
HF Swin Large	8	1,932 samples/sec	6 samples/sec/watt	4.14	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	Mixed	Synthetic	TensorRT 10.6.0	NVIDIA L40S
	16	2,141 samples/sec	6 samples/sec/watt	7.47	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
HF ViT Base	8	5,799 samples/sec	17 samples/sec/watt	1.38	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	FP8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
HF ViT Large	8	1,926 samples/sec	6 samples/sec/watt	4.15	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	FP8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
Megatron BERT Large QAT	8	4,213 sequences/sec	13 sequences/sec/watt	1.9	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
	24	5,097 sequences/sec	15 sequences/sec/watt	4.71	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
QuartzNet	8	7,643 samples/sec	32 samples/sec/watt	1.05	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	128	22,582 samples/sec	65 samples/sec/watt	5.67	1x L40S	Supermicro SYS-521GE-TNRT	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA L40S

1,024 x 1,024 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L4 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	0.82 images/sec	-	1221.73	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
Stable Diffusion XL	1	0.11 images/sec	-	9098.4	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
ResNet-50v1.5	8	9,911 images/sec	138 images/sec/watt	0.81	1x L4	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L4
	32	10,101 images/sec	111 images/sec/watt	16.27	1x L4	GIGABYTE G482-Z54-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L4
BERT-BASE	8	3,323 sequences/sec	46 sequences/sec/watt	2.41	1x L4	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
	24	4,052 sequences/sec	56 sequences/sec/watt	5.92	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
BERT-LARGE	8	1,081 sequences/sec	15 sequences/sec/watt	7.4	1x L4	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
	13	1,314 sequences/sec	19 sequences/sec/watt	9.9	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
EfficientNet-B4	8	1,831 images/sec	25 images/sec/watt	4.37	1x L4	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L4
HF Swin Base	8	1,215 samples/sec	17 samples/sec/watt	6.58	1x L4	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA L4
HF Swin Large	8	621 samples/sec	9 samples/sec/watt	12.88	1x L4	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L4
HF ViT Base	16	1,899 samples/sec	26 samples/sec/watt	8.42	1x L4	GIGABYTE G482-Z52-00	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	NVIDIA L4
HF ViT Large	8	613 samples/sec	9 samples/sec/watt	13.06	1x L4	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L4
Megatron BERT Large QAT	24	1,789 sequences/sec	25 sequences/sec/watt	13.42	1x L4	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
QuartzNet	8	4,063 samples/sec	57 samples/sec/watt	1.97	1x L4	GIGABYTE G482-Z52-00	24.11-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA L4
	128	6,083 samples/sec	84 samples/sec/watt	21.04	1x L4	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L4
RetinaNet-RN34	8	364 images/sec	5 images/sec/watt	21.95	1x L4	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L4

A40 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	11,172 images/sec	40 images/sec/watt	0.72	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	A40
	128	15,401 images/sec	51 images/sec/watt	8.31	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	A40
BERT-BASE	8	4,257 sequences/sec	15 sequences/sec/watt	1.88	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
	128	5,667 sequences/sec	19 sequences/sec/watt	22.59	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
BERT-LARGE	8	1,573 sequences/sec	5 sequences/sec/watt	5.08	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
	128	1,966 sequences/sec	7 sequences/sec/watt	65.11	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
EfficientNet-B0	8	11,142 images/sec	59 images/sec/watt	0.72	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A40
	128	20,068 images/sec	67 images/sec/watt	6.38	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A40
EfficientNet-B4	8	2,138 images/sec	8 images/sec/watt	3.74	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A40
	128	2,700 images/sec	9 images/sec/watt	47.41	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A40
HF Swin Base	8	1,694 samples/sec	6 samples/sec/watt	4.72	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A40
	32	1,838 samples/sec	6 samples/sec/watt	17.41	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A40
HF Swin Large	8	956 samples/sec	3 samples/sec/watt	8.37	1x A40	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A40
	32	1,008 samples/sec	3 samples/sec/watt	31.76	1x A40	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A40
HF ViT Base	8	2,170 samples/sec	7 samples/sec/watt	3.69	1x A40	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A40
	64	2,330 samples/sec	8 samples/sec/watt	27.47	1x A40	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A40
HF ViT Large	8	693 samples/sec	2 samples/sec/watt	11.54	1x A40	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A40
	64	746 samples/sec	2 samples/sec/watt	85.78	1x A40	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A40
Megatron BERT Large QAT	8	2,059 sequences/sec	7 sequences/sec/watt	3.89	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
	128	2,650 sequences/sec	9 sequences/sec/watt	48.31	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
QuartzNet	8	4,380 samples/sec	21 samples/sec/watt	1.83	1x A40	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A40
	128	8,468 samples/sec	29 samples/sec/watt	15.12	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A40
RetinaNet-RN34	8	705 images/sec	2 images/sec/watt	11.34	1x A40	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A40

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

A30 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	10,243 images/sec	71 images/sec/watt	0.8	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
	128	16,633 images/sec	101 images/sec/watt	7.7	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
BERT-BASE	8	4,334 sequences/sec	26 sequences/sec/watt	1.85	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
	128	5,820 sequences/sec	35 sequences/sec/watt	21.99	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
BERT-LARGE	8	1,500 sequences/sec	10 sequences/sec/watt	5.33	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
	128	2,053 sequences/sec	13 sequences/sec/watt	62.34	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
EfficientNet-B0	8	8,997 images/sec	81 images/sec/watt	0.9	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
	128	17,252 images/sec	106 images/sec/watt	7.4	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
EfficientNet-B4	8	1,877 images/sec	13 images/sec/watt	4.3	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
	128	2,416 images/sec	15 images/sec/watt	53	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
HF Swin Base	8	1,647 samples/sec	10 samples/sec/watt	4.9	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
	32	1,846 samples/sec	11 samples/sec/watt	17.3	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
HF Swin Large	8	910 samples/sec	5 samples/sec/watt	8.8	1x A30	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A30
	32	1,003 samples/sec	6 samples/sec/watt	31.9	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
HF ViT Base	8	2,060 samples/sec	12 samples/sec/watt	3.9	1x A30	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A30
	64	2,328 samples/sec	14 samples/sec/watt	27.5	1x A30	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A30
HF ViT Large	8	674 samples/sec	4 samples/sec/watt	11.9	1x A30	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A30
	64	709 samples/sec	4 samples/sec/watt	90.2	1x A30	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A30
Megatron BERT Large QAT	8	1,802 sequences/sec	12 sequences/sec/watt	4.44	1x A30	GIGABYTE G482-Z52-00	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA A30
	128	2,724 sequences/sec	17 sequences/sec/watt	46.99	1x A30	GIGABYTE G482-Z52-00	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA A30
QuartzNet	8	3,460 samples/sec	30 samples/sec/watt	2.3	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
	128	9,998 samples/sec	73 samples/sec/watt	12.8	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30
RetinaNet-RN34	8	702 images/sec	4 images/sec/watt	11.4	1x A30	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A30

A10 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	8,562 images/sec	57 images/sec/watt	0.93	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
	128	10,657 images/sec	71 images/sec/watt	12.01	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
BERT-BASE	8	3,109 sequences/sec	21 sequences/sec/watt	2.57	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A10
	128	3,822 sequences/sec	26 sequences/sec/watt	33.49	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A10
BERT-LARGE	8	1,086 sequences/sec	7 sequences/sec/watt	7.36	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA A10
	128	1,265 sequences/sec	8 sequences/sec/watt	101.17	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA A10
EfficientNet-B0	8	9,616 images/sec	64 images/sec/watt	0.83	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
	128	14,494 images/sec	97 images/sec/watt	8.83	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
EfficientNet-B4	8	1,625 images/sec	11 images/sec/watt	4.92	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
	128	1,897 images/sec	13 images/sec/watt	67.49	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
HF Swin Base	8	1,223 samples/sec	8 samples/sec/watt	6.54	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
	32	1,283 samples/sec	9 samples/sec/watt	24.93	1x A10	GIGABYTE G482-Z52-00	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	NVIDIA A10
HF Swin Large	8	622 samples/sec	4 samples/sec/watt	12.86	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
	32	668 samples/sec	4 samples/sec/watt	47.9	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
HF ViT Base	8	1,395 samples/sec	9 samples/sec/watt	5.74	1x A10	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A10
	64	1,526 samples/sec	10 samples/sec/watt	41.93	1x A10	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A10
HF ViT Large	8	460 samples/sec	3 samples/sec/watt	17.38	1x A10	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A10
Megatron BERT Large QAT	8	1,566 sequences/sec	10 sequences/sec/watt	5.11	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
	128	1,801 sequences/sec	12 sequences/sec/watt	71.06	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
QuartzNet	8	3,851 samples/sec	26 samples/sec/watt	2.08	1x A10	GIGABYTE G482-Z52-00	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	NVIDIA A10
	128	5,924 samples/sec	40 samples/sec/watt	21.61	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
RetinaNet-RN34	8	505 images/sec	3 images/sec/watt	15.83	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10

NVIDIA Performance with Triton Inference Server

H200 Triton Inference Server Performance

Network	Accelerator	Model Format	Framework Backend	Precision	Model Instances on Triton	Client Batch Size	Number of Concurrent Client Requests	Latency (ms)	Throughput	Triton Container Version
BERT Base Inference	NVIDIA H200	tensorrt	TensorRT	Mixed	4	1	4	0.77	3,182 inf/sec	24.09-py3
BERT Large Inference	NVIDIA H200	onnx	PyTorch	Mixed	1	1	16	17.996	1,777 inf/sec	24.09-py3
BERT Large Inference	NVIDIA H200	onnx	PyTorch	Mixed	1	2	32	35.862	1,784 inf/sec	24.09-py3
DLRM	NVIDIA H200	ts-trace	PyTorch	Mixed	4	1	32	0.868	36,852 inf/sec	24.02-py3
DLRM	NVIDIA H200	ts-trace	PyTorch	Mixed	1	2	32	1.504	72,006 inf/sec	24.09-py3
FastPitch Inference	NVIDIA H200	ts-trace	PyTorch	Mixed	2	1	512	108.056	4,736 inf/sec	24.09-py3
FastPitch Inference	NVIDIA H200	ts-trace	PyTorch	Mixed	2	2	256	108.477	4,717 inf/sec	24.09-py3
GPUNet-0	NVIDIA H200	onnx	PyTorch	Mixed	1	1	32	3.992	7,930 inf/sec	24.09-py3
GPUNet-0	NVIDIA H200	onnx	PyTorch	Mixed	2	2	64	11.55	11,011 inf/sec	24.09-py3
GPUNet-1	NVIDIA H200	onnx	PyTorch	Mixed	1	1	64	7.951	8,012 inf/sec	24.09-py3
GPUNet-1	NVIDIA H200	onnx	PyTorch	Mixed	1	2	64	14.269	8,943 inf/sec	24.09-py3
ResNet-50 v1.5	NVIDIA H200	onnx	PyTorch	Mixed	1	1	32	3.801	8,370 inf/sec	24.09-py3
ResNet-50 v1.5	NVIDIA H200	onnx	PyTorch	Mixed	2	2	64	7.482	17,037 inf/sec	24.09-py3
TFT Inference	NVIDIA H200	tensorrt	PyTorch	Mixed	2	1	4	2.751	32,970 inf/sec	24.09-py3
TFT Inference	NVIDIA H200	tensorrt	PyTorch	Mixed	1	2	512	42.754	40,098 inf/sec	24.09-py3

GH200 Triton Inference Server Performance

Network	Accelerator	Model Format	Framework Backend	Precision	Model Instances on Triton	Client Batch Size	Number of Concurrent Client Requests	Latency (ms)	Throughput	Triton Container Version
BERT Base Inference	NVIDIA GH200 96GB	tensorrt	TensorRT	Mixed	4	1	4	1.153	3,458 inf/sec	24.09-py3
BERT Large Inference	NVIDIA GH200 96GB	onnx	PyTorch	Mixed	2	1	64	41.714	1,534 inf/sec	24.09-py3
BERT Large Inference	NVIDIA GH200 96GB	onnx	PyTorch	Mixed	4	2	128	166.125	1,540 inf/sec	24.09-py3
DLRM	NVIDIA GH200 96GB	ts-trace	PyTorch	Mixed	2	1	64	1.241	51,529 inf/sec	24.02-py3
DLRM	NVIDIA GH200 96GB	ts-trace	PyTorch	Mixed	4	2	16	1.189	74,741 inf/sec	24.09-py3
FastPitch Inference	NVIDIA GH200 96GB	ts-trace	PyTorch	Mixed	2	1	1024	257.727	3,968 inf/sec	24.09-py3
FastPitch Inference	NVIDIA GH200 96GB	ts-trace	PyTorch	Mixed	2	2	1024	524.694	3,893 inf/sec	24.09-py3
GPUNet-0	NVIDIA GH200 96GB	onnx	PyTorch	Mixed	4	1	32	2.489	12,701 inf/sec	24.09-py3
GPUNet-0	NVIDIA GH200 96GB	onnx	PyTorch	Mixed	4	2	16	2.314	13,651 inf/sec	24.09-py3
GPUNet-1	NVIDIA GH200 96GB	onnx	PyTorch	Mixed	2	1	32	2.746	11,560 inf/sec	24.09-py3
GPUNet-1	NVIDIA GH200 96GB	onnx	PyTorch	Mixed	1	2	128	23.598	10,837 inf/sec	24.09-py3
ResNet-50 v1.5	NVIDIA GH200 96GB	onnx	PyTorch	Mixed	4	1	512	61.929	8,262 inf/sec	24.09-py3
ResNet-50 v1.5	NVIDIA GH200 96GB	onnx	PyTorch	Mixed	4	2	64	5.945	21,469 inf/sec	24.09-py3
TFT Inference	NVIDIA GH200 96GB	ts-trace	PyTorch	Mixed	4	1	256	12.583	20,330 inf/sec	24.09-py3
TFT Inference	NVIDIA GH200 96GB	ts-trace	PyTorch	Mixed	4	2	128	6.362	40,179 inf/sec	24.09-py3

H100 Triton Inference Server Performance

Network	Accelerator	Model Format	Framework Backend	Precision	Model Instances on Triton	Client Batch Size	Number of Concurrent Client Requests	Latency (ms)	Throughput	Triton Container Version
BERT Base Inference	H100 SXM5-80GB	tensorrt	TensorRT	Mixed	4	1	4	1.207	3,311 inf/sec	24.02-py3
BERT Large Inference	H100 SXM5-80GB	tensorrt	PyTorch	Mixed	4	1	16	14.784	1,082 inf/sec	24.02-py3
BERT Large Inference	H100 SXM5-80GB	tensorrt	PyTorch	Mixed	4	2	8	12.715	1,258 inf/sec	24.02-py3
DLRM	H100 SXM5-80GB	ts-trace	PyTorch	Mixed	1	1	32	0.94	34,027 inf/sec	24.02-py3
DLRM	H100 SXM5-80GB	ts-trace	PyTorch	Mixed	4	2	32	0.913	70,071 inf/sec	24.02-py3
FastPitch Inference	H100 SXM5-80GB	ts-trace	PyTorch	Mixed	2	1	512	119.531	4,281 inf/sec	24.02-py3
FastPitch Inference	H100 SXM5-80GB	ts-trace	PyTorch	Mixed	2	2	256	119.36	4,287 inf/sec	24.02-py3
ResNet-50 v1.5	H100 SXM5-80GB	tensorrt	PyTorch	Mixed	4	1	16	1.977	8,090 inf/sec	24.02-py3
ResNet-50 v1.5	H100 SXM5-80GB	tensorrt	PyTorch	Mixed	4	2	16	4.101	7,801 inf/sec	24.02-py3
TFT Inference	H100 SXM5-80GB	ts-script	PyTorch	Mixed	2	1	1024	33.027	30,996 inf/sec	24.02-py3
TFT Inference	H100 SXM5-80GB	ts-script	PyTorch	Mixed	2	2	512	25.522	40,114 inf/sec	24.02-py3

H100 NVL Triton Inference Server Performance

Network	Accelerator	Model Format	Framework Backend	Precision	Model Instances on Triton	Client Batch Size	Number of Concurrent Client Requests	Latency (ms)	Throughput	Triton Container Version
BERT Base Inference	NVIDIA H100 NVL	tensorrt	TensorRT	Mixed	4	1	4	1.365	2,919 inf/sec	24.09-py3
BERT Large Inference	NVIDIA H100 NVL	onnx	PyTorch	Mixed	1	1	32	25.76	1,242 inf/sec	24.09-py3
BERT Large Inference	NVIDIA H100 NVL	onnx	PyTorch	Mixed	2	2	32	50.884	1,257 inf/sec	24.09-py3
DLRM	NVIDIA H100 NVL	ts-trace	PyTorch	Mixed	2	1	32	0.804	39,745 inf/sec	24.02-py3
DLRM	NVIDIA H100 NVL	ts-trace	PyTorch	Mixed	2	2	32	1.071	59,691 inf/sec	24.02-py3
FastPitch Inference	NVIDIA H100 NVL	ts-trace	PyTorch	Mixed	2	1	256	70.915	3,609 inf/sec	24.09-py3
FastPitch Inference	NVIDIA H100 NVL	ts-trace	PyTorch	Mixed	2	2	256	149.333	3,426 inf/sec	24.09-py3
GPUNet-0	NVIDIA H100 NVL	onnx	PyTorch	Mixed	1	1	32	4.218	7,492 inf/sec	24.09-py3
GPUNet-0	NVIDIA H100 NVL	onnx	PyTorch	Mixed	2	2	32	5.585	11,355 inf/sec	24.09-py3
GPUNet-1	NVIDIA H100 NVL	onnx	PyTorch	Mixed	1	1	64	7.851	8,105 inf/sec	24.09-py3
GPUNet-1	NVIDIA H100 NVL	onnx	PyTorch	Mixed	1	2	32	6.647	9,561 inf/sec	24.09-py3
ResNet-50 v1.5	NVIDIA H100 NVL	onnx	PyTorch	Mixed	1	1	64	6.673	9,546 inf/sec	24.09-py3
ResNet-50 v1.5	NVIDIA H100 NVL	onnx	PyTorch	Mixed	2	2	64	7.446	17,116 inf/sec	24.09-py3
TFT Inference	NVIDIA H100 NVL	ts-trace	PyTorch	Mixed	2	1	512	16.846	30,387 inf/sec	24.02-py3
TFT Inference	NVIDIA H100 NVL	ts-trace	PyTorch	Mixed	4	2	256	21.733	23,544 inf/sec	24.09-py3

L40S Triton Inference Server Performance

Network	Accelerator	Model Format	Framework Backend	Precision	Model Instances on Triton	Client Batch Size	Number of Concurrent Client Requests	Latency (ms)	Throughput	Triton Container Version
BERT Base Inference	NVIDIA L40S	tensorrt	TensorRT	Mixed	4	1	4	1.398	2,853 inf/sec	24.09-py3
BERT Large Inference	NVIDIA L40S	onnx	PyTorch	Mixed	2	1	16	21.281	751 inf/sec	24.09-py3
BERT Large Inference	NVIDIA L40S	onnx	PyTorch	Mixed	1	2	8	20.42	783 inf/sec	24.09-py3
DLRM	NVIDIA L40S	ts-trace	PyTorch	Mixed	1	1	64	1.545	41,403 inf/sec	24.02-py3
DLRM	NVIDIA L40S	ts-trace	PyTorch	Mixed	1	2	32	0.929	68,867 inf/sec	24.02-py3
FastPitch Inference	NVIDIA L40S	ts-trace	PyTorch	Mixed	2	1	256	106.583	2,401 inf/sec	24.09-py3
FastPitch Inference	NVIDIA L40S	ts-trace	PyTorch	Mixed	2	2	64	52.861	2,421 inf/sec	24.09-py3
GPUNet-0	NVIDIA L40S	onnx	PyTorch	Mixed	2	1	32	3.88	8,118 inf/sec	24.09-py3
GPUNet-0	NVIDIA L40S	onnx	PyTorch	Mixed	2	2	32	7.009	9,061 inf/sec	24.09-py3
GPUNet-1	NVIDIA L40S	onnx	PyTorch	Mixed	2	1	32	3.59	8,808 inf/sec	24.09-py3
GPUNet-1	NVIDIA L40S	onnx	PyTorch	Mixed	2	2	16	3.851	8,217 inf/sec	24.09-py3
ResNet-50 v1.5	NVIDIA L40S	onnx	PyTorch	Mixed	4	1	512	57.95	8,807 inf/sec	24.09-py3
ResNet-50 v1.5	NVIDIA L40S	tensorrt	PyTorch	Mixed	2	2	32	5.878	10,836 inf/sec	24.09-py3
TFT Inference	NVIDIA L40S	ts-trace	PyTorch	Mixed	1	1	128	9.37	13,629 inf/sec	24.09-py3
TFT Inference	NVIDIA L40S	ts-trace	PyTorch	Mixed	2	2	128	9.792	26,099 inf/sec	24.09-py3

Inference Performance of NVIDIA GPUs in the Cloud

A100 Inference Performance in the Cloud

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	13,768 images/sec	- images/sec/watt	0.58	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB
	128	30,338 images/sec	- images/sec/watt	4.22	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB
BERT-LARGE	8	2,308 images/sec	- images/sec/watt	3.47	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB
	128	4,045 images/sec	- images/sec/watt	31.64	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB

BERT-Large: Sequence Length = 128

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More