Problem: slow LLM inference speed on Jetson AGX Orin 64GB

329704910 · April 7, 2025, 11:48am

Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Based on “Nvidia Jetson AGX Orin 64GB”, I tried to deploy LLM and run inference service with “Ollama” official Docker image, but found that the inference speed was slow, only about 50% of the Nvidia’s benchmarks (Benchmarks - NVIDIA Jetson AI Lab).

I have tried to investigate the reason and improve the speed, but it didn’t seem to work.

Some environment info of my Orin system:

LSB_RELEASE: Ubuntu 20.04
CUDA_VERSION: 12.2
L4T_VERSION: 35.4.1
JETPACK_VERSION: 5.1

Some of the things I’ve tried:

Change the “Power Mode” of Jetson AGX Orin to MAXN.
Migrate Docker directory (Data Root) to SSD, and the LLMs are saved on SSD.
And some tricks to imporve “Ollama” inference speed:
OLLAMA_FLASH_ATTENITON is set to 1.
Preload a model into Ollama to get faster response times. Refer to: (ollama/docs/faq.md at main · ollama/ollama · GitHub)

The “Docker Run” command I used to start a “Ollama” container:

sudo docker run -dit --runtime nvidia --gpus=all --rm --network=host -v /ssd/llm/ollama:/root/.ollama -e JETSON_JETPACK=5 -e OLLAMA_HOST=0.0.0.0:11434 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_DEBUG=1 --name ollama ollama/ollama

AastaLLL · April 8, 2025, 4:10am

Hi,

The benchmark score is generated with MLC.
You can find the benchmark script below:

github.com/dusty-nv/jetson-containers

packages/llm/mlc/benchmark.sh

master

#!/usr/bin/env bash
#
# Llama benchmark with MLC. This script should be invoked from the host and will run 
# the MLC container with the commands to download, quantize, and benchmark the models.
# It will add its collected performance data to jetson-containers/data/benchmarks/mlc.csv 
#
# Set the HUGGINGFACE_TOKEN environment variable to your HuggingFace account token 
# that has been granted access to the Meta-Llama models.  You can run it like this:
#
#    HUGGINGFACE_TOKEN=hf_abc123 ./benchmark.sh meta-llama/Llama-2-7b-hf
#
# If a model is not specified, then the default set of models will be benchmarked.
# See the environment variables below and their defaults for model settings to change.
#
# These are the possible quantization methods that can be set like QUANTIZATION=q4f16_ft
#
#  (MLC 0.1.0) q4f16_0,q4f16_1,q4f16_2,q4f16_ft,q4f16_ft_group,q4f32_0,q4f32_1,q8f16_ft,q8f16_ft_group,q8f16_1
#  (MLC 0.1.1) q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16
#
set -ex

This file has been truncated. show original

Thanks.

system · May 6, 2025, 2:22am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB) Jetson AGX Orin generative_ai	8	228	May 12, 2025
Running Llama3.1 on JP5.1 Jetson AGX Orin generative_ai , llama	6	167	January 10, 2025
Jetson orin nano insanely slow inference speed? Jetson Orin Nano generative_ai	3	1115	May 6, 2024
VLM Refresh Rate Jetson AGX Orin camera , generative_ai	6	251	December 19, 2024
DLA performance less (around half) than what's expected Jetson AGX Orin dla	6	164	December 9, 2024
The Throughput is too slow in Nvidia jetson AGX ORin DLA Jetson AGX Orin cuda , cudnn , dla	4	508	January 31, 2024
Jetson Orin Nano Super performance test issue Jetson Orin Nano jetson	4	128	June 4, 2025
GeMM performance on Orin DLA Jetson AGX Orin tensorrt , cuda , jetson-inference	10	936	February 21, 2024
Ollama 0.4.2 released and runs on Nvidia Jetson Orin AGX 64 Jetson AGX Orin generative_ai , llama	9	1424	November 21, 2024
Keys to optimization a network on AGX Orin DLA for latency Jetson AGX Orin tensorrt , dla	2	918	October 6, 2023

Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Related topics