Problem: slow LLM inference speed on Jetson AGX Orin 64GB
Based on “Nvidia Jetson AGX Orin 64GB”, I tried to deploy LLM and run inference service with “Ollama” official Docker image, but found that the inference speed was slow, only about 50% of the Nvidia’s benchmarks (Benchmarks - NVIDIA Jetson AI Lab).
I have tried to investigate the reason and improve the speed, but it didn’t seem to work.
Some environment info of my Orin system:
LSB_RELEASE: Ubuntu 20.04
CUDA_VERSION: 12.2
L4T_VERSION: 35.4.1
JETPACK_VERSION: 5.1
Some of the things I’ve tried:
Change the “Power Mode” of Jetson AGX Orin to MAXN.
Migrate Docker directory (Data Root) to SSD, and the LLMs are saved on SSD.
And some tricks to imporve “Ollama” inference speed: