Jetson orin nano local small models perform insanely slow

Hi community!

We recently purchased the Jetson Orin Nano (Developer Kit) and installed JetPack 6. After launching it, we proceeded with the initial tutorials, such as this text-to-text tutorial: text-generation-webui - NVIDIA Jetson AI Lab.

I followed the tutorial instructions precisely. However, I encountered an issue where the chatbot responds extremely slowly:

Model: llama-2-7b-chat.Q4_0.gguf
Model loader: llama.cpp
n_batch: 512

As the tutorial suggested, I set n-gpu-layers to 128. When I did this, the Jetson froze, and I had to restart it by unplugging and plugging

it back in.

With n-gpu-layers: 0, the chatbot at least works, but it is still extremely slow, and the Jetson becomes very sluggish overall.

Am I missing something in my setup? It seems like the model isn’t running on the GPU, given how slow it is.

Here is the terminal output for reference:

Terminal:

Output generated in 1211.71 seconds (0.02 tokens/s, 29 tokens, context 68, seed 806630856)

I would appreciate any advice or insights on how to resolve this issue. Thank you!

Hi @vmikala , setting n-gpu-layers=0 disables CUDA, so it is running on the GPU only. See here for steps for mounting SWAP and freeing up more memory:

Also llama.cpp is not the most optimized, so I wouldn’t get too hung up on it before moving on. Although you might want to give Ollama a shot next (there is a page for that on Jetson AI Lab too), which while it also uses llama.cpp underneath, Ollama is generally easier to use and works better out-of-the-box.

There are smaller language models on this page that can be a better fit for Nano too:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.