NVIDIA NIMs are ready to run pre-packaged containerized models. The NIMs and their included models are available in a variety of profiles supporting different compute hardware configurations. You can run the NIMs in an interrogatory mode that will tell you which models are compatible with your GPU hardware. You can then run the NIM with the associated profile.
Sometimes there are still problems and we have to add additional tuning parameters to fit in memory or change data types. In my case, the data type change is because of some bug in the NIM startup detection code.
This article requires additional polish. It has more than a few rough edges.
NVIDIA NIMs are semi-opaque. You cannot build your own NIM. NIM construction details are not described by NVIDIA.
Examining NVIDIA Model Container Images
The first step is to select models we think can fit and run on our NVIDIA GPU hardware.
The first step is to investigate models of the different types by visiting the appropriate NVIDIA NIM docs
Test platform and plan
Our basic plan is
- Run this test on Ubuntu Linux
- Host the models locally in my single NVIDIA Titan RTX 24GB card system.
- Use the CLI for all testing
Because of hardware limitations, I will be using non-optimized models, as described on the support matrix page.
Some Models
Prerequisites
You need to create credentials that can be used by the various commands. You will get a 403 forbidden when trying to pull down a container if you don't have credentials.
- This assumes you have a nvapi- type key that gives you access
- Log into nvcr.io using the Docker CLI. You may see logins in some of the script logs below.
(base) joe@rocks:~$ docker login nvcr.io
Username: $oauthtoken
Password: <nvapi key here>
WARNING! Your password will be stored unencrypted in /home/joe/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
A Video demonstration
https://youtu.be/JtufRBLFwxM
Validating compatibility
We can ask the models to validate themselves against our hardware. Note that each one of these tests downloads the model image to your machine. The downloads can be large and take up bunches of disk space. The intermediate layers were 29.8GB.
REPOSITORY SIZE
project-hybrid-rag 29.7GB
nvcr.io/nim/meta/llama-3.1-405b-instruct 12.9GB
nvcr.io/nim/meta/llama-3.1-8b-instruct 12.9GB
nvcr.io/nim/meta/llama-3.1-70b-instruct 12.9GB
nvcr.io/nim/meta/llama-3.1-8b-base 12.9GB
project-nim-anywhere 4.59GB
nvcr.io/nim/nvidia/megatron-1b-nmt 14.3GB
nvcr.io/nim/nvidia/parakeet-ctc-1.1b-asr 14.3GB
nvcr.io/nim/nvidia/fastpitch-hifigan-tts 14.3GB
nvcr.io/nim/snowflake/arctic-embed-l 15.7GB
nvcr.io/nim/nvidia/nv-embedqa-e5-v5 15.7GB
nvcr.io/nim/nvidia/nv-embedqa-mistral-7b-v2 15.7GB
nvcr.io/nim/snowflake/arctic-embed-l 15.7GB
nvcr.io/nim/nvidia/nv-embedqa-mistral-7b-v2 15.7GB
nvcr.io/nim/nvidia/nv-embedqa-e5-v5 15.7GB
<none> 4.34GB
nvcr.io/nim/mistralai/mixtral-8x22b-instruct-v01 12.5GB
nvcr.io/nim/mistralai/mixtral-8x7b-instruct-v01 12.5GB
project-gpu-sample 22.1GB
nvcr.io/nim/mistralai/mistral-7b-instruct-v03 12.5GB
covid-vaccinations-python 1.88GB
project-covid-vaccinations-python 1.23GB
project-dogfood 1.23GB
nvcr.io/nim/meta/llama3-70b-instruct 12.5GB
nvcr.io/nim/meta/llama3-8b-instruct 12.5GB
redis 116MB
milvusdb/milvus 1.71GB
traefik 153MB
The above output shows various images I downloaded during testing. All were downloaded with
docker run ... list-model-profile
Local RTX 3060 TI 8GB
None of the models will fit in this card.
nvcr.io/nim/meta/llama3-70b-instruct:1.0.0
Note that this is not compatible with this system. I'm not sure why this shows up as "not compatible" and not "needs more memory".
$ docker run --gpus all nvcr.io/nim/meta/llama3-70b-instruct:1.0.0 list-model-profiles
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-70b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
SYSTEM INFO
- Free GPUs: <None>
- Non-free GPUs:
- [2489:10de] (0) NVIDIA GeForce RTX 3060 Ti [current utilization: 5%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
- 93782c337ddaf3f6b442ef7baebd0a732e433b7b93ec06bfd3db27945de560b6 (tensorrt_llm-h100-fp8-tp8-latency)
- 2e9b29c44b3d82821e7f4facd1a652ec5e0c4e7e473ee33451f6d046f616c3e5 (tensorrt_llm-l40s-fp8-tp8-latency)
- 8bb3b94e326789381bbf287e54057005a9250e3abbad0b1702a70222529fcd17 (tensorrt_llm-h100-fp8-tp4-throughput)
- 8b8e03de8630626b904b37910e3d82a26cebb99634a921a0e5c59cb84125efe8 (tensorrt_llm-l40s-fp8-tp4-throughput)
- a90b2c0217492b1020bead4e7453199c9f965ea53e9506c007f4ff7b58ee01ff (tensorrt_llm-h100-fp16-tp8-latency)
- 96b70da1414c7beb5cf671b3e7cf835078740764064c453cd86e84cf0491aac0 (tensorrt_llm-l40s-fp16-tp8-throughput)
- b811296367317f5097ed9f71b8f08d2688b2411c852978ae49e8a0d5c3a30739 (tensorrt_llm-a100-fp16-tp4-throughput)
- abcff5042bfc3fa9f4d1e715b2e016c11c6735855edfe2093e9c24e83733788e (tensorrt_llm-h100-fp16-tp4-throughput)
- 7f8bb4a2b97cf07faf6fb930ba67f33671492b7653f2a06fe522c7de65b544ca (tensorrt_llm-a100-bf16-tp8-latency)
- 7e8f6cc0d0fde672073a20d5423977a0a02a9b0693f0f3d4ffc2ec8ac35474d4 (vllm-fp16-tp8)
- df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4)
- 03fdb4d11f01be10c31b00e7c0540e2835e89a0079b483ad2dd3c25c8cc29b61 (tensorrt_llm-l40s-fp16-tp8-throughput-lora)
- 7ba9fbd93c41a28358215f3e94e79a2545ab44e39df016eb4c7d7cadc384bde7 (tensorrt_llm-a100-fp16-tp4-throughput-lora)
- 36fc1fa4fc35c1d54da115a39323080b08d7937dceb8ba47be44f4da0ec720ff (tensorrt_llm-h100-fp16-tp4-throughput-lora)
- 0f3de1afe11d355e01657424a267fbaad19bfea3143a9879307c49aed8299db0 (vllm-fp16-tp8-lora)
- a30aae0ed459082efed26a7f3bc21aa97cccad35509498b58449a52c27698544 (vllm-fp16-tp4-lora)
PS C:\Users\joe>
nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
Used by NIM Anywhere as of 2024/07
There is a model that is compatible with the RTX 3060 TI, but can't be run because there is not enough VRAM
$ docker run --gpus all nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 list-model-profiles
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
SYSTEM INFO
- Free GPUs: <None>
- Non-free GPUs:
- [2489:10de] (0) NVIDIA GeForce RTX 3060 Ti [current utilization: 5%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Compatible with system but not runnable due to low GPU free memory
- 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
- With LoRA support:
- 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
- Incompatible with system:
- dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
- f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency)
- 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput)
- 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput)
- a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-a100-fp16-tp2-latency)
- e0f4a47844733eb57f9f9c3566432acb8d20482a1d06ec1c0d71ece448e21086 (tensorrt_llm-a10g-fp16-tp2-latency)
- 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency)
- 24199f79a562b187c52e644489177b6a4eae0c9fdad6f7d0a8cb3677f5b1bc89 (tensorrt_llm-l40s-fp16-tp2-latency)
- 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-a100-fp16-tp1-throughput)
- c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput)
- cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput)
- d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80 (tensorrt_llm-l40s-fp16-tp1-throughput)
- 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
- 9137f4d51dadb93c6b5864a19fd7c035bf0b718f3e15ae9474233ebd6468c359 (tensorrt_llm-a10g-fp16-tp2-throughput-lora)
- cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
- 3bdf6456ff21c19d5c7cc37010790448a4be613a1fd12916655dfab5a0dd9b8e (tensorrt_llm-h100-fp16-tp1-throughput-lora)
- 388140213ee9615e643bda09d85082a21f51622c07bde3d0811d7c6998873a0b (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
- c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)
Validating compatibility with local Titan RTX 24GB
nvcr.io/nim/meta/llama3-70b-instruct:1.0.0
This system is not compatible with the model. I'm not sure why.
$ docker run --gpus all nvcr.io/nim/meta/llama3-70b-instruct:1.0.0 list-model-profiles
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-70b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
SYSTEM INFO
- Free GPUs:
- [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
- 93782c337ddaf3f6b442ef7baebd0a732e433b7b93ec06bfd3db27945de560b6 (tensorrt_llm-h100-fp8-tp8-latency)
- 2e9b29c44b3d82821e7f4facd1a652ec5e0c4e7e473ee33451f6d046f616c3e5 (tensorrt_llm-l40s-fp8-tp8-latency)
- 8bb3b94e326789381bbf287e54057005a9250e3abbad0b1702a70222529fcd17 (tensorrt_llm-h100-fp8-tp4-throughput)
- 8b8e03de8630626b904b37910e3d82a26cebb99634a921a0e5c59cb84125efe8 (tensorrt_llm-l40s-fp8-tp4-throughput)
- a90b2c0217492b1020bead4e7453199c9f965ea53e9506c007f4ff7b58ee01ff (tensorrt_llm-h100-fp16-tp8-latency)
- 96b70da1414c7beb5cf671b3e7cf835078740764064c453cd86e84cf0491aac0 (tensorrt_llm-l40s-fp16-tp8-throughput)
- b811296367317f5097ed9f71b8f08d2688b2411c852978ae49e8a0d5c3a30739 (tensorrt_llm-a100-fp16-tp4-throughput)
- abcff5042bfc3fa9f4d1e715b2e016c11c6735855edfe2093e9c24e83733788e (tensorrt_llm-h100-fp16-tp4-throughput)
- 7f8bb4a2b97cf07faf6fb930ba67f33671492b7653f2a06fe522c7de65b544ca (tensorrt_llm-a100-bf16-tp8-latency)
- 7e8f6cc0d0fde672073a20d5423977a0a02a9b0693f0f3d4ffc2ec8ac35474d4 (vllm-fp16-tp8)
- df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4)
- 03fdb4d11f01be10c31b00e7c0540e2835e89a0079b483ad2dd3c25c8cc29b61 (tensorrt_llm-l40s-fp16-tp8-throughput-lora)
- 7ba9fbd93c41a28358215f3e94e79a2545ab44e39df016eb4c7d7cadc384bde7 (tensorrt_llm-a100-fp16-tp4-throughput-lora)
- 36fc1fa4fc35c1d54da115a39323080b08d7937dceb8ba47be44f4da0ec720ff (tensorrt_llm-h100-fp16-tp4-throughput-lora)
- 0f3de1afe11d355e01657424a267fbaad19bfea3143a9879307c49aed8299db0 (vllm-fp16-tp8-lora)
- a30aae0ed459082efed26a7f3bc21aa97cccad35509498b58449a52c27698544 (vllm-fp16-tp4-lora)
nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
Used by NIM Anywhere as of 2024/07
There is a profile of this model that is compatible and that fits inside our 24GB of VRAM.
$ docker run --gpus all nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 list-model-profiles
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
SYSTEM INFO
- Free GPUs:
- [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable:
- 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
- With LoRA support:
- 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
- Incompatible with system:
- dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
- f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency)
- 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput)
- 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput)
- a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-a100-fp16-tp2-latency)
- e0f4a47844733eb57f9f9c3566432acb8d20482a1d06ec1c0d71ece448e21086 (tensorrt_llm-a10g-fp16-tp2-latency)
- 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency)
- 24199f79a562b187c52e644489177b6a4eae0c9fdad6f7d0a8cb3677f5b1bc89 (tensorrt_llm-l40s-fp16-tp2-latency)
- 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-a100-fp16-tp1-throughput)
- c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput)
- cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput)
- d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80 (tensorrt_llm-l40s-fp16-tp1-throughput)
- 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
- 9137f4d51dadb93c6b5864a19fd7c035bf0b718f3e15ae9474233ebd6468c359 (tensorrt_llm-a10g-fp16-tp2-throughput-lora)
- cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
- 3bdf6456ff21c19d5c7cc37010790448a4be613a1fd12916655dfab5a0dd9b8e (tensorrt_llm-h100-fp16-tp1-throughput-lora)
- 388140213ee9615e643bda09d85082a21f51622c07bde3d0811d7c6998873a0b (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
- c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)
mistral-7B-instruct-v0.3
There is a profile that will run on my Titan RTX
$ docker run --gpus all nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest list-model-profiles
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).
SYSTEM INFO
- Free GPUs:
- [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable:
- 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1)
- With LoRA support:
- 114fc68ad2c150e37eb03a911152f342e4e7423d5efb769393d30fa0b0cd1f9e (vllm-fp16-tp1-lora)
- Incompatible with system:
- 48004baf4f45ca177aa94abfd3c5c54858808ad728914b1626c3cf038ea85bc4 (tensorrt_llm-h100-fp8-tp2-latency)
- 5c17c27186b232e834aee9c61d1f5db388874da40053d70b84fd1386421ff577 (tensorrt_llm-l40s-fp8-tp2-latency)
- 08ab4363f225c19e3785b58408fa4dcac472459cca1febcfaffb43f873557e87 (tensorrt_llm-h100-fp8-tp1-throughput)
- cc18942f40e770aa27a0b02c1f5bf1458a6fedd10a1ed377630d30d71a1b36db (tensorrt_llm-l40s-fp8-tp1-throughput)
- dea9af90d5311ff2d651db8c16f752d014053d3b1c550474cbeda241f81c96bd (tensorrt_llm-a100-fp16-tp2-latency)
- 6064ab4c33a1c6da8058422b8cb0347e72141d203c77ba309ce5c5533f548188 (tensorrt_llm-h100-fp16-tp2-latency)
- ef22c7cecbcf2c8b3889bd58a48095e47a8cc0394d221acda1b4087b46c6f3e9 (tensorrt_llm-l40s-fp16-tp2-latency)
- c79561a74f97b157de12066b7a137702a4b09f71f4273ff747efe060881fca92 (tensorrt_llm-a100-fp16-tp1-throughput)
- 8833b9eba1bd4fbed4f764e64797227adca32e3c1f630c2722a8a52fee2fd1fa (tensorrt_llm-h100-fp16-tp1-throughput)
- 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
- 7387979dae9c209b33010e5da9aae4a94f75d928639ba462201e88a5dd4ac185 (vllm-fp16-tp2)
- 2c57f0135f9c6de0c556ba37f43f55f6a6c0a25fe0506df73e189aedfbd8b333 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
- 8f9730e45a88fb2ac16ce2ce21d7460479da1fd8747ba32d2b92fc4f6140ba83 (tensorrt_llm-h100-fp16-tp1-throughput-lora)
- eb445d1e451ed3987ca36da9be6bb4cdd41e498344cbf477a1600198753883ff (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
- 5797a519e300612f87f8a4a50a496a840fa747f7801b2dcd0cc9a3b4b949dd92 (vllm-fp16-tp2-lora)
mistral-7b-8x7b-v0.1
This will not run on my hardware.
$ docker run --gpus all nvcr.io/nim/mistralai/mixtral-8x7b-instruct-v01:latest list-model-profiles
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mixtral-8x7b-instruct-v0.1
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).
SYSTEM INFO
- Free GPUs:
- [1e02:10de] (0) NVIDIA TITAN RTX [current utilization: 2%]
MODEL PROFILES
- Compatible with system and runnable: <None>
- Incompatible with system:
- d37580fa5deabc5a4cb17a2337e8cc672b19eaf2791cf319fd16582065e40816 (tensorrt_llm-h100-fp8-tp4-latency)
- 00056b81c2e41eb9b847342ed553ae88614f450f3f15eebfd2ae56174484bacd (tensorrt_llm-h100-fp8-tp2-throughput)
- e249e70e3ee390e606782eab19e7a9cf2aeb865bdbc638aaf0fc580901492841 (tensorrt_llm-a100-fp16-tp4-latency)
- 9972482479f39ecacc3f470aaa7d0de7b982a1b18f907aafdb8517db5643e05a (tensorrt_llm-h100-fp16-tp4-latency)
- ad3c46c1c8d71bb481205732787f2c157a9cfc9b6babef5860518a047e155639 (tensorrt_llm-l40s-fp16-tp4-throughput)
- 9865374899b6ac3a1e25e47644f3d66753288e9d949d883b14c3f55b98fb2ebc (tensorrt_llm-a100-fp16-tp2-throughput)
- 1f859af2be6c57528dc6d32b6062c9852605d8f2d68bbe76a43b65ebc5ac738d (tensorrt_llm-h100-fp16-tp2-throughput)
- ee616a54bea8e869009748eefb0d905b168d2095d0cdf66d40f3a5612194d170 (tensorrt_llm-h100-int8wo-tp4-latency)
- 01f1ad019f55abb76f10f1687f76ea8e5d2f3d51d6831ddc582d979ff210b4cb (tensorrt_llm-h100-int8wo-tp2-throughput)
- da767e18d66e067f2c5c2c2257171b8b8331801fffdea98fc8e48b8958549388 (vllm-fp16-tp4)
- 37d1a6210357770f7f6fe5fcdb5f8da11e3863a7274ccde8ff97e4ffc7d17006 (vllm-fp16-tp2)
- 2289b29507d987154efb5ff12b41378323147e28ba2660490737cb8d8544d039 (vllm-fp16-tp4-lora)
- a0aff0e3bf2062cc42f13556b28eb66a0764d8d57c42c20dd8814f919118a127 (vllm-fp16-tp2-lora)
Running the model containers
I want to run the models on my Turing generation 24GB card.
You can find the model container image run commands at the
large language-models page. The container run command will use the image we previously downloaded as part of validation.
Note: Some of the images download additional model information on startup. In my experience, this happens once.
The following command runs a model picking a specific profile. It fails on my machine because it tries to grab 32GB of video memory. The profile is compatible but the card is slightly too small.
Note: MY_API_KEY is the nvapi- key you got for the model
docker run -it --rm --gpus all --shm-size=16GB \
-e NGC_API_KEY=$MY_API_KEY \
-e NIM_MODEL_PROFILE=7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 \
-v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 \
nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest
The following command runs a model picking a specific profile and data type and sets the model size to fit the 24GB card.
docker run -it --rm --gpus all --shm-size=16GB \
-e NGC_API_KEY=$MY_API_KEY \
-e NIM_MODEL_PROFILE=7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 \
-v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 \
nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest \
python3 -m vllm_nvext.entrypoints.openai.api_server --dtype half --max-model-len 26688
Why 26688? I tried 26000. Then I tried 28000. 28000 failed and told me that 26688 was the maximum number I could use. Looks like I could set it at 18000. This model uses 19.5GB of GPU memory.
Impact of shrinking the model length
From an NVIDIA Form post:
Shrinking the sequence length is a good way of decreasing the memory requirements – basically, it means that the size of the KV cache is limited, which can be a very large portion of the memory usage. The downside is that you won’t be able to send/generate messages that are quite as long. Otherwise, the model accuracy shouldn’t be affected.
Testing
A simple call to the deployed NIM endpoint
$ curl -X 'POST' 'http://localhost:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "mistralai/mistral-7b-instruct-v0.3",
"messages": [
{
"role":"user",
"content":"Hello! How are you?"
},
{
"role":"assistant",
"content":"Hi! I am quite well, how can I help you today?"
},
{
"role":"user",
"content":"Can you write me a song?"
}
],
"top_p": 1,
"n": 1,
"max_tokens": 15,
"stream": true,
"frequency_penalty": 1.0,
"stop": ["hello"]
}'
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"S"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"ur"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"e! I "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"will w"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"ri"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"te a s"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"hort"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" and si"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"mple "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"song"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" for"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"y"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-36fcf3b21b004d07b5959e3473f0ef70","object":"chat.completion.chunk","created":1723553206,"model":"mistralai/mistral-7b-instruct-v0.3","choices":[{"index":0,"delta":{"content":"ou:\n\n"},"finish_reason":"length"}],"usage":{"prompt_tokens":36,"total_tokens":51,"completion_tokens":15}}
Running embedding and reranker models locally
We can run the NIM Anywhere text embedding and re-ranker locally if we have enough GPU space. We'll have to run the docker containers manually. Both of these containers fetch model components on startup. Those model components require an API key. It is not the "nvapi-..." key you used to create the containers or log into the image repository. There are two keys.
NVIDIA Embedding QA E5 Embedding Model
Purpose: GPU-accelerated generation of text embeddings used for question-answering retrieval.
export NGC_API_KEY=<an api key>
export NIM_MODEL_NAME=nvidia/nv-embedqa-e5-v5
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)
# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.0.0"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster
- Remove the line with "-it"
- Remove the line with "-p 8000:8000"
NVIDIA rerankqa-mistral-4b-v3
Purpose: GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.
export NGC_API_KEY=<an api key>
export NIM_MODEL_NAME=nvidia/nv-rerankqa-mistral-4b-v3
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)
# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.0.0"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Make the following changes when deploying rather than testing. We want to run detached and will use the container only from within the container cluster
- Remove the line with "-it"
- Remove the line with "-p 8000:8000"
A script that uses compatible profiles
This is a script I used to verify different profiles on my NVIDIA Turing vintage hardware. The profiles were verified using the scripts above
export MY_API_KEY="nvapi-YOUR-KEY"
MODEL="nvcr.io/nim/meta/llama-3.1-8b-instruct:latest"
MODEL_PROFILE="3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5"
MODEL_MAX_LEN=19456
# MODEL="nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest"
# MODEL_PROFILE="7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789"
# MODEL_MAX_LEN=26688
# MODEL="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0"
# MODEL_PROFILE="8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d"
# MODEL_MAX_LEN=8192
#docker login --username $oauthtoken --password $MY_API_KEY nvcr.io
docker login nvcr.io
docker run --rm --gpus all --shm-size=16GB \
-e NGC_API_KEY=$MY_API_KEY \
-e NIM_MODEL_PROFILE=$MODEL_PROFILE \
-v "/home/joe/.cache/nim:/opt/nim/.cache" -u $(id -u) -p 8000:8000 \
$MODEL \
python3 -m vllm_nvext.entrypoints.openai.api_server \
--dtype half \
--max-model-len $MODEL_MAX_LEN
Links
Revision History
Created 2024/08
Added text and Riva support matrix links 2024/08
Corrected NVIDIA capitalization 2025/07
Comments
Post a Comment