Release Notes for NVIDIA NIM for LLMs#

This documentation contains the release notes for NVIDIA NIM for Large Language Models (LLMs).

Release 1.13.0#

Summary#

NVIDIA NIM for LLMs 1.13.0 delivers significant new features and improvements. Key updates include the introduction of thinking budget control for more efficient management of reasoning model resources, the addition of the best_of sampling parameter for enhanced result selection on the completions endpoint, and new hardware support for NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs.

New Features#

The following are the new features in 1.13.0:

  • Support for NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs

  • Thinking budget control: Limit how many thinking tokens a model can generate before it must produce its final answer. This feature helps manage latency and cost for reasoning models. For setup instructions and more details, see Thinking Budget Control (Thinking-Token Limiter).

    • The SGLang backend is not supported.

  • Best of N completions: Use the best_of parameter to generate multiple candidate completions server-side and return the one(s) with the highest cumulative log probability. For usage details, see API Reference for NVIDIA NIM for LLMs.

    • Supported only on the TRT-LLM backend and only for the /v1/completions endpoint.

    • best_of cannot be used when stream=true.

    • The n parameter now allows values greater than 1, enabling you to request multiple top results in a single call (best_of must be >= n).

Known Issues Fixed in 1.13.0#

The following are the previous known issues that were fixed in 1.13.0:

  • No known issues were fixed in this release.

New Known Issues in 1.13.0#

The following are the new known issues discovered in 1.13.0:

  • LoRA deployments on TRT-LLM and vLLM backends can experience significant latency and throughput degradation. The performance impact can vary by model size and configuration.

Previous Releases#

The following are links to the previous release notes.

1.12 | 1.11 | 1.10 | 1.8 | 1.7 | 1.6 | 1.5 |1.4 | 1.3 | 1.2 | 1.1 | 1.0

All Current Known Issues#

The following are the current (unfixed) known issues from all previous versions:

Tip

For related information, see Troubleshoot NVIDIA NIM for LLMs.

General#

  • The top_logprobs parameter is not supported.

  • All models return a 500 when setting logprobs=2, echo=true, and stream=false; they should return a 200.

  • Filenames should not contain spaces if a custom fine-tuned model directory is provided.

  • Some stop words might not work as expected and might appear in the output.

  • The maximum supported context length may decrease based on memory availability.

  • The structured generation of regular expressions results may have unexpected responses. We recommend that you provide a strict answer format, such as \\boxed{}, to get the correct response.

  • The model quantization is fp8, but the logs incorrectly display it as bf16.

Deployment and Environment#

  • Deploying with KServe can require changing permissions for the cache directory. See the Serving models from local assets section for details.

  • GH200 NVIDIA driver <560.35.03 can cause a segmentation fault or hanging during deployment. Fixed in GPU driver 560.35.03

  • Optimized engines (TRT-LLM) aren’t supported with NVIDIA vGPU. To use optimized engines, use GPU Passthrough.

  • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. NVIDIA recommends that you filter these characters out of prompts before submitting the prompt to an LLM.

  • The container may crash when building local TensorRT LLM engines if there isn’t enough host memory. If that happens, try setting NIM_LOW_MEMORY_MODE=1.

  • Out-of-Bounds (OOB) sequence length with tensorrt_llm-local_build is 8K. Use the NIM_MAX_MODEL_LEN environment variable to modify the sequence length within the range of values supported by a model.

  • NIM does not support Multi-instance GPU mode (MIG).

  • vGPU related issues:

    • trtllm_buildable profiles might encounter an Out of Memory (OOM) error on vGPU systems, which can be fixed via NIM_LOW_MEMORY_MODE=1 flag.

    • When using vGPU systems with trtllm_buildable profiles, you might still encounter a broken connection error. For example, client_loop: send disconnect: Broken pipe.

  • vLLM for A100 and H200 is not supported.

  • NIM with vLLM backend may intermittently enter a state where the API returns a “Service in unhealthy” message. This is a known issue with vLLM (vllm-project/vllm#5060). You must restart the NIM in this case.

  • You can’t deploy fp8 quantized engines on H100-NVL GPUs with deterministic generation mode on. For more information, refer to Deterministic Generation Mode in NVIDIA NIM for LLMs.

  • INT4/INT8 quantized profiles are not supported for Blackwell GPUs.

  • When using Native TLS Stack to download the model, you should set --ulimit nofile=1048576 in the docker run command. If a Helm deployment is run behind the proxy, the limit must be increased on host nodes or a custom command must be provided. See Deploying Behind a TLS Proxy for details.

  • Air Gap Deployments of a model like Llama 3.3 Nemotron Super 49B that uses the model directory option might not work if the model directory is in the HuggingFace format. Switch to using NIM_FT_MODEL in those cases. For more information, refer to Air Gap Deployment.

  • Llama-3.1-Nemotron-Ultra-253B-v1 does not work on H100s and A100s. Use H200s and B200s to deploy successfully.

  • Models with 8 billion parameters require NIM_KV_CACHE_PERCENT=0.8 for tp=1 profiles.

  • NIM_ENABLE_PROMPT_LOGPROBS=1 is not supported for the TRTLLM backend.

Model Support and Functionality#

  • Many models return a 500 error when using structured generation with context-free grammar.

  • Function calling and structured generation is not supported for pipeline parallelism greater than 1.

  • Locally-built fine tuned models are not supported with FP8 profiles.

  • P-Tuning isn’t supported.

  • No tokenizer found error when running PEFT. This warning can be safely ignored.

  • vllm + LoRA profiles for long context models (model_max_len > 65528) will not load resulting in ValueError: Due to limitations of the custom LoRA CUDA kernel, max_num_batched_tokens must be <= 65528 when LoRA is enabled. As a workaround you can set NIM_MAX_MODEL_LEN=65525 or lower

  • When repetition_penalty=2, the response time for larger models is greater. Use repetition_penalty=1 on larger models.

  • The following are the known issues with function calling:

    • Format enforcement is not guaranteed by default. The tool_choice parameter no longer supports required as a value, despite its presence in the OpenAPI spec. This might impact the accuracy of tool calling for some models.

    • Function calling no longer uses guided decoding, resulting in lower accuracy for smaller models like Llama 3.2 1B/3B Instruct.

    • Smaller parameter models (<= 3 billion parameters) can have tool calling enabled but are highly inaccurate due to their limited parameter count. We don’t recommended using such models for tool calling use cases.

  • The following are the known issues with the custom guided decoding backend:

    • The fast_outlines backend is deprecated.

    • Guided decoding now defaults to xgrammar instead of outlines. For more information, refer to Structured Generation with NVIDIA NIM for LLMs.

    • The guided decoding backend cannot be accessed without using constraint fields. Set guided_regex to ".*" to act as a minimal trigger for the guided decoding backend.

    • The outlines guided decoding backend is not supported for sglang profiles.

    • Custom guided decoding requires backends that implement the set_custom_guided_decoding_parameters method, as defined in the backend file.

    • Guided decoding does not work for TP > 1 for sglang profiles.

    • Deepseek R1 may produce less accurate results when using guided decoding.

  • DeepSeek models do not support tool calling.

  • LoRA does not work for mistral-nemo-12b-instruct.

  • The vLLM backend is not supported on Llama Nemotron models.

API and Metrics#

  • GET v1/metrics API is missing from the docs page (http://HOST-IP:8000/docs, where HOST-IP is the IP address of your host).

  • Logarithmic Probabilities (logprobs) support with echo:

    • TRTLLM engine needs to be built explicitly with --gather_generation_logits

    • Enabling this may impact model throughput and inter-token latency.

    • NIM_MODEL_NAME must be set to the generated model repository.

  • logit_bias is not available for any model using the TRT-LLM backend.

  • logprobs=2 is only supported for TRT-LLM (optimized) configurations for Reward models; this option is supported for the vLLM (non-optimized) configurations for all models.

  • Empty metrics values on multi-GPU TensorRT-LLM model. Metrics items gpu_cache_usage_perc, num_request_max, num_requests_running, num_requests_waiting, and prompt_tokens_total won’t be reported for multi-GPU TensorRT-LLM model, because TensorRT-LLM currently doesn’t expose iteration statistics in orchestrator mode.

  • If ignore_eos=true, the model ignores EOS tokens and keeps generating until a custom stop token is encountered or the max token limit is reached (if not set, the default context window size is 128 tokens). For VLLM and simple queries, we recommend using ignore_eos=false (default).

All Current Known Issues for Specific Models#

The following are the current (unfixed) known issues from all previous versions, that are specific to a model:

Tip

For related information, see Troubleshoot NVIDIA NIM for LLMs.

  • Code Llama

    • FP8 profiles are not released due to accuracy degradations.

    • LoRA is not supported.

  • Deepseek

    • The min_p sampling parameter is not compatible with Deepseek and will be set to 0.0

    • The following are not supported for DeepSeek models:

      • LoRA

      • Guided Decoding

      • FT (fine-tuning)

    • DeepSeek models require setting --trust-remote-code. This is handled automatically in DeepSeek NIMs.

    • Only profiles matching the following hardware topologies are supported for the DeepSeek R1 model:

      • 2 nodes of 8xH100

      • 1 node of 8xH200

    • DeepSeek-R1 profiles disable DP attention by default to avoid crashes at higher concurrency. To turn on DP attention you can set NIM_ENABLE_DP_ATTENTION.

  • DeepSeek Coder V2 Lite Instruct does not support kv_cache_reuse for vLLM.

  • DeepSeek R1 Distill Llama 70B

    • This model does not include pre-built engines for TP8, A10G, and H100.

    • To deploy, set -e NIM_MAX_MODEL_LEN = 131072

  • DeepSeek R1 Distill Qwen 32B

    • BF16 profiles require at least 64GB GPU memory to launch. For example, vllm-bf16-tp1-pp1 profile does not launch successfully on a single L20 or other supported GPUs with GPU memory less than 80GB.

    • Structured generation has unexpected behavior due to CoT output. Despite this, guided_json parameter exhibits normal functionality when used with a JSON schema prompt.

    • When running vLLM engine with GPU that has smaller memory, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set NIM_MAX_MODEL_LEN = 32768 or less when using vLLM profile.

    • Using a trtllm_buildable profile with a fine-tuned model can crash on H100.

    • Recommend at least 80GB of CPU memory.

  • DeepSeek-R1-Distill-Qwen-14B

    • When running vLLM engine with GPU memory less than 48GB, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set NIM_MAX_MODEL_LEN = 32768 to enable vLLM profile.

  • DeepSeek-R1-Distill-Qwen-7B

    • When running vLLM engine with A10G, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set NIM_MAX_MODEL_LEN = 32768 to enable vLLM profile.

    • kv_cache_reuse is not supported.

    • suffix parameter is not supported in API call.

  • Gemma 2 9B

    • LoRA not supported

  • Gemma 2 2B

    • Does not support the System role in a chat or completions API call.

  • Gemma2 9B CPT Sahabat-AI v1 Instruct

    • gather_context_logits is not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using the trtllm_buildable feature by setting the environment variable NIM_ENABLE_PROMPT_LOGPROBS.

    • Logs for this model can contain spurious Python errors. You can safely ignore them.

  • GPT-OSS-20B and GPT-OSS-120B

    • These models can only be deployed using the vLLM backend.

    • Custom decoding is not supported.

    • Tool-calling is not supported on the Chat Completions API due to an issue with the OpenAI Harmony API.

    • The Chat Completions API may return a 500 error if the request uses "ignore_eos": true due to an issue with the OpenAI Harmony API.

    • GH200 and GB200 GPUs are not supported for this model release.

    • The Responses API is experimental.

    • When passing the payload using the Responses API, background fill is disabled.

    • GPT-OSS-120B requires at least 2 GPUs when using an H100 SXM.

  • Granite 3.3 8B Instruct

    • Thinking is not supported.

    • Tool calling is not supported.

  • Kanana 1.5 8B Instruct 2505

    • Setting NIM_TOKENIZER_MODE=slow is not supported.

    • The server returns a 500 status code (or a 200 status code and a BadRequest error) when logprobs is set to 0 in the request.

  • Llama 3.3 Nemotron Super 49B V1

    • If you set NIM_MANIFEST_ALLOW_UNSAFE to 1, deployment fails.

    • Throughput and latency degradation observed for BF16 profiles in the 5–10% range compared to previous NIM releases and slight degradation compared to OS vLLM specifically for ISL/OSL=5k/500 at concurrencies > 100. You should set NIM_DISABLE_CUDA_GRAPH=1 when running BF16 profiles.

    • Caching engines built for supervised fine-tuning (SFT) models don’t work.

    • The model might occasionally bypass its typical thinking patterns for certain queries, especially in multi-turn conversations (for example, <think> \n\n </think>).

    • Listing the profiles for this model when the local cache is enabled can result in log warnings, which do not impact NIM functionality.

    • Logs for this model can contain spurious warnings. You can safely ignore them.

    • Avoid using the logit_bias parameter with this model because the results are unpredictable.

    • If you send more than 15 concurrent requests with detailed thinking on, the container may crash.

  • Llama 3.3 Nemotron Super 49B V1.5

    • The log indicates errors when listing profiles with the NIM cache disabled. This doesn’t impact NIM functionality.

    • By default, the model responds in reasoning ON mode. Set /no_think in the system prompt to enable reasoning OFF mode.

  • Llama 3.3 70B Instruct

    • At least 400GB of CPU memory is required.

    • Concurrent requests are blocked when running NIM with the -e NIM_MAX_MODEL_LENGTH option and a large max_tokens value in the request.

    • Accuracy was noted to be lower than the expected range with profiles vllm-bf16-tp4-pp1-lora and vllm-bf16-tp8-pp1.

    • The suffix parameter isn’t supported in API calls.

    • Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Make sure the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.

    • gather_context_logits is not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using the trtllm_buildable feature by setting the environment variable NIM_ENABLE_PROMPT_LOGPROBS.

    • Performance degradation observed for the following profiles: tensorrt_llm-h100-bf16-8, tensorrt_llm-h100_nvl-bf16-8, and tensorrt_llm-h100-bf16-8-latency.

  • Llama 3.2 1B Instruct

    • LoRA is not supported for vLLM and TRT-LLM buildable.

    • Accuracy degradation observed for profiles tensorrt_llm-h200-fp8-2-latency and tensorrt_llm-l40s-fp8-tp1-pp1-throughput-lora.

    • Performance degradation observed on the following profiles: tensorrt_llm-b200-fp8-2-latency, tensorrt_llm-a100-bf16-tp1-pp1-throughput-lora, tensorrt_llm-h100-fp8-tp1-pp1-throughput-lora, and on all non-LoRA vLLM profiles (vllm-a100-bf16-2, vllm-a10g-bf16-2, vllm-b200-bf16-2, vllm-h100_nvl-bf16-2, vllm-h100-bf16-2, vllm-h200-bf16-2, vllm-l40s-bf16-2, vllm-gh200_480gb-bf16-1, and vllm-rtx4090-bf16-1). Set NIM_DISABLE_CUDA_GRAPHS to check for improved performance.

    • If you provide an invalid value for chat_template in a chat API call, the server returns a 200 status code rather than a 400 status code.

  • Llama 3.2 3B Instruct

    • Parallel tool calling is not supported.

    • Performance degradation observed for profiles tensorrt_llm-h100-fp8-1-throughput and vllm-gh200_480gb-bf16-1.

    • Currently, TRT-LLM profiles with LoRA enabled show performance degradation compared to vLLM-LoRA profiles at low concurrencies (1 and 5).

    • When making requests that consume the maximum sequence length generation (such as using ignore_eos: True), generation time might be significantly longer and can exhaust the available KV cache, causing future requests to stall. In this scenario, we recommend that you reduce concurrency.

    • gather_context_logits is not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using the trtllm_buildable feature by setting the environment variable NIM_ENABLE_PROMPT_LOGPROBS.

    • Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Make sure the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.

    • This NIM doesn’t include support for TRT-LLM buildable profiles.

    • Deploying a fine-tuned model fails for some TRT-LLM profiles when TP is greater than 1.

  • Llama 3.1 Nemotron Nano 4B V1.1

    • Accuracy degradation observed for profile tensorrt_llm-trtllm_buildable-bf16-tp2-pp1-lora-A100.

    • LoRA is not supported for vLLM profiles.

    • Performance degradation observed for the following vLLM profiles: vllm-h200-bf16-2, vllm-gh200_480gb-bf16-1, vllm-a10g-bf16-2, and vllm-l40s-bf16-2.

    • Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Verify that the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.

  • Llama 3.1 Nemotron Nano 8B V1

    • Accuracy degradation observed for profile vllm-a10g-bf16-4.

    • Performance degradation observed for the following profiles: vllm-l40s-bf16-4, vllm-bf16-tp2-pp1-lora and tensorrt_llm-trtllm_buildable-bf16-tp2-pp1-lora.

  • Llama 3.1 Nemotron Ultra 253B V1

    • Accuracy degradation observed on the B200 GPU.

    • Accuracy degradation observed for the following prebuilt profiles: tensorrt_llm-h100-fp8-8-throughput, tensorrt_llm-h200-fp8-8-throughput, and tensorrt_llm-h100_nvl-fp8-8-throughput

    • Accuracy degradation observed for the following buildable profile: tensorrt_llm-h200-bf16-8.

    • Performance degradation observed (compared to OS vLLM) for the following profiles when ISL>OSL and concurrency is >= 50: tensorrt_llm-b200-fp8-8-throughput and tensorrt_llm-b200-bf16-8.

    • TRT-LLM BF16 TP8 buildable profile cannot be deployed on A100 or H100.

    • Fine-tuned models with input vLLM checkpoints cannot be deployed on H100 GPUs due to out-of-memory (OOM) issues.

    • Tool calling is not supported if you set the nvext extension in the request.

    • Logs for this model can contain spurious warnings. You can safely ignore them.

    • The suffix parameter isn’t supported in API calls.

  • Llama 3.1 Swallow 8B Instruct v0.1

    • LoRA not supported

  • Llama 3.1 Typhoon 2 8B Instruct

    • Performance degradation observed on TRT-LLM profiles when ISL>OSL and concurrency is 100 or 250 for the following GPUs: H200, A100, and L40S.

    • The /v1/health and /v1/metrics API endpoints return incorrect response values and empty response schemas instead of the expected health status and metrics data.

  • Llama 3.1 405B Instruct

    • TRT-LLM BF16 TP16 buildable profile cannot be deployed on A100.

    • LoRA is not supported.

    • Throughput optimized profiles are not supported on A100 FP16 and H100 FP16.

    • vLLM profiles are not supported.

  • Llama 3.1 70B Instruct

    • Performance degradation observed for vLLM profiles on the following GPUs:

      • B200

      • H200

      • H200 NVL

      • H100

      • H100 NVL

      • A100

      • A100 40GB

      • L20

    • Performance degradation observed for TRT-LLM profiles on the following GPUs:

      • H200

      • H200 NVL

      • H100

    • Accuracy degradation observed for the following profiles:

      • H200 TRT-LLM

      • B200 FP8, TP2, LoRa

    • Concurrent requests are blocked when running NIM with the -e NIM_MAX_MODEL_LENGTH option and a large max_tokens value in the request.

    • The suffix parameter isn’t supported in API calls.

    • Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Verify that the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.

    • LoRA A10G TP8 for both vLLM and TRTLLM not supported due to insufficient memory.

    • The performance of vLLM LoRA on L40s TP88 is significantly suboptimal.

    • Deploying with KServe fails. As a workaround, try increasing the CPU memory to at least 77GB in the runtime YAML file.

    • Buildable TRT-LLM BF16 TP4 LoRA profiles on A100 and H100 can fail due to not enough host memory. You can work around this problem by setting NIM_LOW_MEMORY_MODE=1.

  • Llama 3.1 8B Base

    • vLLM profiles are not supported.

  • Llama 3.1 8B Instruct RTX

    • Create chat completion with non-existing model returns a 500 when it should return a 404.

  • Llama 3.1 8B Instruct

    • Performance degradation observed for the following profiles: vllm-b200-bf16-1 and vllm-b200-bf16-2.

    • LoRA is not supported on L40S with TRT-LLM.

    • H100 and L40s LoRA profiles can hang with high (>2000) ISL values.

    • For the LoRA enabled profiles, TTFT can be worse with the pre-built engines compared to the vLLM fallback while throughput is better. If TTFT is critical, please consider using the vLLM fallback.

    • For requests that consume the maximum sequence length generation (for example, requests that use ignore_eos: True), generation time can be very long and the request can consume the available KV cache causing future requests to stall. You should reduce concurrency under these conditions.

  • Llama 3.1 models

    • vLLM profiles fail with ValueError: Unknown RoPE scaling type extended.

  • Llama 3.1 FP8

    • requires NVIDIA driver version >= 550

  • Meta Llama 3 70B Instruct

    • LoRA isn’t supported on 8 x GPU configuration

  • (Meta) Llama 2 70B Chat

    • The vllm-fp16-tp2 profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other GPUs might encounter a “CUDA out of memory” issue.

  • Mistral NeMo Minitron 8B 8K Instruct

    • Tool calling is not supported.

    • LoRA is not supported.

    • vLLM TP4 or TP8 profiles are not available.

  • Mistral Small 24b Instruct 2501

    • Tool calling is not supported.

    • suffix parameter is not supported in API calls.

    • This model requires at least 48GB of VRAM but cannot be launched on a single 48GB GPU such as L40S. Single-GPU deployment is only supported on GPUs with 80GB or more of VRAM (for example, A100 80GB or H100 80GB).

    • Setting NIM_TOKENIZER_MODE=slow is not supported.

  • Mistral 7B Instruct V0.3

    • With optimized TRT-LLM profiles has lower performance compared to the OpenSource vLLM.

  • Mixtral 8x7B Instruct V0.1

    • Does not support function calling and structured generation on vLLM profiles. See vLLM #9433 for details.

    • LoRA is not supported with TRTLLM backend for MoE models

    • vLLM LoRA profiles return an internal server error/500. Set NIM_MAX_LORA_RANK=256 to use LoRA with vLLM.

    • vLLM profiles do not support function calling and structured generation. See vLLM #9433.

    • If you enable NIM_ENABLE_KV_CACHE_REUSE with the L40S FP8 TP4 Throughput profile, deployment fails.

    • Performance degradation observed for the following vLLM profiles: vllm-b200-bf16-2, vllm-a10g-bf16-8, and vllm-l40s-bf16-4.

  • Nemotron4 models

    • Require use of ‘slow’ tokenizers. ‘fast’ tokenizers causes accuracy degradation.

  • Phi 3 Mini 4K Instruct

    • LoRA is not supported

    • Tool calling is not supported

  • Phind Codellama 34B V2 Instruct

    • LoRA is not supported

    • Tool calling is not supported

  • Qwen2.5 Coder 32B Instruct

  • Qwen2.5 72B Instruct

    • The alternative option to use vLLM is not supported due to poor performance. For the GPUs that have no optimized version, use the trtllm_buildable feature to build the TRT-LLM on the fly.

    • For all pre-built engines, gather_context_logits is not enabled. If users require logits output, specify it in your own TRT-LLM configuration when you use the trtllm_buildable feature

    • The tool_choice is not supported.

    • Deploying NIM with NIM_LOG_LEVEL=CRITICAL causes the start process to hang. Use WARNING, DEBUG or INFO instead.

  • Qwen2.5 7B Instruct

    • A pre-built TRT-LLM engine for L20 is available, but it is not fully optimized for different use cases.

    • LoRA is not supported.

    • The tool_choice parameter is not supported.

    • Deploying NIM with NIM_LOG_LEVEL=CRITICAL causes the start process to hang.

    • May have performance issue at a specific use case, when using vLLM backend on L20.

  • Sarvam - M

    • Tool calling is not supported.

    • The suffix parameter is not supported in API calls.

    • The stream_options parameter is not supported in API calls.

    • The logprobs parameter is not supported when stream=true in API calls.

    • This model requires at least 48GB of VRAM but cannot be launched on a single 48GB GPU such as L40S. Single-GPU deployment is only supported on GPUs with 80GB or more of VRAM (for example, A100 80GB or H100 80GB).

  • SILMA 9B Instruct v1.0

    • This model is optimized for Arabic language contexts. While the model does process input in other languages, you may experience inconsistencies or reduced accuracy in content generated for non-Arabic languages.

    • The suffix parameter isn’t supported in API calls.

  • StarCoderBase 15.5B

    • Does not support the chat endpoint.

  • StarCoder2 7B

    • Deployment fails on H100 with vLLM (TP1, PP1) at 250 concurrent requests.

    • Deployment fails for vLLM profiles when NIM_ENABLE_KV_CACHE_REUSE=1.

    • Using FP32 checkpoints for the NIM_FT_MODEL variable or local build isn’t supported.

  • Gemma models