Skip to content

[Fix] Use pre-tokenized prompts in VLLMwithChatTemplate to avoid modifying model input#2434

Open
suhmily10 wants to merge 1 commit intoopen-compass:mainfrom
suhmily10:fix/vllm-avoid-modifying-input-sequence
Open

[Fix] Use pre-tokenized prompts in VLLMwithChatTemplate to avoid modifying model input#2434
suhmily10 wants to merge 1 commit intoopen-compass:mainfrom
suhmily10:fix/vllm-avoid-modifying-input-sequence

Conversation

@suhmily10
Copy link
Copy Markdown

Summary

VLLMwithChatTemplate.generate() currently calls apply_chat_template(tokenize=False) to produce text, then manually strips the BOS token as a workaround for vLLM re-adding it during tokenization (add_special_tokens=True). This approach silently modifies the model's intended input sequence and can cause incorrect evaluation results for models whose chat templates deliberately include BOS.

This PR fixes the issue by:

  • Using apply_chat_template(tokenize=True) to obtain token IDs directly
  • Passing them as pre-tokenized prompts ({"prompt_token_ids": ...}) to vLLM

This preserves the exact token sequence the chat template produces, without any manual modification, and avoids the double-BOS problem entirely.

Motivation

The previous workaround (lines 128-134) had several issues:

  1. Modifies model input — stripping BOS changes the token sequence the model was designed to receive
  2. Fragile — only handles text-level BOS prefix; fails if the tokenizer represents BOS differently
  3. Unnecessary — vLLM natively supports prompt_token_ids, which bypasses its internal tokenization entirely

Test plan

  • Verified the fix preserves the same token IDs that apply_chat_template produces (no extra/missing BOS)
  • Run evaluation with a model that has BOS in its chat template (e.g., LLaMA-based) and confirm results match expectations

Made with Cursor

…fying model input

The previous code called apply_chat_template(tokenize=False) to get text,
then stripped the BOS token as a workaround for vLLM re-adding it during
tokenization. This approach modifies the model's intended input sequence.

Instead, use apply_chat_template(tokenize=True) to obtain token IDs
directly, and pass them as pre-tokenized prompts (prompt_token_ids) to
vLLM. This preserves the exact token sequence the chat template produces
without any manual modification.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant