-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Description
Your current environment
Testing vllm bench serve from main as of opening this bug - Oct 24, 2025. The issue is agnostic of hardware in use.
🐛 Describe the bug
When using vllm bench serve with the openai-chat backend against a vLLM server, the inter-token latency (ITL) is incorrectly calculated for any situation where the server chunks streamed back do not map 1:1 with tokens generated. This regularly happens when reasoning parsers, tool call parsers, or harmony models are in use as all of these have special tokens and parsing logic that can cause responses to get temporarily buffered and/or special tokens removed from the final output.
The logic in vllm.benchmarks.lib.endpoint_request_func.async_request_openai_chat_completions assumes every chunk is one token and calculates ITL as the simple timestamp difference between the last chunk and this chunk. This is misleading, and will lead to reporting higher ITL values than reality because this is actually calculating the latency between streaming chunks as opposed to the latency between generated tokens.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.