Skip to content

[Bug]: vllm bench serve incorrectly calculates ITL for openai-chat with reasoning, tool calling, or harmony models #27485

@bbrowning

Description

@bbrowning

Your current environment

Testing vllm bench serve from main as of opening this bug - Oct 24, 2025. The issue is agnostic of hardware in use.

🐛 Describe the bug

When using vllm bench serve with the openai-chat backend against a vLLM server, the inter-token latency (ITL) is incorrectly calculated for any situation where the server chunks streamed back do not map 1:1 with tokens generated. This regularly happens when reasoning parsers, tool call parsers, or harmony models are in use as all of these have special tokens and parsing logic that can cause responses to get temporarily buffered and/or special tokens removed from the final output.

The logic in vllm.benchmarks.lib.endpoint_request_func.async_request_openai_chat_completions assumes every chunk is one token and calculates ITL as the simple timestamp difference between the last chunk and this chunk. This is misleading, and will lead to reporting higher ITL values than reality because this is actually calculating the latency between streaming chunks as opposed to the latency between generated tokens.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions