Fix CPU offload KV cache accounting and align CUDA xformers pin #28241

m0nk111 · 2025-11-06T19:43:05Z

Summary

fold the CPU offload budget into the KV cache availability check so 12�� GPUs with cpu_offload_gb set no longer fail spuriously
surface the GPU vs CPU contribution in the error message and reuse the combined capacity when estimating a max sequence length
pin the CUDA build of xformers to 0.0.33+5d4b92a5.d20251105 so local installs and Docker builds consume the same wheel

Testing

PYTHONPATH=/home/flip/vllm-fork /home/flip/vllm-bleeding-env/bin/python -c "from types import SimpleNamespace as NS; from vllm.v1.core.kv_cache_utils import check_enough_kv_cache_memory; class DummySpec: \n def init(self, memory):\n self.memory = memory\n def max_memory_usage_bytes(self, cfg):\n return self.memory\ncheck_enough_kv_cache_memory(NS(cache_config=NS(cpu_offload_gb=2.0), model_config=NS(max_model_len=4096)), {'layer': DummySpec(int(2.5 * 10243))}, int(1.0 * 10243))"

github-actions · 2025-11-06T19:43:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request aims to fix an issue with CPU offloading by factoring the CPU offload budget into the KV cache memory check, and also pins the xformers version. While the version pin is fine, the core logic change in kv_cache_utils.py is problematic. It incorrectly adds CPU memory designated for model weight offloading to the available GPU memory for KV cache. This is a critical issue as it can lead to runtime OOM errors by bypassing a safety check. I have left a detailed comment explaining the issue and recommending a revert of this logic.

Copilot

Pull Request Overview

This PR enhances the KV cache memory checking logic to account for CPU offloading and updates the xformers dependency version.

Modifies check_enough_kv_cache_memory to include CPU offload memory in available memory calculations
Updates error messages to reflect CPU + GPU memory availability
Updates xformers version from 0.0.33+5d4b92a5.d20251029 to 0.0.33+5d4b92a5.d20251105

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
vllm/v1/core/kv_cache_utils.py	Enhanced KV cache memory validation to account for CPU offload, updated error messages to display GPU + CPU offload memory breakdown
requirements/cuda.txt	Updated xformers package version to a newer build date

Comments suppressed due to low confidence (1)

vllm/v1/core/kv_cache_utils.py:695

The error message should also suggest increasing cpu_offload_gb for consistency with the error message at line 725. Users may have CPU offload configured but still hit this error if effective memory is negative.

    if effective_available_memory <= 0:
        raise ValueError(
            "No available memory for the cache blocks. "
            "Try increasing `gpu_memory_utilization` when "
            "initializing the engine."
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

m0nk111 · 2025-11-06T20:26:26Z

Thanks for the careful review! The latest revision removes the CPU-memory boost from the KV cache check. Instead, when cpu_offload_gb > 0 we now clamp max_model_len down to the longest length that fits in the profiled GPU budget, leaving the safety guard intact. The error message still reports the GPU-only limit and we log when the auto-adjust kicks in. Let me know if anything else would make this clearer.

heheda12345 · 2025-11-11T07:32:38Z

CC @huydhn for xformer version update and @ApostaC for CPU offloading

ApostaC · 2025-11-11T18:06:18Z

cpu_offload_gb is a V0 configuration and it's not being used now IIUC. @heheda12345 @m0nk111

heheda12345 · 2025-11-12T05:58:33Z

@ApostaC sorry for bothering you. cpu_offload_gb is for weight offloading instead of kv cache offloading.

heheda12345 · 2025-11-12T06:03:23Z

@m0nk111 can you split this PR into multiple PRs to address the problems you mentioned in #27934 one-by-one?

mergify · 2025-11-14T19:04:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @m0nk111.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

m0nk111 · 2025-11-16T11:48:04Z

Heads-up: I spun the max_model_len reset + kv_offload gpu_memory_utilization=0.9 tweak into m0nk111#2 so this PR can focus solely on the CPU-offload accounting / xformers pin discussion per the earlier review feedback.

m0nk111 · 2025-11-16T13:14:06Z

Quick note: the V1 max_model_len fix that used to live in my fork-only PR has been moved into upstream PR #28808 (#28808). Please track that PR for the review instead of the fork link I mentioned earlier.

The check_enough_kv_cache_memory() function was not accounting for CPU offloading capacity when validating available memory. This caused the V1 engine to fail with 'No available memory for cache blocks' error even when --cpu-offload-gb was set. This fix adds the CPU offload capacity to the effective available memory before performing the check, allowing 7B-13B models to work correctly with CPU offloading on 12GB GPUs. Fixes vllm-project#27934

m0nk111 · 2025-11-16T13:41:56Z

Superseded by smaller PRs:

v1: account for CPU offload capacity in KV cache check #28810 (CPU-offload KV cache accounting)
v1: clamp max_model_len when KV cache exceeds GPU budget #28812 (max_model_len clamp / warning)
build: align CUDA 12.1 xformers wheel pin #28813 (xformers CUDA 12.1 pin)

Closing this umbrella PR so each fix can be reviewed independently.

Copilot AI review requested due to automatic review settings November 6, 2025 19:43

m0nk111 requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners November 6, 2025 19:43

mergify bot added ci/build v1 labels Nov 6, 2025

gemini-code-assist bot reviewed Nov 6, 2025

View reviewed changes

Copilot AI reviewed Nov 6, 2025

View reviewed changes

mergify bot added the nvidia label Nov 11, 2025

github-project-automation bot added this to NVIDIA Nov 11, 2025

mergify bot added the needs-rebase label Nov 14, 2025

m0nk111 force-pushed the fix/cpu-offload-memory-check branch from 762b4f2 to 7f3422b Compare November 16, 2025 13:31

mergify bot removed the needs-rebase label Nov 16, 2025

m0nk111 mentioned this pull request Nov 16, 2025

build: align CUDA 12.1 xformers wheel pin #28813

Open

m0nk111 closed this Nov 16, 2025

github-project-automation bot moved this to Done in NVIDIA Nov 16, 2025

Uh oh!

Fix CPU offload KV cache accounting and align CUDA xformers pin #28241

Fix CPU offload KV cache accounting and align CUDA xformers pin #28241

Conversation

m0nk111 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

m0nk111 commented Nov 6, 2025

Uh oh!

heheda12345 commented Nov 11, 2025

Uh oh!

ApostaC commented Nov 11, 2025

Uh oh!

heheda12345 commented Nov 12, 2025

Uh oh!

heheda12345 commented Nov 12, 2025

Uh oh!

mergify bot commented Nov 14, 2025

Uh oh!

m0nk111 commented Nov 16, 2025

Uh oh!

m0nk111 commented Nov 16, 2025

Uh oh!

m0nk111 commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

m0nk111 commented Nov 6, 2025 •

edited

Loading