Skip to content

Conversation

@m0nk111
Copy link

@m0nk111 m0nk111 commented Nov 6, 2025

Summary

  • fold the CPU offload budget into the KV cache availability check so 12�� GPUs with cpu_offload_gb set no longer fail spuriously
  • surface the GPU vs CPU contribution in the error message and reuse the combined capacity when estimating a max sequence length
  • pin the CUDA build of xformers to 0.0.33+5d4b92a5.d20251105 so local installs and Docker builds consume the same wheel

Testing

  • PYTHONPATH=/home/flip/vllm-fork /home/flip/vllm-bleeding-env/bin/python -c "from types import SimpleNamespace as NS; from vllm.v1.core.kv_cache_utils import check_enough_kv_cache_memory; class DummySpec: \n def init(self, memory):\n self.memory = memory\n def max_memory_usage_bytes(self, cfg):\n return self.memory\ncheck_enough_kv_cache_memory(NS(cache_config=NS(cpu_offload_gb=2.0), model_config=NS(max_model_len=4096)), {'layer': DummySpec(int(2.5 * 10243))}, int(1.0 * 10243))"

Closes #27934.

Copilot AI review requested due to automatic review settings November 6, 2025 19:43
@github-actions
Copy link

github-actions bot commented Nov 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an issue with CPU offloading by factoring the CPU offload budget into the KV cache memory check, and also pins the xformers version. While the version pin is fine, the core logic change in kv_cache_utils.py is problematic. It incorrectly adds CPU memory designated for model weight offloading to the available GPU memory for KV cache. This is a critical issue as it can lead to runtime OOM errors by bypassing a safety check. I have left a detailed comment explaining the issue and recommending a revert of this logic.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the KV cache memory checking logic to account for CPU offloading and updates the xformers dependency version.

  • Modifies check_enough_kv_cache_memory to include CPU offload memory in available memory calculations
  • Updates error messages to reflect CPU + GPU memory availability
  • Updates xformers version from 0.0.33+5d4b92a5.d20251029 to 0.0.33+5d4b92a5.d20251105

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
vllm/v1/core/kv_cache_utils.py Enhanced KV cache memory validation to account for CPU offload, updated error messages to display GPU + CPU offload memory breakdown
requirements/cuda.txt Updated xformers package version to a newer build date
Comments suppressed due to low confidence (1)

vllm/v1/core/kv_cache_utils.py:695

  • The error message should also suggest increasing cpu_offload_gb for consistency with the error message at line 725. Users may have CPU offload configured but still hit this error if effective memory is negative.
    if effective_available_memory <= 0:
        raise ValueError(
            "No available memory for the cache blocks. "
            "Try increasing `gpu_memory_utilization` when "
            "initializing the engine."
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m0nk111
Copy link
Author

m0nk111 commented Nov 6, 2025

Thanks for the careful review! The latest revision removes the CPU-memory boost from the KV cache check. Instead, when cpu_offload_gb > 0 we now clamp max_model_len down to the longest length that fits in the profiled GPU budget, leaving the safety guard intact. The error message still reports the GPU-only limit and we log when the auto-adjust kicks in. Let me know if anything else would make this clearer.

@heheda12345
Copy link
Collaborator

CC @huydhn for xformer version update and @ApostaC for CPU offloading

@ApostaC
Copy link
Collaborator

ApostaC commented Nov 11, 2025

cpu_offload_gb is a V0 configuration and it's not being used now IIUC. @heheda12345 @m0nk111

@heheda12345
Copy link
Collaborator

@ApostaC sorry for bothering you. cpu_offload_gb is for weight offloading instead of kv cache offloading.

@heheda12345
Copy link
Collaborator

@m0nk111 can you split this PR into multiple PRs to address the problems you mentioned in #27934 one-by-one?

@mergify
Copy link

mergify bot commented Nov 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @m0nk111.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 14, 2025
@m0nk111
Copy link
Author

m0nk111 commented Nov 16, 2025

Heads-up: I spun the max_model_len reset + kv_offload gpu_memory_utilization=0.9 tweak into m0nk111#2 so this PR can focus solely on the CPU-offload accounting / xformers pin discussion per the earlier review feedback.

@m0nk111
Copy link
Author

m0nk111 commented Nov 16, 2025

Quick note: the V1 max_model_len fix that used to live in my fork-only PR has been moved into upstream PR #28808 (#28808). Please track that PR for the review instead of the fork link I mentioned earlier.

The check_enough_kv_cache_memory() function was not accounting for
CPU offloading capacity when validating available memory. This caused
the V1 engine to fail with 'No available memory for cache blocks' error
even when --cpu-offload-gb was set.

This fix adds the CPU offload capacity to the effective available memory
before performing the check, allowing 7B-13B models to work correctly
with CPU offloading on 12GB GPUs.

Fixes vllm-project#27934
@m0nk111
Copy link
Author

m0nk111 commented Nov 16, 2025

Superseded by smaller PRs:

Closing this umbrella PR so each fix can be reviewed independently.

@m0nk111 m0nk111 closed this Nov 16, 2025
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Nov 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

V1 Engine: Memory allocation failures and crashes with 7B-13B models on RTX 3060 12GB

3 participants