-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Fix CPU offload KV cache accounting and align CUDA xformers pin #28241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an issue with CPU offloading by factoring the CPU offload budget into the KV cache memory check, and also pins the xformers version. While the version pin is fine, the core logic change in kv_cache_utils.py is problematic. It incorrectly adds CPU memory designated for model weight offloading to the available GPU memory for KV cache. This is a critical issue as it can lead to runtime OOM errors by bypassing a safety check. I have left a detailed comment explaining the issue and recommending a revert of this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the KV cache memory checking logic to account for CPU offloading and updates the xformers dependency version.
- Modifies
check_enough_kv_cache_memoryto include CPU offload memory in available memory calculations - Updates error messages to reflect CPU + GPU memory availability
- Updates xformers version from
0.0.33+5d4b92a5.d20251029to0.0.33+5d4b92a5.d20251105
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| vllm/v1/core/kv_cache_utils.py | Enhanced KV cache memory validation to account for CPU offload, updated error messages to display GPU + CPU offload memory breakdown |
| requirements/cuda.txt | Updated xformers package version to a newer build date |
Comments suppressed due to low confidence (1)
vllm/v1/core/kv_cache_utils.py:695
- The error message should also suggest increasing
cpu_offload_gbfor consistency with the error message at line 725. Users may have CPU offload configured but still hit this error if effective memory is negative.
if effective_available_memory <= 0:
raise ValueError(
"No available memory for the cache blocks. "
"Try increasing `gpu_memory_utilization` when "
"initializing the engine."
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks for the careful review! The latest revision removes the CPU-memory boost from the KV cache check. Instead, when |
|
|
|
@ApostaC sorry for bothering you. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Heads-up: I spun the |
The check_enough_kv_cache_memory() function was not accounting for CPU offloading capacity when validating available memory. This caused the V1 engine to fail with 'No available memory for cache blocks' error even when --cpu-offload-gb was set. This fix adds the CPU offload capacity to the effective available memory before performing the check, allowing 7B-13B models to work correctly with CPU offloading on 12GB GPUs. Fixes vllm-project#27934
762b4f2 to
7f3422b
Compare
|
Superseded by smaller PRs:
Closing this umbrella PR so each fix can be reviewed independently. |
Summary
cpu_offload_gbset no longer fail spuriouslyTesting
Closes #27934.