-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Description
Environment
- vLLM version: 0.11.1rc6.dev35+g29de3cdee (nightly), 0.6.4.post1 (stable)
- GPU: NVIDIA RTX 3060 12GB (Compute 8.6, Ampere)
- CUDA: 13.0 (Driver 580.95.05)
- OS: Ubuntu 24.04
- Python: 3.12
- Container:
vllm/vllm-openai:nightlyandvllm/vllm-openai:v0.6.4.post1
Description
The V1 engine consistently fails to initialize with 7B-13B models on RTX 3060 12GB, despite having sufficient GPU memory. Multiple memory-related errors occur across different configurations.
Models Tested
All models fail with similar errors:
Qwen/Qwen2.5-Coder-7B-InstructGryphe/MythoMax-L2-13bNousResearch/Nous-Hermes-2-SOLAR-10.7Bcognitivecomputations/dolphin-2.6-mistral-7b
Error Patterns
Error 1: Insufficient Memory for Cache Blocks (with CPU offload)
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
Configuration: --cpu-offload-gb 8 --gpu-memory-utilization 0.5
GPU Memory Available: 11.63 GiB total, ~6-7 GiB free after model load
Error 2: Memory Utilization Check Failure (without CPU offload)
ValueError: Free memory on device (6.44/11.63 GiB) on startup is less than desired GPU memory utilization (0.9, 10.47 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
Configuration: --gpu-memory-utilization 0.9 (no CPU offload)
Issue: Check happens BEFORE model loads, counts memory from previous crashed processes
Error 3: OOM During Model Loading
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 11.63 GiB of which 194.81 MiB is free. Process 145105 has 11.43 GiB memory in use.
Issue: Previous vLLM process keeps GPU memory allocated even after container stops
Reproduction Steps
Nightly Image (V1 Engine)
docker run -d --name vllm-test --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:nightly \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--cpu-offload-gb 8 \
--gpu-memory-utilization 0.5Result: Crashes with "No available memory for cache blocks"
Stable v0.6.4 (V0 Engine)
docker run -d --name vllm-test --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:v0.6.4.post1 \
--model cognitivecomputations/dolphin-2.6-mistral-7b \
--dtype auto --max-model-len 8192Result: Crashes with OOM from leftover memory allocations
Expected Behavior
- 7B models should work on 12GB GPU without CPU offload
- 13B models should work with CPU offload enabled
- Memory should be properly released when container stops
- Memory utilization check should account for model loading requirements
Actual Behavior
- V1 engine fails to calculate KV cache requirements correctly
- GPU memory from crashed processes persists across container restarts
- Memory utilization pre-check is too aggressive (doesn't account for actual requirements)
- CPU offload doesn't effectively reduce GPU memory pressure
Workarounds Attempted
- ✅ Kill GPU processes manually:
sudo pkill -9 -f vllm(temporary fix) - ❌ Lower GPU utilization: 0.5-0.9 all fail with different errors
- ❌ Enable CPU offload: Causes "no cache blocks" error
- ❌ Use V0 engine:
VLLM_USE_V1=0- still crashes - ❌ Smaller models: Even 7B models fail
Related Issues
- [Bug]: Can't deploy Llama4 Scout on H200 with cpu offloading #16169 - CPU offload AssertionError (marked as fixed, but still broken in nightly)
- Similar to memory fragmentation issues in V1 engine
Logs
Full error log (nightly with CPU offload)
(EngineCore_DP0 pid=135) ERROR 11-01 18:47:58 [core.py:843] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
(EngineCore_DP0 pid=135) Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
kv_cache_configs = get_kv_cache_configs(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1277, in get_kv_cache_configs
check_enough_kv_cache_memory(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 686, in check_enough_kv_cache_memory
raise ValueError(
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
System Impact
- Makes vLLM unusable for common 7B-13B models on consumer 12GB GPUs
- Forces users back to Ollama/llama.cpp despite wanting vLLM's performance
- Affects all Ampere 12GB cards (RTX 3060, RTX 3080 Mobile, etc.)
Possible Root Causes
- V1 KV cache calculation bug: Incorrectly calculates available memory for cache blocks
- Memory leak: GPU processes not properly cleaned up on crash
- Pre-check too conservative: Memory check doesn't account for actual model requirements
- CPU offload broken: Despite [Bug]: Can't deploy Llama4 Scout on H200 with cpu offloading #16169 being marked fixed, still fails in nightly
Request
Please investigate V1 engine memory management for 12GB consumer GPUs. These are common configurations and should be supported for 7B-13B models.