Skip to content

V1 Engine: Memory allocation failures and crashes with 7B-13B models on RTX 3060 12GB #27934

@m0nk111

Description

@m0nk111

Environment

  • vLLM version: 0.11.1rc6.dev35+g29de3cdee (nightly), 0.6.4.post1 (stable)
  • GPU: NVIDIA RTX 3060 12GB (Compute 8.6, Ampere)
  • CUDA: 13.0 (Driver 580.95.05)
  • OS: Ubuntu 24.04
  • Python: 3.12
  • Container: vllm/vllm-openai:nightly and vllm/vllm-openai:v0.6.4.post1

Description

The V1 engine consistently fails to initialize with 7B-13B models on RTX 3060 12GB, despite having sufficient GPU memory. Multiple memory-related errors occur across different configurations.

Models Tested

All models fail with similar errors:

  • Qwen/Qwen2.5-Coder-7B-Instruct
  • Gryphe/MythoMax-L2-13b
  • NousResearch/Nous-Hermes-2-SOLAR-10.7B
  • cognitivecomputations/dolphin-2.6-mistral-7b

Error Patterns

Error 1: Insufficient Memory for Cache Blocks (with CPU offload)

ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

Configuration: --cpu-offload-gb 8 --gpu-memory-utilization 0.5
GPU Memory Available: 11.63 GiB total, ~6-7 GiB free after model load

Error 2: Memory Utilization Check Failure (without CPU offload)

ValueError: Free memory on device (6.44/11.63 GiB) on startup is less than desired GPU memory utilization (0.9, 10.47 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

Configuration: --gpu-memory-utilization 0.9 (no CPU offload)
Issue: Check happens BEFORE model loads, counts memory from previous crashed processes

Error 3: OOM During Model Loading

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 11.63 GiB of which 194.81 MiB is free. Process 145105 has 11.43 GiB memory in use.

Issue: Previous vLLM process keeps GPU memory allocated even after container stops

Reproduction Steps

Nightly Image (V1 Engine)

docker run -d --name vllm-test --gpus all -p 8000:8000 --ipc=host \
  vllm/vllm-openai:nightly \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --cpu-offload-gb 8 \
  --gpu-memory-utilization 0.5

Result: Crashes with "No available memory for cache blocks"

Stable v0.6.4 (V0 Engine)

docker run -d --name vllm-test --gpus all -p 8000:8000 --ipc=host \
  vllm/vllm-openai:v0.6.4.post1 \
  --model cognitivecomputations/dolphin-2.6-mistral-7b \
  --dtype auto --max-model-len 8192

Result: Crashes with OOM from leftover memory allocations

Expected Behavior

  • 7B models should work on 12GB GPU without CPU offload
  • 13B models should work with CPU offload enabled
  • Memory should be properly released when container stops
  • Memory utilization check should account for model loading requirements

Actual Behavior

  1. V1 engine fails to calculate KV cache requirements correctly
  2. GPU memory from crashed processes persists across container restarts
  3. Memory utilization pre-check is too aggressive (doesn't account for actual requirements)
  4. CPU offload doesn't effectively reduce GPU memory pressure

Workarounds Attempted

  • Kill GPU processes manually: sudo pkill -9 -f vllm (temporary fix)
  • Lower GPU utilization: 0.5-0.9 all fail with different errors
  • Enable CPU offload: Causes "no cache blocks" error
  • Use V0 engine: VLLM_USE_V1=0 - still crashes
  • Smaller models: Even 7B models fail

Related Issues

Logs

Full error log (nightly with CPU offload)
(EngineCore_DP0 pid=135) ERROR 11-01 18:47:58 [core.py:843] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
(EngineCore_DP0 pid=135) Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
    kv_cache_configs = get_kv_cache_configs(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1277, in get_kv_cache_configs
    check_enough_kv_cache_memory(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 686, in check_enough_kv_cache_memory
    raise ValueError(
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

System Impact

  • Makes vLLM unusable for common 7B-13B models on consumer 12GB GPUs
  • Forces users back to Ollama/llama.cpp despite wanting vLLM's performance
  • Affects all Ampere 12GB cards (RTX 3060, RTX 3080 Mobile, etc.)

Possible Root Causes

  1. V1 KV cache calculation bug: Incorrectly calculates available memory for cache blocks
  2. Memory leak: GPU processes not properly cleaned up on crash
  3. Pre-check too conservative: Memory check doesn't account for actual model requirements
  4. CPU offload broken: Despite [Bug]: Can't deploy Llama4 Scout on H200 with cpu offloading #16169 being marked fixed, still fails in nightly

Request

Please investigate V1 engine memory management for 12GB consumer GPUs. These are common configurations and should be supported for 7B-13B models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions