V1 Engine: Memory allocation failures and crashes with 7B-13B models on RTX 3060 12GB

## Environment
- **vLLM version**: 0.11.1rc6.dev35+g29de3cdee (nightly), 0.6.4.post1 (stable)
- **GPU**: NVIDIA RTX 3060 12GB (Compute 8.6, Ampere)
- **CUDA**: 13.0 (Driver 580.95.05)
- **OS**: Ubuntu 24.04
- **Python**: 3.12
- **Container**: `vllm/vllm-openai:nightly` and `vllm/vllm-openai:v0.6.4.post1`

## Description
The V1 engine consistently fails to initialize with 7B-13B models on RTX 3060 12GB, despite having sufficient GPU memory. Multiple memory-related errors occur across different configurations.

## Models Tested
All models fail with similar errors:
- `Qwen/Qwen2.5-Coder-7B-Instruct`
- `Gryphe/MythoMax-L2-13b`
- `NousResearch/Nous-Hermes-2-SOLAR-10.7B`
- `cognitivecomputations/dolphin-2.6-mistral-7b`

## Error Patterns

### Error 1: Insufficient Memory for Cache Blocks (with CPU offload)
```
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
```
**Configuration**: `--cpu-offload-gb 8 --gpu-memory-utilization 0.5`
**GPU Memory Available**: 11.63 GiB total, ~6-7 GiB free after model load

### Error 2: Memory Utilization Check Failure (without CPU offload)
```
ValueError: Free memory on device (6.44/11.63 GiB) on startup is less than desired GPU memory utilization (0.9, 10.47 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
```
**Configuration**: `--gpu-memory-utilization 0.9` (no CPU offload)
**Issue**: Check happens BEFORE model loads, counts memory from previous crashed processes

### Error 3: OOM During Model Loading
```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 11.63 GiB of which 194.81 MiB is free. Process 145105 has 11.43 GiB memory in use.
```
**Issue**: Previous vLLM process keeps GPU memory allocated even after container stops

## Reproduction Steps

### Nightly Image (V1 Engine)
```bash
docker run -d --name vllm-test --gpus all -p 8000:8000 --ipc=host \
  vllm/vllm-openai:nightly \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --cpu-offload-gb 8 \
  --gpu-memory-utilization 0.5
```
**Result**: Crashes with "No available memory for cache blocks"

### Stable v0.6.4 (V0 Engine)
```bash
docker run -d --name vllm-test --gpus all -p 8000:8000 --ipc=host \
  vllm/vllm-openai:v0.6.4.post1 \
  --model cognitivecomputations/dolphin-2.6-mistral-7b \
  --dtype auto --max-model-len 8192
```
**Result**: Crashes with OOM from leftover memory allocations

## Expected Behavior
- 7B models should work on 12GB GPU without CPU offload
- 13B models should work with CPU offload enabled
- Memory should be properly released when container stops
- Memory utilization check should account for model loading requirements

## Actual Behavior
1. V1 engine fails to calculate KV cache requirements correctly
2. GPU memory from crashed processes persists across container restarts
3. Memory utilization pre-check is too aggressive (doesn't account for actual requirements)
4. CPU offload doesn't effectively reduce GPU memory pressure

## Workarounds Attempted
- ✅ **Kill GPU processes manually**: `sudo pkill -9 -f vllm` (temporary fix)
- ❌ **Lower GPU utilization**: 0.5-0.9 all fail with different errors
- ❌ **Enable CPU offload**: Causes "no cache blocks" error
- ❌ **Use V0 engine**: `VLLM_USE_V1=0` - still crashes
- ❌ **Smaller models**: Even 7B models fail

## Related Issues
- #16169 - CPU offload AssertionError (marked as fixed, but still broken in nightly)
- Similar to memory fragmentation issues in V1 engine

## Logs
<details>
<summary>Full error log (nightly with CPU offload)</summary>

```
(EngineCore_DP0 pid=135) ERROR 11-01 18:47:58 [core.py:843] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
(EngineCore_DP0 pid=135) Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
    kv_cache_configs = get_kv_cache_configs(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1277, in get_kv_cache_configs
    check_enough_kv_cache_memory(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 686, in check_enough_kv_cache_memory
    raise ValueError(
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
```
</details>

## System Impact
- Makes vLLM unusable for common 7B-13B models on consumer 12GB GPUs
- Forces users back to Ollama/llama.cpp despite wanting vLLM's performance
- Affects all Ampere 12GB cards (RTX 3060, RTX 3080 Mobile, etc.)

## Possible Root Causes
1. **V1 KV cache calculation bug**: Incorrectly calculates available memory for cache blocks
2. **Memory leak**: GPU processes not properly cleaned up on crash
3. **Pre-check too conservative**: Memory check doesn't account for actual model requirements
4. **CPU offload broken**: Despite #16169 being marked fixed, still fails in nightly

## Request
Please investigate V1 engine memory management for 12GB consumer GPUs. These are common configurations and should be supported for 7B-13B models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

V1 Engine: Memory allocation failures and crashes with 7B-13B models on RTX 3060 12GB #27934

Environment

Description

Models Tested

Error Patterns

Error 1: Insufficient Memory for Cache Blocks (with CPU offload)

Error 2: Memory Utilization Check Failure (without CPU offload)

Error 3: OOM During Model Loading

Reproduction Steps

Nightly Image (V1 Engine)

Stable v0.6.4 (V0 Engine)

Expected Behavior

Actual Behavior

Workarounds Attempted

Related Issues

Logs

System Impact

Possible Root Causes

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

V1 Engine: Memory allocation failures and crashes with 7B-13B models on RTX 3060 12GB #27934

Description

Environment

Description

Models Tested

Error Patterns

Error 1: Insufficient Memory for Cache Blocks (with CPU offload)

Error 2: Memory Utilization Check Failure (without CPU offload)

Error 3: OOM During Model Loading

Reproduction Steps

Nightly Image (V1 Engine)

Stable v0.6.4 (V0 Engine)

Expected Behavior

Actual Behavior

Workarounds Attempted

Related Issues

Logs

System Impact

Possible Root Causes

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions