vllm参数--gpu-memory-utilization

doggie · January 12, 2026, 10:07am

If you set --gpu-memory-utilization to the extreme minimum of 0.1, vLLM will reserve only 10% of the total VRAM on this GPU for the KV-cache and intermediate tensors, leaving the remaining 90% unused. The consequences are:

It can start successfully — the lowest threshold, almost never failing due to “insufficient free memory.”
The KV-cache pool becomes very small, significantly reducing the allowable seq_len × batch_size. High concurrency or long input sequences will immediately trigger out-of-memory in KV cache or cause frequent evictions, resulting in a sharp drop in throughput.
Due to insufficient cache capacity, vLLM will frequently evict old sequences and recompute them, leading to lower token/s rates, and may even cause requests to fail outright.
Utilization of VRAM bandwidth is also low, leaving the GPU mostly idle, resulting in poor overall cost-effectiveness.

Therefore, 0.1 is only suitable for smoke testing whether the model can run at all. For production or normal inference workloads, start from 0.6 and tune downward, generally not below 0.4, as going lower will do more harm than good.

Topic	Replies	Views
vllm的tensor-parallel-size参数怎么配置 💻编程 vllm	6	January 12, 2026
vllm的max-num-seqs参数怎么配置 💻编程 vllm	13	January 12, 2026
Qwen2-VL-2B-Instruct Lora 微调 SwanLab可视化记录版（如何使用魔搭下载模型 🛠工具与编程	10	July 31, 2025
gitlab runner内存相关配置shm_size、memory、memory_swap 💻编程 gitlab	4	January 30, 2026
paddleocr-vl-1.5 docker部署及参数调整 💻编程 ocr , docker , paddle	45	February 25, 2026

vllm参数--gpu-memory-utilization

Related topics