If you set --gpu-memory-utilization to the extreme minimum of 0.1, vLLM will reserve only 10% of the total VRAM on this GPU for the KV-cache and intermediate tensors, leaving the remaining 90% unused. The consequences are:
-
It can start successfully — the lowest threshold, almost never failing due to “insufficient free memory.”
-
The KV-cache pool becomes very small, significantly reducing the allowable seq_len × batch_size. High concurrency or long input sequences will immediately trigger out-of-memory in KV cache or cause frequent evictions, resulting in a sharp drop in throughput.
-
Due to insufficient cache capacity, vLLM will frequently evict old sequences and recompute them, leading to lower token/s rates, and may even cause requests to fail outright.
-
Utilization of VRAM bandwidth is also low, leaving the GPU mostly idle, resulting in poor overall cost-effectiveness.
Therefore, 0.1 is only suitable for smoke testing whether the model can run at all. For production or normal inference workloads, start from 0.6 and tune downward, generally not below 0.4, as going lower will do more harm than good.