How to configure the tensor-parallel-size parameter in vllm

doggie · January 12, 2026, 10:16am

--tensor-parallel-size 2 simply splits the model parameters and computation graph in half, so that each GPU holds 1/2 of the weights;
however, vLLM still needs to reserve a proportionally large amount of VRAM on each GPU for its share of weights along with corresponding “KV-cache + intermediate activations”.

In other words:

Each GPU must individually meet the --gpu-memory-utilization threshold;
The threshold = utilization × total memory per GPU (44 GB × 0.25 ≈ 11 GB), and it’s not based on the combined memory of both GPUs;
Therefore, if any single GPU has less than 11 GB of free memory, you’ll get the error “Free memory … < desired”.

Thus, tensor parallelism does not halve the memory threshold—it only allows the model to fit across devices, but each GPU must still independently satisfy the memory requirement.

Topic	Replies	Views
vllm参数--gpu-memory-utilization 💻编程 vllm	7	January 12, 2026
vllm的max-num-seqs参数怎么配置 💻编程 vllm	13	January 12, 2026
Qwen2-VL-2B-Instruct Lora 微调 SwanLab可视化记录版（如何使用魔搭下载模型 🛠工具与编程	8	July 31, 2025
gitlab runner内存相关配置shm_size、memory、memory_swap 💻编程 gitlab	4	January 30, 2026
paddleocr-vl-1.5 docker部署及参数调整 💻编程 ocr , docker , paddle	38	February 25, 2026

How to configure the tensor-parallel-size parameter in vllm

Related topics