How to configure the tensor-parallel-size parameter in vllm

--tensor-parallel-size 2 simply splits the model parameters and computation graph in half, so that each GPU holds 1/2 of the weights;
however, vLLM still needs to reserve a proportionally large amount of VRAM on each GPU for its share of weights along with corresponding “KV-cache + intermediate activations”.

In other words:

  • Each GPU must individually meet the --gpu-memory-utilization threshold;

  • The threshold = utilization × total memory per GPU (44 GB × 0.25 ≈ 11 GB), and it’s not based on the combined memory of both GPUs;

  • Therefore, if any single GPU has less than 11 GB of free memory, you’ll get the error “Free memory … < desired”.

Thus, tensor parallelism does not halve the memory threshold—it only allows the model to fit across devices, but each GPU must still independently satisfy the memory requirement.