--tensor-parallel-size 2 simply splits the model parameters and computation graph in half, so that each GPU holds 1/2 of the weights;
however, vLLM still needs to reserve a proportionally large amount of VRAM on each GPU for its share of weights along with corresponding “KV-cache + intermediate activations”.
In other words:
-
Each GPU must individually meet the
--gpu-memory-utilizationthreshold; -
The threshold =
utilization × total memory per GPU(44 GB × 0.25 ≈ 11 GB), and it’s not based on the combined memory of both GPUs; -
Therefore, if any single GPU has less than 11 GB of free memory, you’ll get the error “Free memory … < desired”.
Thus, tensor parallelism does not halve the memory threshold—it only allows the model to fit across devices, but each GPU must still independently satisfy the memory requirement.