How to configure vllm's max-num-seqs parameter

--max-num-seqs controls the maximum number of “incomplete requests” (including those being generated and those waiting for decoding) that vLLM can process simultaneously within a single scheduling batch.

  • Larger value → higher parallelism, better throughput, but also higher peak GPU memory usage due to KV-cache;
  • Smaller value → reduces peak memory usage, allowing the system to run even in scenarios like yours where only a few hundred MBs of free GPU memory remain;
  • Default is 256; you’ll run into OOM when only 100 MB GPU memory is left. Reducing it from 256 down to 8–16 avoids memory explosion during the warm-up phase.

In short: it’s the number of “concurrent slots”—fewer slots mean smaller KV-cache and immediate relief on GPU memory pressure.