Skip to content

Efficient LLM Inference Systems

This chapter measures how prompt length and batch size affect latency and throughput in decoder-only LLM inference.

The experiment uses Qwen/Qwen2.5-3B-Instruct. A 3B model is used instead of the original 7B model so the run fits cleanly in a 16GB VRAM environment without CPU offload.

Terminal window
cd /cache/Workspace/ziwon/ai-data-center-network/efficient-llm-inference-systems
HF_HOME=/data/LLM/models/hugging-face uv run python chat01/labs-01-sweep.py \
--prompt-lengths 16,256,1024,4096 \
--batch-sizes 1,2,4,8,16,32,64,128 \
--batch-prompt-len 256 \
--max-new-tokens 512 \
--out-dir results/labs-01-sweep-512

The script runs two sweeps.

  • Prompt length sweep: vary prompt length across [16, 256, 1024, 4096] tokens
  • Batch size sweep: fix prompt length at 256 tokens and vary batch size across [1, 2, 4, 8, 16, 32, 64, 128]

The decode loop requests only the final-position logits with logits_to_keep=1. This avoids allocating full prompt logits during prefill, which would otherwise dominate memory use at large batch sizes.

Output files:

Increasing prompt length makes the prefill phase longer. As a result, TTFT, or Time To First Token, grows with the number of prompt tokens.

TPOT, or Time Per Output Token, measures per-token latency during the decode phase. During decode, each token step repeatedly reads model weights, which is the dominant cost in this setup. The KV cache grows with prompt length, but its additional read cost is relatively small in this experiment. Therefore, TPOT changes very little as prompt length increases.

Prompt length sweep

Prompt tokensTTFT msTPOT mean msAggregate throughput tok/s
1613.912.083.5
25628.012.083.3
102473.212.281.8
4096297.412.579.7

Observations:

  • TTFT grows quasi-linearly with prompt length.
  • TPOT stays nearly flat, moving only from 12.0ms to 12.5ms.
  • Longer prompts slightly reduce total throughput, but that effect mostly comes from including TTFT in the average rather than from decode TPOT itself.

With a fixed prompt length of 256 tokens, increasing batch size lets each decode step compute the next token for multiple sequences at once. Per-token latency stays near a plateau for a while, while aggregate throughput across the batch increases substantially.

GPU utilization is sampled with nvidia-smi dmon -s pucvmet during each batch run.

Batch size sweep

Batch sizeTPOT mean msAggregate throughput tok/sGPU util %
112.182.789.1
213.8145.478.3
413.3300.777.6
813.5592.279.6
1614.51107.078.3
3217.01881.191.7
6419.43296.788.8
12825.65007.586.2

Observations:

  • Aggregate throughput increases strongly with batch size, but the gain becomes sublinear at large batches.
  • TPOT stays close to a plateau through batch size 16, then rises at batch sizes 32, 64, and 128.
  • The transition region for this setup starts around batch size 64 to 128: throughput still increases, but doubling batch size from 64 to 128 only improves aggregate throughput by about 1.5x while TPOT rises from 19.4ms to 25.6ms.
  • nvidia-smi dmon samples once per second, so short runs can be noisy, but the logs still show that the GPU remains highly utilized during batch runs.

For Qwen/Qwen2.5-3B-Instruct, the relevant model config is:

FieldValue
Layers36
Query heads16
KV heads2
Head dim128
dtypeBF16

Because the model uses GQA, KV cache size is determined by the number of KV heads, not the number of query heads.

KV bytes per token
= 2(K,V) * 36 layers * 2 KV heads * 128 head_dim * 2 bytes
= 36,864 bytes
= 36 KiB/token

For the batch sweep, prompt length is 256 and generation length is 512, so the final sequence length is about 768 tokens.

batch 128, seq 768:
128 * 768 * 36 KiB = 3.375 GiB

The rough memory budget at batch size 128 is therefore:

weights: ~6 GiB
KV cache: ~3.4 GiB
other costs: activations + CUDA context + PyTorch reserved memory + temporary tensors

One important pitfall is logits allocation during prefill. If the model returns logits for every prompt position, the tensor can be very large:

batch 128 * seq 256 * vocab 151,936 * 2 bytes
= 9.27 GiB

The sweep script uses logits_to_keep=1 because greedy decoding only needs the final-position logits. Without this, batch size 128 can OOM before KV cache becomes the true limiting factor.

The measured throughput knee around batch size 64 to 128 should therefore be interpreted as the practical saturation region for this setup, not as a pure KV-cache capacity boundary. At that point, throughput still increases, but scaling becomes sublinear and TPOT rises sharply.

LLM inference can be split into prefill and decode.

  • Prefill: processes the full prompt in parallel. As prompt length grows, TTFT increases.
  • Decode: generates one token at a time autoregressively. Each token step reads model weights, so decode is often memory-bandwidth bound.
  • Batching: groups multiple requests into the same decode step. This lets the system process more tokens per weight read, increasing aggregate throughput.

The expected pattern is:

  • Increasing prompt length: TTFT increases, TPOT stays nearly flat
  • Increasing batch size: aggregate throughput increases, TPOT stays flat for a range and then rises once the setup approaches saturation