LLM Inference
Inference Pipeline
Section titled “Inference Pipeline”
This diagram summarizes the end-to-end flow of decoder-only LLM inference:
- Convert user input into token IDs with the tokenizer.
- Add token embeddings and positional information.
- Process the sequence through repeated Transformer blocks.
- Project the final hidden state through the LM head to produce logits.
- Convert logits into probabilities and select the next token using a decoding strategy.
- Detokenize generated tokens into streamed output.
- Repeat autoregressively until an end token or stopping condition is reached.
It also connects the high-level pipeline to the Transformer block, single-head self-attention, decoding strategies, prefill/decode behavior with KV cache, and common serving optimizations.