LLM Inference

Inference Pipeline

LLM Inference Pipeline

This diagram summarizes the end-to-end flow of decoder-only LLM inference:

Convert user input into token IDs with the tokenizer.
Add token embeddings and positional information.
Process the sequence through repeated Transformer blocks.
Project the final hidden state through the LM head to produce logits.
Convert logits into probabilities and select the next token using a decoding strategy.
Detokenize generated tokens into streamed output.
Repeat autoregressively until an end token or stopping condition is reached.

It also connects the high-level pipeline to the Transformer block, single-head self-attention, decoding strategies, prefill/decode behavior with KV cache, and common serving optimizations.