Efficient LLM Inference Systems
- Week 1: Understanding Performance Metrics
- Week 2: Hardware Foundations for Inference
- Week 3: Transformer Inference and the KV Cache
- Week 4: Quantization
Appendix
Section titled “Appendix”Resources
Section titled “Resources”- Efficient LLM Inference Systems, Algorithms & Production Engineering - Interview Pocket Notes (2026)
- Build a Large Language Model (From Scratch)
Papers
Section titled “Papers”- Splitwise: Efficient generative LLM inference using phase splitting (2023.11)
- Efficiently Scaling Transformer Inference (2022.11)
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022.08)
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022.10)
- SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs (2022.11)
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (2023.06)
- Scaling Laws for Neural Language Models (2020.01)