Chapter 1: Introduction and AI System Overview
Table of Contents
Section titled “Table of Contents”- Goal
- Core Message
- AI Systems Performance Engineer
- Why Goodput Matters
- Benchmarking and Profiling
- Mechanical Sympathy
- Hardware-Software-Algorithm Codesign
- DeepSeek Case Study
- Performance Bottleneck Lens
- Practical Metrics and Tools
- AI Performance Engineering Workflow
- Design Decision Matrix
- Operational Validation Checklist
- Chapter Summary
- Key Terms
- Questions
- Answers
- References
This chapter introduces the mental model of AI Systems Performance Engineering.
The core idea is:
AI performance engineering is not about making GPUs look busy. It is about maximizing useful work — goodput — across hardware, software, runtime, network, storage, scheduler, and application layers.
Chapter 1 sets the foundation for the whole book:
- AI systems are full-stack systems.
- Performance bottlenecks can appear at any layer.
- Raw GPU utilization is not enough.
- Goodput is the real target.
- Optimization must be driven by profiling, not intuition.
- Hardware, software, and algorithms must be codesigned.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
classDef hw fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef sw fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef alg fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef metric fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef goal fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
HW[Hardware<br/>GPU, CPU, HBM, NVLink, RDMA, Storage]:::hw
SW[Software Stack<br/>OS, Driver, CUDA, PyTorch, Runtime]:::sw
ALG[Algorithms<br/>Attention, MoE, Quantization, Batching]:::alg
PROF[Profiling<br/>Nsight, PyTorch Profiler, DCGM, NCCL Tests]:::metric
GP[Goodput<br/>Useful training/inference throughput]:::goal
HW --> GP
SW --> GP
ALG --> GP
PROF --> HW
PROF --> SW
PROF --> ALG
Core Message
Section titled “Core Message”AI systems performance engineering is the discipline of answering three questions:
- Where is the bottleneck?
- How do we measure it?
- Which layer should we fix?
The important shift is from:
"GPU utilization is high, so the system is healthy."to:
"How much of the system's capacity is doing useful training or inference work?"This is why Chapter 1 introduces goodput, mechanical sympathy, and hardware-software-algorithm codesign early.
AI Systems Performance Engineer
Section titled “AI Systems Performance Engineer”An AI Systems Performance Engineer sits between several domains.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
classDef role fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef layer fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef team fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
PE[AI Systems<br/>Performance Engineer]:::role
GPU[GPU / CUDA<br/>Kernel, SM, HBM]:::layer
INFRA[Infra<br/>OS, Docker, Kubernetes]:::layer
NET[Network<br/>NVLink, NCCL, RDMA]:::layer
STORAGE[Storage<br/>Dataset, Checkpoint, GDS]:::layer
APP[Application<br/>Training loop, Serving path]:::layer
DS[Researchers<br/>Data Scientists]:::team
DEV[Application<br/>Developers]:::team
OPS[Infra / Platform<br/>Engineers]:::team
PE --> GPU
PE --> INFRA
PE --> NET
PE --> STORAGE
PE --> APP
PE --- DS
PE --- DEV
PE --- OPS
The role is not just “GPU administrator” or “ML engineer.”
It combines:
| Area | Responsibility |
|---|---|
| Benchmarking | Measure throughput, latency, memory usage, scaling efficiency |
| Profiling | Identify bottlenecks using system and GPU profilers |
| Debugging | Trace performance regressions to root cause |
| Optimization | Improve kernels, runtime, data pipeline, communication, scheduling |
| Scaling | Move from single GPU to multi-GPU, multinode, multirack systems |
| Resource efficiency | Improve performance per dollar and performance per watt |
| Reproducibility | Make benchmark results repeatable and comparable |
Why Goodput Matters
Section titled “Why Goodput Matters”Goodput means useful throughput.
Raw throughput asks:
How much work appears to be happening?Goodput asks:
How much useful model progress is actually happening?Examples of non-useful work:
- GPU waiting for dataloader
- GPU waiting for NCCL synchronization
- excessive CPU-GPU memory copy
- failed job restart
- suboptimal kernel launch overhead
- communication bubble in pipeline parallelism
- request queueing delay in inference serving
- KV cache eviction or recomputation
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
classDef useful fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef waste fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef total fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
TOTAL[Total GPU Cluster Time]:::total
USEFUL[Useful Work<br/>forward/backward<br/>tokens generated<br/>requests completed]:::useful
W1[Data loading wait]:::waste
W2[NCCL communication wait]:::waste
W3[Kernel launch overhead]:::waste
W4[OOM / restart / preemption]:::waste
W5[Storage checkpoint stall]:::waste
TOTAL --> USEFUL
TOTAL --> W1
TOTAL --> W2
TOTAL --> W3
TOTAL --> W4
TOTAL --> W5
A simplified goodput view:
Goodput = Useful completed work / End-to-end elapsed timeFor training:
Goodput = useful tokens or samples processed per secondFor inference:
Goodput = completed requests or generated tokens per second under SLOThe key point:
GPU utilization can be high while goodput is low.
Example:
| Situation | GPU Utilization | Goodput | Likely Bottleneck |
|---|---|---|---|
| GPU busy but waiting on all-reduce | High | Low | Network / NCCL |
| GPU periodically idle before each batch | Low or unstable | Low | CPU / dataloader / storage |
| GPU memory nearly full, KV cache eviction frequent | High | Low | Memory / serving scheduler |
| p99 latency high despite good average throughput | High | Low under SLO | Application / batching policy |
| training job restarts often | Variable | Low | Reliability / orchestration |
Benchmarking and Profiling
Section titled “Benchmarking and Profiling”Chapter 1 emphasizes that performance work should be profile-driven.
The workflow should be:
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
classDef step fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef decision fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef output fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
A[Define workload]:::step
B[Run baseline benchmark]:::step
C[Collect profiling data]:::step
D{Find bottleneck}:::decision
E[Apply targeted optimization]:::step
F[Re-run benchmark]:::step
G[Compare before/after]:::output
H[Automate regression test]:::output
A --> B --> C --> D --> E --> F --> G --> H
G --> C
Bad optimization style:
"I changed this and it feels faster."Good optimization style:
"Before: 1,200 tokens/s, p99 900 ms.After: 1,580 tokens/s, p99 710 ms.Profiler shows NCCL wait reduced from 27% to 12%."Benchmarking targets
Section titled “Benchmarking targets”| Workload | Primary Metric | Secondary Metrics |
|---|---|---|
| Training | samples/sec, tokens/sec, step time | GPU util, NCCL time, dataloader wait, checkpoint time |
| Inference | tokens/sec, requests/sec | TTFT, TPOT, p95/p99 latency, queue time |
| Distributed training | scaling efficiency | all-reduce time, network bandwidth, straggler ratio |
| Storage-heavy training | data pipeline throughput | IOPS, read BW, dataloader latency |
| LLM serving | SLO-compliant throughput | KV cache usage, batch size, decode latency |
Mechanical Sympathy
Section titled “Mechanical Sympathy”Mechanical sympathy means:
Understand how the machine works, then design software and algorithms that cooperate with it.
For AI systems, the “machine” includes:
- GPU SMs
- Tensor Cores
- HBM
- L2 cache
- CPU NUMA topology
- PCIe / NVLink / NVSwitch
- RDMA NICs
- storage hierarchy
- CUDA runtime
- PyTorch execution model
- Kubernetes scheduler placement
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
classDef machine fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef symptom fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef fix fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
HBM[HBM bandwidth limit]:::machine
ATT[Attention reads/writes too much memory]:::symptom
FA[FlashAttention / memory tiling]:::fix
NV[Limited interconnect bandwidth]:::machine
COMM[AllReduce or MoE expert traffic stalls]:::symptom
OVERLAP[Overlap communication and computation]:::fix
CPU[CPU NUMA / dataloader overhead]:::machine
STARVE[GPU starved for batches]:::symptom
PIN[CPU pinning / memory pinning / prefetch]:::fix
HBM --> ATT --> FA
NV --> COMM --> OVERLAP
CPU --> STARVE --> PIN
Examples:
| Hardware Reality | Performance Problem | Mechanically Sympathetic Fix |
|---|---|---|
| HBM is fast but limited | attention moves too much data | FlashAttention, MLA |
| NVLink is faster than IB | cross-node communication is expensive | keep traffic intra-node/intra-rack when possible |
| Tensor Cores prefer specific precision/shapes | low compute efficiency | FP8/FP4, padding, fused kernels |
| CPU-GPU transfers are costly | dataloader stalls GPU | pinned memory, async copy, prefetch |
| distributed collectives create bubbles | scaling efficiency drops | overlap communication and computation |
Hardware-Software-Algorithm Codesign
Section titled “Hardware-Software-Algorithm Codesign”Chapter 1 frames modern AI performance as a codesign problem.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
classDef hw fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef sw fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef alg fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef result fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
HW[Hardware<br/>GPU, HBM, NVLink, RDMA, Storage]:::hw
SW[Software<br/>CUDA, PyTorch, NCCL, Runtime, Scheduler]:::sw
ALG[Algorithm<br/>Attention, MoE, Quantization, Batching]:::alg
R[High Goodput<br/>Low Latency<br/>Lower Cost]:::result
HW <--> SW
SW <--> ALG
ALG <--> HW
HW --> R
SW --> R
ALG --> R
A performance issue can often be solved at different layers.
Example: inference latency is too high.
| Layer | Possible Fix | Trade-off |
|---|---|---|
| Hardware | Use B200/H200 instead of A100/H100 | expensive |
| Precision | FP8/FP4 quantization | possible accuracy loss |
| Kernel | optimized attention kernel | engineering complexity |
| Runtime | CUDA Graphs | shape/static constraints |
| Serving | continuous batching | latency-throughput trade-off |
| Application | prompt compression | possible quality loss |
| Scheduler | route long prompts separately | operational complexity |
The senior engineer’s question is:
Which layer gives the highest ROI fix for this bottleneck?DeepSeek Case Study
Section titled “DeepSeek Case Study”Chapter 1 uses DeepSeek as a motivating case.
The lesson is not simply that “DeepSeek used fewer GPUs.”
The deeper performance engineering lesson is:
When hardware is constrained, software and algorithmic optimization become strategic weapons.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
classDef constraint fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef technique fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef outcome fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
C1[Restricted GPU access<br/>H800 instead of top-tier GPUs]:::constraint
C2[Lower interconnect bandwidth]:::constraint
C3[Large MoE model scale]:::constraint
T1[Custom kernels]:::technique
T2[Communication/computation overlap]:::technique
T3[MoE sparse activation]:::technique
T4[Distillation / RL strategy]:::technique
O[High model capability<br/>Lower training cost<br/>Better ROI]:::outcome
C1 --> T1
C2 --> T2
C3 --> T3
T1 --> O
T2 --> O
T3 --> O
T4 --> O
Key lessons:
| Constraint | Engineering Response |
|---|---|
| limited GPU interconnect bandwidth | reduce and overlap communication |
| limited hardware availability | optimize kernels and runtime |
| large model size | use MoE sparse activation |
| high training cost | improve algorithmic efficiency |
| inference cost pressure | optimize attention and KV cache behavior |
For your DGX B200/H100 context, the same lesson applies:
Do not assume that more GPU is the first answer.First prove whether the bottleneck is compute, memory, network, storage, runtime, or scheduling.Performance Bottleneck Lens
Section titled “Performance Bottleneck Lens”Chapter 1 is an overview chapter, so the bottleneck lens should cover the full stack.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
classDef layer fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef metric fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef tool fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
APP[Application<br/>training loop / serving API]:::layer
RUNTIME[Runtime<br/>PyTorch / CUDA / vLLM]:::layer
GPU[GPU<br/>SM / Tensor Core / HBM]:::layer
CPU[CPU / OS<br/>NUMA / threads / memory]:::layer
NET[Network<br/>NVLink / NCCL / RDMA]:::layer
STORAGE[Storage<br/>dataset / checkpoint]:::layer
SCHED[Scheduler<br/>Kubernetes / SLURM placement]:::layer
M[Metrics<br/>tokens/s, step time, TTFT, TPOT, p99, HBM BW, NCCL BW]:::metric
T[Tools<br/>Nsight, PyTorch Profiler, DCGM, NCCL tests, iostat]:::tool
APP --> RUNTIME --> GPU
CPU --> GPU
NET --> GPU
STORAGE --> CPU
SCHED --> CPU
SCHED --> GPU
SCHED --> NET
APP --> M
RUNTIME --> M
GPU --> M
CPU --> M
NET --> M
STORAGE --> M
SCHED --> M
M --> T
Bottleneck table
Section titled “Bottleneck table”| Bottleneck Layer | Symptom | Metric | Tool | Example Fix |
|---|---|---|---|---|
| GPU Compute | low achieved FLOPS | SM occupancy, tensor core utilization | Nsight Compute | kernel fusion, mixed precision |
| GPU Memory | GPU busy but slow | HBM bandwidth, memory stall | Nsight Compute | FlashAttention, tiling, cache reuse |
| CPU / OS | GPU waits for batches | dataloader time, CPU util | PyTorch Profiler, perf | num_workers, CPU pinning, pinned memory |
| Network | multi-GPU scaling poor | NCCL time, RDMA BW | NCCL tests, Nsight Systems | topology-aware placement, overlap |
| Storage | slow epoch start/checkpoint | read BW, IOPS, latency | iostat, fio, gdsio | local cache, prefetch, GDS |
| Runtime | many tiny kernels | kernel launch overhead | Nsight Systems | CUDA Graphs, torch.compile |
| Scheduler | performance varies by placement | GPU/NIC locality | kubectl, DCGM, topology view | topology-aware scheduling |
| Application | p99 latency high | TTFT, TPOT, queue time | vLLM/SGLang metrics | continuous batching, prefix cache |
Practical Metrics and Tools
Section titled “Practical Metrics and Tools”Training metrics
Section titled “Training metrics”| Metric | Meaning |
|---|---|
| step time | end-to-end training iteration time |
| samples/sec | training throughput |
| tokens/sec | LLM training throughput |
| GPU utilization | whether GPU is active |
| SM occupancy | whether GPU execution resources are filled |
| HBM bandwidth | whether kernels are memory-bound |
| NCCL time | communication overhead |
| dataloader wait | CPU/storage pipeline bottleneck |
| checkpoint latency | storage write bottleneck |
| scaling efficiency | how well performance improves with more GPUs |
Inference metrics
Section titled “Inference metrics”| Metric | Meaning |
|---|---|
| TTFT | time to first token; mostly prefill-sensitive |
| TPOT | time per output token; mostly decode-sensitive |
| requests/sec | serving throughput |
| tokens/sec | generation throughput |
| p50 / p95 / p99 latency | user-facing latency distribution |
| queue time | scheduler/batching pressure |
| KV cache usage | memory pressure |
| batch size | throughput-latency balance |
| SLO-compliant throughput | useful inference goodput |
| Tool | Best For |
|---|---|
nvidia-smi | quick GPU utilization, memory, power |
| DCGM | cluster-level GPU telemetry |
| Nsight Systems | end-to-end timeline, CPU/GPU/NCCL overlap |
| Nsight Compute | kernel-level SM, memory, warp analysis |
| PyTorch Profiler | PyTorch operator-level bottlenecks |
| NVTX | custom profiling ranges |
| NCCL tests | communication bandwidth and latency |
iostat, fio | storage I/O bottlenecks |
perf | CPU-level hotspots |
nvidia-smi topo -m | GPU/NIC/CPU topology |
| Kubernetes metrics | placement, throttling, resource pressure |
AI Performance Engineering Workflow
Section titled “AI Performance Engineering Workflow”A practical workflow for this chapter:
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
classDef q fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef m fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef t fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
classDef f fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
Q1[1. What is slow?<br/>throughput, latency, cost, reliability]:::q
Q2[2. Where is the bottleneck?<br/>GPU, CPU, network, storage, runtime, scheduler, app]:::q
Q3[3. Which metric proves it?]:::m
Q4[4. Which profiler/tool shows it?]:::t
Q5[5. What is the lowest-risk fix?]:::f
Q6[6. Did goodput improve?]:::m
Q7[7. Can we reproduce it?]:::f
Q1 --> Q2 --> Q3 --> Q4 --> Q5 --> Q6 --> Q7
The key habit:
Never optimize without a baseline.Never claim improvement without before/after numbers.Never rely on GPU utilization alone.Design Decision Matrix
Section titled “Design Decision Matrix”When performance is poor, where should you look first?
Section titled “When performance is poor, where should you look first?”| Symptom | First Suspect | Confirm With | Likely Fix |
|---|---|---|---|
| GPU utilization low | CPU/dataloader/storage | PyTorch Profiler, iostat | prefetch, pin_memory, more workers |
| GPU utilization high but throughput low | GPU memory-bound kernel | Nsight Compute | memory tiling, fused kernel |
| scaling from 8 to 64 GPUs is poor | NCCL/network | NCCL tests, Nsight Systems | topology-aware placement, overlap |
| p99 latency high in serving | batching/scheduler/KV cache | serving metrics | separate long prompts, tune batching |
| training pauses every N steps | checkpoint I/O | iostat, storage metrics | async checkpoint, faster storage |
| high variance between runs | scheduler/topology/noisy neighbor | DCGM, placement logs | pin placement, isolate resources |
| OOM or frequent eviction | memory pressure | GPU memory, KV cache stats | quantization, offload, cache policy |
Operational Validation Checklist
Section titled “Operational Validation Checklist”Use this checklist after reading Chapter 1.
Baseline
Section titled “Baseline”- Define the workload: training, inference, fine-tuning, batch inference, online serving
- Record hardware: GPU type, GPU count, CPU, memory, NIC, storage
- Record software stack: driver, CUDA, PyTorch, NCCL, container image
- Measure baseline throughput
- Measure baseline latency if serving
- Measure GPU utilization and memory usage
- Save profiler trace
Goodput
Section titled “Goodput”- Identify useful work metric: samples/sec, tokens/sec, requests/sec
- Separate useful compute time from wait time
- Check dataloader wait
- Check communication wait
- Check checkpoint or storage stall
- Check failure/restart/preemption overhead
- Estimate goodput gap
Profiling
Section titled “Profiling”- Use PyTorch Profiler for framework-level bottleneck
- Use Nsight Systems for CPU/GPU/NCCL timeline
- Use Nsight Compute for kernel bottleneck
- Use NCCL tests for network baseline
- Use storage tools for dataset/checkpoint path
- Add NVTX ranges for important code regions
Optimization
Section titled “Optimization”- Optimize the largest proven bottleneck first
- Change one major variable at a time
- Re-run benchmark
- Compare before/after
- Record trade-offs
- Add regression test if possible
Chapter Summary
Section titled “Chapter Summary”Chapter 1 gives the operating philosophy of the whole book.
The chapter’s main message:
AI systems performance engineering is full-stack, empirical, and goodput-driven.
Important takeaways:
- GPU utilization alone is not enough.
- Goodput is the meaningful performance target.
- Bottlenecks can appear in GPU, CPU, memory, network, storage, runtime, scheduler, or application layers.
- Performance optimization must be profile-driven.
- DeepSeek shows that smart engineering can offset hardware constraints.
- Mechanical sympathy means designing software and algorithms around hardware realities.
- Hardware, software, and algorithms must be codesigned.
- Reproducibility matters because performance claims without repeatable benchmarks are weak.
- At AI scale, small efficiency improvements can translate into large cost savings.
- The job of an AI Systems Performance Engineer is to turn expensive raw compute into useful model progress.
Key Terms
Section titled “Key Terms”| Term | Meaning |
|---|---|
| Goodput | useful training/inference throughput after excluding overhead |
| Throughput | total work processed per unit time |
| GPU utilization | percentage of time GPU appears active |
| Mechanical Sympathy | hardware-aware software/algorithm design |
| Codesign | optimizing hardware, software, and algorithms together |
| Profiling | measuring where time/resources are spent |
| Benchmarking | reproducible measurement of performance |
| NCCL | NVIDIA collective communication library |
| NIXL | NVIDIA inference transfer library for distributed inference data movement |
| RDMA | direct memory transfer across network without CPU copy |
| FlashAttention | hardware-aware attention algorithm reducing memory traffic |
| MoE | mixture-of-experts model using sparse activation |
| TTFT | time to first token |
| TPOT | time per output token |
| Scaling Efficiency | realized speedup compared with ideal speedup |
Questions
Section titled “Questions”Concept Check
Section titled “Concept Check”- What is the difference between throughput and goodput?
- Why can GPU utilization be misleading?
- What does an AI Systems Performance Engineer optimize?
- What is mechanical sympathy?
- Why does Chapter 1 emphasize reproducible benchmarking?
Bottleneck Diagnosis
Section titled “Bottleneck Diagnosis”- GPU utilization is 95%, but training throughput is low. What are three possible causes?
- Multi-GPU training scales poorly from 8 GPUs to 64 GPUs. Which metrics would you check?
- Inference p99 latency is high while average latency is acceptable. What should you inspect?
- A training job pauses every few hundred steps. Which layer might be responsible?
- A model serving system has high TTFT but acceptable TPOT. Which phase is likely bottlenecked?
Practical Application
Section titled “Practical Application”- In a DGX B200/H100 cluster, why is topology-aware scheduling important?
- Which tools would you use to distinguish GPU compute bottleneck from network bottleneck?
- How would you prove that a dataloader optimization improved goodput?
- When should you consider algorithm-level optimization instead of buying more GPUs?
- What should be included in a performance regression test?
Answers
Section titled “Answers”-
Throughput is total processed work per time. Goodput is useful completed work per time, excluding waits, restarts, stalls, and overhead.
-
GPU utilization only says the GPU is active. It does not prove that the GPU is doing useful model progress. A GPU can be busy with inefficient kernels, memory movement, or synchronization overhead.
-
The role optimizes end-to-end AI workload performance across hardware, software, algorithms, runtime, network, storage, and scheduler layers.
-
Mechanical sympathy means understanding the hardware’s actual behavior and designing software/algorithms that exploit its strengths and avoid its weaknesses.
-
Because performance claims are meaningless unless they can be repeated, compared, and validated with the same workload, environment, and metrics.
-
Possible causes: NCCL wait, memory-bound kernels, small batch size, dataloader stalls, storage jitter, CPU NUMA issues, synchronization bubbles.
-
Check NCCL bandwidth, all-reduce time, step time breakdown, GPU/NIC topology, RDMA counters, NVLink/NVSwitch usage, and straggler behavior.
-
Inspect queue time, request length distribution, batch size, KV cache usage, prefill/decode split, TTFT, TPOT, and scheduler policy.
-
Checkpoint I/O, storage bandwidth, filesystem latency, or distributed synchronization may be responsible.
-
High TTFT usually points to the prefill phase, prompt processing, scheduling queue, or long input context bottleneck.
-
Because bad placement can put GPUs, NICs, and CPU threads across inefficient topology paths, increasing NCCL latency and reducing goodput.
-
Use Nsight Systems for timeline and NCCL overlap, Nsight Compute for kernel-level compute/memory analysis, and NCCL tests for network baseline.
-
Measure before/after samples/sec or tokens/sec, dataloader wait time, GPU idle time, CPU utilization, and repeat the benchmark under the same conditions.
-
When profiling shows the bottleneck is memory movement, communication, attention complexity, batching policy, or KV cache behavior rather than raw compute capacity.
-
Workload definition, fixed input shape/data, hardware/software versions, baseline metric, acceptable threshold, profiler artifact, and automated comparison.