Skip to content

Chapter 1: Introduction and AI System Overview

This chapter introduces the mental model of AI Systems Performance Engineering.

The core idea is:

AI performance engineering is not about making GPUs look busy. It is about maximizing useful work — goodput — across hardware, software, runtime, network, storage, scheduler, and application layers.

Chapter 1 sets the foundation for the whole book:

  • AI systems are full-stack systems.
  • Performance bottlenecks can appear at any layer.
  • Raw GPU utilization is not enough.
  • Goodput is the real target.
  • Optimization must be driven by profiling, not intuition.
  • Hardware, software, and algorithms must be codesigned.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef hw fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef sw fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef alg fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef metric fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef goal fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    HW[Hardware<br/>GPU, CPU, HBM, NVLink, RDMA, Storage]:::hw
    SW[Software Stack<br/>OS, Driver, CUDA, PyTorch, Runtime]:::sw
    ALG[Algorithms<br/>Attention, MoE, Quantization, Batching]:::alg
    PROF[Profiling<br/>Nsight, PyTorch Profiler, DCGM, NCCL Tests]:::metric
    GP[Goodput<br/>Useful training/inference throughput]:::goal

    HW --> GP
    SW --> GP
    ALG --> GP
    PROF --> HW
    PROF --> SW
    PROF --> ALG

AI systems performance engineering is the discipline of answering three questions:

  1. Where is the bottleneck?
  2. How do we measure it?
  3. Which layer should we fix?

The important shift is from:

"GPU utilization is high, so the system is healthy."

to:

"How much of the system's capacity is doing useful training or inference work?"

This is why Chapter 1 introduces goodput, mechanical sympathy, and hardware-software-algorithm codesign early.

An AI Systems Performance Engineer sits between several domains.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef role fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef layer fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef team fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    PE[AI Systems<br/>Performance Engineer]:::role

    GPU[GPU / CUDA<br/>Kernel, SM, HBM]:::layer
    INFRA[Infra<br/>OS, Docker, Kubernetes]:::layer
    NET[Network<br/>NVLink, NCCL, RDMA]:::layer
    STORAGE[Storage<br/>Dataset, Checkpoint, GDS]:::layer
    APP[Application<br/>Training loop, Serving path]:::layer

    DS[Researchers<br/>Data Scientists]:::team
    DEV[Application<br/>Developers]:::team
    OPS[Infra / Platform<br/>Engineers]:::team

    PE --> GPU
    PE --> INFRA
    PE --> NET
    PE --> STORAGE
    PE --> APP

    PE --- DS
    PE --- DEV
    PE --- OPS

The role is not just “GPU administrator” or “ML engineer.”

It combines:

AreaResponsibility
BenchmarkingMeasure throughput, latency, memory usage, scaling efficiency
ProfilingIdentify bottlenecks using system and GPU profilers
DebuggingTrace performance regressions to root cause
OptimizationImprove kernels, runtime, data pipeline, communication, scheduling
ScalingMove from single GPU to multi-GPU, multinode, multirack systems
Resource efficiencyImprove performance per dollar and performance per watt
ReproducibilityMake benchmark results repeatable and comparable

Goodput means useful throughput.

Raw throughput asks:

How much work appears to be happening?

Goodput asks:

How much useful model progress is actually happening?

Examples of non-useful work:

  • GPU waiting for dataloader
  • GPU waiting for NCCL synchronization
  • excessive CPU-GPU memory copy
  • failed job restart
  • suboptimal kernel launch overhead
  • communication bubble in pipeline parallelism
  • request queueing delay in inference serving
  • KV cache eviction or recomputation
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef useful fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef waste fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef total fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    TOTAL[Total GPU Cluster Time]:::total

    USEFUL[Useful Work<br/>forward/backward<br/>tokens generated<br/>requests completed]:::useful

    W1[Data loading wait]:::waste
    W2[NCCL communication wait]:::waste
    W3[Kernel launch overhead]:::waste
    W4[OOM / restart / preemption]:::waste
    W5[Storage checkpoint stall]:::waste

    TOTAL --> USEFUL
    TOTAL --> W1
    TOTAL --> W2
    TOTAL --> W3
    TOTAL --> W4
    TOTAL --> W5

A simplified goodput view:

Goodput = Useful completed work / End-to-end elapsed time

For training:

Goodput = useful tokens or samples processed per second

For inference:

Goodput = completed requests or generated tokens per second under SLO

The key point:

GPU utilization can be high while goodput is low.

Example:

SituationGPU UtilizationGoodputLikely Bottleneck
GPU busy but waiting on all-reduceHighLowNetwork / NCCL
GPU periodically idle before each batchLow or unstableLowCPU / dataloader / storage
GPU memory nearly full, KV cache eviction frequentHighLowMemory / serving scheduler
p99 latency high despite good average throughputHighLow under SLOApplication / batching policy
training job restarts oftenVariableLowReliability / orchestration

Chapter 1 emphasizes that performance work should be profile-driven.

The workflow should be:

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef step fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef decision fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef output fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    A[Define workload]:::step
    B[Run baseline benchmark]:::step
    C[Collect profiling data]:::step
    D{Find bottleneck}:::decision
    E[Apply targeted optimization]:::step
    F[Re-run benchmark]:::step
    G[Compare before/after]:::output
    H[Automate regression test]:::output

    A --> B --> C --> D --> E --> F --> G --> H
    G --> C

Bad optimization style:

"I changed this and it feels faster."

Good optimization style:

"Before: 1,200 tokens/s, p99 900 ms.
After: 1,580 tokens/s, p99 710 ms.
Profiler shows NCCL wait reduced from 27% to 12%."
WorkloadPrimary MetricSecondary Metrics
Trainingsamples/sec, tokens/sec, step timeGPU util, NCCL time, dataloader wait, checkpoint time
Inferencetokens/sec, requests/secTTFT, TPOT, p95/p99 latency, queue time
Distributed trainingscaling efficiencyall-reduce time, network bandwidth, straggler ratio
Storage-heavy trainingdata pipeline throughputIOPS, read BW, dataloader latency
LLM servingSLO-compliant throughputKV cache usage, batch size, decode latency

Mechanical sympathy means:

Understand how the machine works, then design software and algorithms that cooperate with it.

For AI systems, the “machine” includes:

  • GPU SMs
  • Tensor Cores
  • HBM
  • L2 cache
  • CPU NUMA topology
  • PCIe / NVLink / NVSwitch
  • RDMA NICs
  • storage hierarchy
  • CUDA runtime
  • PyTorch execution model
  • Kubernetes scheduler placement
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef machine fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef symptom fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fix fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    HBM[HBM bandwidth limit]:::machine
    ATT[Attention reads/writes too much memory]:::symptom
    FA[FlashAttention / memory tiling]:::fix

    NV[Limited interconnect bandwidth]:::machine
    COMM[AllReduce or MoE expert traffic stalls]:::symptom
    OVERLAP[Overlap communication and computation]:::fix

    CPU[CPU NUMA / dataloader overhead]:::machine
    STARVE[GPU starved for batches]:::symptom
    PIN[CPU pinning / memory pinning / prefetch]:::fix

    HBM --> ATT --> FA
    NV --> COMM --> OVERLAP
    CPU --> STARVE --> PIN

Examples:

Hardware RealityPerformance ProblemMechanically Sympathetic Fix
HBM is fast but limitedattention moves too much dataFlashAttention, MLA
NVLink is faster than IBcross-node communication is expensivekeep traffic intra-node/intra-rack when possible
Tensor Cores prefer specific precision/shapeslow compute efficiencyFP8/FP4, padding, fused kernels
CPU-GPU transfers are costlydataloader stalls GPUpinned memory, async copy, prefetch
distributed collectives create bubblesscaling efficiency dropsoverlap communication and computation

Chapter 1 frames modern AI performance as a codesign problem.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef hw fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef sw fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef alg fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef result fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    HW[Hardware<br/>GPU, HBM, NVLink, RDMA, Storage]:::hw
    SW[Software<br/>CUDA, PyTorch, NCCL, Runtime, Scheduler]:::sw
    ALG[Algorithm<br/>Attention, MoE, Quantization, Batching]:::alg

    R[High Goodput<br/>Low Latency<br/>Lower Cost]:::result

    HW <--> SW
    SW <--> ALG
    ALG <--> HW

    HW --> R
    SW --> R
    ALG --> R

A performance issue can often be solved at different layers.

Example: inference latency is too high.

LayerPossible FixTrade-off
HardwareUse B200/H200 instead of A100/H100expensive
PrecisionFP8/FP4 quantizationpossible accuracy loss
Kerneloptimized attention kernelengineering complexity
RuntimeCUDA Graphsshape/static constraints
Servingcontinuous batchinglatency-throughput trade-off
Applicationprompt compressionpossible quality loss
Schedulerroute long prompts separatelyoperational complexity

The senior engineer’s question is:

Which layer gives the highest ROI fix for this bottleneck?

Chapter 1 uses DeepSeek as a motivating case.

The lesson is not simply that “DeepSeek used fewer GPUs.”

The deeper performance engineering lesson is:

When hardware is constrained, software and algorithmic optimization become strategic weapons.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef constraint fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef technique fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef outcome fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    C1[Restricted GPU access<br/>H800 instead of top-tier GPUs]:::constraint
    C2[Lower interconnect bandwidth]:::constraint
    C3[Large MoE model scale]:::constraint

    T1[Custom kernels]:::technique
    T2[Communication/computation overlap]:::technique
    T3[MoE sparse activation]:::technique
    T4[Distillation / RL strategy]:::technique

    O[High model capability<br/>Lower training cost<br/>Better ROI]:::outcome

    C1 --> T1
    C2 --> T2
    C3 --> T3
    T1 --> O
    T2 --> O
    T3 --> O
    T4 --> O

Key lessons:

ConstraintEngineering Response
limited GPU interconnect bandwidthreduce and overlap communication
limited hardware availabilityoptimize kernels and runtime
large model sizeuse MoE sparse activation
high training costimprove algorithmic efficiency
inference cost pressureoptimize attention and KV cache behavior

For your DGX B200/H100 context, the same lesson applies:

Do not assume that more GPU is the first answer.
First prove whether the bottleneck is compute, memory, network, storage, runtime, or scheduling.

Chapter 1 is an overview chapter, so the bottleneck lens should cover the full stack.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef layer fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef metric fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef tool fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    APP[Application<br/>training loop / serving API]:::layer
    RUNTIME[Runtime<br/>PyTorch / CUDA / vLLM]:::layer
    GPU[GPU<br/>SM / Tensor Core / HBM]:::layer
    CPU[CPU / OS<br/>NUMA / threads / memory]:::layer
    NET[Network<br/>NVLink / NCCL / RDMA]:::layer
    STORAGE[Storage<br/>dataset / checkpoint]:::layer
    SCHED[Scheduler<br/>Kubernetes / SLURM placement]:::layer

    M[Metrics<br/>tokens/s, step time, TTFT, TPOT, p99, HBM BW, NCCL BW]:::metric
    T[Tools<br/>Nsight, PyTorch Profiler, DCGM, NCCL tests, iostat]:::tool

    APP --> RUNTIME --> GPU
    CPU --> GPU
    NET --> GPU
    STORAGE --> CPU
    SCHED --> CPU
    SCHED --> GPU
    SCHED --> NET

    APP --> M
    RUNTIME --> M
    GPU --> M
    CPU --> M
    NET --> M
    STORAGE --> M
    SCHED --> M
    M --> T
Bottleneck LayerSymptomMetricToolExample Fix
GPU Computelow achieved FLOPSSM occupancy, tensor core utilizationNsight Computekernel fusion, mixed precision
GPU MemoryGPU busy but slowHBM bandwidth, memory stallNsight ComputeFlashAttention, tiling, cache reuse
CPU / OSGPU waits for batchesdataloader time, CPU utilPyTorch Profiler, perfnum_workers, CPU pinning, pinned memory
Networkmulti-GPU scaling poorNCCL time, RDMA BWNCCL tests, Nsight Systemstopology-aware placement, overlap
Storageslow epoch start/checkpointread BW, IOPS, latencyiostat, fio, gdsiolocal cache, prefetch, GDS
Runtimemany tiny kernelskernel launch overheadNsight SystemsCUDA Graphs, torch.compile
Schedulerperformance varies by placementGPU/NIC localitykubectl, DCGM, topology viewtopology-aware scheduling
Applicationp99 latency highTTFT, TPOT, queue timevLLM/SGLang metricscontinuous batching, prefix cache
MetricMeaning
step timeend-to-end training iteration time
samples/sectraining throughput
tokens/secLLM training throughput
GPU utilizationwhether GPU is active
SM occupancywhether GPU execution resources are filled
HBM bandwidthwhether kernels are memory-bound
NCCL timecommunication overhead
dataloader waitCPU/storage pipeline bottleneck
checkpoint latencystorage write bottleneck
scaling efficiencyhow well performance improves with more GPUs
MetricMeaning
TTFTtime to first token; mostly prefill-sensitive
TPOTtime per output token; mostly decode-sensitive
requests/secserving throughput
tokens/secgeneration throughput
p50 / p95 / p99 latencyuser-facing latency distribution
queue timescheduler/batching pressure
KV cache usagememory pressure
batch sizethroughput-latency balance
SLO-compliant throughputuseful inference goodput
ToolBest For
nvidia-smiquick GPU utilization, memory, power
DCGMcluster-level GPU telemetry
Nsight Systemsend-to-end timeline, CPU/GPU/NCCL overlap
Nsight Computekernel-level SM, memory, warp analysis
PyTorch ProfilerPyTorch operator-level bottlenecks
NVTXcustom profiling ranges
NCCL testscommunication bandwidth and latency
iostat, fiostorage I/O bottlenecks
perfCPU-level hotspots
nvidia-smi topo -mGPU/NIC/CPU topology
Kubernetes metricsplacement, throttling, resource pressure

A practical workflow for this chapter:

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef q fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef m fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef t fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef f fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    Q1[1. What is slow?<br/>throughput, latency, cost, reliability]:::q
    Q2[2. Where is the bottleneck?<br/>GPU, CPU, network, storage, runtime, scheduler, app]:::q
    Q3[3. Which metric proves it?]:::m
    Q4[4. Which profiler/tool shows it?]:::t
    Q5[5. What is the lowest-risk fix?]:::f
    Q6[6. Did goodput improve?]:::m
    Q7[7. Can we reproduce it?]:::f

    Q1 --> Q2 --> Q3 --> Q4 --> Q5 --> Q6 --> Q7

The key habit:

Never optimize without a baseline.
Never claim improvement without before/after numbers.
Never rely on GPU utilization alone.

When performance is poor, where should you look first?

Section titled “When performance is poor, where should you look first?”
SymptomFirst SuspectConfirm WithLikely Fix
GPU utilization lowCPU/dataloader/storagePyTorch Profiler, iostatprefetch, pin_memory, more workers
GPU utilization high but throughput lowGPU memory-bound kernelNsight Computememory tiling, fused kernel
scaling from 8 to 64 GPUs is poorNCCL/networkNCCL tests, Nsight Systemstopology-aware placement, overlap
p99 latency high in servingbatching/scheduler/KV cacheserving metricsseparate long prompts, tune batching
training pauses every N stepscheckpoint I/Oiostat, storage metricsasync checkpoint, faster storage
high variance between runsscheduler/topology/noisy neighborDCGM, placement logspin placement, isolate resources
OOM or frequent evictionmemory pressureGPU memory, KV cache statsquantization, offload, cache policy

Use this checklist after reading Chapter 1.

  • Define the workload: training, inference, fine-tuning, batch inference, online serving
  • Record hardware: GPU type, GPU count, CPU, memory, NIC, storage
  • Record software stack: driver, CUDA, PyTorch, NCCL, container image
  • Measure baseline throughput
  • Measure baseline latency if serving
  • Measure GPU utilization and memory usage
  • Save profiler trace
  • Identify useful work metric: samples/sec, tokens/sec, requests/sec
  • Separate useful compute time from wait time
  • Check dataloader wait
  • Check communication wait
  • Check checkpoint or storage stall
  • Check failure/restart/preemption overhead
  • Estimate goodput gap
  • Use PyTorch Profiler for framework-level bottleneck
  • Use Nsight Systems for CPU/GPU/NCCL timeline
  • Use Nsight Compute for kernel bottleneck
  • Use NCCL tests for network baseline
  • Use storage tools for dataset/checkpoint path
  • Add NVTX ranges for important code regions
  • Optimize the largest proven bottleneck first
  • Change one major variable at a time
  • Re-run benchmark
  • Compare before/after
  • Record trade-offs
  • Add regression test if possible

Chapter 1 gives the operating philosophy of the whole book.

The chapter’s main message:

AI systems performance engineering is full-stack, empirical, and goodput-driven.

Important takeaways:

  1. GPU utilization alone is not enough.
  2. Goodput is the meaningful performance target.
  3. Bottlenecks can appear in GPU, CPU, memory, network, storage, runtime, scheduler, or application layers.
  4. Performance optimization must be profile-driven.
  5. DeepSeek shows that smart engineering can offset hardware constraints.
  6. Mechanical sympathy means designing software and algorithms around hardware realities.
  7. Hardware, software, and algorithms must be codesigned.
  8. Reproducibility matters because performance claims without repeatable benchmarks are weak.
  9. At AI scale, small efficiency improvements can translate into large cost savings.
  10. The job of an AI Systems Performance Engineer is to turn expensive raw compute into useful model progress.
TermMeaning
Goodputuseful training/inference throughput after excluding overhead
Throughputtotal work processed per unit time
GPU utilizationpercentage of time GPU appears active
Mechanical Sympathyhardware-aware software/algorithm design
Codesignoptimizing hardware, software, and algorithms together
Profilingmeasuring where time/resources are spent
Benchmarkingreproducible measurement of performance
NCCLNVIDIA collective communication library
NIXLNVIDIA inference transfer library for distributed inference data movement
RDMAdirect memory transfer across network without CPU copy
FlashAttentionhardware-aware attention algorithm reducing memory traffic
MoEmixture-of-experts model using sparse activation
TTFTtime to first token
TPOTtime per output token
Scaling Efficiencyrealized speedup compared with ideal speedup
  1. What is the difference between throughput and goodput?
  2. Why can GPU utilization be misleading?
  3. What does an AI Systems Performance Engineer optimize?
  4. What is mechanical sympathy?
  5. Why does Chapter 1 emphasize reproducible benchmarking?
  1. GPU utilization is 95%, but training throughput is low. What are three possible causes?
  2. Multi-GPU training scales poorly from 8 GPUs to 64 GPUs. Which metrics would you check?
  3. Inference p99 latency is high while average latency is acceptable. What should you inspect?
  4. A training job pauses every few hundred steps. Which layer might be responsible?
  5. A model serving system has high TTFT but acceptable TPOT. Which phase is likely bottlenecked?
  1. In a DGX B200/H100 cluster, why is topology-aware scheduling important?
  2. Which tools would you use to distinguish GPU compute bottleneck from network bottleneck?
  3. How would you prove that a dataloader optimization improved goodput?
  4. When should you consider algorithm-level optimization instead of buying more GPUs?
  5. What should be included in a performance regression test?
  1. Throughput is total processed work per time. Goodput is useful completed work per time, excluding waits, restarts, stalls, and overhead.

  2. GPU utilization only says the GPU is active. It does not prove that the GPU is doing useful model progress. A GPU can be busy with inefficient kernels, memory movement, or synchronization overhead.

  3. The role optimizes end-to-end AI workload performance across hardware, software, algorithms, runtime, network, storage, and scheduler layers.

  4. Mechanical sympathy means understanding the hardware’s actual behavior and designing software/algorithms that exploit its strengths and avoid its weaknesses.

  5. Because performance claims are meaningless unless they can be repeated, compared, and validated with the same workload, environment, and metrics.

  6. Possible causes: NCCL wait, memory-bound kernels, small batch size, dataloader stalls, storage jitter, CPU NUMA issues, synchronization bubbles.

  7. Check NCCL bandwidth, all-reduce time, step time breakdown, GPU/NIC topology, RDMA counters, NVLink/NVSwitch usage, and straggler behavior.

  8. Inspect queue time, request length distribution, batch size, KV cache usage, prefill/decode split, TTFT, TPOT, and scheduler policy.

  9. Checkpoint I/O, storage bandwidth, filesystem latency, or distributed synchronization may be responsible.

  10. High TTFT usually points to the prefill phase, prompt processing, scheduling queue, or long input context bottleneck.

  11. Because bad placement can put GPUs, NICs, and CPU threads across inefficient topology paths, increasing NCCL latency and reducing goodput.

  12. Use Nsight Systems for timeline and NCCL overlap, Nsight Compute for kernel-level compute/memory analysis, and NCCL tests for network baseline.

  13. Measure before/after samples/sec or tokens/sec, dataloader wait time, GPU idle time, CPU utilization, and repeat the benchmark under the same conditions.

  14. When profiling shows the bottleneck is memory movement, communication, attention complexity, batching policy, or KV cache behavior rather than raw compute capacity.

  15. Workload definition, fixed input shape/data, hardware/software versions, baseline metric, acceptable threshold, profiler artifact, and automated comparison.