Lab: Communication and Computation Overlap
Show why Chapter 4 focuses on exposed communication time. The total communication volume may be the same, but the optimized path starts each bucket transfer as soon as that gradient bucket is ready.
This lab is a portable analogue of no-overlap and overlap-enabled DDP training paths.
Baseline
Section titled “Baseline”The baseline runs all backward-layer compute first, then synchronizes every gradient bucket. Communication is fully exposed at the end of the step.
Optimized
Section titled “Optimized”The optimized path submits a bucket transfer immediately after each layer finishes. Later backward compute overlaps with earlier bucket communication, so only the final communication tail remains exposed.
python compare.pyExpected Observation
Section titled “Expected Observation”Both paths produce the same checksum. The optimized path should report lower median step time and a higher overlap ratio.
What This Proves
Section titled “What This Proves”Overlap does not make communication disappear. It moves communication under useful compute so the training step pays only the part that cannot be hidden.