Skip to content

Lab: Communication and Computation Overlap

Show why Chapter 4 focuses on exposed communication time. The total communication volume may be the same, but the optimized path starts each bucket transfer as soon as that gradient bucket is ready.

This lab is a portable analogue of no-overlap and overlap-enabled DDP training paths.

The baseline runs all backward-layer compute first, then synchronizes every gradient bucket. Communication is fully exposed at the end of the step.

The optimized path submits a bucket transfer immediately after each layer finishes. Later backward compute overlaps with earlier bucket communication, so only the final communication tail remains exposed.

Terminal window
python compare.py

Both paths produce the same checksum. The optimized path should report lower median step time and a higher overlap ratio.

Overlap does not make communication disappear. It moves communication under useful compute so the training step pays only the part that cannot be hidden.