Lab: DataParallel vs DDP
Show why Chapter 4 treats PyTorch DataParallel as an anti-pattern for serious multi-GPU training.
Baseline
Section titled “Baseline”The baseline models one Python process that scatters input, runs replicas, gathers output, and reduces gradients through a primary device.
Optimized
Section titled “Optimized”The optimized path models one process per GPU with local forward/backward work and one collective gradient synchronization.
python compare.pyExpected Observation
Section titled “Expected Observation”The DDP-shaped path should spend less time in host orchestration and primary-device fan-in.