Lab: GPU Communication Reference
Provide real GPU/NCCL entrypoints for the Chapter 4 communication labs.
The other labs in this directory are CPU-portable concept models. This lab is intentionally not portable: it should be run only on a host with CUDA, NCCL, PyTorch distributed support, and at least two visible NVIDIA GPUs.
This code adapts the chapter’s important GPU-facing communication ideas:
ddp_no_overlap.pyddp_overlap.py- gradient bucket and NCCL all-reduce benchmark patterns
Scripts
Section titled “Scripts”| Script | Purpose |
|---|---|
no_overlap.py | Runs forward/backward, then manually all-reduces each parameter gradient after backward. |
ddp_overlap.py | Uses PyTorch DistributedDataParallel so gradient buckets can be reduced during backward. |
allreduce_bucket_sweep.py | Measures NCCL all-reduce latency and bandwidth across bucket sizes and dtypes. |
topology_sweep.py | Measures peer-copy bandwidth for visible GPU pairs. |
dataparallel_vs_ddp.py | Runs DataParallel in one process or DDP under torchrun. |
communicator_reuse.py | Compares process-group recreation with process-group reuse. |
pipeline_1f1b.py | CUDA stream sketch for fill-drain vs overlapped pipeline work. |
symmetric_memory_probe.py | Checks whether the local PyTorch build exposes symmetric-memory support. |
torchrun --nproc_per_node=2 no_overlap.pytorchrun --nproc_per_node=2 ddp_overlap.pytorchrun --nproc_per_node=2 allreduce_bucket_sweep.pypython topology_sweep.pypython dataparallel_vs_ddp.pytorchrun --nproc_per_node=2 dataparallel_vs_ddp.pytorchrun --nproc_per_node=2 communicator_reuse.pypython pipeline_1f1b.pypython symmetric_memory_probe.pyOptional knobs:
torchrun --nproc_per_node=4 ddp_overlap.py --iterations 50 --warmup 10 --bucket-cap-mb 25torchrun --nproc_per_node=4 allreduce_bucket_sweep.py --min-kb 16 --max-mb 512 --dtype fp16Expected Observation
Section titled “Expected Observation”no_overlap.pyexposes all gradient communication after backward.ddp_overlap.pyshould reduce step time when communication can overlap with backward compute.allreduce_bucket_sweep.pyshould show poor efficiency for very small buckets and better bandwidth as message size grows.
Validation Notes
Section titled “Validation Notes”Use these with Chapter 4 operational checks:
NCCL_DEBUG=INFO torchrun --nproc_per_node=2 ddp_overlap.pynvidia-smi topo -mFor real profiling, capture an Nsight Systems trace and check whether NCCL kernels overlap with backward kernels. A successful run alone does not prove the fabric path is healthy.