Skip to content

Lab: Gradient Bucketing, Fusion, and Compression

Show why Chapter 4 treats gradient synchronization as both a bandwidth problem and a launch-latency problem.

This lab adapts the important ideas behind gradient fusion and gradient compression benchmarks.

The baseline sends many small FP32 gradient buckets. This preserves precision, but every bucket pays a fixed communication launch cost.

The optimized path fuses buckets into one contiguous buffer and transfers it as FP16. This reduces both launch count and communication bytes. The reduced-precision checksum is compared with a tolerance instead of exact equality.

Terminal window
python compare.py

The optimized path should have fewer launches, fewer transferred bytes, lower modelled communication time, and a close output checksum.

Gradient fusion and compression are useful when transfer latency or network bandwidth is exposed. They are not free: the precision error must be measured against the training tolerance.