Appendix: BGP-based Underlay and GLB NNHN in AI Data Center Fabrics
Supplement to Chapter 6’s GLB section, with additional BGP underlay context. This appendix covers (1) why eBGP is the de-facto underlay routing protocol for modern AI/ML data center fabrics, and (2) how Juniper’s Global Load Balancing (GLB) extends BGP with the Next-to-Next-Hop-Node (NNHN) capability to mitigate elephant-flow congestion in 400G/800G Clos fabrics.
1. Background: Why BGP/eBGP for Data Center Underlays
Section titled “1. Background: Why BGP/eBGP for Data Center Underlays”1.1 Departure from IGP-based fabrics
Section titled “1.1 Departure from IGP-based fabrics”Traditional enterprise data centers used IGPs (OSPF, IS-IS) as the underlay routing protocol. As hyperscale operators scaled out, this model broke down on several axes — flooding domain size, SPF recomputation cost, policy expressiveness, and operational complexity.
RFC 7938 (Use of BGP for Routing in Large-Scale Data Centers) codified the alternative: eBGP as a single-protocol underlay over a Clos (leaf-spine) topology.
1.2 Design pattern
Section titled “1.2 Design pattern”| Aspect | Pattern |
|---|---|
| Topology | 3-stage or 5-stage Clos (leaf-spine, leaf-spine-superspine) |
| ASN allocation | Private 4-byte ASNs; one ASN per leaf, one per spine (or per spine tier) |
| Peering | eBGP point-to-point on routed /31 links; no L2 below the leaf |
| Loop prevention | Native via AS_PATH; no STP, no MLAG, no large L2 domains |
| Multipathing | multipath multiple-as enabled; all leaf→spine uplinks active concurrently |
| Failure domain | Single link/node failure → one or two BGP UPDATE messages; no SPF cascade |
| Policy | Communities, AS_PATH manipulation, prefix-lists — significantly more expressive than IGP |
1.3 Relevance to AI/ML fabrics
Section titled “1.3 Relevance to AI/ML fabrics”AI training fabrics depend on this pattern for one principal reason: ECMP across many parallel paths is mandatory, not optional. A GPU cluster running collective operations (AllReduce, AllGather) generates flows that must be spread across every available leaf→spine uplink to achieve target Job Completion Time (JCT) and tail latency. A routing protocol that cannot cleanly express N-way ECMP at scale is disqualified.
eBGP gives you this for free, with operational characteristics — small failure domain, declarative policy, no flooding — that align with the high-radix, high-bandwidth (400G/800G) topology of modern AI fabrics.
2. The Limitation: Elephant Flows and Downstream Congestion
Section titled “2. The Limitation: Elephant Flows and Downstream Congestion”2.1 Workload characteristics
Section titled “2.1 Workload characteristics”AI/ML training traffic differs from conventional data center workloads in two critical ways:
- Elephant flows. A single fat flow may live for the entire duration of a training step, continuously synchronizing tensors between GPU-enabled servers in
n-to-1or1-to-ncommunication patterns. - Low flow entropy. With a small number of GPUs and a small number of long-lived flows, the 5-tuple hash space is sparse. Standard ECMP hashing degenerates into pathological load distributions — some uplinks saturated, others idle.
2.2 Why classical DLB is insufficient
Section titled “2.2 Why classical DLB is insufficient”Dynamic Load Balancing (DLB) addresses local link conditions: the ingress leaf chooses an uplink based on queue depth and bandwidth utilization of its own outgoing interfaces.
The failure mode this misses is in-cast congestion at the spine:
- Traffic from multiple ingress leaves converges on a spine.
- The spine’s egress link toward the destination leaf saturates.
- DCQCN ECN and PFC-DSCP eventually signal the congestion back, but by then the damage to JCT and tail latency is done.
- Critically, the ingress leaf had no way to know in advance that this particular spine’s downstream link was congested.
The ingress leaf needs visibility into the end-to-end path quality, not just its local link quality.
3. GLB Architecture: BGP NNHN + ASIC Heartbeats
Section titled “3. GLB Architecture: BGP NNHN + ASIC Heartbeats”Juniper’s Global Load Balancing (GLB), introduced in Junos OS Evolved 23.4R2 on Tomahawk 5-based QFX5240 platforms, extends DLB to consider downstream path quality. GLB is structured as two cooperating planes.
3.1 Control plane: BGP NNHN capability
Section titled “3.1 Control plane: BGP NNHN capability”GLB introduces a new BGP attribute, Next-to-Next-Hop-Node (NNHN) capability, carried inside the Next-Hop Dependent Capabilities attribute of a BGP UPDATE. It is standardized via draft-wang-idr-next-next-hop-nodes.
The NNHN TLV carries:
| Field | Description |
|---|---|
| Next-hop BGP ID | 32-bit BGP identifier of the next-hop-node attaching this capability |
| Next-next-hop BGP IDs | One or more 32-bit identifiers, each representing an NNH used by the NH for ECMP forwarding for the advertised NLRI |
In plain terms: when a spine advertises a prefix to a leaf, it also signals which downstream leaves it would ECMP that prefix to. The ingress leaf thus learns the two-hop topology from BGP control plane signaling alone.
3.2 Data plane: GLB heartbeats
Section titled “3.2 Data plane: GLB heartbeats”The second component is a per-hop liveness and quality signal generated by the Packet Forwarding Engine (PFE):
| Property | Value |
|---|---|
| Encapsulation | UDP |
| Destination | Multicast group 224.0.0.149 (default; configurable) |
| Scope | Sent only on interfaces with active BGP underlay peering between leaf and spine |
| Payload | Link quality metric, carried in the data portion of the packet |
Restricting heartbeats to BGP-peered fabric links is deliberate: GLB is enabled only on the leaf-spine topology and ignores quality information from links that are not part of the fabric (e.g., management or storage networks).
3.3 Correlation at the PFE
Section titled “3.3 Correlation at the PFE”The ingress leaf’s PFE correlates the two streams:
- BGP NNHN advertisements tell it which NNH a given uplink leads to.
- Heartbeats from the spine PFE tell it the current quality of the spine→NNH link.
The forwarding decision for an elephant flow then becomes:
score(uplink_i) = local_link_quality(uplink_i) + downstream_quality(NNH_reached_via_uplink_i)This allows the ingress leaf to proactively avoid a congested spine-to-egress-leaf link by selecting a different spine, before any ECN or PFC feedback is generated.
4. Operational Roles in a 3-stage Clos
Section titled “4. Operational Roles in a 3-stage Clos”Juniper exposes GLB as two configuration modes, one per tier:
4.1 Spine — helper-only mode
Section titled “4.1 Spine — helper-only mode”set protocols bgp global-load-balancing helper-onlyset forwarding-options enhanced-hash-key ecmp-dlb <flowlet | per-packet>In this mode:
- BGP emits the NNHN capability on routes it advertises.
- The GLB application monitors quality of all local links with active eBGP sessions.
- Quality information is flooded to direct neighbors via heartbeats.
4.2 Leaf — load-balancer-only mode
Section titled “4.2 Leaf — load-balancer-only mode”set protocols bgp global-load-balancing load-balancer-onlyset forwarding-options enhanced-hash-key ecmp-dlb <flowlet | per-packet>In this mode:
- BGP does not emit the NNHN capability.
- The leaf receives link quality from neighboring spines.
- The combined NH + NNH quality drives ECMP forwarding decisions.
4.3 Selective disable
Section titled “4.3 Selective disable”GLB can be selectively disabled per interface or per peer group, allowing mixed-mode deployments during phased rollouts.
5. Verification
Section titled “5. Verification”5.1 Confirming the GLB profile on a leaf
Section titled “5.1 Confirming the GLB profile on a leaf”A show route lookup on a leaf for a prefix originated by a remote leaf should display, in addition to the standard next-hop attribute, the NNHN capability with the list of next-next-hop-node BGP IDs. This confirms the control-plane half of GLB is operational.
5.2 Capturing a heartbeat
Section titled “5.2 Capturing a heartbeat”Heartbeats can be observed via port-mirroring on a leaf-facing interface. The expected capture shows:
- Outer UDP, destination IP
224.0.0.149(default multicast group). - Inner payload containing the per-link quality metric.
- Source: the spine’s PFE on a link with active BGP underlay peering.
6. Comparison with Vendor-equivalent Mechanisms
Section titled “6. Comparison with Vendor-equivalent Mechanisms”The general pattern — BGP signals fabric topology; ASIC measures and acts on path quality — is not unique to Juniper. Equivalent or adjacent technologies exist across vendors:
| Vendor | Technology | Control plane | Data plane |
|---|---|---|---|
| Juniper | GLB | BGP NNHN | Heartbeat (UDP multicast) |
| Broadcom (silicon) | DLB / CLB | — (vendor-overlay) | Local + telemetry-driven |
| NVIDIA | Spectrum-X adaptive routing | — | Per-packet adaptive on SHARP |
| Cisco | Silicon-level load balancing (SiLB / DLB) | — | Local congestion-aware |
GLB is distinguished by the use of a standards-track BGP extension for the control plane, which makes its topology-discovery mechanism interoperable in principle across vendors that adopt the NNHN draft.
7. Summary
Section titled “7. Summary”- eBGP as DC underlay is the baseline pattern (RFC 7938) and is a hard prerequisite for AI/ML fabrics because of its native, scalable, declarative support for ECMP.
- Elephant flows + downstream congestion expose a structural limitation of DLB: the ingress leaf cannot react to congestion that occurs beyond its own uplinks.
- GLB closes this gap by combining a new BGP signaling capability (NNHN) with a per-hop quality heartbeat at the ASIC level, allowing forwarding decisions to incorporate end-to-end path quality.
- Role separation (
helper-onlyon spines,load-balancer-onlyon leaves) keeps the protocol semantics clean and the deployment incremental.
References
Section titled “References”- RFC 7938, Use of BGP for Routing in Large-Scale Data Centers
- draft-wang-idr-next-next-hop-nodes-00, BGP Next-to-Next-Hop-Nodes Capability
- Juniper Networks, Global Load Balancing (GLB), Junos OS Evolved documentation
- M. Styszynski, Avoiding AI/ML traffic congestion with global load balancing, Juniper Blogs, Oct 2024