Appendix: BGP-based Underlay and GLB NNHN in AI Data Center Fabrics

Supplement to Chapter 6’s GLB section, with additional BGP underlay context. This appendix covers (1) why eBGP is the de-facto underlay routing protocol for modern AI/ML data center fabrics, and (2) how Juniper’s Global Load Balancing (GLB) extends BGP with the Next-to-Next-Hop-Node (NNHN) capability to mitigate elephant-flow congestion in 400G/800G Clos fabrics.

1. Background: Why BGP/eBGP for Data Center Underlays

1.1 Departure from IGP-based fabrics

Traditional enterprise data centers used IGPs (OSPF, IS-IS) as the underlay routing protocol. As hyperscale operators scaled out, this model broke down on several axes — flooding domain size, SPF recomputation cost, policy expressiveness, and operational complexity.

RFC 7938 (Use of BGP for Routing in Large-Scale Data Centers) codified the alternative: eBGP as a single-protocol underlay over a Clos (leaf-spine) topology.

1.2 Design pattern

Aspect	Pattern
Topology	3-stage or 5-stage Clos (leaf-spine, leaf-spine-superspine)
ASN allocation	Private 4-byte ASNs; one ASN per leaf, one per spine (or per spine tier)
Peering	eBGP point-to-point on routed `/31` links; no L2 below the leaf
Loop prevention	Native via AS_PATH; no STP, no MLAG, no large L2 domains
Multipathing	`multipath multiple-as` enabled; all leaf→spine uplinks active concurrently
Failure domain	Single link/node failure → one or two BGP UPDATE messages; no SPF cascade
Policy	Communities, AS_PATH manipulation, prefix-lists — significantly more expressive than IGP

1.3 Relevance to AI/ML fabrics

AI training fabrics depend on this pattern for one principal reason: ECMP across many parallel paths is mandatory, not optional. A GPU cluster running collective operations (AllReduce, AllGather) generates flows that must be spread across every available leaf→spine uplink to achieve target Job Completion Time (JCT) and tail latency. A routing protocol that cannot cleanly express N-way ECMP at scale is disqualified.

eBGP gives you this for free, with operational characteristics — small failure domain, declarative policy, no flooding — that align with the high-radix, high-bandwidth (400G/800G) topology of modern AI fabrics.

2. The Limitation: Elephant Flows and Downstream Congestion

2.1 Workload characteristics

AI/ML training traffic differs from conventional data center workloads in two critical ways:

Elephant flows. A single fat flow may live for the entire duration of a training step, continuously synchronizing tensors between GPU-enabled servers in n-to-1 or 1-to-n communication patterns.
Low flow entropy. With a small number of GPUs and a small number of long-lived flows, the 5-tuple hash space is sparse. Standard ECMP hashing degenerates into pathological load distributions — some uplinks saturated, others idle.

2.2 Why classical DLB is insufficient

Dynamic Load Balancing (DLB) addresses local link conditions: the ingress leaf chooses an uplink based on queue depth and bandwidth utilization of its own outgoing interfaces.

The failure mode this misses is in-cast congestion at the spine:

Traffic from multiple ingress leaves converges on a spine.
The spine’s egress link toward the destination leaf saturates.
DCQCN ECN and PFC-DSCP eventually signal the congestion back, but by then the damage to JCT and tail latency is done.
Critically, the ingress leaf had no way to know in advance that this particular spine’s downstream link was congested.

The ingress leaf needs visibility into the end-to-end path quality, not just its local link quality.

3. GLB Architecture: BGP NNHN + ASIC Heartbeats

Juniper’s Global Load Balancing (GLB), introduced in Junos OS Evolved 23.4R2 on Tomahawk 5-based QFX5240 platforms, extends DLB to consider downstream path quality. GLB is structured as two cooperating planes.

3.1 Control plane: BGP NNHN capability

GLB introduces a new BGP attribute, Next-to-Next-Hop-Node (NNHN) capability, carried inside the Next-Hop Dependent Capabilities attribute of a BGP UPDATE. It is standardized via draft-wang-idr-next-next-hop-nodes.

The NNHN TLV carries:

Field	Description
Next-hop BGP ID	32-bit BGP identifier of the next-hop-node attaching this capability
Next-next-hop BGP IDs	One or more 32-bit identifiers, each representing an NNH used by the NH for ECMP forwarding for the advertised NLRI

In plain terms: when a spine advertises a prefix to a leaf, it also signals which downstream leaves it would ECMP that prefix to. The ingress leaf thus learns the two-hop topology from BGP control plane signaling alone.

3.2 Data plane: GLB heartbeats

The second component is a per-hop liveness and quality signal generated by the Packet Forwarding Engine (PFE):

Property	Value
Encapsulation	UDP
Destination	Multicast group `224.0.0.149` (default; configurable)
Scope	Sent only on interfaces with active BGP underlay peering between leaf and spine
Payload	Link quality metric, carried in the data portion of the packet

Restricting heartbeats to BGP-peered fabric links is deliberate: GLB is enabled only on the leaf-spine topology and ignores quality information from links that are not part of the fabric (e.g., management or storage networks).

3.3 Correlation at the PFE

The ingress leaf’s PFE correlates the two streams:

BGP NNHN advertisements tell it which NNH a given uplink leads to.
Heartbeats from the spine PFE tell it the current quality of the spine→NNH link.

The forwarding decision for an elephant flow then becomes:

score(uplink_i) = local_link_quality(uplink_i)
                + downstream_quality(NNH_reached_via_uplink_i)

This allows the ingress leaf to proactively avoid a congested spine-to-egress-leaf link by selecting a different spine, before any ECN or PFC feedback is generated.

4. Operational Roles in a 3-stage Clos

Juniper exposes GLB as two configuration modes, one per tier:

4.1 Spine — `helper-only` mode

set protocols bgp global-load-balancing helper-only
set forwarding-options enhanced-hash-key ecmp-dlb <flowlet | per-packet>

In this mode:

BGP emits the NNHN capability on routes it advertises.
The GLB application monitors quality of all local links with active eBGP sessions.
Quality information is flooded to direct neighbors via heartbeats.

4.2 Leaf — `load-balancer-only` mode

set protocols bgp global-load-balancing load-balancer-only
set forwarding-options enhanced-hash-key ecmp-dlb <flowlet | per-packet>

In this mode:

BGP does not emit the NNHN capability.
The leaf receives link quality from neighboring spines.
The combined NH + NNH quality drives ECMP forwarding decisions.

4.3 Selective disable

GLB can be selectively disabled per interface or per peer group, allowing mixed-mode deployments during phased rollouts.

5. Verification

5.1 Confirming the GLB profile on a leaf

A show route lookup on a leaf for a prefix originated by a remote leaf should display, in addition to the standard next-hop attribute, the NNHN capability with the list of next-next-hop-node BGP IDs. This confirms the control-plane half of GLB is operational.

5.2 Capturing a heartbeat

Heartbeats can be observed via port-mirroring on a leaf-facing interface. The expected capture shows:

Outer UDP, destination IP 224.0.0.149 (default multicast group).
Inner payload containing the per-link quality metric.
Source: the spine’s PFE on a link with active BGP underlay peering.

6. Comparison with Vendor-equivalent Mechanisms

The general pattern — BGP signals fabric topology; ASIC measures and acts on path quality — is not unique to Juniper. Equivalent or adjacent technologies exist across vendors:

Vendor	Technology	Control plane	Data plane
Juniper	GLB	BGP NNHN	Heartbeat (UDP multicast)
Broadcom (silicon)	DLB / CLB	— (vendor-overlay)	Local + telemetry-driven
NVIDIA	Spectrum-X adaptive routing	—	Per-packet adaptive on SHARP
Cisco	Silicon-level load balancing (SiLB / DLB)	—	Local congestion-aware

GLB is distinguished by the use of a standards-track BGP extension for the control plane, which makes its topology-discovery mechanism interoperable in principle across vendors that adopt the NNHN draft.

7. Summary

eBGP as DC underlay is the baseline pattern (RFC 7938) and is a hard prerequisite for AI/ML fabrics because of its native, scalable, declarative support for ECMP.
Elephant flows + downstream congestion expose a structural limitation of DLB: the ingress leaf cannot react to congestion that occurs beyond its own uplinks.
GLB closes this gap by combining a new BGP signaling capability (NNHN) with a per-hop quality heartbeat at the ASIC level, allowing forwarding decisions to incorporate end-to-end path quality.
Role separation (helper-only on spines, load-balancer-only on leaves) keeps the protocol semantics clean and the deployment incremental.

References

RFC 7938, Use of BGP for Routing in Large-Scale Data Centers
draft-wang-idr-next-next-hop-nodes-00, BGP Next-to-Next-Hop-Nodes Capability
Juniper Networks, Global Load Balancing (GLB), Junos OS Evolved documentation
M. Styszynski, Avoiding AI/ML traffic congestion with global load balancing, Juniper Blogs, Oct 2024