Skip to content

Appendix: BGP-based Underlay and GLB NNHN in AI Data Center Fabrics

Supplement to Chapter 6’s GLB section, with additional BGP underlay context. This appendix covers (1) why eBGP is the de-facto underlay routing protocol for modern AI/ML data center fabrics, and (2) how Juniper’s Global Load Balancing (GLB) extends BGP with the Next-to-Next-Hop-Node (NNHN) capability to mitigate elephant-flow congestion in 400G/800G Clos fabrics.


1. Background: Why BGP/eBGP for Data Center Underlays

Section titled “1. Background: Why BGP/eBGP for Data Center Underlays”

Traditional enterprise data centers used IGPs (OSPF, IS-IS) as the underlay routing protocol. As hyperscale operators scaled out, this model broke down on several axes — flooding domain size, SPF recomputation cost, policy expressiveness, and operational complexity.

RFC 7938 (Use of BGP for Routing in Large-Scale Data Centers) codified the alternative: eBGP as a single-protocol underlay over a Clos (leaf-spine) topology.

AspectPattern
Topology3-stage or 5-stage Clos (leaf-spine, leaf-spine-superspine)
ASN allocationPrivate 4-byte ASNs; one ASN per leaf, one per spine (or per spine tier)
PeeringeBGP point-to-point on routed /31 links; no L2 below the leaf
Loop preventionNative via AS_PATH; no STP, no MLAG, no large L2 domains
Multipathingmultipath multiple-as enabled; all leaf→spine uplinks active concurrently
Failure domainSingle link/node failure → one or two BGP UPDATE messages; no SPF cascade
PolicyCommunities, AS_PATH manipulation, prefix-lists — significantly more expressive than IGP

AI training fabrics depend on this pattern for one principal reason: ECMP across many parallel paths is mandatory, not optional. A GPU cluster running collective operations (AllReduce, AllGather) generates flows that must be spread across every available leaf→spine uplink to achieve target Job Completion Time (JCT) and tail latency. A routing protocol that cannot cleanly express N-way ECMP at scale is disqualified.

eBGP gives you this for free, with operational characteristics — small failure domain, declarative policy, no flooding — that align with the high-radix, high-bandwidth (400G/800G) topology of modern AI fabrics.


2. The Limitation: Elephant Flows and Downstream Congestion

Section titled “2. The Limitation: Elephant Flows and Downstream Congestion”

AI/ML training traffic differs from conventional data center workloads in two critical ways:

  1. Elephant flows. A single fat flow may live for the entire duration of a training step, continuously synchronizing tensors between GPU-enabled servers in n-to-1 or 1-to-n communication patterns.
  2. Low flow entropy. With a small number of GPUs and a small number of long-lived flows, the 5-tuple hash space is sparse. Standard ECMP hashing degenerates into pathological load distributions — some uplinks saturated, others idle.

Dynamic Load Balancing (DLB) addresses local link conditions: the ingress leaf chooses an uplink based on queue depth and bandwidth utilization of its own outgoing interfaces.

The failure mode this misses is in-cast congestion at the spine:

  • Traffic from multiple ingress leaves converges on a spine.
  • The spine’s egress link toward the destination leaf saturates.
  • DCQCN ECN and PFC-DSCP eventually signal the congestion back, but by then the damage to JCT and tail latency is done.
  • Critically, the ingress leaf had no way to know in advance that this particular spine’s downstream link was congested.

The ingress leaf needs visibility into the end-to-end path quality, not just its local link quality.


3. GLB Architecture: BGP NNHN + ASIC Heartbeats

Section titled “3. GLB Architecture: BGP NNHN + ASIC Heartbeats”

Juniper’s Global Load Balancing (GLB), introduced in Junos OS Evolved 23.4R2 on Tomahawk 5-based QFX5240 platforms, extends DLB to consider downstream path quality. GLB is structured as two cooperating planes.

GLB introduces a new BGP attribute, Next-to-Next-Hop-Node (NNHN) capability, carried inside the Next-Hop Dependent Capabilities attribute of a BGP UPDATE. It is standardized via draft-wang-idr-next-next-hop-nodes.

The NNHN TLV carries:

FieldDescription
Next-hop BGP ID32-bit BGP identifier of the next-hop-node attaching this capability
Next-next-hop BGP IDsOne or more 32-bit identifiers, each representing an NNH used by the NH for ECMP forwarding for the advertised NLRI

In plain terms: when a spine advertises a prefix to a leaf, it also signals which downstream leaves it would ECMP that prefix to. The ingress leaf thus learns the two-hop topology from BGP control plane signaling alone.

The second component is a per-hop liveness and quality signal generated by the Packet Forwarding Engine (PFE):

PropertyValue
EncapsulationUDP
DestinationMulticast group 224.0.0.149 (default; configurable)
ScopeSent only on interfaces with active BGP underlay peering between leaf and spine
PayloadLink quality metric, carried in the data portion of the packet

Restricting heartbeats to BGP-peered fabric links is deliberate: GLB is enabled only on the leaf-spine topology and ignores quality information from links that are not part of the fabric (e.g., management or storage networks).

The ingress leaf’s PFE correlates the two streams:

  • BGP NNHN advertisements tell it which NNH a given uplink leads to.
  • Heartbeats from the spine PFE tell it the current quality of the spine→NNH link.

The forwarding decision for an elephant flow then becomes:

score(uplink_i) = local_link_quality(uplink_i)
+ downstream_quality(NNH_reached_via_uplink_i)

This allows the ingress leaf to proactively avoid a congested spine-to-egress-leaf link by selecting a different spine, before any ECN or PFC feedback is generated.


Juniper exposes GLB as two configuration modes, one per tier:

set protocols bgp global-load-balancing helper-only
set forwarding-options enhanced-hash-key ecmp-dlb <flowlet | per-packet>

In this mode:

  • BGP emits the NNHN capability on routes it advertises.
  • The GLB application monitors quality of all local links with active eBGP sessions.
  • Quality information is flooded to direct neighbors via heartbeats.
set protocols bgp global-load-balancing load-balancer-only
set forwarding-options enhanced-hash-key ecmp-dlb <flowlet | per-packet>

In this mode:

  • BGP does not emit the NNHN capability.
  • The leaf receives link quality from neighboring spines.
  • The combined NH + NNH quality drives ECMP forwarding decisions.

GLB can be selectively disabled per interface or per peer group, allowing mixed-mode deployments during phased rollouts.


A show route lookup on a leaf for a prefix originated by a remote leaf should display, in addition to the standard next-hop attribute, the NNHN capability with the list of next-next-hop-node BGP IDs. This confirms the control-plane half of GLB is operational.

Heartbeats can be observed via port-mirroring on a leaf-facing interface. The expected capture shows:

  • Outer UDP, destination IP 224.0.0.149 (default multicast group).
  • Inner payload containing the per-link quality metric.
  • Source: the spine’s PFE on a link with active BGP underlay peering.

6. Comparison with Vendor-equivalent Mechanisms

Section titled “6. Comparison with Vendor-equivalent Mechanisms”

The general pattern — BGP signals fabric topology; ASIC measures and acts on path quality — is not unique to Juniper. Equivalent or adjacent technologies exist across vendors:

VendorTechnologyControl planeData plane
JuniperGLBBGP NNHNHeartbeat (UDP multicast)
Broadcom (silicon)DLB / CLB— (vendor-overlay)Local + telemetry-driven
NVIDIASpectrum-X adaptive routingPer-packet adaptive on SHARP
CiscoSilicon-level load balancing (SiLB / DLB)Local congestion-aware

GLB is distinguished by the use of a standards-track BGP extension for the control plane, which makes its topology-discovery mechanism interoperable in principle across vendors that adopt the NNHN draft.


  • eBGP as DC underlay is the baseline pattern (RFC 7938) and is a hard prerequisite for AI/ML fabrics because of its native, scalable, declarative support for ECMP.
  • Elephant flows + downstream congestion expose a structural limitation of DLB: the ingress leaf cannot react to congestion that occurs beyond its own uplinks.
  • GLB closes this gap by combining a new BGP signaling capability (NNHN) with a per-hop quality heartbeat at the ASIC level, allowing forwarding decisions to incorporate end-to-end path quality.
  • Role separation (helper-only on spines, load-balancer-only on leaves) keeps the protocol semantics clean and the deployment incremental.

  • RFC 7938, Use of BGP for Routing in Large-Scale Data Centers
  • draft-wang-idr-next-next-hop-nodes-00, BGP Next-to-Next-Hop-Nodes Capability
  • Juniper Networks, Global Load Balancing (GLB), Junos OS Evolved documentation
  • M. Styszynski, Avoiding AI/ML traffic congestion with global load balancing, Juniper Blogs, Oct 2024