Skip to content

Chapter 6: Efficient Load Balancing

This chapter explains how load balancing works in AI/ML data center fabrics and why conventional ECMP is often not enough for large RoCEv2 training clusters.

The core idea is:

AI fabrics need load balancing that understands low entropy, elephant flows, local and remote congestion, packet ordering, and workload policy.

The chapter focuses on these topics:

  • ECMP control-plane and data-plane behavior
  • Why RoCEv2 traffic often has low flow entropy
  • Static Load Balancing, SLB
  • Dynamic Load Balancing, DLB
  • Flowlet-based balancing and reactive rebalancing
  • Global Load Balancing, GLB
  • BGP Next-Next-Hop Nodes, NNHN, and GLB heartbeats
  • Traffic Engineering-Based Load Balancing, TELB
  • Per-packet load balancing and selective packet spraying
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef workload fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fabric fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef method fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef advanced fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef risk fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    W[AI training workload<br/>few applications, huge GPU flows]:::workload
    R[RoCEv2 traffic<br/>UDP/IP + BTH + QP]:::workload
    E[Low entropy<br/>similar source/destination pairs]:::risk
    C[Clos fabric<br/>many equal-cost spine paths]:::fabric

    SLB[SLB<br/>packet header hash]:::method
    DLB[DLB<br/>local link and queue quality]:::method
    GLB[GLB<br/>remote link quality and topology]:::advanced
    TELB[TELB<br/>tenant/job/path policy]:::advanced
    PPS[Per-packet spraying<br/>packet-level path spread]:::advanced

    W --> R
    R --> E
    E --> C
    C --> SLB
    C --> DLB
    C --> GLB
    C --> TELB
    C --> PPS

AI/ML fabrics are not typical enterprise data center fabrics. A cluster may run a small number of large distributed jobs, and most useful traffic may be RDMA over Converged Ethernet, RoCEv2.

Important properties:

  • Traffic is dominated by east-west GPU-to-GPU communication.
  • Distributed training creates synchronized bursts.
  • The number of large flows may be small.
  • Many flows share similar source/destination IP and UDP fields.
  • RoCEv2 normally uses UDP destination port 4791.
  • A few elephant flows can consume whole leaf-spine or spine-leaf links.
  • Packet reordering can hurt RDMA unless the NIC, DPU, or transport can handle it.

Entropy means the amount of useful variation in packet header fields that a switch can hash on.

Traditional ECMP often hashes on a 5-tuple:

FieldMeaning
Source IPSender address
Destination IPReceiver address
Source portTransport source port
Destination portTransport destination port
ProtocolTCP, UDP, and so on

In AI RoCEv2 fabrics, this can be weak because the same GPU pairs may communicate repeatedly, UDP destination port 4791 is common, and the number of flows is often much smaller than in web or enterprise traffic.

To improve entropy, some implementations can include RoCEv2 Base Transport Header, BTH, fields such as Queue Pair, QP, in the hash calculation.

An elephant flow is a large, bandwidth-heavy flow. A mouse flow is a short, small flow.

In AI training, elephant flows are common:

  • Gradient synchronization
  • Activation transfer
  • Parameter exchange
  • AllReduce, AllGather, ReduceScatter, and AlltoAll communication
  • Checkpoint-heavy storage or synchronization traffic

When multiple elephant flows hash to the same ECMP member, that link can become congested while other equal-cost links remain underused.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef gpu fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef leaf fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef spine fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef hot fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef idle fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    G1[GPU flow 1<br/>elephant]:::gpu
    G2[GPU flow 2<br/>elephant]:::gpu
    LA[Leaf A]:::leaf
    SB[Spine B<br/>hot link]:::hot
    SA[Spine A<br/>unused capacity]:::idle
    SC[Spine C<br/>unused capacity]:::idle
    LD[Leaf D]:::leaf
    DST[Destination GPUs]:::gpu

    G1 --> LA
    G2 --> LA
    LA ==>|both flows hash here| SB
    LA -.-> SA
    LA -.-> SC
    SB ==> LD
    SA -.-> LD
    SC -.-> LD
    LD --> DST

The result is poor fabric utilization, longer tail latency, more ECN/CNP activity, and possibly more PFC pressure.


Equal-Cost Multipathing, ECMP, lets a leaf switch use several equal-cost paths through the spine layer.

Example:

  • GPU1 is attached behind Leaf A.
  • GPU2 is attached behind Leaf B.
  • Leaf A can reach Leaf B through Spine A, Spine B, or Spine C.
  • BGP or an IGP installs equal-cost next hops.
  • The ASIC chooses one next hop for each flow.

The routing control plane tells the switch which next hops are available.

In a BGP-based IP fabric:

  1. Leaf B advertises the GPU2 prefix to the spines.
  2. Spine A, Spine B, and Spine C advertise the same prefix to Leaf A.
  3. Leaf A sees multiple routes with equal attributes.
  4. Leaf A programs an ECMP next-hop group in the ASIC.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef leaf fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef spine fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef route fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    LA[Leaf A<br/>ingress]:::leaf
    SA[Spine A<br/>next hop]:::spine
    SB[Spine B<br/>next hop]:::spine
    SC[Spine C<br/>next hop]:::spine
    LB[Leaf B<br/>GPU2 prefix]:::leaf
    NH[ASIC ECMP group<br/>nh1, nh2, nh3]:::route

    LB -->|BGP advertises GPU2 prefix| SA
    LB -->|BGP advertises GPU2 prefix| SB
    LB -->|BGP advertises GPU2 prefix| SC
    SA -->|next-hop self| LA
    SB -->|next-hop self| LA
    SC -->|next-hop self| LA
    LA -.-> NH

For a vendor reference topology, see Juniper’s EBGP underlay overview in its IP fabric underlay design guide.

The data plane makes the forwarding decision at packet arrival time. The control plane is not consulted for every packet.

The switch ASIC:

  1. Parses packet headers.
  2. Looks up the destination in the forwarding table.
  3. Finds an ECMP next-hop group.
  4. Computes a hash over selected packet fields.
  5. Maps the hash result to an ECMP bucket.
  6. Sends the packet toward the chosen spine.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef packet fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef stage fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef table fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef out fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    P[Packet from GPU]:::packet
    H[Header parse<br/>IP, UDP, BTH, VLAN/VNI]:::stage
    L[Route lookup<br/>destination prefix]:::table
    E[ECMP next-hop group<br/>Spine A, B, C]:::table
    X[Hash calculation<br/>5-tuple or extended fields]:::stage
    B[ECMP bucket<br/>member selection]:::table
    O[Chosen spine link]:::out

    P --> H --> L --> E
    H --> X --> B
    E --> B --> O

For classic ECMP, the ASIC does not need to remember every flow. The hash is deterministic, so packets with the same hash inputs select the same ECMP member while different flows are spread across the available spine links.

Possible hash inputs include:

FieldUsefulness
Source IPBasic flow separation
Destination IPBasic flow separation
Source UDP/TCP portHelps when varied
Destination UDP/TCP portWeak for RoCEv2 if many packets use UDP 4791
ProtocolBasic 5-tuple field
VLAN/VNIUseful in overlays or segmentation
RoCEv2 BTH QPImproves entropy for RDMA traffic

The chapter’s practical point is that default packet-header hashing may be too coarse for AI training. The switch may need RoCEv2-aware hash fields or a more adaptive load-balancing mechanism.


The chapter compares several mechanisms:

Proactive load balancing and reactive congestion management

Juniper’s elephant-flow discussion separates the problem into proactive and reactive controls. Proactive mechanisms try to avoid congestion before it forms by choosing better paths for flows, flowlets, or selected packets. Reactive mechanisms respond after queues begin to build by marking, pausing, or slowing traffic.

In this model, SLB, DLB, DLB v2, GLB, TELB, and selective packet spraying are mainly proactive load-balancing tools. ECN, PFC, and DCQCN are reactive congestion-management tools. A RoCEv2 fabric usually needs both: proactive path selection to reduce hot spots, and reactive congestion control to protect loss-sensitive RDMA traffic when queues still build.

AI fabric load-balancing mechanism pyramid

This pyramid is a simplified, original diagram based on the load-balancing categories discussed in Juniper’s AI data center elephant-flow blog.

MechanismDecision InputMain BenefitMain Risk
SLBPacket or flow header hashSimple and widely deployedBlind to real-time congestion
DLBHeader hash plus local link and queue qualityBetter local link utilizationDoes not see remote spine-to-leaf congestion
GLBLocal and remote link quality plus topologyBetter end-to-end path choiceRequires topology awareness and heartbeat scale
TELBTenant, GPU, QP, port, path policyPredictable path controlOperational and policy complexity
Per-packetPacket-level path choiceVery high link utilizationPacket reordering

Static Load Balancing is the traditional Ethernet ECMP model. It uses packet or flow header fields to assign a flow to an outgoing ECMP member.

SLB works well when:

  • There are many flows.
  • Flow sizes are diverse.
  • Header entropy is high.
  • Link capacities are symmetric.
  • Workloads are not tightly synchronized.

It is weaker for AI fabrics because it does not consider:

  • Current link utilization
  • Queue depth
  • Flow size
  • Remote congestion
  • Whether elephant flows collided on the same path

Example:

FlowAssigned LinkRate
Flow 1Leaf A to Spine A50 Gbps
Flow 3Leaf A to Spine A50 Gbps
Flow 2Leaf A to Spine B100 Gbps
Flow 4Leaf A to Spine B100 Gbps

If each leaf-spine link is 200 Gbps:

LinkLoadUtilization
Leaf A to Spine A100 Gbps50%
Leaf A to Spine B200 Gbps100%

The number of flows is balanced, but bandwidth is not.

Resilient, Symmetric, and Weighted Hashing

Section titled “Resilient, Symmetric, and Weighted Hashing”

SLB can be enhanced, but it is still fundamentally hash-based.

EnhancementPurpose
Resilient hashingReduce flow churn when ECMP members fail or recover
Symmetric hashingKeep forward and reverse directions on the same path
Weighted ECMPSend more hash buckets to higher-capacity or preferred paths

Resilient hashing commonly uses a fixed bucket table, such as 512 or 1024 buckets. When a link fails, only buckets pointing to that link are remapped, instead of remapping many flows because the ECMP member count changed.

Weighted ECMP changes bucket distribution:

Next HopCapacity ExampleWeightBucket Share
Spine A400G240%
Spine B400G240%
Spine C200G120%

These tools help normal fabrics, but they do not solve the core AI problem when there are only a few low-entropy elephant flows.

Dynamic Load Balancing improves on SLB by adding local link quality information.

DLB considers signals such as:

  • Link utilization
  • Queue depth
  • Buffer utilization
  • Local congestion
  • Recent usage

The ASIC or packet forwarding engine keeps a link quality table and chooses better local ECMP members for new flows or flowlets.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef signal fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef asic fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef table fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef decision fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    U[Interface utilization]:::signal
    Q[Queue depth / buffer use]:::signal
    C[Local congestion signal]:::signal
    A[ASIC microkernel<br/>quality algorithm]:::asic
    T[Link quality table<br/>per ECMP member]:::table
    D[Forwarding decision<br/>choose better local link]:::decision

    U --> A
    Q --> A
    C --> A
    A --> T
    T --> D

With the earlier example, DLB can avoid putting both 100 Gbps flows on the same link. Instead, it can drive both links to roughly 75% utilization.

Assigned-flow mode pins an active flow to an interface for the flow’s lifetime.

Benefits:

  • Preserves packet ordering
  • Works well for short-lived or high-entropy traffic
  • Simple behavior after initial assignment

Limitations:

  • A long-lived elephant flow can remain on a path even after conditions change.
  • It is less useful when a small number of flows dominate the link.
  • It can still leave persistent imbalance in AI workloads.

Flowlet mode splits one flow into bursts separated by idle gaps.

The switch monitors inactivity:

  • If the idle gap is shorter than the inactivity timer, the flow stays on the same path.
  • If the idle gap is longer than the inactivity timer, the next burst can be assigned to a different path.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant GPU1 as GPU 1
    participant Leaf as Ingress Leaf
    participant A as Spine A
    participant B as Spine B
    participant GPU2 as GPU 2

    GPU1->>Leaf: Flowlet 1
    Leaf->>A: Use current best path
    A->>GPU2: Deliver packets in order
    Note over GPU1,Leaf: idle gap > inactivity timer
    GPU1->>Leaf: Flowlet 2
    Leaf->>B: Reassign to better path
    B->>GPU2: Deliver next burst

Why it fits AI workloads:

  • Distributed training often has bursty communication phases.
  • A collective may create bursts separated by compute or synchronization gaps.
  • Flowlet reassignment improves load distribution while reducing packet reordering risk.

Reactive path rebalancing monitors the quality of the path assigned to a long-lived flow.

If a link degrades and another ECMP member has better quality, the switch may move the flow or a later burst to the better path.

Trade-off:

  • Better response to changing congestion
  • Possible short-term packet reordering
  • Requires NIC/RDMA stack tolerance if packets from the same logical flow can arrive out of order

DLB per-packet mode sprays packets from the same flow across ECMP members based on the link quality table.

This can improve link utilization, but it directly creates packet reordering risk. It should be used only when the receiver side can handle the reordering for the relevant RDMA operation or when the traffic class is safe for spraying.


DLB only sees local link quality. GLB extends the idea by using remote link quality and topology information to select a better end-to-end path.

Consider two ingress leaves sending traffic to the same egress leaf. Each ingress leaf may independently choose Spine A because its local link to Spine A looks good. However, Spine A may have only one congested downlink toward the egress leaf.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef leaf fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef spine fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef hot fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef ok fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    LA[Leaf A<br/>ingress]:::leaf
    LB[Leaf B<br/>ingress]:::leaf
    SA[Spine A]:::spine
    SB[Spine B]:::ok
    LD[Leaf D<br/>egress]:::leaf

    LA -->|local DLB chooses A| SA
    LB -->|local DLB chooses A| SA
    SA ==>|shared congested downlink| LD
    LA -.-> SB
    LB -.-> SB
    SB -.-> LD
    linkStyle 2 stroke:#e11d48,stroke-width:3px

DLB improved the ingress leaf decision, but it did not see the downstream congestion from Spine A to Leaf D.

Section titled “Remote Link Quality and End-to-End Path Quality”

GLB lets a leaf consider:

  • Local leaf-to-spine quality
  • Remote spine-to-egress-leaf quality
  • Destination location
  • Next-hop and next-next-hop topology
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef local fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef remote fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef topo fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef decision fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    LQ[Local link quality<br/>Leaf to Spine]:::local
    RQ[Remote link quality<br/>Spine to egress Leaf]:::remote
    TO[Topology mapping<br/>destination behind Leaf D]:::topo
    PQ[End-to-end path quality]:::decision
    BEST[Choose best path<br/>for new flow or flowlet]:::decision

    LQ --> PQ
    RQ --> PQ
    TO --> PQ
    PQ --> BEST

GLB can reduce the chance that many elephant flows converge on the same congested spine-to-leaf link. It can also reduce DCQCN triggers and PFC pressure because new traffic can be steered away from degraded paths earlier.

GLB needs two kinds of information:

ComponentRole
Control-plane topologyTell a leaf which egress node is behind which next-next-hop
ASIC-level heartbeatCarry fast path quality information between neighboring switches

BGP Next-Next-Hop Nodes, NNHN, is one proposed way to tell a switch which nodes sit behind a next hop for ECMP forwarding. In a leaf-spine fabric, the next hop may be a spine and the next-next-hop may be the egress leaf.

Example:

View from Leaf AMeaning
Destination GPU2 prefixThe prefix Leaf A wants to reach
Next hopSpine A, Spine B, or Spine C
Next-next-hopLeaf B, the egress leaf behind those spines
NNHN topology signalWhich egress leaves are reachable behind each spine next hop

Without NNHN-like topology information, Leaf A knows that several spines can reach the destination prefix, but it does not have a clean control-plane mapping from each spine next hop to the egress leaf behind it. With NNHN, the switch can associate a next hop with the downstream node that matters for path quality.

GLB heartbeats carry link quality information at the forwarding level. The chapter describes heartbeats as change-based, with an example frequency of 20 ms.

The division of labor is important:

SignalQuestion AnsweredExample
BGP NNHNWhere can this next hop take me?Spine A can reach Leaf B
GLB heartbeatHow healthy is that path right now?Spine A to Leaf B is congested
GLB decisionWhich path should this flow or flowlet use?Prefer Spine C for traffic to Leaf B

For more detail on the BGP underlay assumptions, NNHN signaling, and GLB operational roles, see Appendix: BGP-based Underlay and GLB NNHN.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef control fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef data fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef table fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    BGP[BGP NNHN<br/>next-hop and next-next-hop topology]:::control
    HB[GLB heartbeat<br/>link quality signal]:::data
    PFE[PFE / ASIC<br/>path monitor]:::data
    PQ[Path quality profile<br/>simple or compound]:::table
    LB[GLB forwarding decision]:::table

    BGP --> PFE
    HB --> PFE
    PFE --> PQ
    PQ --> LB
Fabric AreaGLB Application
Three-stage ClosGLB can run across all leaf and spine devices
Within a podIn five-stage Clos, each pod can be treated like its own three-stage GLB domain
Spine to super-spineUseful when upper layers are oversubscribed or congested
All layersPossible, but table scale and heartbeat overhead must be evaluated

Important constraints:

  • GLB is newer than SLB and DLB.
  • Vendor implementation matters.
  • Heartbeat frequency and scale must be tuned.
  • Microburst behavior must be tested.
  • Multi-vendor interoperability may still be difficult.


Traffic Engineering-Based Load Balancing, TELB

Section titled “Traffic Engineering-Based Load Balancing, TELB”

Traffic Engineering-Based Load Balancing uses policy to steer AI traffic over specific logical paths.

SLB, DLB, and GLB still choose paths dynamically based on hashing and quality. TELB is useful when an operator wants more deterministic behavior.

TELB can be useful for:

  • Multi-tenant AI fabrics
  • Predictable job performance
  • Tenant isolation
  • Steering a tenant, GPU, QP range, or UDP port range to a path color
  • Keeping packet order by limiting path diversity for selected traffic
  • Using backup path IDs after failures

The chapter notes that traditional service-provider traffic engineering technologies such as MPLS-TE or SR-MPLS are mature, but may be too heavy or expensive for AI data center fabrics. AI fabrics often need a lightweight pure-IP form of traffic engineering.

TELB can match traffic characteristics and assign a path color.

Possible match inputs:

Match InputExample Use
Tenant IDPut one training job on a dedicated spine set
GPU IDMap GPU traffic to a logical fabric color
QP rangePin RoCEv2 QP ranges to path colors
Source UDP port rangeUse port allocation to represent a job or tenant
Ingress interfaceTie a server rail or NIC to a logical path group
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef match fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef policy fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef path fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef backup fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    T[Tenant / job ID]:::match
    G[GPU ID]:::match
    Q[QP or UDP port range]:::match
    I[Ingress interface]:::match
    P[Policy lookup<br/>path color]:::policy
    B[Backup path color]:::backup
    S[Selected spine set<br/>logical fabric color]:::path

    T --> P
    G --> P
    Q --> P
    I --> P
    P --> S
    P --> B

BGP Deterministic Path Forwarding, BGP-DPF, is described as one way to deliver TELB. The idea is to use BGP policy to pin traffic characteristics to colored paths.

Example policy model:

GPU IDQP RangeTenant Port RangePrimary Path ColorBackup Path Color
GPU 01000-1999Tenant 1 portsBlueGreen
GPU 12000-2999Tenant 2 portsGreenBlue
AnyStorage sync rangeRedBlue

The ASIC must be able to apply the policy quickly and switch to a backup path when a link or node fails.

A centralized controller can combine fabric telemetry with scheduler intent.

Inputs:

  • Tenant and job identity
  • GPU allocation
  • Leaf/spine utilization
  • Queue depth
  • ECN, PFC, and DCQCN signals
  • Link or node failure events
  • Policy objectives

Outputs:

  • Path color assignment
  • BGP-DPF or policy updates
  • Monitoring and alerting
  • Automated remediation
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef sched fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef telemetry fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef controller fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fabric fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    S[AI scheduler<br/>tenant, job, GPU placement]:::sched
    T[Fabric telemetry<br/>utilization, queues, ECN/PFC/DCQCN]:::telemetry
    C[Controller<br/>path policy and remediation]:::controller
    F[AI fabric<br/>BGP-DPF / ACL / path colors]:::fabric

    S --> C
    T --> C
    C --> F
    F --> T

Per-packet load balancing treats packets independently rather than pinning a whole flow to one path.

Benefits:

  • Uses ECMP members more evenly.
  • Can break a single elephant flow across multiple paths.
  • Can improve bandwidth utilization when flow entropy is low.

Main risk:

  • Packets can arrive out of order.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef packet fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef leaf fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef spine fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef dest fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    SRC[One elephant flow]:::packet
    L[Ingress leaf]:::leaf
    A[Spine A<br/>packet A]:::spine
    B[Spine B<br/>packet B]:::spine
    C[Spine C<br/>packet C]:::spine
    D[Destination NIC<br/>reorder needed]:::dest

    SRC --> L
    L --> A --> D
    L --> B --> D
    L --> C --> D

Random spray sends packets across ECMP members randomly or round-robin.

It does not check link quality before sending each packet. It is simple, but it can still send packets to poor-quality links.

Selective packet spraying applies per-packet mode only to traffic that can tolerate or handle reordering.

The switch can match packet characteristics such as:

  • RoCEv2 opcode
  • BTH fields
  • QP
  • ACL match
  • Tenant or traffic class

The chapter notes that some modern 400G NICs and DPUs can handle reordering for selected RDMA operations, especially certain write operations. That makes selective spraying more practical than spraying everything.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef flow fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef match fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef normal fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef spray fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    F1[Flow 1]:::flow
    F2[Flow 2]:::flow
    ACL{Packet header<br/>matches selective spraying rule?}:::match
    N[Default load balancing<br/>SLB / DLB / flowlet]:::normal
    P[Per-packet spraying<br/>only for safe traffic]:::spray

    F1 --> ACL
    F2 --> ACL
    ACL -->|No| N
    ACL -->|Yes| P

Selective spraying requires:

  • ASIC support for parsing the required fields
  • ACL or TCAM resources
  • Clear knowledge of NIC reordering capability
  • Per-traffic-class policy
  • Validation with real RDMA workloads

When packets take different spines, they may experience different queueing and processing delays. Packet B can arrive before packet A even if A was sent first.

The receiver then needs to:

  1. Buffer out-of-order packets.
  2. Identify missing sequence positions.
  3. Reorder packets.
  4. Forward in-order data to the application or RDMA operation.

Small amounts of reordering may be manageable. Under congestion, the number of out-of-order packets can grow and can hurt latency, throughput, buffer usage, and CPU/NIC/DPU work.

The practical rule:

Per-packet load balancing is powerful, but it should be enabled only where the transport, NIC, and workload semantics can handle the reordering.

OpenAI’s MRC, Multipath Reliable Connection, is a recent example of packet spraying becoming part of the transport design rather than only a switch load-balancing mode.

MRC packet spraying compared with traditional single-path forwarding

This is an original diagram based on OpenAI’s MRC article and the MRC/SRv6 paper.

Traditional ECMP usually keeps one flow or RDMA transfer on one path:

One flow or transfer -> one path

This preserves ordering, but it also means a single elephant transfer can be trapped behind one path’s bandwidth and congestion state.

MRC changes the model:

One RDMA transfer -> many packets -> many paths

MRC extends RoCE and sprays packets from a single transfer across many paths in a multi-plane network. Packets may arrive out of order, but MRC packets carry enough placement information for the destination to deliver data to the correct memory location. In other words, MRC treats out-of-order delivery as an expected transport behavior, not as an accidental side effect.

The packet load-balancing view is:

MechanismLoad-Balancing UnitMain BenefitMain Burden
Per-flow ECMPFlowSimple orderingElephant flow can be stuck on one path
Flowlet DLBBurst or flowletBetter distribution with lower reordering riskInactivity timer must match the workload
Per-packet sprayingPacketHigh link utilizationReceiver must handle reordering
MRCPackets within one reliable transferOne transfer can use many paths safelyTransport must handle placement, reliability, and path health

MRC also adapts away from congested or failed paths. If a path looks unhealthy, MRC can stop using it, retransmit affected data, and probe for recovery. It also uses packet trimming: when congestion would otherwise cause a drop, a switch can trim the payload and forward header information so the destination can request retransmission more explicitly.

The practical takeaway is that MRC is not just “turning on per-packet mode.” It combines packet spraying, out-of-order-safe delivery, multipath reliability, path failure handling, and congestion signaling into one transport-level design.


FeatureSLBDLBGLBTELB
Packet or flow header hashYesYesYesPolicy-dependent
Local link bandwidth awarenessNoYesYesOptional
Queue size awarenessNoYesYesOptional
Remote link qualityNoNoYesOptional
RoCEv2 BTH fieldsPossiblePossiblePossibleUseful
Fabric telemetryNoLocalRemote heartbeatController or policy
Path determinismLowMediumMediumHigh
Reordering riskLowLow to mediumLow to mediumUsually controlled
Operational complexityLowMediumHighHigh
Industry adoptionHighMediumLowerLower

Recommended mental model:

Workload or Fabric ConditionLikely Mechanism
Many small diverse flowsSLB may be enough
AI training with local link imbalanceDLB, especially flowlet mode
Congestion hidden behind spinesGLB
Multi-tenant fabric needing predictable path policyTELB or BGP-DPF
One elephant flow must use many linksSelective per-packet spraying

Before relying on a load-balancing design, validate it under AI-like traffic.

Checklist:

  • Confirm which fields are used in the hash: 5-tuple, VLAN/VNI, RoCEv2 BTH, QP.
  • Measure per-link utilization across leaf-spine and spine-leaf links.
  • Check whether elephant flows collide on the same ECMP member.
  • Test DLB quality table behavior under mixed 50G, 100G, 200G, 400G, or 800G flows.
  • Tune flowlet inactivity timers against actual collective traffic gaps.
  • Validate packet reordering counters on NICs and DPUs.
  • Verify ECN, CNP, PFC, and DCQCN behavior during congestion.
  • Test link and node failure convergence.
  • Confirm whether GLB heartbeat scale and frequency are acceptable.
  • For TELB, validate tenant/job policy, backup path behavior, and controller failure behavior.
  • Run workload-level tests such as NCCL collectives, all-to-all patterns, storage synchronization, and checkpoint bursts.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef observe fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef test fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef signal fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef decision fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fix fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    H[Inspect hash fields<br/>and ASIC support]:::observe
    W[Generate AI-like workload<br/>NCCL, RDMA, checkpoint]:::test
    L[Observe link and queue balance]:::signal
    R[Observe reordering<br/>ECN, CNP, PFC, DCQCN]:::signal
    D{Does fabric behavior<br/>match workload target?}:::decision
    A[Accept operating envelope]:::observe
    F[Retune hash, DLB timer,<br/>GLB, TELB, or spraying policy]:::fix

    H --> W --> L --> R --> D
    D -->|Yes| A
    D -->|No| F
    F --> W

Efficient load balancing is central to Ethernet AI fabric performance.

The main takeaways:

  • AI training traffic often has low entropy and large elephant flows.
  • ECMP provides multiple paths, but default per-flow hashing can create hot spots.
  • SLB is simple and common, but it does not understand real-time link utilization or queue depth.
  • Resilient, symmetric, and weighted hashing improve SLB operations but do not fully solve AI low-entropy traffic.
  • DLB uses local link quality and queue state to improve local path choice.
  • Flowlet mode is often a practical compromise because it improves path distribution while limiting packet reordering.
  • GLB adds remote link quality and topology awareness, helping avoid congestion beyond the first hop.
  • BGP NNHN and GLB heartbeats are mechanisms for distributing topology and path quality information.
  • TELB provides deterministic path control for tenants, jobs, GPUs, QPs, or port ranges.
  • Per-packet load balancing can maximize link utilization but requires careful handling of packet reordering.
  • Selective packet spraying is more practical than spraying all traffic because it can be limited to RDMA operations and NICs that support reordering.
  • MRC shows a modern transport-level approach where packet spraying, out-of-order-safe delivery, reliability, and path health are designed together.

TermMeaning
ECMPEqual-Cost Multipathing; forwarding across multiple equal-cost next hops
SLBStatic Load Balancing; hash-based path selection without real-time congestion awareness
DLBDynamic Load Balancing; path choice based on local link and queue quality
GLBGlobal Load Balancing; path choice using local and remote link quality plus topology
TELBTraffic Engineering-Based Load Balancing; policy-driven path control
BGP-DPFBGP Deterministic Path Forwarding; BGP-based mechanism for deterministic path selection
NNHNNext-Next-Hop Nodes; BGP capability for signaling nodes behind a next hop
Flow entropyHeader variation available for hashing
Elephant flowLarge bandwidth-heavy flow
Mouse flowSmall short-lived flow
FlowletA burst within a flow separated by an idle gap
Inactivity timerTimer used to decide whether the next burst can be reassigned
Packet sprayingSending packets from the same flow across multiple paths
MRCMultipath Reliable Connection; OpenAI’s RoCE-based transport that sprays packets across many paths and handles placement, reliability, and path health
RoCEv2 BTHRoCEv2 Base Transport Header
QPRDMA Queue Pair
TCAMTernary Content-Addressable Memory used for fast match rules
DCQCNData Center Quantized Congestion Notification
PFCPriority Flow Control
CNPCongestion Notification Packet

1. What is the role of ECMP in AI/ML data center fabrics?

Section titled “1. What is the role of ECMP in AI/ML data center fabrics?”

ECMP gives the fabric multiple equal-cost paths between leaf switches, usually through several spine switches. The control plane, such as BGP or an IGP, installs equal-cost next hops, and the switch ASIC maps packets or flows to one of those next hops using selected packet fields.

ECMP should be treated as the baseline multipathing mechanism, not the complete AI load-balancing solution. It gives the network path diversity, but it does not automatically guarantee bandwidth balance. If several large RoCEv2 flows hash to the same spine, the fabric can have one hot link and several idle links even though the topology looks non-blocking on paper.

The important point is that ECMP balances hash buckets, not bytes. That distinction matters in AI fabrics because a few elephant flows can dominate total bandwidth. ECMP is therefore necessary, but AI clusters usually need better entropy, DLB, GLB, TELB, or selective spraying on top of basic ECMP behavior.

2. Why is static load balancing often insufficient for AI workloads?

Section titled “2. Why is static load balancing often insufficient for AI workloads?”

SLB assigns flows based on header hashes. It does not consider flow size, link utilization, queue depth, or remote congestion.

That is acceptable when there are many small flows with diverse headers. It becomes weak when the workload has a small number of synchronized, high-bandwidth RoCEv2 flows. In that case, the number of flows may look balanced while the number of bits per second is badly skewed.

The key factors are entropy and elephant flows. RoCEv2 traffic often shares common fields, such as UDP destination port 4791, and may not expose enough useful entropy to a default 5-tuple hash. If two or three elephant flows collide on the same ECMP member, the impact is not a minor statistical imbalance. It can create queue buildup, ECN/CNP activity, PFC pressure, and longer collective completion time.

So the problem with SLB is not that hashing is wrong. The problem is that static hashing is blind to traffic volume and fabric state.

DLB adds local link quality information. Instead of relying only on the hash, the switch can consider local interface utilization, queue depth, buffer state, and recent usage.

This allows the switch to steer new flows, flowlets, or in some modes packets toward less congested local ECMP members. The practical improvement is that the ingress leaf is no longer completely blind. If one uplink has a deep queue and another uplink is clean, DLB can prefer the cleaner member.

The limitation is equally important. DLB usually sees the local leaf-to-spine condition well, but it may not see the downstream condition behind the spine. A path can look healthy from the ingress leaf to the spine while the spine-to-egress-leaf link is congested. That is why DLB improves local utilization but does not fully solve end-to-end path quality.

In practice, I would describe DLB as the first move from static hashing toward state-aware forwarding. It is useful, but it still has a local view.

4. Why is flowlet mode useful for AI training traffic?

Section titled “4. Why is flowlet mode useful for AI training traffic?”

Flowlet mode uses natural pauses between bursts as reassignment points. If the idle gap exceeds the inactivity timer, the next burst can move to a better path.

This is useful because AI training often alternates compute and communication phases. Collective operations can create bursts separated by short idle gaps. Flowlet mode uses those gaps as safer boundaries for moving traffic to a different path.

The key trade-off is between load distribution and packet ordering. Per-packet spraying gives the most aggressive link utilization, but it can reorder packets. Pure per-flow hashing preserves ordering, but it can create hot spots. Flowlet mode sits between those extremes. It can move a later burst to a better path while reducing the chance that packets from the same burst arrive out of order.

Flowlet mode is a practical compromise. It is not magic; the inactivity timer must match the workload. If the timer is too short, the fabric may reorder packets. If it is too long, the mechanism behaves too much like static per-flow hashing.

GLB solves the problem where local path choice looks good but the downstream path is congested. For example, several leaves may choose the same spine, and that spine may have a congested downlink to the egress leaf.

The classic example is two ingress leaves sending traffic to the same egress leaf. Each ingress leaf may independently choose Spine A because its local uplink to Spine A looks good. But Spine A may have a congested downlink to the egress leaf. Local DLB cannot see that full picture, so it keeps choosing a path that is locally good and globally bad.

GLB adds remote link quality and topology information. Instead of asking only “which local uplink is good?”, the ingress leaf can ask “which end-to-end path toward the egress leaf is good?” That is the real value of GLB.

The key point is that GLB attacks hidden downstream congestion. It is especially relevant in Clos fabrics where many ingress leaves can converge on the same spine-to-leaf segment. If GLB works well, it can reduce queue buildup before ECN, CNP, or PFC become the main line of defense.

6. What are BGP NNHN and GLB heartbeats used for?

Section titled “6. What are BGP NNHN and GLB heartbeats used for?”

BGP NNHN tells a switch which next-next-hop nodes are behind a next hop. In a Clos fabric, this helps a leaf understand which egress leaf is behind a spine.

GLB heartbeats carry fast path quality information at the forwarding level. Together, topology information and heartbeat quality let the ASIC build path quality profiles and choose better paths.

The clean way to explain this is:

SignalQuestion Answered
BGP NNHNWhere can this next hop take me?
GLB heartbeatHow healthy is that path right now?
GLB decisionWhich path should this flow or flowlet use?

For example, Leaf A may know that Spine A, Spine B, and Spine C can all reach a GPU prefix. NNHN-like topology signaling tells Leaf A which egress leaf sits behind those spines. Heartbeats then tell Leaf A whether the relevant downstream path is healthy or degraded.

It is important to separate the two. NNHN is not the congestion signal. It is topology context. Heartbeats are not the routing protocol. They are fast path-quality signals. GLB needs both.

TELB is useful when the operator needs deterministic path behavior. It can pin tenant, job, GPU, QP, or port-range traffic to specific path colors or spine sets.

This is especially useful in multi-tenant AI fabrics where predictable job performance and isolation matter. For example, one tenant or training job can be assigned to one set of spine paths, while another tenant uses a different path color. A storage or checkpoint traffic class may also be steered differently from GPU collective traffic.

The difference from DLB or GLB is intent. DLB and GLB are dynamic mechanisms that react to quality signals. TELB is more policy-driven. It lets the operator express “this class of traffic should use this logical path set” rather than leaving every choice to hashing and local quality.

The trade-off is operational complexity. TELB needs clean policy, match criteria, backup path behavior, and failure handling. If the policy is wrong, the network can become less balanced, not more balanced. I would use TELB when isolation, predictability, or tenant control is more important than maximum automatic path freedom.

8. What is the main trade-off of per-packet load balancing?

Section titled “8. What is the main trade-off of per-packet load balancing?”

Per-packet load balancing can spread a single elephant flow across many ECMP members and achieve excellent link utilization.

The trade-off is packet reordering. Packets sent across different spines can arrive in a different order because the paths may have different queueing, serialization, or congestion conditions. The receiver NIC, DPU, RDMA stack, or application must be able to buffer and reorder safely.

In AI fabrics, this is a serious design question, not a minor implementation detail. Some traffic and operations may tolerate reordering well if the NIC or transport has the right support. Other traffic may suffer from retransmission, buffering pressure, latency spikes, or reduced throughput.

Per-packet mode is powerful, but it moves complexity to the receiver and validation process. It should not be enabled broadly just because it improves link utilization in a synthetic test. Reordering counters, NIC behavior, tail latency, and JCT should be validated under real NCCL or RDMA workloads.

9. Why is selective packet spraying safer than spraying all traffic?

Section titled “9. Why is selective packet spraying safer than spraying all traffic?”

Selective packet spraying applies per-packet mode only to traffic that matches safe criteria, such as specific RoCEv2 opcodes, QP ranges, or traffic classes.

This lets the operator use packet spraying where the NIC can handle reordering, while leaving other traffic on SLB, DLB, or flowlet mode. It is safer because it acknowledges that not every packet has the same ordering tolerance.

For RoCEv2, the useful idea is to look beyond the basic IP and UDP header. A RoCE-aware device may use BTH fields, QP information, opcode information, or traffic class policy to decide whether a flow is eligible for more aggressive balancing.

Selective spraying is a form of risk containment. The network gets some of the bandwidth benefit of per-packet distribution without turning the entire fabric into a reordering experiment. The hard requirement is evidence: the selected traffic class must be tested on the actual NICs, firmware, drivers, and application stack.

10. How should load-balancing mechanisms be chosen?

Section titled “10. How should load-balancing mechanisms be chosen?”

Start from workload behavior and hardware capability.

If traffic has high entropy and many small flows, SLB may be enough. If AI training traffic creates local imbalance, DLB and flowlet mode are usually the next mechanisms to evaluate. If congestion appears beyond the first hop, GLB becomes relevant because the ingress leaf needs remote path-quality information. If the fabric is multi-tenant or needs predictable path control, TELB can add policy-based steering. If one elephant flow must use many links, selective packet spraying may be useful, but only after validating reordering support.

I would not choose the mechanism from a feature checklist. I would choose it from observed failure modes:

SymptomLikely Direction
Hash collisions among elephant flowsBetter hash inputs, DLB, or flowlet mode
Local uplink imbalanceDLB
Downstream spine-to-leaf congestionGLB
Tenant or job interferenceTELB or path coloring
Single flow cannot fill enough pathsSelective packet spraying

The practical answer is to start simple, measure, and then add sophistication where the data proves it is needed. Every mechanism adds its own operational cost: timers, telemetry, policy, heartbeat scale, reordering validation, or vendor-specific behavior.