Skip to content

Chapter 8: IP Routing for AI/ML Fabrics

This chapter explains routing choices for AI/ML data center fabrics.

The core idea is:

AI fabric routing is no longer just reachability. It must support scale, fast convergence, path diversity, traffic engineering, tenant isolation, and workload-aware forwarding.

The chapter focuses on these topics:

  • eBGP underlay for three-stage and five-stage Clos fabrics
  • BGP unnumbered using IPv6 link-local next hops
  • BGP ASN allocation and AS-PATH behavior
  • BGP ADD-PATH and BGP Link Bandwidth Extended Community
  • BGP Deterministic Path Forwarding, DPF
  • RIFT for fat-tree and Dragonfly-style fabrics
  • IS-IS flood optimization and FlexAlgo
  • EVPN-VXLAN and server/GPU-level multi-tenancy
  • Extending routing to GPU servers
  • Controller-driven traffic engineering
  • Segment Routing, SRv6, and uSID

Routing map for AI fabrics

Earlier chapters covered topology, load balancing, and congestion control. Routing ties those pieces together.

In AI fabrics, the routing protocol must answer more than “Can this prefix be reached?”

It also affects:

  • How quickly the fabric converges after a link or node failure
  • Whether ECMP has enough usable next hops
  • Whether unequal link speeds can be represented
  • Whether workload or tenant traffic can be pinned to selected paths
  • Whether multi-tenant overlays can be signaled
  • Whether fabric topology can be exposed to controllers or adaptive routing
  • Whether GPU servers can participate in routing directly

AI data centers often separate backend and frontend concerns.

DomainMain TrafficCommon Routing Style
Backend training fabricGPU/NIC east-west RDMA, RoCEv2Native IP eBGP, BGP-DPF, GLB, BGP link bandwidth, RIFT, IS-IS FlexAlgo
Frontend / inference fabricUser traffic, API serving, storage, tenant serviceseBGP underlay plus EVPN-VXLAN overlay
Storage domainCheckpoints, data loading, object or block storageRoCEv2, NVMe/TCP, iSCSI, or other IP storage designs

The same routing protocol can be used in both domains, but with different features. For example, backend eBGP may use weighted ECMP and BGP Link Bandwidth Extended Community, while frontend eBGP may carry EVPN-VXLAN services for tenant segmentation.

The chapter compares traditional and emerging routing choices.

ProtocolBasic TypeStrengthConcern
eBGPPath-vectorScale, policy, multi-vendor adoption, loop preventionConvergence and topology awareness need tuning
OSPFLink-state IGPFamiliar, hierarchical, fast enough in many networksFlooding, complexity, and less data center traction
IS-ISLink-state IGPScalable, extensible TLVs, FlexAlgo, link-local behaviorLess common in enterprise/data center operations
RIFTFat-tree optimized IGPDesigned for Clos/fat-tree, fast convergence, disaggregationNewer and less mature than BGP
Segment RoutingSource-routing architectureExplicit path control and controller-driven TEHeader overhead, hardware support, operational complexity

There is no universal best routing protocol. The right choice depends on topology, scale, operations, vendor support, convergence target, and traffic engineering needs.


eBGP is widely used for large data center fabrics, including AI fabrics, because it scales well and has strong policy controls.

Typical AI fabric use:

  • Three-stage Clos: leaf - spine - leaf
  • Five-stage Clos: leaf - spine - super-spine - spine - leaf
  • Backend native IP fabric for GPU RDMA traffic
  • Frontend underlay for EVPN-VXLAN overlays

Benefits:

  • Proven at cloud scale
  • Clear loop prevention through AS-PATH
  • Strong route policy controls
  • Works across vendors
  • Supports Clos, Dragonfly-like, and full-mesh variations
  • Can carry additional service families such as EVPN
  • Can be automated with per-rack or per-switch ASN plans

Limitations:

  • BGP was not originally designed as a link-state fabric protocol.
  • Native BGP has limited link/queue awareness.
  • Convergence can be slower than some IGPs.
  • Large topologies require careful ASN and policy design.
  • Advanced AI features often require extensions or route policies.

BGP unnumbered simplifies leaf-spine peering by avoiding per-link IPv4 addressing.

The idea:

  • Interfaces use IPv6 link-local addresses.
  • IPv6 Neighbor Discovery discovers neighbors.
  • BGP establishes TCP sessions over link-local addresses.
  • IPv4 prefixes can be advertised with IPv6 link-local next hops.
  • Extended Next Hop Encoding, RFC 5549, allows this behavior.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant Leaf as Leaf
    participant Spine as Spine

    Leaf->>Spine: IPv6 Neighbor Solicitation
    Spine-->>Leaf: IPv6 Neighbor Advertisement
    Spine-->>Leaf: Router Advertisement / link-local info
    Leaf->>Spine: TCP SYN to BGP port 179
    Spine-->>Leaf: TCP SYN/ACK
    Leaf->>Spine: BGP OPEN with capabilities
    Spine-->>Leaf: BGP OPEN / KEEPALIVE
    Leaf->>Spine: BGP UPDATE: IPv4 NLRI, IPv6 link-local next hop

Operational benefit:

  • Fewer point-to-point addresses to allocate
  • Easier automation
  • Less per-link configuration
  • Good fit for fabrics with many leaf-spine links

In eBGP, each hop normally rewrites the next hop and adds its ASN to AS-PATH.

Example:

  1. Leaf 1 originates a GPU server prefix.
  2. Spine receives it and advertises it to Leaf 2.
  3. Spine rewrites the next hop to itself.
  4. AS-PATH is updated with the relevant AS sequence.

This behavior helps loop prevention and policy, but it must be understood when designing deterministic forwarding, maintenance policies, and failure handling.

ASN design is important in AI fabrics.

Common patterns:

PatternDescriptionBenefitRisk
Same ASN for all spines, unique ASN per leafCommon eBGP Clos designSimple loop prevention and origin identificationRequires planning for AS-PATH expectations
Unique ASN per spine and leafMore granular identityMore explicit topology identityCan create suboptimal forwarding under failures
iBGP with route reflectorsSame ASN across fabricAvoids eBGP next-hop rewriteLess common for cloud-style data center fabrics

The chapter emphasizes that using the same ASN for spines and unique ASNs for ToR/leaf switches can help avoid suboptimal forwarding after failures because BGP loop prevention rejects routes containing repeated ASNs.

BGP normally advertises only the best path for a prefix. BGP ADD-PATH allows more than one path to be advertised.

Why this matters:

  • More path diversity
  • Better redundancy
  • Better multipath visibility
  • Useful when route reflectors or policy otherwise hide alternate paths

In AI fabrics, ADD-PATH can help preserve usable paths for GPU traffic where link diversity and rapid recovery matter.

AS-PATH strip and replace normalizes AS-PATH when private ASNs are used inside the data center and routes must be exchanged with a larger core network.

Uses:

  • Hide internal private ASN details from the core
  • Normalize path length
  • Avoid exposing tenant or fabric-internal ASN design
  • Support multi-tenant or multi-domain routing boundaries

This is mainly relevant at fabric borders or when connecting to a core IP network.

BGP Link Bandwidth Extended Community carries bandwidth information that can be used for weighted ECMP.

Example:

PathLink BandwidthWeighting Goal
Path A400GLower share
Path B800GHigher share

If one next hop has twice the capacity, weighted ECMP can send more traffic to it. This is useful in mixed-speed fabrics or during transitions from 400G to 800G links.

eBGP underlay path diversity in an AI fabric

Some designs require a minimum number of active BGP peers for a prefix to be considered usable.

Reason:

  • Avoid advertising or using a destination when too few paths remain.
  • Protect workload performance when ECMP width has collapsed.
  • Keep GPU jobs away from partially degraded fabric areas.

This is a routing-level guardrail for performance, not just reachability.


BGP Deterministic Path Forwarding, BGP-DPF

Section titled “BGP Deterministic Path Forwarding, BGP-DPF”

BGP Deterministic Path Forwarding, BGP-DPF, provides deterministic path selection by associating traffic with logical fabric colors.

The goal is similar to traffic engineering:

  • Divide one physical fabric into logical fabrics.
  • Pin selected traffic to a fabric color.
  • Use GPU ID, QP, tenant, or SLA to choose a path.
  • Keep elephant flows predictable.
  • Improve isolation between tenants or workloads.

Deterministic routing planes for AI fabrics

DPF can be implemented through:

  • BGP communities
  • Colored route advertisement
  • Session coloring
  • Route policy
  • ASIC forwarding behavior that maps flow characteristics to a color

BGP-DPF can be applied in both ROD and RUD designs.

DesignDPF Use
Rail-Optimized Design, RODEach rail can carry a specific logical fabric or tenant path
Rail-Unified Design, RUDGPU ID, QP, or tenant can choose a logical fabric even when rails share leaf groups

In a RUD design, DPF becomes especially useful because multiple GPU/NIC positions may share the same physical leaf. The fabric needs a logical way to keep selected flows separated.

DPF can color the BGP session or the routes carried over the session.

Session coloring:

  • A BGP peer/session belongs to a color such as black or gray.
  • Routes learned over that session are associated with that color.
  • Forwarding can select color-specific next hops.

Colored route advertisement:

  • Server or leaf advertises a prefix with color metadata.
  • Fabric uses that color to select the logical path.
  • A tenant or GPU workload can be mapped to a deterministic fabric.

BGP-DPF can correlate overlay tenants with underlay colors.

Example:

  • Tenant Black uses MAC-VRF Black.
  • Tenant Gray uses MAC-VRF Gray.
  • Tenant Black overlay is mapped to Underlay Fabric Black.
  • Tenant Gray overlay is mapped to Underlay Fabric Gray.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef tenant fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef overlay fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef underlay fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    TB[Tenant Black]:::tenant
    TG[Tenant Gray]:::tenant
    MB[MAC-VRF Black<br/>EVPN-VXLAN]:::overlay
    MG[MAC-VRF Gray<br/>EVPN-VXLAN]:::overlay
    UB[Underlay Color Black]:::underlay
    UG[Underlay Color Gray]:::underlay

    TB --> MB --> UB
    TG --> MG --> UG

This gives both tenant isolation and path isolation.


RIFT, Routing in Fat Trees, is an IGP designed for fat-tree and Clos-like data center topologies.

RIFT combines two propagation styles:

DirectionBehavior
NorthboundLink-state flooding toward higher levels
SouthboundDistance-vector style routing toward lower levels

Key benefits:

  • Designed for fat-tree topology
  • Fast convergence
  • Automatic disaggregation on failure
  • Wide ECMP and UCMP support
  • Topology awareness
  • Metadata advertisements
  • Better fit for large Clos than general-purpose IGP flooding
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef leaf fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef spine fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef super fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    L1[Leaf 1]:::leaf
    L2[Leaf 2]:::leaf
    S1[Spine 1]:::spine
    S2[Spine 2]:::spine
    SS[Super-spine]:::super

    L1 -->|northbound link-state| S1
    L2 -->|northbound link-state| S2
    S1 -->|northbound| SS
    S2 -->|northbound| SS
    SS -->|southbound distance-vector| S1
    SS -->|southbound distance-vector| S2
    S1 -->|southbound| L1
    S2 -->|southbound| L2

The chapter discusses Dragonfly as a topology where groups connect to other groups through global links.

Benefits:

  • Reduced network diameter
  • Lower hop count
  • Lower latency potential
  • High path diversity
  • Useful for HPC-like topologies

Challenges:

  • Cabling complexity
  • Group-to-group link planning
  • Workload placement sensitivity
  • Need for topology-aware routing
  • Less familiar operational model than Clos

Dragonfly Sparse reduces some complexity by using leaf-spine Clos inside a group and connecting selected top-of-fabric nodes between groups.

RIFT is relevant here because it can encode topology information that traditional Clos-only assumptions may not support.


IS-IS is a link-state IGP. The chapter presents it as an alternative to BGP and RIFT for AI backend fabrics, especially where fast convergence, dense topology optimization, and logical path computation matter.

Traditional link-state flooding can become expensive in dense topologies. IS-IS Optimal Distributed Flooding reduces unnecessary flooding by electing selected neighbors for update propagation.

Benefits:

  • Less control-plane flooding
  • Faster convergence in dense leaf-spine designs
  • Reduced LSP fragment processing
  • More useful as spine count grows to 32, 64, or more
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef leaf fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef spine fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef chosen fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    L[Leaf announces new prefix]:::leaf
    S1[Spine 1]:::spine
    S2[Spine 2<br/>selected flood neighbor]:::chosen
    S3[Spine 3]:::spine

    L -.-> S1
    L ==>|flood update here| S2
    L -.-> S3

FlexAlgo lets IS-IS compute multiple logical topologies over the same physical fabric.

Each FlexAlgo is defined by a Flexible Algorithm Definition, FAD:

  • Calculation type
  • Metric type
  • Constraints

Example:

FlexAlgoConstraint GoalPossible Workload
DefaultNormal shortest pathGeneral traffic
128Low latencyLatency-sensitive collectives
129High bandwidthElephant RDMA flows
130Low congestionCongestion-sensitive or premium workloads
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef physical fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef low fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef bw fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef cong fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    P[Same physical leaf-spine fabric]:::physical
    A128[FlexAlgo 128<br/>low latency plane]:::low
    A129[FlexAlgo 129<br/>high bandwidth plane]:::bw
    A130[FlexAlgo 130<br/>low congestion plane]:::cong

    P --> A128
    P --> A129
    P --> A130

FlexAlgo can steer different AI workloads into different logical path planes without requiring SR-MPLS or SRv6 encapsulation for the basic pure-IP use case.

AspectBGP-DPFIS-IS FlexAlgo
Control inputBGP communities, route/session colors, policyIS-IS TLVs and FAD constraints
Main abstractionLogical fabric colorAlgorithm-specific topology
Topology awarenessPolicy-driven, BGP is not native link-stateNative link-state
Workload mappingTenant, GPU ID, QP, SLAMetric/constraint-based plane
Best fitBGP-based fabrics needing deterministic colorsIGP fabrics needing path diversity and fast convergence

Multi-tenancy appears when GPUs are shared between users, teams, customers, or services.

Reasons:

  • Security isolation
  • Performance isolation
  • Capacity planning
  • GPU-as-a-Service, GPUaaS
  • Public or private cloud AI offerings
  • Separate training and inference tenants

Tenant isolation, server routing, telemetry, and SRv6 traffic engineering

The simplest model is dedicated physical resources:

  • A tenant gets a full GPU server.
  • Each GPU/NIC port is connected to a rail leaf.
  • Switch ports are dedicated to that tenant.

This is simple and strong from an isolation perspective, but it can waste resources if tenants do not need full servers or full rails.

EVPN-VXLAN can segment tenant traffic using:

ObjectRole
VLANLocal Layer 2 tenant mapping
VNIVXLAN network identifier
MAC-VRFTenant Layer 2 forwarding instance
IP-VRFTenant Layer 3 routing instance
EVPN RT-2MAC/IP host route advertisement
EVPN RT-5IP prefix route advertisement
Routing VNITenant L3 VXLAN tunnel identifier
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef tenant fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef evpn fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fabric fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    TA[Tenant A GPU]:::tenant
    TB[Tenant B GPU]:::tenant
    VA[VLAN A / VNI 10010<br/>MAC-VRF A / IP-VRF A]:::evpn
    VB[VLAN B / VNI 10020<br/>MAC-VRF B / IP-VRF B]:::evpn
    F[Shared leaf-spine fabric]:::fabric

    TA --> VA --> F
    TB --> VB --> F

The chapter describes using RT-5 EVPN IP instances between rails and using route servers at the spine layer. Tenant routes are defined at the ToR/leaf, while spines provide EVPN route reflection or route-server behavior.

Server vendors may provide GPU-level partitioning. The chapter uses NVIDIA MIG, Multi-Instance GPU, as the example.

MIG can partition one physical GPU into multiple GPU instances. Each instance can be assigned to a different tenant or workload.

Example:

Physical GPUTenant Mapping
MIG instance 1Tenant 1
MIG instance 2Tenant 2
MIG instance 3Tenant 3
MIG instance 7Tenant 7

Inside the server, vGPU or MIG-level tenancy can be scheduled across servers with collective communication software such as NCCL or RCCL.

Combining Server and Network Multi-Tenancy

Section titled “Combining Server and Network Multi-Tenancy”

Server-level and network-level tenancy can be combined.

Example:

  • A server has seven MIG instances.
  • Each MIG instance maps to a VLAN.
  • The ToR leaf has seven MAC-VRFs and seven EVPN IP-level instances.
  • Each tenant gets separate L2 tags, VNI, VRF, and routing policy.

This creates end-to-end tenant isolation from GPU instance to fabric path.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef gpu fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef vlan fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef vrf fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    M1[MIG 1<br/>Tenant 1]:::gpu
    M2[MIG 2<br/>Tenant 2]:::gpu
    V1[VLAN 101<br/>VNI 100101]:::vlan
    V2[VLAN 102<br/>VNI 100102]:::vlan
    R1[MAC-VRF / IP-VRF<br/>Tenant 1]:::vrf
    R2[MAC-VRF / IP-VRF<br/>Tenant 2]:::vrf

    M1 --> V1 --> R1
    M2 --> V2 --> R2

Another approach is dynamic ACLs on the ToR data plane, driven by a tenant profile from a system such as RADIUS.

Benefits:

  • Centralized policy
  • Potentially simpler than full EVPN-VXLAN in smaller designs
  • Can apply tenant rules directly at the ToR

Limitations:

  • TCAM scale can become a blocker.
  • ACL complexity grows with tenant count.
  • EVPN-VXLAN is usually more scalable for distributed tenant isolation.

The chapter also discusses extending routing to GPU servers.

Instead of treating servers as passive hosts, a GPU server can run a BGP stack and advertise prefixes to the fabric.

Use cases:

  • Advertise service IPs from servers
  • Advertise host routes or anycast services
  • Improve failover through BGP withdraw
  • Keep end-to-end routing protocol behavior consistent
  • Avoid static host route models
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef server fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef leaf fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef spine fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    S1[GPU server 1<br/>BGP ASN]:::server
    S2[GPU server 2<br/>BGP ASN]:::server
    L1[ToR / Leaf 1]:::leaf
    L2[ToR / Leaf 2]:::leaf
    SP[Spine]:::spine

    S1 <-->|eBGP| L1
    S2 <-->|eBGP| L2
    L1 <-->|fabric eBGP| SP
    L2 <-->|fabric eBGP| SP

If a server-side service fails, the server can withdraw the prefix, and the fabric can stop forwarding traffic to it.

At larger scale, having every server peer directly with every relevant fabric node may be too heavy.

A virtual appliance BGP route reflector can:

  • Reduce peering count
  • Centralize server route reflection
  • Advertise server service prefixes into spines
  • Provide a cleaner boundary between server routing and fabric routing

This is especially useful for service prefixes such as anycast IPs.


AI fabric traffic engineering can be distributed or controller-driven.

Distributed examples:

  • ECMP
  • DLB
  • GLB
  • BGP-DPF
  • IS-IS FlexAlgo

For the GLB-specific use of BGP NNHN and forwarding-plane heartbeats, see Appendix: BGP-based Underlay and GLB NNHN.

Controller-driven examples:

  • Controller receives fabric topology
  • Controller receives telemetry
  • Controller receives job scheduler intent
  • Controller computes preferred GPU-to-GPU paths
  • Controller pushes routing or policy updates

Telemetry inputs:

SignalUse
Link utilizationAvoid hot links
Queue occupancyDetect congestion pressure
PFC/ECN countersDetect lossless fabric stress
sFlow or sampled flow dataIdentify traffic patterns
Egress BGP statisticsUnderstand route and next-hop use
Active probesMeasure end-to-end path quality
Job scheduler metadataMap workload to fabric policy
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef scheduler fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef telemetry fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef controller fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fabric fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    J[GPU job scheduler<br/>placement and SLA]:::scheduler
    T[Fabric telemetry<br/>links, queues, ECN/PFC, probes]:::telemetry
    C[Fabric controller<br/>path computation]:::controller
    F[AI fabric<br/>BGP, DPF, SR, policy]:::fabric

    J --> C
    T --> C
    C --> F
    F --> T

Segment Routing, SR, is a source-routing architecture. The ingress node encodes path instructions in the packet, and transit nodes forward based on those instructions.

Important terms:

TermMeaning
SR domainNetwork where Segment Routing is enabled
SegmentA node, link, adjacency, or instruction
SIDSegment Identifier
Node SIDIdentifier for a node
Adjacency SIDIdentifier for a specific link or adjacency
Anycast SIDShared identifier for a group of nodes
EPE SIDEgress Peer Engineering SID
SR pathOrdered list of segments from ingress to egress
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef node fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef sid fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    I[Ingress leaf]:::node
    S[Spine<br/>Node SID 5]:::sid
    E[Egress leaf<br/>Node SID 1]:::sid
    D[Destination server]:::node

    I -->|packet carries SID 5, SID 1| S
    S -->|pop SID 5, follow SID 1| E
    E -->|pop SID 1| D

SR lets a controller compute an explicit path and install instructions at ingress rather than relying only on local ECMP choices at each hop.

Data PlaneHow It Encodes SIDsNotes
SR-MPLSMPLS label stackCommon in service provider networks
SRv6IPv6 Segment Routing Header, SRHUses IPv6 extension header with segment list

Control-plane information for Segment Routing can come from:

  • IGP
  • BGP-LS
  • Controller topology database
  • Path Computation Element Protocol, PCEP

SRv6 uses an IPv6 extension header called the Segment Routing Header, SRH.

The SRH sits between IPv6 and upper-layer payload such as UDP/RoCEv2.

Conceptual packet:

IPv6 header
SRH: segment list
UDP / RoCEv2 header
Payload

SRv6 SID structure can include:

PartMeaning
LocatorIdentifies the node or location
FunctionEncodes node, adjacency, or service behavior
ArgumentOptional service or application metadata

The chapter notes that SRv6 network programming lets the network encode a program of forwarding instructions into IPv6 packet headers.

Long SRv6 segment lists can increase packet overhead.

Compressed SID and micro-segment SRv6, uSID, reduce that overhead.

Key ideas:

  • Normal SRv6 SID is 128 bits.
  • uSID can encode micro-segments in smaller chunks, such as 16-bit micro-SIDs.
  • uSID keeps the SRv6 programming model while reducing header size.
  • This helps multi-domain or long-path deployments.

For AI fabrics, SRv6/uSID is interesting because it can support deterministic path placement, but it must still coexist with RoCEv2 lossless mechanisms such as DCQCN.


CharacteristicBGPRIFTIS-IS
IP Clos ECMP supportYesYesYes
Dragonfly topology supportLimitedYesLimited
Multi-tenancy optionsStrong with EVPN-VXLANLimitedLimited
Convergence speedMediumFastFast
Link awarenessLow by defaultHighHigh
Full topology awarenessMediumHighMedium
Automatic disaggregation on failureNoYesNo
Fabric configuration metadataLimitedYesLimited
Wide ECMP / UCMPYes with extensionsYesLimited
Policy controlHighLowerMedium
Operational maturity in DC fabricsHighEmergingMedium

Decision guidance:

RequirementLikely Fit
Cloud-style Clos with strong policy and EVPNeBGP
Fat-tree optimized IGP with fast convergenceRIFT
Pure IP fabric with logical path algorithmsIS-IS FlexAlgo
Deterministic tenant/GPU/QP path coloring in BGP fabricBGP-DPF
Explicit controller-computed pathsSegment Routing / SRv6
Server anycast or service prefix advertisementServer-to-ToR BGP

Routing design should be validated as a workload-facing system, not only as a reachability graph.

Checklist:

  • Confirm backend and frontend routing domains are intentionally separated or shared.
  • Validate eBGP ASN allocation and AS-PATH loop prevention.
  • Confirm BGP unnumbered peers exchange IPv4 NLRI with IPv6 link-local next hops.
  • Test link failure convergence for leaf-spine and spine-super-spine paths.
  • Verify ECMP width for GPU prefixes.
  • Test BGP ADD-PATH behavior if route reflectors or path hiding are present.
  • Validate BGP Link Bandwidth Extended Community and weighted ECMP behavior.
  • Confirm minimum active peer policy for critical prefixes.
  • Validate BGP-DPF color assignment for tenant, GPU, QP, or SLA traffic.
  • Verify EVPN-VXLAN tenant route isolation with RT-2 and RT-5 routes.
  • Test RIFT or IS-IS convergence if using IGP alternatives.
  • Validate IS-IS FlexAlgo path constraints and fallback behavior.
  • Confirm server-to-ToR BGP withdraw behavior for service failure.
  • Validate controller-driven path updates against real telemetry.
  • Test SRv6/uSID MTU and hardware forwarding support.
  • Measure workload effects: NCCL latency, p99 iteration time, GPU utilization, and JCT.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef model fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef test fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef signal fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef decision fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fix fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    R[Routing model<br/>BGP, RIFT, IS-IS, SR]:::model
    P[Policy model<br/>tenant, color, VRF, path]:::model
    F[Failure tests<br/>link, spine, ToR, server]:::test
    T[Telemetry<br/>ECMP width, queues, route churn]:::signal
    W[Workload validation<br/>NCCL, RoCEv2, storage]:::test
    D{Does routing behavior<br/>protect workload target?}:::decision
    A[Accept design<br/>document operating envelope]:::model
    X[Revise ASN, policy,<br/>protocol, path, or topology]:::fix

    R --> P --> F --> T --> W --> D
    D -->|Yes| A
    D -->|No| X
    X --> R

AI data center routing must support reachability, convergence, path control, and tenant isolation at the same time.

The main takeaways:

  • eBGP remains the most common underlay choice for three-stage and five-stage Clos AI fabrics.
  • BGP unnumbered reduces per-link addressing and simplifies automation.
  • ASN allocation matters because AS-PATH behavior can prevent loops or create suboptimal paths.
  • BGP ADD-PATH and BGP Link Bandwidth Extended Community improve path diversity and weighted ECMP.
  • BGP-DPF introduces deterministic logical fabric colors for tenant, GPU, QP, or SLA traffic.
  • RIFT is designed for fat-tree fabrics and can also support Dragonfly-style topology awareness.
  • IS-IS Optimal Distributed Flooding improves convergence in dense fabrics.
  • IS-IS FlexAlgo can create logical routing planes using bandwidth, latency, or congestion constraints.
  • EVPN-VXLAN, VRFs, MIG, and dynamic ACLs are different tools for AI multi-tenancy.
  • Extending BGP to servers enables anycast and service prefix advertisement directly from GPU hosts.
  • Segment Routing and SRv6 provide explicit path programming, especially with controller-driven traffic engineering.
  • No routing protocol is universally best; the right answer depends on topology, scale, vendor support, operations, and workload goals.

TermMeaning
eBGPExternal Border Gateway Protocol
BGP unnumberedBGP peering using IPv6 link-local addresses instead of numbered IPv4 point-to-point links
RFC 5549Advertising IPv4 NLRI with IPv6 next hop
ASNAutonomous System Number
AS-PATHBGP path attribute listing ASNs traversed
ADD-PATHBGP capability to advertise multiple paths for a prefix
BGP Link Bandwidth Extended CommunityBGP community carrying bandwidth for weighted ECMP
DPFDeterministic Path Forwarding
Fabric colorLogical fabric or path group used for deterministic forwarding
RIFTRouting in Fat Trees
UCMPUnequal-Cost Multipathing
IS-ISIntermediate System to Intermediate System
FlexAlgoIS-IS flexible algorithm for constraint-based logical topologies
FADFlexible Algorithm Definition
EVPNEthernet VPN control plane
VXLANOverlay encapsulation using VNI identifiers
VNIVXLAN Network Identifier
MAC-VRFLayer 2 tenant forwarding instance
IP-VRFLayer 3 tenant routing instance
RT-2EVPN MAC/IP route type
RT-5EVPN IP prefix route type
MIGMulti-Instance GPU
SRSegment Routing
SIDSegment Identifier
SRHSegment Routing Header
SRv6Segment Routing over IPv6
uSIDMicro-segment SRv6

1. Why is BGP widely adopted for routing in large-scale AI data center fabrics?

Section titled “1. Why is BGP widely adopted for routing in large-scale AI data center fabrics?”

BGP is widely adopted because it scales well, has strong policy control, works across vendors, and has built-in loop prevention through AS-PATH.

In AI fabrics, eBGP is commonly used with unique ASNs per leaf or rack and shared ASNs at spine layers. BGP unnumbered simplifies leaf-spine peering by using IPv6 link-local addresses, while route policies control prefix advertisement, filtering, maintenance, and tenant behavior.

BGP unnumbered removes the need to assign IPv4 point-to-point addresses to every leaf-spine link.

The fabric uses IPv6 link-local addresses for BGP neighbor discovery and session establishment. IPv4 server prefixes can still be advertised, but the next hop is an IPv6 link-local address. This is useful in AI fabrics because the number of physical links can be very large.

3. Why does ASN allocation matter in eBGP Clos fabrics?

Section titled “3. Why does ASN allocation matter in eBGP Clos fabrics?”

ASN allocation determines how AS-PATH loop prevention behaves. If spines share an ASN and leaves use unique ASNs, BGP can reject routes that would loop through the same AS. This can also help prevent suboptimal forwarding after failures.

Poor ASN design can make troubleshooting harder or allow unexpected backup paths with bad performance.

BGP-DPF divides one physical fabric into logical colored fabrics. Traffic can be mapped to a color based on tenant, GPU ID, QP range, or SLA.

This gives deterministic path selection. It can isolate elephant flows, keep tenant traffic predictable, and align overlay tenants with underlay path colors.

5. What are RIFT’s advantages in AI fabrics?

Section titled “5. What are RIFT’s advantages in AI fabrics?”

RIFT is designed for fat-tree and Clos-style topologies. It provides fast convergence, topology awareness, wide ECMP, UCMP, and automatic disaggregation on failures.

It can be attractive when BGP policy is less important than topology-native convergence and fabric awareness. Its main trade-off is maturity and operational familiarity compared with eBGP.

6. How does IS-IS FlexAlgo support workload isolation?

Section titled “6. How does IS-IS FlexAlgo support workload isolation?”

IS-IS FlexAlgo computes multiple logical topologies over the same physical fabric. Each algorithm can use different constraints such as latency, bandwidth, or congestion.

This lets one workload use a low-latency plane, another use a high-bandwidth plane, and another use a low-congestion plane. It is similar in spirit to BGP-DPF, but implemented through link-state TLVs and FADs.

7. How does multi-tenancy affect AI routing?

Section titled “7. How does multi-tenancy affect AI routing?”

Multi-tenancy requires segmentation and policy at multiple layers. A tenant may need isolated GPU instances, isolated VLAN/VNI mappings, separate MAC-VRF/IP-VRF instances, and separate EVPN routes.

BGP-EVPN is commonly used because it can signal tenant MAC/IP and prefix routes at scale. Server-level features such as MIG can be combined with network-level EVPN-VXLAN to provide end-to-end GPU tenant isolation.

Server-to-ToR BGP lets GPU servers advertise service prefixes or anycast addresses directly into the fabric. If a service fails, the server can withdraw the prefix.

This makes routing more dynamic and can simplify anycast service design, but it adds operational responsibility to the server side and may require route reflectors at scale.

9. What role does telemetry play in routing and traffic engineering?

Section titled “9. What role does telemetry play in routing and traffic engineering?”

Telemetry gives routing controllers and operators the data needed to make path decisions. Useful signals include link utilization, queue occupancy, ECN/PFC counters, active probes, sFlow, and job scheduler placement.

Without telemetry, traffic engineering becomes static policy. With telemetry, the fabric can adapt paths to real workload and congestion state.

10. How does Segment Routing or SRv6 help AI fabrics?

Section titled “10. How does Segment Routing or SRv6 help AI fabrics?”

Segment Routing lets an ingress node encode a path into the packet. With a controller, GPU-to-GPU paths can be computed to avoid congestion or satisfy SLA goals.

SRv6 carries path instructions in an IPv6 Segment Routing Header. uSID reduces header overhead by using compact micro-segments. The trade-off is that SRv6 requires hardware support, MTU planning, and careful integration with RoCEv2 lossless mechanisms such as DCQCN.