Skip to content

Chapter 7: RoCEv2 Transport and Congestion Management

This chapter explains how RoCEv2 traffic is transported across Ethernet AI fabrics and how congestion is detected, signaled, and mitigated.

The core idea is:

RoCEv2 gives AI clusters high-throughput, low-latency RDMA over Ethernet, but Ethernet must be engineered carefully so congestion does not turn into packet loss, PFC storms, or long GPU stalls.

When a training job reports NCCL or RDMA timeouts, also correlate GPU Xid, ECC, NVLink, PCIe, and scheduler events. The GPU Cluster Failure Analysis appendix covers this cross-layer workflow.

The chapter focuses on these topics:

  • Why RoCEv2 uses UDP/IP for RDMA traffic
  • Where congestion appears in leaf-spine and multi-stage Clos fabrics
  • ECN and CNP for end-to-end congestion signaling
  • PFC for hop-by-hop lossless flow control
  • PFC watchdog for PFC storm detection and mitigation
  • DCQCN as a RoCEv2 congestion control loop
  • SFC as a flow-based source control mechanism
  • CSIG as an emerging in-band congestion telemetry mechanism

RoCEv2 congestion management map

RDMA over Converged Ethernet version 2, RoCEv2, is widely used in AI/ML clusters to synchronize data between application buffers on distributed GPU servers.

RoCEv2 carries RDMA traffic over UDP/IP. This allows it to run across routed Ethernet fabrics while avoiding TCP’s connection state and CPU-heavy transport behavior.

RoCEv2 is attractive for AI fabrics because it provides:

  • Low-latency data movement
  • High throughput
  • Zero-copy RDMA semantics
  • Kernel bypass
  • Lower CPU involvement
  • Better parallel session scalability than TCP-based transport
  • Fit for GPU synchronization and storage traffic such as NVMe-oF over RoCE

The trade-off is that RoCEv2 is built on UDP/IP. UDP does not provide TCP-style reliability, retransmission, or congestion window behavior. Therefore, the Ethernet fabric must provide strong congestion management.

TransportReliability ModelCongestion BehaviorAI Fabric Implication
TCPACKs, retransmission, windowingBuilt into transportMore CPU/state overhead; not ideal for GPU RDMA fast path
UDPBest effortNo built-in reliability or congestion controlFast and simple, but loss-sensitive applications need fabric help
RoCEv2RDMA over UDP/IPDepends on ECN, PFC, DCQCN, and NIC behaviorHigh performance, but needs tuned lossless or low-loss fabric

Ethernet is normally lossy. RoCEv2-based RDMA expects lossless or near-lossless behavior because packet drops can cause severe performance degradation.

Congestion can appear even in a fabric designed with no intentional oversubscription:

  • Load balancing can hash several elephant flows onto one link.
  • Incast can concentrate traffic toward one egress.
  • Synchronized collectives can create sudden bursts.
  • Storage read/write bursts can overrun server-facing links.
  • Multi-stage Clos fabrics can concentrate traffic at spine or super-spine layers.

The chapter describes several locations where congestion can happen.

Congestion PointWhere It HappensTypical Cause
Local leaf linkInside or below one leafMultiple local servers send to one local target
Leaf-to-spineIngress leaf uplinkECMP or load balancing sends many flows to one spine
Spine-to-leafTransit spine downlinkSeveral ingress leaves converge toward one egress leaf
Leaf-to-serverEgress leaf downlinkFabric sends more than the destination NIC can receive
Spine-to-super-spineFive-stage Clos uplinkTraffic converges from spine to one super-spine
Super-spine-to-spineFive-stage Clos downlinkSuper-spine sends too much traffic toward one spine or block

Congestion points in a RoCEv2 Clos fabric

Local leaf congestion occurs when multiple devices connected to the same leaf send line-rate traffic to a local target, such as a storage server or GPU server.

Example:

  • Server 1 sends 100G.
  • Server 2 sends 100G.
  • Both target a local storage server with one 100G-facing link.
  • The leaf receives 200G of offered load for a 100G output.

This can trigger queue growth, ECN marking, and eventually PFC.

Leaf-to-spine congestion happens when multiple flows are load-balanced onto the same leaf uplink.

This is closely related to Chapter 6:

  • Low entropy RoCEv2 traffic may hash poorly.
  • A small number of elephant flows may collide on one ECMP member.
  • Other spine uplinks may remain underused.

Spine-to-leaf congestion happens when several ingress leaves send traffic through the same spine toward the same egress leaf.

This is a classic incast pattern:

  • Leaf A sends traffic to Leaf D.
  • Leaf B sends traffic to Leaf D.
  • Both flows land on Spine A.
  • Spine A has one downlink toward Leaf D.

The ingress side may look balanced locally, but the transit spine downlink becomes congested.

Leaf-to-server congestion happens at the final hop when the fabric sends more traffic than the destination NIC can receive.

This can happen when:

  • Multiple source GPUs send to one destination GPU.
  • Multiple storage readers or writers target one server.
  • Inference aggregation sends many responses to one endpoint.
  • A destination NIC is 100G while the aggregate fabric input is higher.

In a five-stage Clos, traffic may move from leaf to spine to super-spine. If several leaf domains converge on the same spine-to-super-spine link, congestion can happen there.

This is the same incast pattern lifted one stage higher in the topology.

Super-spine-to-spine congestion happens on the downlink from a super-spine toward a destination spine or block.

This is especially important in multi-stage fabrics because congestion may occur far away from the original ingress leaf. Local link quality alone may not reveal the end-to-end bottleneck.


RoCEv2 fabrics use multiple mechanisms together.

MechanismLayer / ScopeMain FunctionMain Trade-Off
ECNIP / end-to-endMark congestion and trigger CNP rate reductionSlower than hop-by-hop pause
CNPRoCEv2 notificationTell sender which flow is congestedArrives after marked packet reaches receiver
PFCEthernet / hop-by-hopPause a priority class to prevent dropsHOL blocking and PFC storms
PFC watchdogSwitch protectionDetect and mitigate persistent PFC stormsMay drop or forward traffic during mitigation
DCQCNRoCEv2 congestion controlCombine ECN/CNP and PFC into a rate-control loopRequires careful threshold and NIC tuning
SFCSource flow controlSend direct flow-level signal toward sourceNewer, requires support across devices
CSIGIn-band telemetryCarry bottleneck metadata in packetsEmerging, needs protocol and silicon support

The practical model is layered:

ECN, PFC, and DCQCN control loop


Explicit Congestion Notification, ECN, is an end-to-end congestion signaling mechanism.

ECN requires:

  • Sender ECN support
  • Receiver ECN support
  • ECN-enabled transit switches
  • ECN thresholds on switch queues
  • CNP behavior in the RoCEv2 endpoint

If a transit device in the path does not support ECN, end-to-end ECN behavior is broken.

The IP header includes DSCP and ECN bits. DSCP uses the first 6 bits for QoS or CoS marking. ECN uses the last 2 bits.

ECN BitsMeaningBehavior
00Not ECN capablePacket may be dropped under congestion
01ECN-capable transportUsed as ECT value; also appears in RoCEv2 CNP
10ECN-capable transportTreated similarly to 01 from a network perspective
11Congestion Experienced, CESwitch marks packet instead of dropping it

When queue usage exceeds the ECN threshold, the switch marks packets with CE, 11, and forwards them. The receiver then sends a CNP back to the sender.

Congestion Notification Packet, CNP, is generated by the receiver or destination server.

Important properties:

  • It is a RoCEv2 frame.
  • It is sent back to the source when ECN-marked traffic is received.
  • The chapter identifies the CNP IB BTH opcode as 129.
  • It carries destination Queue Pair information so the sender can identify the congested flow.
  • The sender reduces the traffic rate for that flow.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant Src as Sender GPU/NIC
    participant LeafA as Leaf A
    participant Spine as Spine A
    participant LeafD as Leaf D
    participant Dst as Receiver GPU/NIC

    Src->>LeafA: RoCEv2 packet, ECN capable
    LeafA->>Spine: Forward
    Note over Spine,LeafD: Queue crosses ECN threshold
    Spine->>LeafD: Mark ECN CE=11 and forward
    LeafD->>Dst: Deliver marked packet
    Dst-->>Src: CNP with congested QP information
    Src->>Src: Reduce sender rate for that flow

ECN marking is based on queue usage.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef safe fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef mark fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef drop fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    A[Low queue use<br/>no marking]:::safe
    B[Queue crosses ECN threshold<br/>mark CE=11]:::mark
    C[Queue keeps growing<br/>WRED or drops possible]:::drop

    A --> B --> C

ECN threshold tuning matters:

  • If the threshold is too low, the fabric may mark too aggressively and reduce throughput.
  • If the threshold is too high, congestion may turn into drops before senders slow down.
  • Multiple flows may share the same queue, which makes threshold tuning harder.
  • Operators often monitor ECN counters and queue occupancy to refine thresholds.

ECN is useful, but it is not instantaneous.

The delay path is:

  1. Congested switch marks the packet.
  2. Marked packet reaches the receiver.
  3. Receiver generates CNP.
  4. CNP reaches the sender.
  5. Sender reduces rate.

During this delay, a burst can continue filling the queue. If the queue grows faster than ECN/CNP can react, packet drops can still occur. The chapter therefore describes ECN-enabled queues as lossy queues.


Priority Flow Control, PFC, is an Ethernet link-layer mechanism that pauses traffic for a priority class.

PFC is different from ECN:

  • ECN marks packets end-to-end.
  • PFC sends hop-by-hop pause frames.
  • ECN targets a flow through CNP and rate reduction.
  • PFC targets a whole priority or traffic class.

A queue with PFC configured is often treated as a lossless queue.

PFC uses Ethernet MAC control frames.

Important fields:

FieldMeaning
Destination MACReserved multicast or control destination
Source MACSwitch that generated the PFC frame
EtherType0x8808 for MAC control
OpcodePFC control opcode
Priority Control VectorWhich traffic classes are paused
Time fieldsPause duration per class

The frame supports traffic classes 0 through 7. For each enabled class, a time value indicates how long traffic should be paused. A time value of 0 indicates unpause.

PFC uses two threshold concepts:

SignalMeaningTrigger
XOFFTransmit offQueue crosses high threshold
XONTransmit onQueue drains below low threshold

When buffer usage crosses the XOFF threshold, the switch sends a PFC pause frame upstream. When the queue drains below the XON threshold, the switch sends an unpause signal.

The chapter’s example uses a pause timer of 65535 microseconds for an XOFF frame and 0 for an XON frame.

PFC thresholds must leave enough headroom for traffic already in flight.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef low fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef ecn fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef pfc fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    XON[XON threshold<br/>resume traffic]:::low
    ECN[ECN threshold<br/>mark packets]:::ecn
    XOFF[XOFF threshold<br/>send PFC pause]:::pfc

    XON --> ECN --> XOFF

In most designs, ECN threshold is lower than PFC XOFF threshold. ECN should reduce sender rate before PFC is needed. PFC acts as a stronger loss-prevention mechanism when queues continue to grow.

PFC pauses a priority class, not a single flow.

If Flow A causes congestion in priority class 3, PFC can pause class 3. That also pauses Flow B and Flow C in the same priority class, even if they did not cause congestion.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef flow fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef queue fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef paused fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    A[Flow A<br/>congesting]:::flow
    B[Flow B<br/>innocent]:::flow
    C[Flow C<br/>innocent]:::flow
    Q[Priority 3 queue]:::queue
    P[PFC pause<br/>entire class]:::paused

    A --> Q
    B --> Q
    C --> Q
    Q --> P
    P -.-> A
    P -.-> B
    P -.-> C

This is head-of-line, HOL, blocking. It can increase training iteration time because unrelated GPU flows wait behind the paused class.

A PFC storm happens when pause frames propagate upstream and trigger wider backpressure through the fabric.

Typical sequence:

  1. A downstream switch sends excessive PFC pause frames.
  2. An upstream switch pauses the class.
  3. The upstream switch’s own queues build.
  4. It sends more PFC pause frames further upstream.
  5. The pause behavior spreads and can degrade many flows.

PFC storms are dangerous in AI fabrics because a synchronized training job can be gated by the slowest path.

PFC watchdog detects and mitigates persistent abnormal PFC backpressure.

It has three functions:

FunctionRole
DetectionMonitor pause frames per port or queue and compare to thresholds
MitigationDrop or forward traffic to break storm propagation
RestorationResume normal lossless behavior when storm signals fall below threshold

Mitigation options:

  • Drop packets already in the output queue.
  • Drop new packets for the affected queue or priority group.
  • Drop or suppress additional PFC frames to limit propagation.
  • Forward despite PFC, depending on platform policy.

Dropping is commonly used because it prevents the PFC storm from spreading, but it means the fabric is no longer strictly lossless during mitigation.


Data Center Quantized Congestion Notification, DCQCN

Section titled “Data Center Quantized Congestion Notification, DCQCN”

Data Center Quantized Congestion Notification, DCQCN, is a RoCEv2 congestion control mechanism that combines ECN/CNP with PFC behavior.

The idea:

  • ECN provides early end-to-end congestion signaling.
  • CNP tells the sender which flow or QP should slow down.
  • The sender performs rate control.
  • PFC provides stronger hop-by-hop backpressure if queues keep growing.
  • PFC watchdog protects against persistent PFC storm behavior.

DCQCN normally places the ECN threshold below the PFC XOFF threshold.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef normal fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef ecn fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef pfc fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef restore fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    N[Normal queue use]:::normal
    E[ECN threshold crossed<br/>mark CE, receiver sends CNP]:::ecn
    P[PFC XOFF crossed<br/>hop-by-hop pause]:::pfc
    R[XON crossed after drain<br/>resume traffic]:::restore

    N --> E --> P --> R

Why this ordering matters:

  • ECN should react first and preserve other flows in the queue.
  • PFC should be a later protection against drops.
  • If PFC triggers too often, it can reduce bandwidth and create HOL blocking.
  • If ECN is too high or too slow, packet drops can occur before rate control reacts.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant Src as Sender NIC
    participant LeafA as Leaf A
    participant Spine as Spine A
    participant LeafD as Leaf D
    participant Dst as Destination NIC

    Src->>LeafA: RoCEv2 traffic
    LeafA->>Spine: Forward
    Note over Spine,LeafD: Queue crosses ECN threshold
    Spine->>LeafD: ECN mark CE=11
    LeafD->>Dst: Deliver ECN-marked packet
    Dst-->>Src: CNP
    Src->>Src: DCQCN rate reduction
    Note over Spine,LeafD: If queue keeps growing
    Spine-->>LeafA: PFC XOFF for priority class
    LeafA-->>Src: Hop-by-hop pause may propagate

DCQCN is therefore a combined control loop:

  • ECN/CNP attempts to slow the sender before loss.
  • PFC prevents packet loss when buffer pressure becomes urgent.
  • PFC watchdog limits damage if pause behavior becomes pathological.

Operationally, DCQCN depends on telemetry.

Counter / SignalMeaning
ECN marked packet countHow often switches mark congestion
CNP RX/TX countHow often receivers/senders participate in rate reduction
PFC RX/TX countHow often pause frames are sent or received
PFC storm detection countWhether watchdog mitigation is happening
Queue buffer occupancyWhether thresholds are reasonable
WRED/drop countWhether ECN/PFC failed to prevent loss
RDMA retransmission or error countersWhether RoCEv2 traffic is suffering loss or timeout
Per-priority queue usageWhether one traffic class is dominating

Source Flow Control, SFC, is described as a newer flow-based mechanism. It is also referred to as Source PFC in the chapter.

The main idea:

Instead of pausing a whole class hop-by-hop, the congested device sends a direct signal toward the source flow.

SFC attempts to combine some strengths of ECN and PFC:

  • Faster than ECN because the congested switch sends a signal directly toward the source.
  • More precise than PFC because it targets a flow rather than a whole priority class.
  • Avoids some PFC side effects such as HOL blocking and storm propagation.
  • Can be implemented at edge or ToR level depending on device support.

SFC and CSIG congestion signaling

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant Src as Source NIC
    participant LeafA as Leaf A
    participant Spine as Spine A
    participant LeafD as Leaf D
    participant Dst as Destination

    Src->>LeafA: RoCEv2 flow
    LeafA->>Spine: Forward
    Note over Spine,LeafD: Queue crosses SFC threshold
    Spine-->>LeafA: SFC signal for congested flow
    alt NIC supports SFC
        LeafA-->>Src: Forward SFC signal
        Src->>Src: Pause or rate control that flow
    else NIC does not support SFC
        LeafA-->>Src: Translate to PFC or local action
    end

The chapter describes SFC signal creation this way:

  1. A switch detects congestion on a link based on queue usage and an SFC threshold.
  2. The switch creates an SFC signal for the congested flow.
  3. The source and destination IP addresses are reversed.
  4. The packet payload may be trimmed.
  5. The signal is sent back toward the source.

This lets the source learn about congestion without waiting for the original packet to reach the destination and trigger a CNP.

MechanismSignal PathGranularityMain StrengthMain Weakness
ECNCongested switch to receiver to senderFlow/QP through CNPEnd-to-end and flow-awareReaction delay
PFCHop-by-hop upstreamPriority classImmediate lossless pauseHOL blocking and PFC storms
SFCCongested switch toward sourceFlowFaster and more preciseRequires newer support

SFC is useful because it attacks the two main drawbacks of existing mechanisms: ECN delay and PFC class-level blast radius.


Congestion Signaling, CSIG, is an emerging IETF draft mechanism for direct, real-time, in-band congestion signaling.

CSIG uses in-band network telemetry, INT, ideas:

  • Live data packets carry compact congestion metadata.
  • Switches update CSIG tags along the path.
  • The receiver reflects the information back to the sender.
  • The sender can use the bottleneck data for rate control or path selection.

CSIG adds a Layer 2 tag between the Ethernet header and Layer 3 header. The chapter notes that it is structurally similar to a VLAN tag and can coexist with a VLAN tag when the CSIG tag is the last tag in the Layer 2 header.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef field fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef tag fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    ETH[Ethernet header]:::field
    VLAN[Optional VLAN tag]:::field
    CSIG[CSIG L2 tag<br/>congestion metadata]:::tag
    IP[IP header]:::field
    L4[TCP / UDP / RoCEv2]:::field
    PAY[Payload]:::field

    ETH --> VLAN --> CSIG --> IP --> L4 --> PAY

End-to-end flow:

  1. Source sends a packet with CSIG support.
  2. Each hop can update the CSIG tag with local bottleneck information.
  3. Destination receives accumulated congestion metadata.
  4. Destination reflects the information to the source.
  5. Source adjusts rate or path behavior.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant Src as Source GPU
    participant L1 as Leaf
    participant S as Spine
    participant L2 as Egress Leaf
    participant Dst as Destination GPU

    Src->>L1: Data packet with CSIG tag
    L1->>S: Update tag if needed
    S->>L2: Update tag with bottleneck signal
    L2->>Dst: Update tag if needed
    Dst-->>Src: Reflect CSIG metadata
    Src->>Src: Adjust rate or path selection

CSIG metadata can include:

MetadataMeaning
Bottleneck capacityCapacity of the limiting link
Bottleneck stageLeaf, spine, super-spine, or other stage
Device IDWhich device observed the bottleneck
Link identificationUplink, downlink, or link ID
Quantized signalCompact congestion value
Available bandwidth or queue signalSummary of path pressure

The point is not to expose unlimited telemetry. It is to carry a compact fixed-length summary that a sender or control loop can use quickly.

CSIG may help AI fabrics because it can:

  • Identify where the path bottleneck is.
  • Provide in-band telemetry without a separate polling loop.
  • Support better path selection.
  • Inform sender-side rate control.
  • Work with encrypted payloads because the signal is in a Layer 2 tag.
  • Evolve beyond ECN and PFC’s limited signal models.

CSIG is still new. It requires endpoint, switch, and standards support before it becomes a normal production mechanism.


MechanismScopeSignal DirectionGranularitySpeedMain Risk
ECNEnd-to-endSwitch marks packet, receiver sends CNPFlow/QPMediumToo slow for microbursts
PFCHop-by-hopDownstream pauses upstreamPriority classFastHOL blocking, PFC storm
PFC watchdogLocal switch protectionDetect and suppress abnormal pause behaviorPort/queue/classFast after thresholdDrops may occur during mitigation
DCQCNEnd-to-end plus hop-by-hopECN/CNP rate control plus PFC fallbackFlow/QP plus class fallbackMedium to fastNeeds careful tuning
SFCCongested switch to sourceDirect source signalFlowFastNewer support required
CSIGIn-band path telemetryHops update tag, receiver reflectsPath/bottleneck metadataPotentially fastEmerging standard and support

Recommended mental model:

ConditionPreferred Response
Mild queue growthECN mark and CNP-driven rate reduction
Queue pressure continuesDCQCN sender rate control
Loss is imminentPFC pause for the priority class
Pause behavior persistsPFC watchdog detection and mitigation
Need faster flow-specific controlSFC
Need path bottleneck telemetryCSIG

RoCEv2 congestion management must be validated under realistic AI traffic.

Checklist:

  • Confirm ECN is enabled on endpoints and every transit device in the path.
  • Confirm ECN thresholds are below PFC XOFF thresholds.
  • Confirm PFC is enabled only on intended priorities and links.
  • Validate XOFF/XON headroom against link speed, cable distance, MTU, and buffer size.
  • Monitor ECN marked packet counts during NCCL and storage bursts.
  • Monitor CNP RX/TX counters on NICs.
  • Monitor PFC RX/TX counters per port and priority.
  • Test PFC watchdog detection, mitigation, and restoration.
  • Validate that watchdog mitigation behavior is acceptable for the workload.
  • Check queue occupancy and WRED/drop counters.
  • Check RDMA retransmission, timeout, and error counters.
  • Validate leaf-to-spine, spine-to-leaf, and leaf-to-server incast cases.
  • Test five-stage congestion if the fabric includes super-spines.
  • Verify load balancing from Chapter 6 before increasing PFC dependence.
  • Record job-level impact: GPU utilization, collective latency, p99 step time, and JCT.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef config fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef test fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef signal fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef decision fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fix fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    C[Configure ECN, PFC,<br/>DCQCN, QoS, buffers]:::config
    W[Run AI-like workload<br/>NCCL, RDMA, storage bursts]:::test
    M[Measure ECN, CNP,<br/>PFC, queues, drops]:::signal
    J[Measure job impact<br/>GPU idle, p99 step, JCT]:::signal
    D{Congestion controlled<br/>without excessive pause or loss?}:::decision
    A[Accept operating envelope]:::config
    F[Retune thresholds,<br/>load balancing, QoS, watchdog]:::fix

    C --> W --> M --> J --> D
    D -->|Yes| A
    D -->|No| F
    F --> W

RoCEv2 is a high-performance transport for AI/ML fabrics, but it depends on carefully engineered congestion management.

The main takeaways:

  • RoCEv2 carries RDMA over UDP/IP and avoids TCP’s CPU and state overhead.
  • Ethernet is lossy by default, so RoCEv2 fabrics require ECN, PFC, DCQCN, and careful buffer tuning.
  • Congestion can appear at local leaf links, leaf-spine uplinks, spine-leaf downlinks, leaf-server links, and super-spine stages.
  • ECN marks congestion and relies on CNP to tell the sender to reduce rate.
  • ECN is flow-aware but can be too slow for fast bursts.
  • PFC prevents drops with hop-by-hop pause, but it pauses a whole priority class.
  • PFC can cause head-of-line blocking and PFC storms.
  • PFC watchdog detects, mitigates, and restores from abnormal pause behavior.
  • DCQCN combines ECN/CNP and PFC into a RoCEv2 congestion control loop.
  • ECN thresholds should normally be lower than PFC XOFF thresholds.
  • SFC aims to send direct flow-level congestion signals toward the source.
  • CSIG uses in-band telemetry tags to report path bottlenecks and can help future sender rate control and path selection.

TermMeaning
RoCEv2RDMA over Converged Ethernet version 2; RDMA over UDP/IP
RDMARemote Direct Memory Access
ECNExplicit Congestion Notification
ECTECN-Capable Transport
CECongestion Experienced
CNPCongestion Notification Packet
PFCPriority Flow Control
XOFFPause signal; transmit off
XONResume signal; transmit on
HOL blockingHead-of-line blocking; unrelated flows wait behind a paused class
PFC stormCascading or persistent PFC pause behavior across the fabric
PFC watchdogMechanism to detect, mitigate, and restore from PFC storms
DCQCNData Center Quantized Congestion Notification
DSCPDifferentiated Services Code Point
QoSQuality of Service
CoSClass of Service
WREDWeighted Random Early Detection
SFCSource Flow Control
CSIGCongestion Signaling
INTIn-band Network Telemetry
QPRDMA Queue Pair
CNP opcodeRoCEv2 CNP BTH opcode, identified in the chapter as 129

1. What is RoCEv2, and why is it important for AI/ML clusters?

Section titled “1. What is RoCEv2, and why is it important for AI/ML clusters?”

RoCEv2 encapsulates RDMA traffic in UDP/IP packets so RDMA can run across routed Ethernet fabrics.

It is important for AI/ML clusters because distributed training requires fast synchronization of large data chunks between GPU servers. RoCEv2 provides low latency, high throughput, and CPU bypass. The trade-off is that the Ethernet fabric must be tuned to avoid drops and uncontrolled congestion.

2. Where can congestion happen in a RoCEv2 AI fabric?

Section titled “2. Where can congestion happen in a RoCEv2 AI fabric?”

Congestion can happen at many points: local leaf links, leaf-to-spine uplinks, spine-to-leaf downlinks, leaf-to-server links, spine-to-super-spine links, and super-spine-to-spine links.

The common pattern is incast or flow collision. Several line-rate sources converge on one output link or one destination NIC. This can happen even if the fabric has no intentional oversubscription.

When a switch queue exceeds the ECN threshold, the switch marks packets with ECN CE bits instead of dropping them. The receiver sees the marked packet and sends a CNP back to the sender. The sender then reduces the rate for the affected flow or QP.

ECN is useful because it is flow-aware and end-to-end. Its weakness is delay: the marked packet must reach the receiver before the sender receives feedback.

CNP means Congestion Notification Packet. It is a RoCEv2 notification sent by the receiver back to the sender after the receiver sees ECN-marked traffic.

The CNP identifies the congested flow or QP so the sender can reduce the rate of that specific traffic. The chapter identifies the CNP BTH opcode as 129.

ECN is an end-to-end marking and rate-control signal. PFC is a hop-by-hop Ethernet pause mechanism.

PFC prevents packet drops by pausing a priority class on the upstream link. It is fast, but it is coarse. It pauses the whole class, not just the congesting flow. That can create HOL blocking and PFC storms.

XOFF is the PFC pause signal. It is generated when queue usage crosses the high threshold. XON is the resume signal. It is generated when queue usage drains below the lower threshold.

The XOFF/XON gap prevents the fabric from rapidly toggling pause and resume when queue usage is near the threshold.

A PFC storm is a cascading pause condition. A downstream device sends excessive PFC pause frames. The upstream device pauses traffic, its own queues build, and it sends pause frames further upstream.

This can spread backpressure through the fabric and stall unrelated traffic in the same priority class.

PFC watchdog monitors pause behavior and detects abnormal or persistent PFC storms. Once a threshold is exceeded, it mitigates the storm by dropping or forwarding affected traffic and suppressing further propagation. It later restores normal behavior when pause signals fall below the configured threshold.

It is a safety mechanism. It protects the fabric, but mitigation may sacrifice lossless behavior temporarily.

DCQCN uses ECN and CNP for early sender rate reduction. If queue usage keeps rising and crosses the PFC XOFF threshold, PFC provides hop-by-hop pause to prevent drops.

In a well-tuned fabric, ECN should usually react before PFC is needed. PFC should be a fallback, not the primary congestion control path.

10. Why are SFC and CSIG interesting for future AI fabrics?

Section titled “10. Why are SFC and CSIG interesting for future AI fabrics?”

SFC is interesting because it sends a direct flow-level signal from the congested device toward the source. It avoids the receiver round trip of ECN/CNP and avoids the class-level blast radius of PFC.

CSIG is interesting because it embeds compact congestion metadata in live packets. Switches can update the tag along the path, and the receiver can reflect the path bottleneck information back to the sender. This can support better rate control and path selection.