Skip to content

Chapter 11: Monitoring and Telemetry

This chapter explains how monitoring and telemetry change in AI/ML data center fabrics.

The core idea is:

AI fabric monitoring is not just “is the switch up?” It must explain whether microbursts, queue buildup, PFC pushback, ECN marking, drops, per-hop latency, flow imbalance, storage stalls, or GPU-side symptoms are increasing Job Completion Time or inference tail latency.

The chapter focuses on these topics:

  • SNMP polling, SNMP traps, and syslog
  • Streaming telemetry with gNMI/gRPC-style push models
  • Periodic telemetry and on-change telemetry
  • sFlow, IPFIX, RPC queries, and active probes
  • Historical monitoring versus real-time monitoring
  • AI fabric signals such as queue occupancy, ECN, PFC, frame loss, and buffer utilization
  • Correlating server telemetry with switch telemetry
  • In-Band Flow Analyzer, IFA, for per-hop metadata
  • Mirroring on demand and mirroring on drop
  • Corrective actions and the path toward AI-assisted operations

AI fabric monitoring map

Monitoring a data center normally includes power, temperature, humidity, device health, interface state, and general availability. Those signals still matter. However, AI/ML fabrics introduce a tighter relationship between network behavior and application performance.

In AI training, a brief congestion event can become a job-level performance problem. A small number of slow ranks can make other GPUs wait. A storage stall can reduce GPU utilization. A PFC pause can propagate pressure through a lossless class. ECN marking can trigger congestion-control behavior. Packet drops can force retransmission or recovery.

The result is that AI fabric monitoring must answer questions such as:

  • Which hop or queue introduced latency?
  • Did PFC pause a priority class?
  • Did ECN marking increase before JCT moved?
  • Did a fat flow or low-entropy flow create path imbalance?
  • Did storage reads or checkpoint writes block GPU progress?
  • Is a symptom isolated to one interface, one switch, one rail, one rack, or many racks?
  • Is the corrective action traffic engineering, source throttling, job rescheduling, capacity expansion, or security mitigation?

Traditional Monitoring Is Necessary but Not Sufficient

Section titled “Traditional Monitoring Is Necessary but Not Sufficient”

Traditional monitoring tools give useful baseline visibility.

MethodWhat It Does WellWhere It Falls Short
SNMP pollingPeriodic counters, interface status, CPU, memory, queue countersCan miss microbursts and fast queue events
SNMP trapsThreshold-based event notificationOnly reports configured threshold events
SyslogCritical errors, link events, hardware faults, routing eventsUsually event text, not high-frequency telemetry
Flow telemetryTraffic characterization and flow visibilitySampling may miss short-lived events
Active probesControlled latency/loss measurementsProbe packets may not match real workload packets

The problem is not that these tools are obsolete. The problem is that AI workloads need finer time resolution and better correlation across network, server, GPU, storage, and scheduler layers.

AI/ML fabrics have several properties that raise the monitoring bar.

AI Fabric PropertyMonitoring Implication
RDMA/RoCEv2 trafficNeed ECN, CNP, PFC, drop, queue, and retransmission visibility
Collective communicationSlowest rank affects all ranks
Lossless or near-lossless designNeed to detect pause propagation and buffer pressure
Microburst-sensitive workloadsPolling intervals may be too slow
Large fan-in/fan-out trafficNeed incast, entropy, and elephant-flow analysis
Expensive GPU timeSmall network stalls can waste high-value compute
Multi-layer bottlenecksNeed switch, server, storage, and scheduler correlation

AI data center monitoring usually combines several methods rather than relying on one.

SNMP polling uses a collector to query device MIB counters periodically. SNMPv2c and SNMPv3 are common, with SNMPv3 preferred when authentication and privacy are needed.

Typical SNMP polling targets:

AreaExample Data
Interface statisticsPacket count, byte count, error count
Queue statisticsQueue packets, queue drops, queue counters
Buffer utilizationShared buffer, dedicated buffer, per-interface buffer
FIB/RIB/MAC tablesForwarding, routing, and MAC table usage
TCAM utilizationACL and policy table capacity
CPU and memoryNode, line card, or ASIC resource usage

SNMP polling is useful for trend visibility and baseline operations, but it has important limitations in AI fabrics:

  • Polling intervals may miss microbursts.
  • Rate calculations depend on interval length.
  • Counter changes may be detected after the workload impact has already happened.
  • Per-queue and per-buffer visibility may be limited by device support.
  • Polling large fabrics too frequently can create collector and device load.

SNMP traps are device-generated notifications sent when a condition or threshold is met.

Examples:

  • CPU utilization crosses a threshold.
  • Queue or buffer threshold is reached.
  • PFC pushback rate exceeds a configured value.
  • Interface error count increases.
  • TCAM utilization reaches a limit.

Syslog sends textual event messages to collectors.

Syslog AreaExample Event
Critical/fatalASIC error, process crash
InterfaceLink down/up, optics fault
SecurityACL drop, control-plane protection event
RoutingBGP or OSPF adjacency change
HardwareFan, power, temperature

Traps and syslog are useful for events, but they do not replace high-resolution telemetry. They tell you that something happened; they often do not show the full per-hop timing or queue buildup path.

Streaming telemetry changes the model from collector pull to device push. Instead of periodically asking for counters, the device sends structured data to a telemetry collector.

Common properties:

  • Data is pushed from switch, router, firewall, or server agent.
  • gNMI/gRPC-style sessions are common.
  • Data can be periodic or on-change.
  • Multiple collectors can receive concurrent streams.
  • Data can feed a time-series database, dashboard, alerting system, or AI analyzer.

SNMP and streaming telemetry comparison

Streaming telemetry can export:

LevelExample Data
Node levelCPU, memory, FIB, RIB, TCAM, global buffer
Interface levelPacket rate, byte rate, errors, drops
Queue levelQueue occupancy, queue drop, ECN mark, PFC stats
Buffer levelShared buffer, dedicated buffer, per-interface buffer

For AI fabrics, per-interface, per-queue, and per-buffer telemetry is especially important.

Other monitoring methods complement SNMP and streaming telemetry.

MethodRole
sFlowSampled packet/flow visibility for traffic characterization
IPFIXFlow records and exported telemetry records
RPC queryStructured command/API query using XML, JSON, or similar formats
Active probesSynthetic UDP/TCP/TWAMP-style probes for latency and loss
IFAData-plane metadata or cloned probes for per-hop visibility

Active probes can measure loss and latency, but synthetic probes may not match the original workload packet size, path, queue, or priority. IFA is useful because it can tag or clone traffic closer to the real data-plane path.

ItemSNMP PollingStreaming Telemetry
Data modelCollector pulls dataDevice pushes data
TimingSeconds or minutes are commonSub-second or event-driven is possible
EncodingMIB/OID modelStructured data, often gNMI/gRPC
Microburst visibilityWeakBetter, depending on device support
On-change supportLimitedNative design option
Per-queue detailDevice dependentBetter fit for queue/buffer telemetry
Collector loadPolling scales with query countSession and stream management required
AI fabric fitBaseline and trend monitoringReal-time fabric visibility

The practical design is usually hybrid:

  • Use SNMP for baseline counters and operational compatibility.
  • Use traps and syslog for important events.
  • Use streaming telemetry for high-frequency queue, buffer, ECN, PFC, and interface visibility.
  • Use flow telemetry for application flow shape and elephant-flow detection.
  • Use IFA or probes when per-hop latency and congestion localization are needed.

AI/ML data centers should monitor more than interface utilization.

AI telemetry signal correlation

Egress buffer utilization is one of the most important early congestion signals.

Useful views:

  • Shared buffer utilization
  • Dedicated buffer utilization
  • Per-interface buffer usage
  • Per-queue occupancy
  • Queue drop counters
  • Burst absorption behavior

If egress buffers remain high, operators may need to add capacity, adjust traffic placement, retune congestion thresholds, or move workloads. In RoCEv2 fabrics, buffer behavior should be correlated with PFC and ECN.

Latency should be measured both at the application layer and inside the fabric.

LayerExample Signal
ApplicationStep time, p99 request latency, JCT
ServerGPU idle time, dataloader wait, RDMA counters
FabricPer-hop latency, queue residence time, ECN marks
StorageRead latency, write latency, checkpoint pause

In-band telemetry and IFA-style probes help identify where latency changes occur between leaf nodes, spines, and egress leaves.

Bandwidth utilization should be measured from leaf to spine, spine to leaf, and across rails or planes.

High utilization is not automatically bad. The problem is when utilization becomes uneven, concentrated, or correlated with tail latency and congestion.

Important checks:

  • Is one spine overutilized while others are underused?
  • Are elephant flows reducing ECMP balance?
  • Is flow entropy too low for hashing to distribute traffic?
  • Does a specific job, tenant, rack, or rail dominate a path?
  • Are link utilization changes correlated with ECN or PFC?

For RoCEv2 and lossless Ethernet, these signals should be watched together.

SignalMeaning
PFC XOFF / pausePriority class is being paused because receiver-side buffer pressure exists
ECN markCongestion signal for rate control
CNP countCongestion notification behavior for DCQCN
Frame lossQueue or link loss that may harm RDMA or TCP
RDMA retransmissionRecovery behavior after loss or timeout
Queue occupancyBuffer pressure before drops or pauses

PFC and ECN counters should not be interpreted alone. A small number of marks may be normal in a tuned fabric. A fast rise across many ports or a correlation with JCT increase is more important.

Historical monitoring and real-time monitoring serve different purposes.

TypePurposeExample Question
Historical monitoringTrend, capacity planning, recurring pattern analysisDid leaf-spine bandwidth grow 10% over several weeks?
Real-time monitoringDetect and localize active incidentsWhich queue is dropping now?

Historical monitoring helps with:

  • Capacity planning
  • Oversubscription review
  • Growth trend analysis
  • Optics and interface reliability history
  • Recurring congestion pattern detection
  • Correlating job placement with fabric symptoms

Real-time monitoring helps with:

  • Queue drop detection
  • PFC and ECN bursts
  • Flow tail latency
  • Fat-flow identification
  • Entropy scoring
  • ASIC memory/CPU pressure
  • Immediate traffic engineering decisions

AI fabrics need both. Historical data explains the trend; real-time data explains the live incident.

Switch telemetry alone is not enough in an AI/ML data center. Server telemetry must be correlated with switch telemetry.

Useful server-side signals:

  • GPU utilization
  • GPU memory pressure
  • GPU communication time
  • RDMA NIC counters
  • Storage read throughput
  • Dataloader wait time
  • Checkpoint write time
  • Scheduler placement and job ID
  • Application step time

Useful switch-side signals:

  • Interface and queue counters
  • Buffer occupancy
  • PFC pause and XOFF behavior
  • ECN marking
  • Drops
  • Per-hop latency
  • Flow records

The operational goal is to connect symptoms:

GPU utilization dropped because dataloaders waited; dataloaders waited because storage read latency rose; storage read latency rose because a leaf queue was congested; the queue was congested because a low-entropy elephant flow overloaded one spine path.

Without correlation, each team sees only one part of the incident.

In-Band Flow Analyzer, IFA, adds telemetry metadata to data-plane packets or cloned probe packets. It helps measure per-hop latency, queue behavior, congestion, and path information using packets that follow the same fabric path as the traffic of interest.

IFA hop metadata flow

IFA usually has three roles.

RoleFunction
InitiatorSelects traffic, adds IFA header or creates cloned IFA probe
TransitAppends local metadata at each hop, such as timestamp, queue, congestion, or port data
TerminationRemoves metadata or exports metadata stack to a collector, often through IPFIX

In a three-stage Clos, the initiator may be an ingress leaf, the transit node may be a spine, and the termination node may be an egress leaf. In a five-stage fabric, spine and super-spine hops can add metadata.

Conceptually, an IFA packet adds headers and metadata before the original payload.

L2 header
L3 header
IFA header
L4 header
IFA metadata header
IFA metadata stack
Original payload

The chapter notes that the IFA header sits after the IP header. Because IFA adds extra bytes, the fabric MTU must be large enough. AI data center fabrics commonly use jumbo MTU, such as 9000 bytes or more, but the full path must be checked.

MTU checks should include:

  • Server NIC MTU
  • Leaf and spine MTU
  • Overlay or VXLAN overhead
  • Collector path MTU
  • Probe packet size
  • IFA metadata stack growth across hops

Each hop can append metadata.

Metadata FieldWhy It Matters
Residence timeTime a packet spends inside a hop
Per-hop latencyLink plus device latency between adjacent nodes
Ingress portWhere the packet entered the node
Egress portWhere the packet left the node
RX timestampTimestamp used for latency calculation
Queue IDWhich queue carried the packet
Congestion notificationWhether congestion was observed
Egress port speedHelps interpret delay and path properties
Device IDIdentifies the hop

For AI fabrics, the highest-value fields are often congestion indication, RX timestamp, residence time, queue ID, and egress port.

IFA answers the question:

Which hop, port, or queue added the latency?

The basic workflow:

  1. An initiator selects or clones a packet from the traffic of interest.
  2. Each transit hop appends metadata.
  3. The termination hop exports the metadata stack.
  4. A collector reconstructs hop-by-hop path, queue, congestion, and latency.
  5. The operator correlates this with application symptoms such as JCT or p99 latency.

This is more precise than only knowing that a flow was slow end to end.

IFA is powerful, but it must be used carefully.

Practical constraints:

  • MTU overhead must be validated.
  • Device support and metadata fields vary by implementation.
  • Collector capacity must handle exported metadata.
  • Sampling policy must avoid excessive overhead.
  • Cloned probes should not distort workload behavior.
  • Timestamp accuracy depends on device capability.
  • Multicast metadata behavior may be less important for AI backends but still needs awareness.

Mirroring is useful when counters prove that something happened but not why.

MethodUse
Mirroring on demandMirror selected traffic for manual or automated packet analysis
Mirroring on dropExport packets that were dropped so the cause can be inspected

Mirroring on drop is useful because not all drops mean the same thing. A drop may be caused by congestion, malformed packets, security policy, control-plane protection, queue limits, or ASIC protection behavior.

In AI fabrics, mirroring should be scoped carefully. Full packet capture at scale is usually impractical. Use filters, sampling, drop reason, interface, queue, and job context when possible.

Monitoring is valuable only if it leads to decisions.

Corrective action feedback loop

Examples of corrective actions:

SymptomPossible Action
Congestion source identifiedRate-limit, move, or isolate source traffic
Persistent PFC pressureRetune thresholds, rebalance traffic, add capacity
ECN marks rise across a railReview DCQCN, link balance, job placement
Queue drops on a low-priority classValidate QoS policy and traffic classification
One spine is overutilizedAdjust hashing, traffic engineering, or topology
Job sees high JCT from fabricReschedule job or move job to healthier fabric capacity
Security or churn source foundACL, source block, or control-plane protection
Capacity design is wrongAdd links, change oversubscription, redesign topology

The chapter emphasizes that monitoring cannot compensate for a fundamentally undersized fabric. If oversubscription, load balancing, optics quality, or capacity planning is wrong, telemetry will reveal the problem but cannot magically remove it.

The chapter also discusses the path toward AI-assisted operations.

The idea:

  1. Collect telemetry, syslog, SNMP, IFA, and server data.
  2. Store it in a data lake or time-series database.
  3. Mine patterns and correlate incidents with corrective actions.
  4. Use an AI/LLM analyzer to propose or eventually apply actions.
  5. Move gradually toward autonomous or self-driving network behavior.

This is still an operational evolution. In many environments, humans remain involved before actions are applied. That is appropriate because the cost of a wrong action can be high: killing a training job, moving traffic, blocking a source, or changing congestion-control behavior can affect many GPUs.

Use this checklist when designing monitoring for an AI data center fabric.

  • Confirm which signals are collected by SNMP, syslog, traps, streaming telemetry, flow telemetry, and IFA.
  • Check polling intervals and decide which events require streaming telemetry instead of SNMP.
  • Collect per-interface, per-queue, and per-buffer telemetry.
  • Track ECN, CNP, PFC, drops, queue occupancy, and RDMA retransmission together.
  • Correlate switch telemetry with GPU utilization, GPU communication time, dataloader wait, checkpoint time, and JCT.
  • Include job ID, tenant, rack, rail, server, GPU, and NIC identity in correlation data where possible.
  • Validate time synchronization across collectors, switches, and servers.
  • Store historical telemetry for capacity planning and recurring pattern analysis.
  • Use real-time telemetry for queue drops, PFC bursts, ECN bursts, fat flows, and entropy changes.
  • Validate collector capacity before enabling high-volume telemetry streams.
  • Test IFA MTU overhead across server NICs, leafs, spines, overlays, and collector paths.
  • Define safe mirroring policies for mirroring on demand and mirroring on drop.
  • Define corrective-action playbooks before automating actions.
  • Validate that monitoring can localize whether a problem is one interface, one queue, one switch, one rail, one rack, or fabric-wide.
  • Do not treat monitoring as a substitute for correct topology, oversubscription, optics, and capacity design.

Build a Minimum Useful Telemetry Set First

Section titled “Build a Minimum Useful Telemetry Set First”

A telemetry program can fail because it collects too much before it answers the first operational questions. Start with a small set that can explain GPU-visible symptoms.

QuestionMinimum Signals
Are GPUs waiting on the network?Step time, collective duration, GPU communication time, ECN/CNP, queue occupancy
Are GPUs waiting on storage?Dataloader wait, storage read latency, metadata latency, storage NIC utilization
Did lossless behavior activate?PFC XOFF, pause duration, queue occupancy, ECN marks, drops
Is congestion localized?Interface, queue, switch, rack, rail, job ID, path or flow record
Did a corrective action help?Before/after JCT, p99 step time, p99 latency, drops, retransmissions

Once those questions are answerable, add more detailed telemetry such as IFA, mirroring on drop, or per-flow entropy scoring.

Time Synchronization Is Part of Monitoring

Section titled “Time Synchronization Is Part of Monitoring”

Cross-layer correlation depends on time. If switch telemetry, server logs, GPU traces, storage metrics, and scheduler events are not aligned, a correct root cause may look wrong.

Practical checks:

  • Confirm NTP/PTP source and clock drift policy.
  • Record collector ingest time and device event time separately.
  • Check time skew before comparing GPU step-time spikes with queue events.
  • Preserve job ID, host, NIC, rail, rack, and switch identifiers in the same time-series labels.

Sample Carefully Before Turning on Everything

Section titled “Sample Carefully Before Turning on Everything”

High-cardinality telemetry can overload collectors and make dashboards slow. IFA, flow records, and mirroring are especially easy to overuse.

Use a staged approach:

StageScope
BaselineInterface, queue, buffer, ECN, PFC, drops, CPU, memory
Job-awareAdd job/rack/rail labels and scheduler metadata
Flow-awareAdd sFlow/IPFIX or selected flow records
Packet-awareAdd IFA, mirroring on demand, mirroring on drop
Automated actionAdd scoped playbooks with rollback and human review

The main takeaways:

  • AI fabric monitoring must connect infrastructure signals to workload outcomes such as JCT, GPU idle time, and inference tail latency.
  • SNMP polling remains useful for baseline visibility, but it can miss microbursts and fast queue events.
  • SNMP traps and syslog are useful event mechanisms, but they do not replace high-resolution telemetry.
  • Streaming telemetry provides a better model for high-frequency interface, queue, buffer, ECN, and PFC visibility.
  • Historical monitoring supports trend analysis and capacity planning.
  • Real-time monitoring supports active incident detection and localization.
  • Server telemetry must be correlated with switch telemetry to explain AI workload symptoms.
  • IFA adds or exports per-hop metadata so operators can locate latency, queue, and congestion points.
  • Mirroring on demand and mirroring on drop help inspect packets when counters are not enough.
  • Corrective actions can include traffic engineering, source isolation, job rescheduling, threshold changes, and capacity expansion.
  • AI-assisted operations can analyze telemetry patterns, but automation should be introduced carefully because wrong actions can be expensive.
TermMeaning
SNMPSimple Network Management Protocol
MIBManagement Information Base
SNMP trapDevice-generated notification triggered by an event or threshold
SyslogEvent logging protocol and message stream
Streaming telemetryDevice-pushed structured telemetry stream
gNMIgRPC Network Management Interface
gRPCRemote procedure call framework often used for telemetry transport
sFlowSampled flow monitoring technology
IPFIXIP Flow Information Export
RPC queryStructured remote query to gather device state
IFAIn-Band Flow Analyzer
InitiatorIFA node that selects or clones traffic and adds IFA information
TransitIFA node that appends local metadata
TerminationIFA node that removes or exports metadata
Residence timeTime a packet spends inside a hop
Queue IDIdentifier of the queue carrying the packet
PFCPriority Flow Control
ECNExplicit Congestion Notification
CNPCongestion Notification Packet
Fat flowLarge flow that can dominate a path
Flow entropyDiversity of flow hash inputs and path distribution potential
Mirroring on dropExporting dropped packets or copies for analysis
Corrective actionOperational response such as rerouting, blocking, rescheduling, or capacity change

1. Why is ordinary interface monitoring not enough for AI fabrics?

Section titled “1. Why is ordinary interface monitoring not enough for AI fabrics?”

Average interface utilization can hide microbursts, queue buildup, PFC pauses, ECN bursts, and tail-latency events. AI workloads care about whether those events make GPUs wait or increase JCT, so monitoring must include per-queue, per-buffer, per-hop, and server-side signals.

SNMP polling is useful for baseline counters, trend monitoring, interface statistics, CPU, memory, FIB/RIB/MAC usage, and TCAM utilization. It is not enough for fast microburst or per-hop latency analysis.

3. What is the benefit of streaming telemetry?

Section titled “3. What is the benefit of streaming telemetry?”

Streaming telemetry lets devices push structured data to collectors, either periodically or on change. It is better suited for high-frequency queue, buffer, ECN, PFC, and interface visibility than traditional polling.

4. Why should server telemetry be correlated with switch telemetry?

Section titled “4. Why should server telemetry be correlated with switch telemetry?”

Switch counters may show congestion, but server telemetry shows workload impact. Correlating both tells whether a network event increased GPU idle time, dataloader wait, checkpoint pause, or JCT.

5. What does IFA add beyond normal telemetry?

Section titled “5. What does IFA add beyond normal telemetry?”

IFA adds or exports per-hop metadata from data-plane packets or cloned probes. It can identify which hop, port, or queue contributed latency or congestion.

6. What IFA fields matter most in AI fabrics?

Section titled “6. What IFA fields matter most in AI fabrics?”

Congestion indication, RX timestamp, residence time, queue ID, ingress port, egress port, and device ID are especially useful because they help localize queue and latency problems.

IFA adds headers and metadata. If the fabric MTU is too small, probe or tagged packets may be fragmented or dropped. The server NIC, leaf, spine, overlay, and collector path MTU must all be checked.

Mirroring on drop exports copies of dropped packets so operators can determine whether drops came from congestion, malformed traffic, security policy, control-plane protection, queue limits, or other causes.

9. What is the difference between historical and real-time monitoring?

Section titled “9. What is the difference between historical and real-time monitoring?”

Historical monitoring finds trends and supports capacity planning. Real-time monitoring detects active problems such as queue drops, PFC bursts, ECN bursts, fat flows, and tail-latency spikes.

10. Why should corrective actions be introduced carefully?

Section titled “10. Why should corrective actions be introduced carefully?”

Corrective actions can affect running jobs and many GPUs. Blocking a source, changing traffic paths, rescheduling a job, or adjusting thresholds can solve one problem while creating another. Automation should start with strong evidence, scoped actions, and human review where risk is high.