Chapter 11: Monitoring and Telemetry
Table of Contents
Section titled “Table of Contents”- Goal
- Why AI Fabrics Need Better Monitoring
- Monitoring Methods
- SNMP and Streaming Telemetry Comparison
- AI Fabric Telemetry Signals
- Historical and Real-Time Monitoring
- Server and Switch Telemetry Correlation
- In-Band Flow Analyzer, IFA
- Mirroring on Demand and Mirroring on Drop
- Corrective Actions
- Operational Validation Checklist
- Practical Tips and Notes
- Chapter Summary
- Key Terms
- Q&A
- References
This chapter explains how monitoring and telemetry change in AI/ML data center fabrics.
The core idea is:
AI fabric monitoring is not just “is the switch up?” It must explain whether microbursts, queue buildup, PFC pushback, ECN marking, drops, per-hop latency, flow imbalance, storage stalls, or GPU-side symptoms are increasing Job Completion Time or inference tail latency.
The chapter focuses on these topics:
- SNMP polling, SNMP traps, and syslog
- Streaming telemetry with gNMI/gRPC-style push models
- Periodic telemetry and on-change telemetry
- sFlow, IPFIX, RPC queries, and active probes
- Historical monitoring versus real-time monitoring
- AI fabric signals such as queue occupancy, ECN, PFC, frame loss, and buffer utilization
- Correlating server telemetry with switch telemetry
- In-Band Flow Analyzer, IFA, for per-hop metadata
- Mirroring on demand and mirroring on drop
- Corrective actions and the path toward AI-assisted operations
Why AI Fabrics Need Better Monitoring
Section titled “Why AI Fabrics Need Better Monitoring”Monitoring a data center normally includes power, temperature, humidity, device health, interface state, and general availability. Those signals still matter. However, AI/ML fabrics introduce a tighter relationship between network behavior and application performance.
In AI training, a brief congestion event can become a job-level performance problem. A small number of slow ranks can make other GPUs wait. A storage stall can reduce GPU utilization. A PFC pause can propagate pressure through a lossless class. ECN marking can trigger congestion-control behavior. Packet drops can force retransmission or recovery.
The result is that AI fabric monitoring must answer questions such as:
- Which hop or queue introduced latency?
- Did PFC pause a priority class?
- Did ECN marking increase before JCT moved?
- Did a fat flow or low-entropy flow create path imbalance?
- Did storage reads or checkpoint writes block GPU progress?
- Is a symptom isolated to one interface, one switch, one rail, one rack, or many racks?
- Is the corrective action traffic engineering, source throttling, job rescheduling, capacity expansion, or security mitigation?
Traditional Monitoring Is Necessary but Not Sufficient
Section titled “Traditional Monitoring Is Necessary but Not Sufficient”Traditional monitoring tools give useful baseline visibility.
| Method | What It Does Well | Where It Falls Short |
|---|---|---|
| SNMP polling | Periodic counters, interface status, CPU, memory, queue counters | Can miss microbursts and fast queue events |
| SNMP traps | Threshold-based event notification | Only reports configured threshold events |
| Syslog | Critical errors, link events, hardware faults, routing events | Usually event text, not high-frequency telemetry |
| Flow telemetry | Traffic characterization and flow visibility | Sampling may miss short-lived events |
| Active probes | Controlled latency/loss measurements | Probe packets may not match real workload packets |
The problem is not that these tools are obsolete. The problem is that AI workloads need finer time resolution and better correlation across network, server, GPU, storage, and scheduler layers.
What Changes in AI/ML Data Centers
Section titled “What Changes in AI/ML Data Centers”AI/ML fabrics have several properties that raise the monitoring bar.
| AI Fabric Property | Monitoring Implication |
|---|---|
| RDMA/RoCEv2 traffic | Need ECN, CNP, PFC, drop, queue, and retransmission visibility |
| Collective communication | Slowest rank affects all ranks |
| Lossless or near-lossless design | Need to detect pause propagation and buffer pressure |
| Microburst-sensitive workloads | Polling intervals may be too slow |
| Large fan-in/fan-out traffic | Need incast, entropy, and elephant-flow analysis |
| Expensive GPU time | Small network stalls can waste high-value compute |
| Multi-layer bottlenecks | Need switch, server, storage, and scheduler correlation |
Monitoring Methods
Section titled “Monitoring Methods”AI data center monitoring usually combines several methods rather than relying on one.
SNMP Polling
Section titled “SNMP Polling”SNMP polling uses a collector to query device MIB counters periodically. SNMPv2c and SNMPv3 are common, with SNMPv3 preferred when authentication and privacy are needed.
Typical SNMP polling targets:
| Area | Example Data |
|---|---|
| Interface statistics | Packet count, byte count, error count |
| Queue statistics | Queue packets, queue drops, queue counters |
| Buffer utilization | Shared buffer, dedicated buffer, per-interface buffer |
| FIB/RIB/MAC tables | Forwarding, routing, and MAC table usage |
| TCAM utilization | ACL and policy table capacity |
| CPU and memory | Node, line card, or ASIC resource usage |
SNMP polling is useful for trend visibility and baseline operations, but it has important limitations in AI fabrics:
- Polling intervals may miss microbursts.
- Rate calculations depend on interval length.
- Counter changes may be detected after the workload impact has already happened.
- Per-queue and per-buffer visibility may be limited by device support.
- Polling large fabrics too frequently can create collector and device load.
SNMP Traps and Syslog
Section titled “SNMP Traps and Syslog”SNMP traps are device-generated notifications sent when a condition or threshold is met.
Examples:
- CPU utilization crosses a threshold.
- Queue or buffer threshold is reached.
- PFC pushback rate exceeds a configured value.
- Interface error count increases.
- TCAM utilization reaches a limit.
Syslog sends textual event messages to collectors.
| Syslog Area | Example Event |
|---|---|
| Critical/fatal | ASIC error, process crash |
| Interface | Link down/up, optics fault |
| Security | ACL drop, control-plane protection event |
| Routing | BGP or OSPF adjacency change |
| Hardware | Fan, power, temperature |
Traps and syslog are useful for events, but they do not replace high-resolution telemetry. They tell you that something happened; they often do not show the full per-hop timing or queue buildup path.
Streaming Telemetry
Section titled “Streaming Telemetry”Streaming telemetry changes the model from collector pull to device push. Instead of periodically asking for counters, the device sends structured data to a telemetry collector.
Common properties:
- Data is pushed from switch, router, firewall, or server agent.
- gNMI/gRPC-style sessions are common.
- Data can be periodic or on-change.
- Multiple collectors can receive concurrent streams.
- Data can feed a time-series database, dashboard, alerting system, or AI analyzer.
Streaming telemetry can export:
| Level | Example Data |
|---|---|
| Node level | CPU, memory, FIB, RIB, TCAM, global buffer |
| Interface level | Packet rate, byte rate, errors, drops |
| Queue level | Queue occupancy, queue drop, ECN mark, PFC stats |
| Buffer level | Shared buffer, dedicated buffer, per-interface buffer |
For AI fabrics, per-interface, per-queue, and per-buffer telemetry is especially important.
sFlow, IPFIX, RPC, and Active Probes
Section titled “sFlow, IPFIX, RPC, and Active Probes”Other monitoring methods complement SNMP and streaming telemetry.
| Method | Role |
|---|---|
| sFlow | Sampled packet/flow visibility for traffic characterization |
| IPFIX | Flow records and exported telemetry records |
| RPC query | Structured command/API query using XML, JSON, or similar formats |
| Active probes | Synthetic UDP/TCP/TWAMP-style probes for latency and loss |
| IFA | Data-plane metadata or cloned probes for per-hop visibility |
Active probes can measure loss and latency, but synthetic probes may not match the original workload packet size, path, queue, or priority. IFA is useful because it can tag or clone traffic closer to the real data-plane path.
SNMP and Streaming Telemetry Comparison
Section titled “SNMP and Streaming Telemetry Comparison”| Item | SNMP Polling | Streaming Telemetry |
|---|---|---|
| Data model | Collector pulls data | Device pushes data |
| Timing | Seconds or minutes are common | Sub-second or event-driven is possible |
| Encoding | MIB/OID model | Structured data, often gNMI/gRPC |
| Microburst visibility | Weak | Better, depending on device support |
| On-change support | Limited | Native design option |
| Per-queue detail | Device dependent | Better fit for queue/buffer telemetry |
| Collector load | Polling scales with query count | Session and stream management required |
| AI fabric fit | Baseline and trend monitoring | Real-time fabric visibility |
The practical design is usually hybrid:
- Use SNMP for baseline counters and operational compatibility.
- Use traps and syslog for important events.
- Use streaming telemetry for high-frequency queue, buffer, ECN, PFC, and interface visibility.
- Use flow telemetry for application flow shape and elephant-flow detection.
- Use IFA or probes when per-hop latency and congestion localization are needed.
AI Fabric Telemetry Signals
Section titled “AI Fabric Telemetry Signals”AI/ML data centers should monitor more than interface utilization.
Egress Buffer Utilization
Section titled “Egress Buffer Utilization”Egress buffer utilization is one of the most important early congestion signals.
Useful views:
- Shared buffer utilization
- Dedicated buffer utilization
- Per-interface buffer usage
- Per-queue occupancy
- Queue drop counters
- Burst absorption behavior
If egress buffers remain high, operators may need to add capacity, adjust traffic placement, retune congestion thresholds, or move workloads. In RoCEv2 fabrics, buffer behavior should be correlated with PFC and ECN.
Latency Changes
Section titled “Latency Changes”Latency should be measured both at the application layer and inside the fabric.
| Layer | Example Signal |
|---|---|
| Application | Step time, p99 request latency, JCT |
| Server | GPU idle time, dataloader wait, RDMA counters |
| Fabric | Per-hop latency, queue residence time, ECN marks |
| Storage | Read latency, write latency, checkpoint pause |
In-band telemetry and IFA-style probes help identify where latency changes occur between leaf nodes, spines, and egress leaves.
Bandwidth Utilization and Flow Entropy
Section titled “Bandwidth Utilization and Flow Entropy”Bandwidth utilization should be measured from leaf to spine, spine to leaf, and across rails or planes.
High utilization is not automatically bad. The problem is when utilization becomes uneven, concentrated, or correlated with tail latency and congestion.
Important checks:
- Is one spine overutilized while others are underused?
- Are elephant flows reducing ECMP balance?
- Is flow entropy too low for hashing to distribute traffic?
- Does a specific job, tenant, rack, or rail dominate a path?
- Are link utilization changes correlated with ECN or PFC?
PFC, ECN, Drops, and RDMA Symptoms
Section titled “PFC, ECN, Drops, and RDMA Symptoms”For RoCEv2 and lossless Ethernet, these signals should be watched together.
| Signal | Meaning |
|---|---|
| PFC XOFF / pause | Priority class is being paused because receiver-side buffer pressure exists |
| ECN mark | Congestion signal for rate control |
| CNP count | Congestion notification behavior for DCQCN |
| Frame loss | Queue or link loss that may harm RDMA or TCP |
| RDMA retransmission | Recovery behavior after loss or timeout |
| Queue occupancy | Buffer pressure before drops or pauses |
PFC and ECN counters should not be interpreted alone. A small number of marks may be normal in a tuned fabric. A fast rise across many ports or a correlation with JCT increase is more important.
Historical and Real-Time Monitoring
Section titled “Historical and Real-Time Monitoring”Historical monitoring and real-time monitoring serve different purposes.
| Type | Purpose | Example Question |
|---|---|---|
| Historical monitoring | Trend, capacity planning, recurring pattern analysis | Did leaf-spine bandwidth grow 10% over several weeks? |
| Real-time monitoring | Detect and localize active incidents | Which queue is dropping now? |
Historical monitoring helps with:
- Capacity planning
- Oversubscription review
- Growth trend analysis
- Optics and interface reliability history
- Recurring congestion pattern detection
- Correlating job placement with fabric symptoms
Real-time monitoring helps with:
- Queue drop detection
- PFC and ECN bursts
- Flow tail latency
- Fat-flow identification
- Entropy scoring
- ASIC memory/CPU pressure
- Immediate traffic engineering decisions
AI fabrics need both. Historical data explains the trend; real-time data explains the live incident.
Server and Switch Telemetry Correlation
Section titled “Server and Switch Telemetry Correlation”Switch telemetry alone is not enough in an AI/ML data center. Server telemetry must be correlated with switch telemetry.
Useful server-side signals:
- GPU utilization
- GPU memory pressure
- GPU communication time
- RDMA NIC counters
- Storage read throughput
- Dataloader wait time
- Checkpoint write time
- Scheduler placement and job ID
- Application step time
Useful switch-side signals:
- Interface and queue counters
- Buffer occupancy
- PFC pause and XOFF behavior
- ECN marking
- Drops
- Per-hop latency
- Flow records
The operational goal is to connect symptoms:
GPU utilization dropped because dataloaders waited; dataloaders waited because storage read latency rose; storage read latency rose because a leaf queue was congested; the queue was congested because a low-entropy elephant flow overloaded one spine path.
Without correlation, each team sees only one part of the incident.
In-Band Flow Analyzer, IFA
Section titled “In-Band Flow Analyzer, IFA”In-Band Flow Analyzer, IFA, adds telemetry metadata to data-plane packets or cloned probe packets. It helps measure per-hop latency, queue behavior, congestion, and path information using packets that follow the same fabric path as the traffic of interest.
IFA Roles
Section titled “IFA Roles”IFA usually has three roles.
| Role | Function |
|---|---|
| Initiator | Selects traffic, adds IFA header or creates cloned IFA probe |
| Transit | Appends local metadata at each hop, such as timestamp, queue, congestion, or port data |
| Termination | Removes metadata or exports metadata stack to a collector, often through IPFIX |
In a three-stage Clos, the initiator may be an ingress leaf, the transit node may be a spine, and the termination node may be an egress leaf. In a five-stage fabric, spine and super-spine hops can add metadata.
IFA Packet Format
Section titled “IFA Packet Format”Conceptually, an IFA packet adds headers and metadata before the original payload.
L2 headerL3 headerIFA headerL4 headerIFA metadata headerIFA metadata stackOriginal payloadThe chapter notes that the IFA header sits after the IP header. Because IFA adds extra bytes, the fabric MTU must be large enough. AI data center fabrics commonly use jumbo MTU, such as 9000 bytes or more, but the full path must be checked.
MTU checks should include:
- Server NIC MTU
- Leaf and spine MTU
- Overlay or VXLAN overhead
- Collector path MTU
- Probe packet size
- IFA metadata stack growth across hops
IFA Metadata Stack
Section titled “IFA Metadata Stack”Each hop can append metadata.
| Metadata Field | Why It Matters |
|---|---|
| Residence time | Time a packet spends inside a hop |
| Per-hop latency | Link plus device latency between adjacent nodes |
| Ingress port | Where the packet entered the node |
| Egress port | Where the packet left the node |
| RX timestamp | Timestamp used for latency calculation |
| Queue ID | Which queue carried the packet |
| Congestion notification | Whether congestion was observed |
| Egress port speed | Helps interpret delay and path properties |
| Device ID | Identifies the hop |
For AI fabrics, the highest-value fields are often congestion indication, RX timestamp, residence time, queue ID, and egress port.
Per-Hop Latency Analysis
Section titled “Per-Hop Latency Analysis”IFA answers the question:
Which hop, port, or queue added the latency?
The basic workflow:
- An initiator selects or clones a packet from the traffic of interest.
- Each transit hop appends metadata.
- The termination hop exports the metadata stack.
- A collector reconstructs hop-by-hop path, queue, congestion, and latency.
- The operator correlates this with application symptoms such as JCT or p99 latency.
This is more precise than only knowing that a flow was slow end to end.
IFA Practical Constraints
Section titled “IFA Practical Constraints”IFA is powerful, but it must be used carefully.
Practical constraints:
- MTU overhead must be validated.
- Device support and metadata fields vary by implementation.
- Collector capacity must handle exported metadata.
- Sampling policy must avoid excessive overhead.
- Cloned probes should not distort workload behavior.
- Timestamp accuracy depends on device capability.
- Multicast metadata behavior may be less important for AI backends but still needs awareness.
Mirroring on Demand and Mirroring on Drop
Section titled “Mirroring on Demand and Mirroring on Drop”Mirroring is useful when counters prove that something happened but not why.
| Method | Use |
|---|---|
| Mirroring on demand | Mirror selected traffic for manual or automated packet analysis |
| Mirroring on drop | Export packets that were dropped so the cause can be inspected |
Mirroring on drop is useful because not all drops mean the same thing. A drop may be caused by congestion, malformed packets, security policy, control-plane protection, queue limits, or ASIC protection behavior.
In AI fabrics, mirroring should be scoped carefully. Full packet capture at scale is usually impractical. Use filters, sampling, drop reason, interface, queue, and job context when possible.
Corrective Actions
Section titled “Corrective Actions”Monitoring is valuable only if it leads to decisions.
Human-Guided Actions
Section titled “Human-Guided Actions”Examples of corrective actions:
| Symptom | Possible Action |
|---|---|
| Congestion source identified | Rate-limit, move, or isolate source traffic |
| Persistent PFC pressure | Retune thresholds, rebalance traffic, add capacity |
| ECN marks rise across a rail | Review DCQCN, link balance, job placement |
| Queue drops on a low-priority class | Validate QoS policy and traffic classification |
| One spine is overutilized | Adjust hashing, traffic engineering, or topology |
| Job sees high JCT from fabric | Reschedule job or move job to healthier fabric capacity |
| Security or churn source found | ACL, source block, or control-plane protection |
| Capacity design is wrong | Add links, change oversubscription, redesign topology |
The chapter emphasizes that monitoring cannot compensate for a fundamentally undersized fabric. If oversubscription, load balancing, optics quality, or capacity planning is wrong, telemetry will reveal the problem but cannot magically remove it.
AI-Assisted and Autonomous Operations
Section titled “AI-Assisted and Autonomous Operations”The chapter also discusses the path toward AI-assisted operations.
The idea:
- Collect telemetry, syslog, SNMP, IFA, and server data.
- Store it in a data lake or time-series database.
- Mine patterns and correlate incidents with corrective actions.
- Use an AI/LLM analyzer to propose or eventually apply actions.
- Move gradually toward autonomous or self-driving network behavior.
This is still an operational evolution. In many environments, humans remain involved before actions are applied. That is appropriate because the cost of a wrong action can be high: killing a training job, moving traffic, blocking a source, or changing congestion-control behavior can affect many GPUs.
Operational Validation Checklist
Section titled “Operational Validation Checklist”Use this checklist when designing monitoring for an AI data center fabric.
- Confirm which signals are collected by SNMP, syslog, traps, streaming telemetry, flow telemetry, and IFA.
- Check polling intervals and decide which events require streaming telemetry instead of SNMP.
- Collect per-interface, per-queue, and per-buffer telemetry.
- Track ECN, CNP, PFC, drops, queue occupancy, and RDMA retransmission together.
- Correlate switch telemetry with GPU utilization, GPU communication time, dataloader wait, checkpoint time, and JCT.
- Include job ID, tenant, rack, rail, server, GPU, and NIC identity in correlation data where possible.
- Validate time synchronization across collectors, switches, and servers.
- Store historical telemetry for capacity planning and recurring pattern analysis.
- Use real-time telemetry for queue drops, PFC bursts, ECN bursts, fat flows, and entropy changes.
- Validate collector capacity before enabling high-volume telemetry streams.
- Test IFA MTU overhead across server NICs, leafs, spines, overlays, and collector paths.
- Define safe mirroring policies for mirroring on demand and mirroring on drop.
- Define corrective-action playbooks before automating actions.
- Validate that monitoring can localize whether a problem is one interface, one queue, one switch, one rail, one rack, or fabric-wide.
- Do not treat monitoring as a substitute for correct topology, oversubscription, optics, and capacity design.
Practical Tips and Notes
Section titled “Practical Tips and Notes”Build a Minimum Useful Telemetry Set First
Section titled “Build a Minimum Useful Telemetry Set First”A telemetry program can fail because it collects too much before it answers the first operational questions. Start with a small set that can explain GPU-visible symptoms.
| Question | Minimum Signals |
|---|---|
| Are GPUs waiting on the network? | Step time, collective duration, GPU communication time, ECN/CNP, queue occupancy |
| Are GPUs waiting on storage? | Dataloader wait, storage read latency, metadata latency, storage NIC utilization |
| Did lossless behavior activate? | PFC XOFF, pause duration, queue occupancy, ECN marks, drops |
| Is congestion localized? | Interface, queue, switch, rack, rail, job ID, path or flow record |
| Did a corrective action help? | Before/after JCT, p99 step time, p99 latency, drops, retransmissions |
Once those questions are answerable, add more detailed telemetry such as IFA, mirroring on drop, or per-flow entropy scoring.
Time Synchronization Is Part of Monitoring
Section titled “Time Synchronization Is Part of Monitoring”Cross-layer correlation depends on time. If switch telemetry, server logs, GPU traces, storage metrics, and scheduler events are not aligned, a correct root cause may look wrong.
Practical checks:
- Confirm NTP/PTP source and clock drift policy.
- Record collector ingest time and device event time separately.
- Check time skew before comparing GPU step-time spikes with queue events.
- Preserve job ID, host, NIC, rail, rack, and switch identifiers in the same time-series labels.
Sample Carefully Before Turning on Everything
Section titled “Sample Carefully Before Turning on Everything”High-cardinality telemetry can overload collectors and make dashboards slow. IFA, flow records, and mirroring are especially easy to overuse.
Use a staged approach:
| Stage | Scope |
|---|---|
| Baseline | Interface, queue, buffer, ECN, PFC, drops, CPU, memory |
| Job-aware | Add job/rack/rail labels and scheduler metadata |
| Flow-aware | Add sFlow/IPFIX or selected flow records |
| Packet-aware | Add IFA, mirroring on demand, mirroring on drop |
| Automated action | Add scoped playbooks with rollback and human review |
Chapter Summary
Section titled “Chapter Summary”The main takeaways:
- AI fabric monitoring must connect infrastructure signals to workload outcomes such as JCT, GPU idle time, and inference tail latency.
- SNMP polling remains useful for baseline visibility, but it can miss microbursts and fast queue events.
- SNMP traps and syslog are useful event mechanisms, but they do not replace high-resolution telemetry.
- Streaming telemetry provides a better model for high-frequency interface, queue, buffer, ECN, and PFC visibility.
- Historical monitoring supports trend analysis and capacity planning.
- Real-time monitoring supports active incident detection and localization.
- Server telemetry must be correlated with switch telemetry to explain AI workload symptoms.
- IFA adds or exports per-hop metadata so operators can locate latency, queue, and congestion points.
- Mirroring on demand and mirroring on drop help inspect packets when counters are not enough.
- Corrective actions can include traffic engineering, source isolation, job rescheduling, threshold changes, and capacity expansion.
- AI-assisted operations can analyze telemetry patterns, but automation should be introduced carefully because wrong actions can be expensive.
Key Terms
Section titled “Key Terms”| Term | Meaning |
|---|---|
| SNMP | Simple Network Management Protocol |
| MIB | Management Information Base |
| SNMP trap | Device-generated notification triggered by an event or threshold |
| Syslog | Event logging protocol and message stream |
| Streaming telemetry | Device-pushed structured telemetry stream |
| gNMI | gRPC Network Management Interface |
| gRPC | Remote procedure call framework often used for telemetry transport |
| sFlow | Sampled flow monitoring technology |
| IPFIX | IP Flow Information Export |
| RPC query | Structured remote query to gather device state |
| IFA | In-Band Flow Analyzer |
| Initiator | IFA node that selects or clones traffic and adds IFA information |
| Transit | IFA node that appends local metadata |
| Termination | IFA node that removes or exports metadata |
| Residence time | Time a packet spends inside a hop |
| Queue ID | Identifier of the queue carrying the packet |
| PFC | Priority Flow Control |
| ECN | Explicit Congestion Notification |
| CNP | Congestion Notification Packet |
| Fat flow | Large flow that can dominate a path |
| Flow entropy | Diversity of flow hash inputs and path distribution potential |
| Mirroring on drop | Exporting dropped packets or copies for analysis |
| Corrective action | Operational response such as rerouting, blocking, rescheduling, or capacity change |
1. Why is ordinary interface monitoring not enough for AI fabrics?
Section titled “1. Why is ordinary interface monitoring not enough for AI fabrics?”Average interface utilization can hide microbursts, queue buildup, PFC pauses, ECN bursts, and tail-latency events. AI workloads care about whether those events make GPUs wait or increase JCT, so monitoring must include per-queue, per-buffer, per-hop, and server-side signals.
2. When is SNMP polling still useful?
Section titled “2. When is SNMP polling still useful?”SNMP polling is useful for baseline counters, trend monitoring, interface statistics, CPU, memory, FIB/RIB/MAC usage, and TCAM utilization. It is not enough for fast microburst or per-hop latency analysis.
3. What is the benefit of streaming telemetry?
Section titled “3. What is the benefit of streaming telemetry?”Streaming telemetry lets devices push structured data to collectors, either periodically or on change. It is better suited for high-frequency queue, buffer, ECN, PFC, and interface visibility than traditional polling.
4. Why should server telemetry be correlated with switch telemetry?
Section titled “4. Why should server telemetry be correlated with switch telemetry?”Switch counters may show congestion, but server telemetry shows workload impact. Correlating both tells whether a network event increased GPU idle time, dataloader wait, checkpoint pause, or JCT.
5. What does IFA add beyond normal telemetry?
Section titled “5. What does IFA add beyond normal telemetry?”IFA adds or exports per-hop metadata from data-plane packets or cloned probes. It can identify which hop, port, or queue contributed latency or congestion.
6. What IFA fields matter most in AI fabrics?
Section titled “6. What IFA fields matter most in AI fabrics?”Congestion indication, RX timestamp, residence time, queue ID, ingress port, egress port, and device ID are especially useful because they help localize queue and latency problems.
7. Why does IFA require MTU validation?
Section titled “7. Why does IFA require MTU validation?”IFA adds headers and metadata. If the fabric MTU is too small, probe or tagged packets may be fragmented or dropped. The server NIC, leaf, spine, overlay, and collector path MTU must all be checked.
8. What is mirroring on drop used for?
Section titled “8. What is mirroring on drop used for?”Mirroring on drop exports copies of dropped packets so operators can determine whether drops came from congestion, malformed traffic, security policy, control-plane protection, queue limits, or other causes.
9. What is the difference between historical and real-time monitoring?
Section titled “9. What is the difference between historical and real-time monitoring?”Historical monitoring finds trends and supports capacity planning. Real-time monitoring detects active problems such as queue drops, PFC bursts, ECN bursts, fat flows, and tail-latency spikes.
10. Why should corrective actions be introduced carefully?
Section titled “10. Why should corrective actions be introduced carefully?”Corrective actions can affect running jobs and many GPUs. Blocking a source, changing traffic paths, rescheduling a job, or adjusting thresholds can solve one problem while creating another. Automation should start with strong evidence, scoped actions, and human review where risk is high.