InfiniBand Packet Analysis: A Practical RDMA Transport Primer
English | 한국어
This report analyzes the packet captures in the ib-packets directory using tshark. The goal is to connect the captured packets back to the Chapter 1 RDMA and InfiniBand notes: data path vs control path, InfiniBand packet structure, management traffic, IP over InfiniBand, Queue Pairs, and Reliable Connection behavior.

Table of Contents
Section titled “Table of Contents”- Executive Summary
- Scope: Observed vs Reference Model
- Capture Set
- How the Captures Were Analyzed
- Inferred Capture Method
- InfiniBand Protocol Stack
- InfiniBand Layers Visible in the Captures
- ERF Capture Anatomy
- InfiniBand Packet Structure
- RDMA Read/Write Packet Analysis Model
- Control Path vs Data Path
- Per-Capture Findings
- Key Packet Examples
- What These Captures Do Not Show
- Useful tshark Commands
- Takeaways for Chapter 1
- References
Executive Summary
Section titled “Executive Summary”The captures can be analyzed with tshark without superuser privileges because they are offline .pcap files. Root or special capture permissions are usually needed for live packet capture, not for reading existing capture files.
The packet set shows several important InfiniBand behaviors:
- InfiniBand management traffic is visible through MAD packets.
- Subnet Management traffic appears as
SubnGet,SubnGetResp,SubnSet, and Subnet Administration records. - Performance Management traffic appears as
PortCounters,PortCountersExtended, andClassPortInfo. - IP over InfiniBand (IPoIB) is visible as normal IP, TCP, SSH, ARP, and ICMP traffic carried inside InfiniBand frames.
- One capture shows Reliable Connection (RC) behavior, including
ConnectRequest,ConnectReply,ReadyToUse,RC SEND Only, andRC Acknowledge.
The captures are especially useful for understanding the distinction between:
- Control path: setup, discovery, management, path lookup, connection establishment, and performance queries.
- Data path: payload movement after the required resources and paths are ready.
Most captures are management or IPoIB examples. They do not show a complete RDMA Read or RDMA Write payload exchange with RETH fields, remote virtual addresses, or rkeys. The closest data-path example is infiniband.pcap, which shows RC SEND and AETH ACK behavior.
Scope: Observed vs Reference Model
Section titled “Scope: Observed vs Reference Model”This report intentionally separates packet evidence from explanatory reference material.
| Topic | Status in this report | Evidence or purpose |
|---|---|---|
| Subnet Management | Observed in captures | SubnGet, SubnGetResp, SubnSet, QP0 traffic |
| Subnet Administration | Observed in captures | Path records, multicast membership, QP1 traffic |
| Performance Management | Observed in captures | PortCounters, PortCountersExtended, ClassPortInfo |
| IPoIB | Observed in captures | ICMP, TCP, SSH, and ARP-like behavior over InfiniBand |
| Connection Management | Observed in captures | ConnectRequest, ConnectReply, ReadyToUse |
| Reliable Connection SEND/ACK | Observed in captures | RC SEND Only, RC Acknowledge, AETH |
| RDMA READ | Reference model only | Added to explain BTH + RETH request and response packet behavior for future captures |
| RDMA WRITE | Reference model only | Added to explain BTH + RETH + payload request behavior for future captures |
| NCCL collective traffic | Not present | Use the official NCCL collective operations, NCCL networking troubleshooting, and NVIDIA/nccl-tests references instead of expanding it here |
| Bit-level packet format reference | Companion document | packet-format-reference.md — LRH/GRH/BTH/extended headers/MAD/SMP DR/IPoIB bit layouts and the full BTH opcode master table |
When reading the report, treat the observed sections as analysis of the provided pcap files. Treat the RDMA READ/WRITE section as a packet-analysis guide for future captures that include one-sided RDMA operations. For byte- and bit-level field layouts that the report references but does not exhaustively tabulate, see the companion packet format reference.
Capture Set
Section titled “Capture Set”| File | Packets | Duration | Main Observation |
|---|---|---|---|
ib_initial_sniffer.pcap | 108 | 10.90 s | Initial subnet discovery, SMP, SA, multicast membership, and performance queries |
ib_ibping_sniffer.pcap | 65 | 10.18 s | Vendor MAD request/response behavior plus performance counters |
ib_ibtracert_sminfo_sniffer.pcap | 84 | 30.46 s | Tracing and SMInfo-related control path traffic |
ib_sniffer.pcap | 24 | 6.00 s | Performance Management traffic only |
ib_ipping_sniffer.pcap | 34 | 12.00 s | ICMP ping over IPoIB plus a small amount of ARP and performance traffic |
ib_IPoIB.pcap | 5,848 | 4.28 s | SSH over TCP over IPoIB |
infiniband.pcap | 43 | 250.57 s | SMInfo, IPoIB, CM connection setup, RC SEND, and RC ACK behavior |
All files are pcap files with Extensible Record Format encapsulation. tshark decodes the ERF outer record and then the InfiniBand payload.
How the Captures Were Analyzed
Section titled “How the Captures Were Analyzed”The analysis used Wireshark/TShark 4.2.2:
tshark -vBasic capture metadata:
capinfos ../ib-packets/*.pcapProtocol hierarchy:
tshark -r ../ib-packets/ib_IPoIB.pcap -q -z io,phsInfiniBand field extraction:
tshark -r ../ib-packets/ib_initial_sniffer.pcap \ -Y infiniband \ -T fields \ -e frame.number \ -e frame.time_relative \ -e infiniband.lrh.dlid \ -e infiniband.lrh.slid \ -e infiniband.bth.opcode \ -e infiniband.bth.destqp \ -e infiniband.mad.method \ -e infiniband.mad.attributeid \ -E header=yUseful fields:
| Field | Meaning |
|---|---|
infiniband.lrh.dlid | Destination Local ID from the Local Route Header |
infiniband.lrh.slid | Source Local ID from the Local Route Header |
infiniband.bth.opcode | Base Transport Header opcode |
infiniband.bth.destqp | Destination Queue Pair |
infiniband.mad.mgmtclass | MAD management class |
infiniband.mad.method | MAD method, such as Get or GetResp |
infiniband.mad.attributeid | MAD attribute ID |
Inferred Capture Method
Section titled “Inferred Capture Method”The exact capture commands cannot be proven from the pcap files alone. The following is an inference from file names, encapsulation type, protocol hierarchy, and decoded packet contents.
The captures are likely the result of running InfiniBand diagnostic or IPoIB workloads while a native InfiniBand sniffer was recording traffic. The files use Extensible Record Format encapsulation and expose InfiniBand LRH/BTH/MAD fields, which is more consistent with a native InfiniBand capture path than with a simple Ethernet-style tcpdump on an IP interface.
| File | Likely workload during capture | Evidence |
|---|---|---|
ib_initial_sniffer.pcap | Fabric initialization or subnet discovery | SubnGet(NodeInfo), NodeDescription, PortInfo, SMInfo, QP0 traffic |
ib_ibping_sniffer.pcap | ibping between two InfiniBand nodes | Repeated vendor MAD request/response traffic between LID 5 and LID 8 |
ib_ibtracert_sminfo_sniffer.pcap | ibtracert, sminfo, and possibly counter queries | SMInfo, LinearForwardingTable, PortCounters, PortCountersExtended |
ib_sniffer.pcap | Performance counter polling | Mostly PERF (PortCounters) and PortCountersExtended |
ib_ipping_sniffer.pcap | IP ping over IPoIB | ICMP echo request/reply plus ARP over InfiniBand |
ib_IPoIB.pcap | SSH/TCP session over IPoIB | TCP conversation 10.10.10.12:34826 <-> 10.10.10.11:22, SSH payload |
infiniband.pcap | Mixed InfiniBand sample workload | SMInfo, PathRecord, ConnectRequest, ConnectReply, ReadyToUse, RC SEND, and RC ACK |
A plausible collection workflow would have looked like this:
Terminal 1: Start a native InfiniBand sniffer and write to a pcap file.
Terminal 2: Run one diagnostic or workload command, such as ibping, ibtracert, sminfo, perfquery, ping over IPoIB, or SSH over an IPoIB address.
Result: The sniffer records LRH/BTH/MAD/IPoIB traffic into a pcap file.For example, the ib_ibping_sniffer.pcap name and decoded packets suggest this type of scenario:
Start capture: native IB sniffer -> ib_ibping_sniffer.pcap
Run workload: ibping between two IB endpoints
Observed packets: VENDOR MAD request/response traffic between LIDsThe IPoIB captures likely came from ordinary IP tools running over an ib0-style interface:
Start capture: native IB or IPoIB-aware capture -> ib_ipping_sniffer.pcap
Run workload: ping <remote IPoIB address>
Observed packets: ARP, ICMP Echo request, ICMP Echo reply over InfiniBandand:
Start capture: native IB or IPoIB-aware capture -> ib_IPoIB.pcap
Run workload: ssh <remote IPoIB address>
Observed packets: TCP handshake and SSH payload over IPoIBIf reproducing a similar analysis from existing files, no superuser privileges are required:
tshark -r ../ib-packets/ib_ibping_sniffer.pcap -c 10tshark -r ../ib-packets/ib_ipping_sniffer.pcap -c 10tshark -r ../ib-packets/ib_IPoIB.pcap -q -z conv,tcpIf reproducing the capture itself, permissions depend on the capture method. Live capture from a privileged interface or a vendor sniffer may require extra capabilities, group membership, or root privileges. Offline analysis of the resulting pcap does not.
InfiniBand Protocol Stack
Section titled “InfiniBand Protocol Stack”The InfiniBand protocol stack can be viewed at two complementary levels:
- Protocol stack view: how applications, upper-layer protocols, transport services, network routing, link behavior, and physical signaling fit together.
- Packet structure view: how an individual packet is encoded on the wire, including routing headers, transport headers, optional extended headers, payload, and integrity checks.
The following third-party diagrams are useful orientation material. They are included here as educational figures, while the packet-level interpretation in this report is based on the fields visible through tshark and the official NVIDIA, IBTA, and Wireshark references listed below.

Conceptually, this stack explains why the captures include both control path protocols, such as Subnet Management and Connection Management, and data path traffic, such as IPoIB and Reliable Connection packets.

The encapsulation figure aligns with the next two sections: tshark exposes packet fields such as LRH, BTH, DETH, MAD, AETH, and IP payloads, depending on the packet type.
Source: What is InfiniBand? (A Complete Guide)
InfiniBand Layers Visible in the Captures
Section titled “InfiniBand Layers Visible in the Captures”The packet structure visible in tshark maps well to the Chapter 1 InfiniBand Communication Stack.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
ERF[ERF capture record]
LRH[InfiniBand LRH<br/>LID routing inside the fabric]
BTH[InfiniBand BTH<br/>transport opcode, destination QP, PSN]
EXT[Extended transport headers<br/>DETH / AETH / CM / MAD fields]
PAYLOAD[Payload<br/>MAD, IPoIB, ICMP, TCP, SSH, or data]
ERF --> LRH --> BTH --> EXT --> PAYLOAD
Important visible headers:
| Header | Role | Example from captures |
|---|---|---|
| LRH | Local routing inside the InfiniBand fabric | slid, dlid, packet length |
| BTH | Transport behavior and QP selection | opcode 100 for UD SEND Only, opcode 4 for RC SEND Only, opcode 17 for RC ACK |
| DETH | Datagram transport fields for UD traffic | QP0/QP1 management traffic |
| MAD | Management datagram | SubnGet, SubnGetResp, PortCounters |
| AETH | ACK Extended Transport Header | RC Acknowledge packets in infiniband.pcap |
| IP payload | IP over InfiniBand | TCP/SSH and ICMP over IPoIB |
ERF Capture Anatomy
Section titled “ERF Capture Anatomy”The captures use the Endace Extensible Record Format (ERF) as the outer wrapper. Each on-wire InfiniBand frame is encapsulated by an ERF record emitted by the capture device, and tshark dissects this outer record before handing the inner bytes to the InfiniBand dissector. Understanding what ERF preserves vs hides is what separates “I see a packet” from “I know exactly what the sniffer recorded.”
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
PHY["Physical layer<br/>8b/10b or 64b/66b symbols<br/>training, idle, recovery"]
LFC["Link-level flow control<br/>FCCL / FCTBS credits"]
ERF["ERF outer record<br/>ts, type, flags, rlen, wlen"]
IB["InfiniBand frame<br/>LRH → (GRH) → BTH → ext → payload → ICRC/VCRC"]
PHY -.discarded.-> ERF
LFC -.discarded.-> ERF
ERF --> IB
ERF Outer Record
Section titled “ERF Outer Record”The ERF record is small but carries every piece of metadata the sniffer hardware can supply. All seven captures use ERF type 0x15 (INFINIBAND), which is set by the capture device firmware and is the strongest single piece of evidence that the recording came from an IB-aware sniffer rather than a host-side tcpdump.
| ERF field | Filter name | Example (infiniband.pcap frame 10) | Meaning |
|---|---|---|---|
| Timestamp | erf.ts | 0x482b41f8ae3041c0 | 64-bit fractional-seconds hardware timestamp |
| Record type | erf.types | 0x15 (Type 21: INFINIBAND) | Identifies inner payload as native IB |
| Extension header present | erf.types.ext_header | 0 | No ERF extension headers in this dataset |
| Capture interface | erf.flags.cap | 1 (Port B) | Which sniffer port observed this frame |
| Varying record length | erf.flags.vlen | 1 | Record length varies per frame |
| Truncated | erf.flags.trunc | 0 | Frame was captured at full wire length |
| RX error | erf.flags.rxe | 0 | Capture device flagged no receive error |
| DS error | erf.flags.dse | 0 | No data-stream error |
| Record length | erf.rlen | 136 | ERF record bytes including padding |
| Loss counter | erf.lctr | 0 | Frames dropped between this and the previous record |
| Wire length | erf.wlen | 114 | Original on-wire byte count |
The pair rlen (136) vs wlen (114) shows ERF’s per-record padding to alignment boundaries. For timing analysis, erf.ts is the authoritative clock — frame.time_relative derives from it but rounds to microseconds in some output modes.
Capture-Interface Topology
Section titled “Capture-Interface Topology”ERF’s flags.cap field tells you which sniffer port saw each frame, which is essential for interpreting bidirectional flows.
| File | Capture interfaces used | Implication |
|---|---|---|
infiniband.pcap | 0 and 1 | Bidirectional tap; both link directions captured |
| All other pcaps | 0 only | Single-direction tap |
Concrete evidence from infiniband.pcap:
Frame 10 (RC SEND Only, DLID=1, SLID=4) → Capture interface 1 (Port B), ts=0x482b41f8ae3041c0Frame 11 (RC Acknowledge, DLID=4, SLID=1) → Capture interface 0 (Port A), ts=0x482b41f8ae30ede0The SEND and its ACK arrive on different sniffer ports because they travel in opposite directions on the link. In single-interface captures this asymmetry is invisible — you may only see one half of an exchange depending on which port was tapped.
What ERF Preserves vs Hides
Section titled “What ERF Preserves vs Hides”The ERF wrapper is thin, but the IB dissector behind it is comprehensive. The “simplification” you might perceive comes from two places: (a) hardware events that occur below the packet boundary and never become packets, and (b) the IB dissector’s choice of which fields to expose as filterable names vs tree-only fields.
| Layer / signal | Visible in tshark? | Notes |
|---|---|---|
| Physical 8b/10b or 64b/66b symbols | No | Decoded by HCA SerDes; never reach the capture host |
| Link training, recovery, idle symbols | No | Sub-packet events, discarded by the link layer |
| Link-level flow-control credits (FCCL, FCTBS) | No | Carried in dedicated link-level subheaders, not delivered as IB packets |
| Inter-packet gaps and bandwidth headroom | No | Reconstruct from erf.ts deltas instead |
| Frames dropped or rejected by sniffer hardware | Partial | Visible only as a non-zero erf.lctr jump |
| RX-error frames | Conditional | Forwarded with erf.flags.rxe = 1 if the device is configured to keep them |
| LRH | Yes | infiniband.lrh.* (slid, dlid, lnh, vl, sl, packet length) |
| GRH | Only when LRH.LNH = 0x3 | All packets in this set carry LNH = 0x2, so GRH is correctly absent |
| BTH and extended headers | Yes | DETH, AETH, MAD, RETH (when present) all decoded |
| Payload (MAD, IPoIB IP/TCP/ICMP) | Yes | Standard upper-layer dissection |
| Invariant CRC | Yes | infiniband.invariant.crc, e.g. 0x0acca5df in frame 10 |
| Variant CRC | Yes | infiniband.variant.crc, e.g. 0x24a8 in frame 10 |
A common misconception is that ERF strips ICRC/VCRC. In this dataset both are present in the IB tree and are filterable as infiniband.invariant.crc and infiniband.variant.crc. The Wireshark IB dissector does not auto-validate them, however; integrity is asserted by the capture device’s RX-error flag (erf.flags.rxe), not by the dissector.
Worked Example: ERF + IB Frame Layout
Section titled “Worked Example: ERF + IB Frame Layout”The following anonymized layout is infiniband.pcap frame 10 (the RC SEND Only carrying an IPoIB ICMP echo request). It demonstrates how the ERF outer record, the InfiniBand headers, the EtherType-encapsulated IPoIB payload, and the trailing CRCs all coexist in a single 114-byte wire frame.
Frame 10 — 114 bytes wire / 136-byte ERF record / capture interface 1 (Port B)
ERF outer record Timestamp: 0x482b41f8ae3041c0 Type: 0x15 (INFINIBAND) Ext header: 0 Flags: cap=1, vlen=1, trunc=0, rxe=0, dse=0 Record len: 136 Loss counter: 0 Wire length: 114
InfiniBand LRH (Local Route Header) VL = 0 Service Level = 0 LNH = 0x2 (BTH only — no GRH) DLID = 1, SLID = 4 Packet length = 28 (4-byte words) BTH (Base Transport Header) Opcode = 4 (RC SEND Only) Solicited Event = False MigReq = True Pad Count = 0 P_Key = 0xffff Destination QP = <masked> Acknowledge Request = True PSN = <masked> IBA Payload — EtherType-encapsulated for IPoIB Ethertype = 0x0800 (IPv4) Invariant CRC: 0x0acca5df Variant CRC: 0x24a8
IPv4 → ICMP Echo request Src 10.0.1.34 → Dst 10.0.0.58A few details worth noticing:
LRH.LNH = 0x2confirms local-subnet routing, which is why noGRHappears betweenLRHandBTH.- The
IBA Payload — EtherType-encapsulatedline is the IPoIB shim: a 4-byte header with an EtherType selecting IPv4 or ARP, sitting between the BTH and the IP packet. This is the layer that lets ordinary IP applications run over IB. - Both
Invariant CRCandVariant CRCare present in the dissection tree. ICRC covers everything except mutable fields; VCRC covers the entire packet on the link. MigReq = Trueindicates the path supports automatic path migration. This is a per-QP attribute set during connection setup and is unrelated to the data being carried.
Key Takeaways
Section titled “Key Takeaways”- ERF is a thin metadata wrapper; nearly every IB header field, including ICRC/VCRC, survives into the dissection tree. The “simplification” is real only at sub-packet hardware-event level.
- Use
erf.tsfor nanosecond-resolution timing analysis (e.g., the one-secondibpingcadence inib_ibping_sniffer.pcapis precisely measurable from this field). - Use
erf.flags.capto distinguish link directions ininfiniband.pcap, and to recognize that single-interface captures may show only one half of a bidirectional exchange. - Use
erf.lctrto detect sniffer drops; a non-zero value means there is a gap in the recording that no amount of IB-layer analysis can recover. - Conclusions about link credit exhaustion, link training, or symbol-error rates require switch counters and HCA hardware diagnostics — they are intrinsically not in the pcap, regardless of which sniffer was used.
InfiniBand Packet Structure
Section titled “InfiniBand Packet Structure”Building on the encapsulation diagram above, an InfiniBand packet can be read from left to right as fabric routing, transport selection, operation-specific metadata, and payload. The exact extended header depends on the transport and opcode.
For bit-level field layouts of every header listed here (LRH, GRH, BTH, DETH, RETH, AETH, AtomicETH, ImmDt, IETH, RDETH, XRCETH, MAD, SMP DR, IPoIB encap), the AETH syndrome encoding, and the full BTH opcode master table, see the companion
packet-format-reference.md. This section gives the high-level packet shape; the reference document drills down to byte and bit boundaries.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
LRH["LRH<br/>Local Route Header<br/>DLID, SLID, VL, packet length"]
GRH["GRH<br/>Global Route Header<br/>optional, GID-based routing"]
BTH["BTH<br/>Base Transport Header<br/>opcode, P_Key, destination QP, PSN"]
EXT["Extended Header<br/>DETH, RETH, AETH, Atomic, Immediate, or none"]
PAYLOAD["Payload<br/>MAD, IPoIB packet, SEND data, RDMA data"]
CRC["ICRC / VCRC<br/>integrity checks on the wire"]
LRH --> GRH --> BTH --> EXT --> PAYLOAD --> CRC
GRH is optional, so many local-subnet packets are effectively LRH -> BTH -> .... In this dataset every packet carries LRH.LNH = 0x2, which is why no GRH is decoded. Whether ICRC/VCRC are exposed depends on the capture path; this dataset preserves both as filterable fields (infiniband.invariant.crc, infiniband.variant.crc) — see ERF Capture Anatomy for evidence and the full preservation matrix.
Common packet shapes:
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
MGMT["Management traffic<br/>QP0/QP1 control path"]
MGMT_SEQ["LRH -> BTH -> DETH -> MAD"]
IPOIB["IP over InfiniBand<br/>normal IP payload over IB"]
IPOIB_SEQ["LRH -> BTH -> DETH -> IP -> TCP/ICMP/SSH"]
RC_SEND["Reliable Connection SEND<br/>message-style data path"]
RC_SEND_SEQ["LRH -> BTH -> SEND payload"]
RC_ACK["Reliable Connection ACK<br/>transport acknowledgement"]
RC_ACK_SEQ["LRH -> BTH -> AETH"]
RDMA_WRITE["RDMA WRITE<br/>one-sided push, not fully shown here"]
RDMA_WRITE_SEQ["LRH -> BTH -> RETH -> data payload"]
RDMA_READ["RDMA READ<br/>one-sided pull, not fully shown here"]
RDMA_READ_SEQ["Request: LRH -> BTH -> RETH<br/>Response: LRH -> BTH -> AETH -> data payload"]
MGMT --> MGMT_SEQ
IPOIB --> IPOIB_SEQ
RC_SEND --> RC_SEND_SEQ
RC_ACK --> RC_ACK_SEQ
RDMA_WRITE --> RDMA_WRITE_SEQ
RDMA_READ --> RDMA_READ_SEQ
How this maps to the current captures:
| Packet family | Typical structure | Visible in this packet set? | Notes |
|---|---|---|---|
| Subnet Management | LRH -> BTH -> DETH -> MAD | Yes | Seen in ib_initial_sniffer.pcap, ib_ibtracert_sminfo_sniffer.pcap, and infiniband.pcap |
| Performance Management | LRH -> BTH -> DETH -> MAD | Yes | Seen as PortCounters, PortCountersExtended, and ClassPortInfo |
| IPoIB | LRH -> BTH -> DETH -> IP payload | Yes | Carries ICMP, TCP, SSH, and ARP-like behavior over InfiniBand |
| RC SEND | LRH -> BTH -> payload | Yes | infiniband.pcap shows RC SEND Only |
| RC ACK | LRH -> BTH -> AETH | Yes | infiniband.pcap shows RC Acknowledge |
| RDMA WRITE | LRH -> BTH -> RETH -> payload | No | This would show remote virtual address and rkey in RETH |
| RDMA READ | request with RETH, response with data | No | This would show the pull model described in Chapter 1 |
RDMA Read/Write Packet Analysis Model
Section titled “RDMA Read/Write Packet Analysis Model”The current pcap set does not contain a complete RDMA READ or RDMA WRITE exchange. This section is therefore a reference model for how such packets should be interpreted if future captures include one-sided RDMA operations. It is based on the InfiniBand transport-layer behavior described in the official references and the Tencent Cloud article listed in the references.
The key header for one-sided RDMA operations is RETH, the RDMA Extended Transport Header.
| Header | Important fields | Why it matters |
|---|---|---|
BTH | opcode, destination QP, PSN, ACK request | Identifies the operation type and packet ordering |
RETH | virtual address, rkey, DMA length | Authorizes and describes the remote memory range |
AETH | ACK/NAK syndrome, MSN | Confirms reliable transport progress or reports an error |
| Payload | read response data or write data | Carries user data depending on operation direction |
Operation Support by Transport Service
Section titled “Operation Support by Transport Service”InfiniBand transport services do not support all verbs-style operations equally. The practical takeaway is that one-sided operations that need a response, strict ordering, or read-modify-write semantics require a reliable transport context.
| Operation | RC | UC | UD | RD |
|---|---|---|---|---|
| SEND/RECV | ✓ | ✓ | ✓ | ✓ |
| RDMA WRITE | ✓ | ✓ | ✗ | ✓ |
| RDMA READ | ✓ | ✗ | ✗ | ✓ |
| Atomic | ✓ | ✗ | ✗ | ✓ |
In modern RDMA software, RC is the common practical transport for RDMA READ and Atomic operations. RD also supports them in the InfiniBand architecture, but it is rarely the default choice in mainstream application stacks.
Why RDMA READ does not fit UC/UD:
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
participant Req as Requester
participant Resp as Responder
Req->>Resp: READ Request<br/>"Read N bytes from remote VA with rkey"
Resp->>Resp: Validate rkey, VA, length, ordering, responder resources
Resp-->>Req: READ Response<br/>Data returns to requester
RDMA READ is not just a one-way packet. It creates responder-side work: the responder RNIC must validate the request, fetch remote memory, generate one or more response packets, preserve ordering, and handle retry/error behavior. UC has no reliable response/ACK machinery, and UD is message-oriented datagram transport without the connected responder state needed for remote memory reads.
Why Atomic does not fit UC/UD:
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
participant Req as Requester
participant Resp as Responder
participant MR as Remote MR
Req->>Resp: Atomic request<br/>Compare-and-swap or fetch-and-add
Resp->>MR: Read old value, compute, write new value atomically
Resp-->>Req: Atomic response<br/>Return original value or completion state
Atomic operations require a single globally ordered read-modify-write at the remote memory location. The requester also needs a reliable response to know the returned value and whether the operation completed. That requires connected state, ordering, and retry/error semantics, which is why practical deployments use RC-style reliable transport for atomics.
Practical note for NCCL, UCX, MPI, and DC transport:
NCCL collectives such as AllReduce move large chunks from GPU memory to other GPU memory. Some phases can be implemented as push-style transfers, but pull-based peer access patterns benefit from RDMA READ semantics. Ring and tree algorithms also require predictable ordering and completion behavior, so reliable transport is important.
UCX is a general-purpose communication layer. Small messages may use SEND/RECV or inline paths, while large messages can use RDMA. UCX also exposes tag matching and RMA-style operations, including atomics on capable transports. That naturally favors reliable connection-oriented transports for the paths that need READ, Atomic, ordering, or retry semantics.
MPI implementations often map one-sided primitives such as
MPI_Put,MPI_Get, andMPI_Accumulateonto RDMA WRITE, RDMA READ, and Atomic operations when the transport supports it. Since MPI semantics assume reliable communication, the underlying network path usually needs reliable completion and ordering behavior.At large cluster scale, pure RC can become expensive because a dense all-to-all peer mesh may require a large number of QPs and associated HCA memory.
DCtransport, or Dynamically Connected transport, addresses this by keeping reliable semantics while dynamically reusing connection resources. This is why DC-style transports are important in large InfiniBand deployments. NVIDIA SHARP and NCCL-RDMA-SHARP paths can also appear in modern collective stacks, but the exact use of DC, UCX, verbs, or SHARP depends on hardware, plugin availability, topology, and runtime environment settings.
Transport-Layer Details Worth Checking
Section titled “Transport-Layer Details Worth Checking”The Tencent Cloud article is useful because it frames RDMA READ/WRITE as InfiniBand transport-layer operations, not just verbs API calls. The following details are worth carrying into packet analysis:
| Detail | Packet-analysis implication |
|---|---|
| Transport service type | BTH opcode bits identify whether the packet belongs to RC, UC, RD, UD, or XRC style transport behavior. This matters because ACK/NAK behavior and packet validation differ by service. |
BTH is the operation decoder | BTH opcode determines how the bytes after BTH should be interpreted: RETH, AETH, DETH, immediate data, payload, or no extended header. |
PSN is not just a counter | Packet Sequence Number is used by the responder/requester to detect missing, duplicate, or out-of-order packets. In reliable services, this drives ACK/NAK and retry behavior. |
P_Key and destination QP are validation inputs | A packet can be silently dropped if its destination QP, QP state, transport type, or partition key does not match the responder context. |
RETH is a protection boundary | RETH is not only an address descriptor. The responder must validate rkey, access permissions, virtual address range, and DMA length before touching remote memory. |
AETH carries ACK/NAK state | In reliable transports, AETH tells the requester whether progress was acknowledged or whether a retry/error condition exists. |
ICRC/VCRC are on-wire integrity checks | Capture tools may expose only part of this, but invalid CRCs are normally discarded before useful transport-layer interpretation. |
P_Key deserves special attention. It is a partition membership value carried in BTH, similar in spirit to a fabric-level tenant or isolation tag. The high bit indicates full vs limited membership, and the lower bits identify the partition. If a packet’s P_Key does not match the destination port’s partition membership or the QP context, the packet is not accepted as valid traffic for that partition. This is why P_Key should be read together with destination QP, transport type, and QP state when debugging packet drops.
Two subtle points are easy to miss:
- Multi-packet SEND and RDMA WRITE messages are not interleaved with other operations on the same send queue until the final packet of that message has been generated.
- RDMA READ behaves differently: after issuing a READ request, the requester may issue later requests without waiting for the READ response, but the maximum number of outstanding READ and ATOMIC operations is negotiated during connection setup.
RDMA READ
Section titled “RDMA READ”RDMA READ is a one-sided pull. The requester asks the responder RNIC to read from remote memory and return the data.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
participant Req as Requester RNIC
participant LocalMR as Local MR
participant Fabric as InfiniBand Fabric
participant Resp as Responder RNIC
participant RemoteMR as Remote MR
Req->>Fabric: RDMA READ request<br/>BTH + RETH, no data payload
Fabric->>Resp: Deliver READ request
Resp->>Resp: Validate QP, PSN, VA, rkey, length, access rights
Resp->>RemoteMR: DMA read requested bytes
Resp-->>Fabric: RDMA READ response packet(s)<br/>BTH + AETH on first/last/only + payload
Fabric-->>Req: Return read payload
Req->>LocalMR: DMA write payload into requester local buffer
For a small READ whose response fits within the path MTU:
Request: LRH -> BTH(RDMA READ Request) -> RETH(VA, rkey, length)Response: LRH -> BTH(RDMA READ Response Only) -> AETH -> payloadFor a multi-packet READ response:
Request: LRH -> BTH(RDMA READ Request) -> RETHFirst: LRH -> BTH(RDMA READ Response First) -> AETH -> PMTU-sized payloadMiddle: LRH -> BTH(RDMA READ Response Middle) -> PMTU-sized payloadLast: LRH -> BTH(RDMA READ Response Last) -> AETH -> remaining payloadImportant analysis points:
- The READ request packet is small because it describes what to read; it does not carry the requested data.
- A single READ request can produce multiple READ response packets when the requested length exceeds the path MTU.
AETHis present inRDMA READ Response First,RDMA READ Response Last, andRDMA READ Response Only.RDMA READ Response Middlecarries payload but does not carryAETH.PSNis used to detect missing or out-of-order response packets.- The responder validates the retry request,
rkey, remote virtual address, and access permissions. - The requester may have more than one outstanding READ, depending on the negotiated connection limits.
- RDMA READ does not carry immediate data.
Example Wireshark decode, with sensitive values anonymized:
RDMA READ Request BTH: Opcode: Reliable Connection (RC) - RDMA READ Request Partition Key: 0xffff Destination QP: 0x00xxxx Acknowledge Request: True Packet Sequence Number: <request_psn> RETH: Virtual Address: 0x0000xxxxxxxxxxxx Remote Key: 0x00xxxxxx DMA Length: 65536 bytes ICRC: PresentRDMA READ Response Middle BTH: Opcode: Reliable Connection (RC) - RDMA READ Response Middle Partition Key: 0xffff Destination QP: 0x00xxxx Acknowledge Request: False Packet Sequence Number: <response_psn> Payload: Data: 1024 bytes ICRC: PresentThe request decode is the key evidence for a one-sided READ: it has BTH + RETH, and RETH carries the remote virtual address, rkey, and requested DMA length. BTH also carries the Partition Key (P_Key), which identifies the InfiniBand partition membership used by the packet. A commonly seen value such as 0xffff represents full membership in the default partition, but production fabrics may use different partition keys for isolation. The response-middle decode shows the reverse data movement: it carries data payload but no RETH and no AETH. This matches the multi-packet READ model where AETH appears on the first, last, or only response packet, while middle response packets are pure data-bearing segments.
For public documentation, avoid publishing raw screenshots unless the following fields are masked:
- Remote virtual address
rkey- destination QP
- packet sequence number
- any payload bytes that may contain application data
Example AETH decode, with sensitive values anonymized:
RC Acknowledge BTH: Opcode: Reliable Connection (RC) - Acknowledge Partition Key: 0xffff Destination QP: 0x00xxxx Acknowledge Request: False Packet Sequence Number: <ack_psn> AETH: Syndrome: 0, Ack OpCode: Ack Credit Count: <credit_count> Message Sequence Number: <msn> ICRC: PresentAETH is the key ACK/NAK carrier for reliable transport. A normal ACK indicates that the responder accepted progress for the relevant reliable operation. If the syndrome indicates NAK or an error condition, the requester may need to retry or fail the Work Request depending on the transport state and retry counters. In packet analysis, BTH tells us this is an RC acknowledge packet and which partition/QP context it belongs to, while AETH tells us whether it is a successful acknowledgement or an error/flow-control signal.
RDMA WRITE
Section titled “RDMA WRITE”RDMA WRITE is a one-sided push. The requester sends data to a remote memory range that the responder has already registered and shared through metadata exchange.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
participant Req as Requester RNIC
participant Fabric as InfiniBand Fabric
participant Resp as Responder RNIC
participant MR as Remote MR
Req->>Fabric: RDMA WRITE request<br/>BTH + RETH + payload
Fabric->>Resp: Deliver WRITE packets
Resp->>Resp: Validate QP, PSN, VA, rkey, length, access rights
Resp->>MR: DMA write payload into remote memory
Resp-->>Req: ACK / NAK<br/>BTH + AETH
For a small WRITE whose payload fits within the path MTU, the packet shape is:
LRH -> BTH(RDMA WRITE Only) -> RETH(VA, rkey, length) -> payload -> ICRC/VCRCFor a multi-packet WRITE, the message is segmented:
First packet: LRH -> BTH(RDMA WRITE First) -> RETH -> PMTU-sized payloadMiddle packet: LRH -> BTH(RDMA WRITE Middle) -> PMTU-sized payloadLast packet: LRH -> BTH(RDMA WRITE Last) -> remaining payloadACK: LRH -> BTH(Acknowledge) -> AETHImportant analysis points:
RETHappears in the first packet or the only packet of an RDMA WRITE message.RETHcarries the remote virtual address,rkey, and DMA length.- Middle and last WRITE packets carry payload but do not repeat the full remote memory metadata.
- The responder checks the
rkey, access permissions, address range, and packet sequence. - Multi-packet WRITE messages are ordered as one message and are not interleaved with other operations on the same send queue before the final WRITE packet.
- In reliable transports such as RC, the responder returns an ACK or NAK using
AETH. - A normal RDMA WRITE updates remote memory but does not automatically notify the remote application. Notification requires a higher-level protocol,
RDMA_WRITE_WITH_IMM, SEND/RECV, or polling.
What Future Captures Should Show
Section titled “What Future Captures Should Show”A future pcap that truly contains one-sided RDMA traffic should show at least some of the following:
| Expected evidence | RDMA READ | RDMA WRITE |
|---|---|---|
| BTH opcode | RDMA READ Request, RDMA READ Response First/Middle/Last/Only | RDMA WRITE First/Middle/Last/Only |
| RETH | Request packet | First or only request packet |
| Remote virtual address | In request RETH | In RETH |
rkey | In request RETH | In RETH |
| Payload direction | Responder to requester | Requester to responder |
| AETH | First, last, or only read response | ACK/NAK response |
| Target CPU involvement | Not in data path | Not in data path |
This explains why the current packet set is useful for RDMA/IB fundamentals but still cannot be treated as a full one-sided RDMA data-path capture.
Control Path vs Data Path
Section titled “Control Path vs Data Path”The captures make the Chapter 1 distinction between control path and data path concrete.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
subgraph Control[Control path]
SM[Subnet Management<br/>NodeInfo, PortInfo, SMInfo]
SA[Subnet Administration<br/>PathRecord, MCMemberRecord]
PM[Performance Management<br/>PortCounters]
CM[Connection Management<br/>ConnectRequest, ConnectReply, ReadyToUse]
end
subgraph Data[Data path]
IPoIB[IP over InfiniBand<br/>ICMP, TCP, SSH]
RC[Reliable Connection traffic<br/>RC SEND and ACK]
end
Control --> Ready[QP/path/resources ready]
Ready --> Data
Control Path
Section titled “Control Path”Control path traffic is strongly represented in these captures. It includes:
- Subnet discovery through QP0.
- Subnet Administration through QP1.
- Path discovery and multicast membership.
- Performance counter queries.
- Connection Management messages.
This corresponds to Chapter 1’s explanation that RDMA does not remove the kernel or control software from the system. The CPU, kernel driver, subnet manager, RDMA runtime, and NIC firmware still configure resources and paths.
Data Path
Section titled “Data Path”Data path traffic appears in two forms:
- IPoIB traffic, where normal IP applications run over InfiniBand.
- RC SEND/ACK traffic, where InfiniBand transport behavior is visible below an IP payload.
The packet set does not show full RDMA Write or RDMA Read operations with RETH. Therefore, it is better to describe it as an InfiniBand and IPoIB packet set, not as a complete RDMA Read/Write capture set.
Per-Capture Findings
Section titled “Per-Capture Findings”ib_initial_sniffer.pcap
Section titled “ib_initial_sniffer.pcap”This capture is the best example of InfiniBand control path initialization.
Protocol hierarchy:
erf infiniband arpRepresentative packets:
UD Send Only QP=0x000000 SubnGet(NodeInfo)UD Send Only QP=0x000000 SubnGetResp(NodeInfo)UD Send Only QP=0x000000 SubnGet(NodeDescription)UD Send Only QP=0x000000 SubnGetResp(NodeDescription)UD Send Only QP=0x000000 SubnGet(PortInfo)UD Send Only QP=0x000000 SubnGetResp(PortInfo)Interpretation:
- QP0 is used for Subnet Management Packets.
- The node is being discovered and configured.
NodeInfo,NodeDescription,PortInfo,P_KeyTable, andSMInfoare part of fabric discovery and setup.- Subnet Administration records such as
MCMemberRecordandInformInfoalso appear. - This is control path traffic, not user payload movement.
Why it matters for Chapter 1:
- It shows the setup work that must happen before the fast path can be used.
- It supports the point that “kernel bypass” does not mean “no setup or control path.”
- It maps to PD/QP/MR/path readiness concepts in the RDMA Process section.
ib_ibping_sniffer.pcap
Section titled “ib_ibping_sniffer.pcap”This capture is centered on ibping behavior and vendor MAD messages.
Protocol hierarchy:
erf infinibandRepresentative packets:
LID: 5 -> LID: 8 InfiniBand 290 VENDOR (Unknown Attribute)LID: 8 -> LID: 5 InfiniBand 290 VENDOR (Unknown Attribute)Top decoded items:
VENDOR (Unknown Attribute)PERF (PortCounters)PERF (ClassPortInfo)PERF (PortCountersExtended)LID-pair distribution (65 packets total):
| Source LID | Dest LID | MAD class | Method | Packets | Likely workload |
|---|---|---|---|---|---|
| 5 | 8 | 0x32 Vendor OUI | Get | 11 | ibping request |
| 8 | 5 | 0x32 Vendor OUI | GetResp | 11 | ibping reply |
| 5 | 1 | 0x04 PerfMgt | Get | 27 | PortCounters poll |
| 5 | 4 | 0x04 PerfMgt | Get | 12 | PortCounters poll |
| 5 | 2 | 0x04 PerfMgt | Get | 2 | PortCounters poll |
| 2 | 5 | 0x04 PerfMgt | GetResp | 2 | PortCounters reply |
The vendor MAD class 0x32 is what ibping uses to carry its own request/response, separate from the standard PerfMgt class 0x04. The ibping exchange is exclusively LID 5 ↔ LID 8 (22 of 65 packets, perfectly symmetric request/response). The remaining 43 packets are background perfquery-style polling originating from LID 5, with only LID 2 replying within the capture window.
Reproducing this breakdown:
tshark -r ../ib-packets/ib_ibping_sniffer.pcap \ -T fields \ -e infiniband.lrh.slid \ -e infiniband.lrh.dlid \ -e infiniband.mad.mgmtclass \ -e infiniband.mad.method \ | sort | uniq -cInterpretation:
ibpinguses InfiniBand management-style traffic rather than IP ping.- The
ibpingrequest/response pair is between LID 5 and LID 8, carried over the vendor MAD class0x32. - The periodic pattern is visible: requests and responses are roughly one second apart.
- Performance Management traffic from a separate tool (likely
perfquery) is mixed into the same capture window, which is why LIDs 1, 2, and 4 also appear.
Why it matters for Chapter 1:
- It demonstrates that InfiniBand has its own management and diagnostic traffic independent of TCP/IP.
- LIDs are used directly at the fabric level.
ib_ibtracert_sminfo_sniffer.pcap
Section titled “ib_ibtracert_sminfo_sniffer.pcap”This capture combines trace-related behavior, SMInfo, and performance management.
Protocol hierarchy:
erf infinibandRepresentative packets:
PERF (PortCounters)PERF (PortCountersExtended)UD Send Only QP=0x000000 SubnGet(SMInfo)UD Send Only QP=0x000000 SubnGetResp(SMInfo)UD Send Only QP=0x000000 SubnGet(LinearForwardingTable)UD Send Only QP=0x000000 SubnGetResp(LinearForwardingTable)Interpretation:
ibtracertneeds fabric topology and forwarding information.SMInfoidentifies subnet manager information.LinearForwardingTablepoints to switch forwarding behavior.- Performance counters show management-plane visibility into port state.
Why it matters for Chapter 1:
- It shows the control plane objects behind the switched fabric model.
- It supports the Packet Relay / Fabric section: switches forward packets, while management traffic discovers and programs fabric behavior.
ib_sniffer.pcap
Section titled “ib_sniffer.pcap”This is a small Performance Management capture.
Protocol hierarchy:
erf infinibandTop decoded items:
PERF (PortCounters)PERF (PortCountersExtended)Interpretation:
- The capture is dominated by performance counter queries.
- It is useful for observing monitoring traffic, but it does not show application payloads or RDMA Read/Write operations.
Why it matters for Chapter 1:
- It connects to the monitoring side of the control path.
- Fabric health is not inferred only from data packets. It is also queried through management traffic.
ib_ipping_sniffer.pcap
Section titled “ib_ipping_sniffer.pcap”This capture shows ICMP over IPoIB.
Protocol hierarchy:
erf infiniband ip icmp arpRepresentative packets:
203.0.113.17 -> 203.0.113.18 ICMP Echo request203.0.113.18 -> 203.0.113.17 ICMP Echo replyInterpretation:
- This is not
ibping; it is IP ping carried over InfiniBand. - It shows normal IP packets mapped onto InfiniBand.
- ARP appears because IPoIB still needs address resolution for IP communication.
Why it matters for Chapter 1:
- It demonstrates the difference between native InfiniBand management tools and IP over InfiniBand.
- It shows how normal IP applications can run above InfiniBand transport.
ib_IPoIB.pcap
Section titled “ib_IPoIB.pcap”This is the largest capture and shows SSH over TCP over IPoIB.
Protocol hierarchy:
erf infiniband ip tcp sshTCP conversation:
10.10.10.12:34826 <-> 10.10.10.11:22Total frames: 5848Total bytes: 10 MBDuration: 4.2846 sRepresentative packets:
10.10.10.12 -> 10.10.10.11 TCP 34826 -> 22 [SYN]10.10.10.11 -> 10.10.10.12 TCP 22 -> 34826 [SYN, ACK]10.10.10.12 -> 10.10.10.11 SSH Client: ProtocolInterpretation:
- This is an IP workload carried over InfiniBand.
- The application is SSH, not RDMA Read/Write.
tsharkdecodes the upper layers just like ordinary IP traffic once it gets past the InfiniBand encapsulation.- The high frame count and 10 MB size make this the best sample for IPoIB throughput-style traffic.
Why it matters for Chapter 1:
- It shows that InfiniBand can carry ordinary IP workloads.
- It should not be confused with RDMA semantics. IPoIB is not the same as one-sided RDMA.
infiniband.pcap
Section titled “infiniband.pcap”This is the most useful capture for seeing multiple InfiniBand concepts in one place.
Protocol hierarchy:
erf infiniband ip udp icmp arp ipv6 icmpv6Important decoded items:
UD Send Only QP=0x000000 SubnGet(SMInfo)UD Send Only QP=0x000000 SubnGetResp(SMInfo)CM: ConnectRequestCM: ConnectReplyCM: ReadyToUseRC AcknowledgeICMP Echo requestICMPv6 Neighbor Solicitation / AdvertisementCondensed flow:
1-2 SMInfo query and response3-6 IPoIB UDP/ARP traffic7-9 Connection Management: request, reply, ready10-23 RC SEND Only carrying ICMP and RC ACK packets24-25 More UDP over IPoIB26-31 IPv6 neighbor discovery and RC ACK32-33 Subnet Administration PathRecord lookup34-40 Second CM setup plus ICMPv6 traffic and ACKs41-42 SMInfo query and response43 ICMPv6 echo requestThe most important pair is frame 10 and frame 11:
Frame 10: Opcode: Reliable Connection (RC) - SEND Only (4) Destination QP: 0xfc0407 Acknowledge Request: True Payload: IPv4 ICMP Echo request
Frame 11: Opcode: Reliable Connection (RC) - Acknowledge (17) Destination QP: 0x870408 AETH: AckInterpretation:
- Connection Management prepares communication.
- RC SEND carries an IP payload.
- RC ACK confirms reliable delivery.
- AETH is visible in the ACK packet.
- This is close to the Chapter 1 discussion of two-sided reliable transport, even though it is not a full RDMA Write or Read example.
Why it matters for Chapter 1:
- It visibly connects QP, BTH opcode, PSN, ACK, and AETH.
- It shows how a ready connection can carry payload and receive transport-level acknowledgments.
- It helps explain why InfiniBand reliability is handled by the HCA/RNIC rather than by the CPU for each packet.
Key Packet Examples
Section titled “Key Packet Examples”Subnet Management over QP0
Section titled “Subnet Management over QP0”From ib_initial_sniffer.pcap:
UD Send Only QP=0x000000 SubnGet(NodeInfo)UD Send Only QP=0x000000 SubnGetResp(NodeInfo)Meaning:
- QP0 is used for Subnet Management.
- The traffic is part of fabric discovery and configuration.
- This is control path behavior.
Performance Management
Section titled “Performance Management”From ib_sniffer.pcap and related captures:
PERF (PortCounters)PERF (PortCountersExtended)Meaning:
- Fabric components expose counters through management traffic.
- These counters support monitoring and troubleshooting.
- This is not application data movement.
From ib_ipping_sniffer.pcap:
203.0.113.17 -> 203.0.113.18 ICMP Echo request203.0.113.18 -> 203.0.113.17 ICMP Echo replyFrom ib_IPoIB.pcap:
10.10.10.12:34826 <-> 10.10.10.11:22 TCP/SSHMeaning:
- IP packets can be transported over InfiniBand.
- Upper-layer tools may look familiar, but the L2/L3 underlay is InfiniBand rather than Ethernet.
Reliable Connection SEND and ACK
Section titled “Reliable Connection SEND and ACK”From infiniband.pcap:
RC SEND Only QP=0xfc0407RC Acknowledge QP=0x870408Meaning:
- The BTH opcode distinguishes the transport operation.
- The destination QP identifies the queue pair endpoint.
- The ACK includes AETH.
- This shows reliable InfiniBand transport behavior.
What These Captures Do Not Show
Section titled “What These Captures Do Not Show”These captures are not full RDMA Read/Write examples.
Missing from the packet set:
- RDMA Write packets with RETH.
- RDMA Read request and response flows.
- Remote virtual address and rkey fields in RETH.
- A complete one-sided RDMA data movement example.
- NCCL collective traffic.
Therefore, the correct interpretation is:
These captures demonstrate InfiniBand fabric management, IPoIB, and some reliable connection transport behavior. They support the RDMA/IB concepts in Chapter 1, but they do not fully demonstrate RDMA Read or RDMA Write data movement.
The RDMA Read/Write Packet Analysis Model section above describes what should appear in future captures that include one-sided operations.
Useful tshark Commands
Section titled “Useful tshark Commands”List basic packet summaries:
tshark -r ../ib-packets/infiniband.pcap -c 20Show protocol hierarchy:
tshark -r ../ib-packets/ib_IPoIB.pcap -q -z io,phsExtract LRH, BTH, and MAD fields:
tshark -r ../ib-packets/ib_initial_sniffer.pcap \ -Y infiniband \ -T fields \ -e frame.number \ -e frame.time_relative \ -e infiniband.lrh.slid \ -e infiniband.lrh.dlid \ -e infiniband.bth.opcode \ -e infiniband.bth.destqp \ -e infiniband.mad.mgmtclass \ -e infiniband.mad.method \ -e infiniband.mad.attributeid \ -E header=yCount BTH opcodes:
tshark -r ../ib-packets/infiniband.pcap \ -Y "infiniband.bth.opcode" \ -T fields \ -e infiniband.bth.opcode | sort | uniq -cShow a detailed packet decode:
tshark -r ../ib-packets/infiniband.pcap \ -Y "frame.number==10 || frame.number==11" \ -VFind IPoIB TCP conversations:
tshark -r ../ib-packets/ib_IPoIB.pcap -q -z conv,tcpDiscover the exact InfiniBand field names supported by the local Wireshark/TShark build:
tshark -G fields | rg -i "infiniband.*(bth|reth|aeth|opcode|psn|rkey)"If a future capture contains RDMA READ or WRITE packets, start with BTH opcodes and then expand into RETH/AETH details:
tshark -r ../ib-packets/<rdma-read-write-capture>.pcap \ -Y "infiniband.bth.opcode" \ -T fields \ -e frame.number \ -e infiniband.bth.opcode \ -e infiniband.bth.destqp \ -e infiniband.bth.psnTakeaways for Chapter 1
Section titled “Takeaways for Chapter 1”-
InfiniBand has a visible control path. The captures show Subnet Management, Subnet Administration, Performance Management, and Connection Management traffic. This reinforces the Chapter 1 point that RDMA kernel bypass mainly applies to the data path.
-
LIDs and QPs are real packet-level identifiers. LRH fields show source and destination LIDs. BTH fields show destination QPs and opcodes. These are not just abstract API concepts.
-
QP0 and QP1 matter for management. QP0 appears in Subnet Management traffic. QP1 appears in Subnet Administration and Connection Management traffic.
-
IPoIB is different from one-sided RDMA. IPoIB carries normal IP traffic over InfiniBand. It can show TCP, SSH, ARP, and ICMP, but that does not mean it is showing RDMA Read or RDMA Write.
-
Reliable Connection behavior is visible.
infiniband.pcapshows RC SEND and RC ACK packets. The ACK includes AETH, which matches the Chapter 1 discussion of ACK Extended Transport Header behavior. -
The data path/control path distinction should stay explicit. The management captures are mostly control path. IPoIB and RC SEND/ACK are closer to data path. Full RDMA Read/Write would require additional captures that include RETH and one-sided operation fields.
References
Section titled “References”Companion documents
Section titled “Companion documents”packet-format-reference.md— bit-level layouts for every IB header used in this dataset, plus the full BTH opcode master table and operation→extended-header mapping.
NVIDIA official documentation
Section titled “NVIDIA official documentation”- NVIDIA Introduction to InfiniBand™
- NVIDIA RDMA Aware Networks Programming User Manual: Key Concepts
- NVIDIA MLNX_OFED: InfiniBand Network
- NVIDIA MLNX_OFED Documentation v24.04-0.7.0.0
- NVIDIA Quantum InfiniBand Networking Solutions
- NVIDIA Quantum-X800 InfiniBand Platform
- NVIDIA NCCL Documentation
- NVIDIA NCCL Collective Operations
- NVIDIA NCCL Environment Variables
- NVIDIA NCCL Troubleshooting: Networking Issues
- NVIDIA SHARP with NVIDIA NCCL
- NVIDIA HPC-X NCCL-RDMA-SHARP Plugins
Protocol and tooling references
Section titled “Protocol and tooling references”- InfiniBand Trade Association: InfiniBand Architecture Specification FAQ
- InfiniBand Trade Association: About InfiniBand
- Wireshark Display Filter Reference: InfiniBand
- TShark Manual Page
- linux-rdma rdma-core
- NVIDIA nccl-tests
- OpenFabrics Alliance Overview
- OpenFabrics Alliance Advanced Network Software
Technical articles and background
Section titled “Technical articles and background”- NADDOD Blog: What is InfiniBand?
- Tencent Cloud Developer: RDMA - IB Specification Volume 1 Transport Layer
- O’Reilly InfiniBand Network Architecture
- NVIDIA Technical Blog: Simplifying Network Operations for AI with NVIDIA Quantum InfiniBand
- NVIDIA Technical Blog: InfiniBand Multilayered Security Protects Data Centers and AI Workloads
- NVIDIA Technical Blog: Powering the Next Frontier of Networking for AI Platforms with NVIDIA DOCA 3.0
- NVIDIA Technical Blog: New MLPerf Inference Network Division Showcases NVIDIA InfiniBand and GPUDirect RDMA Capabilities