Skip to content

InfiniBand Packet Analysis: A Practical RDMA Transport Primer

English | 한국어

NVIDIA RDMA TShark InfiniBand IPoIB

This report analyzes the packet captures in the ib-packets directory using tshark. The goal is to connect the captured packets back to the Chapter 1 RDMA and InfiniBand notes: data path vs control path, InfiniBand packet structure, management traffic, IP over InfiniBand, Queue Pairs, and Reliable Connection behavior.

ib packet analysis

The captures can be analyzed with tshark without superuser privileges because they are offline .pcap files. Root or special capture permissions are usually needed for live packet capture, not for reading existing capture files.

The packet set shows several important InfiniBand behaviors:

  • InfiniBand management traffic is visible through MAD packets.
  • Subnet Management traffic appears as SubnGet, SubnGetResp, SubnSet, and Subnet Administration records.
  • Performance Management traffic appears as PortCounters, PortCountersExtended, and ClassPortInfo.
  • IP over InfiniBand (IPoIB) is visible as normal IP, TCP, SSH, ARP, and ICMP traffic carried inside InfiniBand frames.
  • One capture shows Reliable Connection (RC) behavior, including ConnectRequest, ConnectReply, ReadyToUse, RC SEND Only, and RC Acknowledge.

The captures are especially useful for understanding the distinction between:

  • Control path: setup, discovery, management, path lookup, connection establishment, and performance queries.
  • Data path: payload movement after the required resources and paths are ready.

Most captures are management or IPoIB examples. They do not show a complete RDMA Read or RDMA Write payload exchange with RETH fields, remote virtual addresses, or rkeys. The closest data-path example is infiniband.pcap, which shows RC SEND and AETH ACK behavior.

This report intentionally separates packet evidence from explanatory reference material.

TopicStatus in this reportEvidence or purpose
Subnet ManagementObserved in capturesSubnGet, SubnGetResp, SubnSet, QP0 traffic
Subnet AdministrationObserved in capturesPath records, multicast membership, QP1 traffic
Performance ManagementObserved in capturesPortCounters, PortCountersExtended, ClassPortInfo
IPoIBObserved in capturesICMP, TCP, SSH, and ARP-like behavior over InfiniBand
Connection ManagementObserved in capturesConnectRequest, ConnectReply, ReadyToUse
Reliable Connection SEND/ACKObserved in capturesRC SEND Only, RC Acknowledge, AETH
RDMA READReference model onlyAdded to explain BTH + RETH request and response packet behavior for future captures
RDMA WRITEReference model onlyAdded to explain BTH + RETH + payload request behavior for future captures
NCCL collective trafficNot presentUse the official NCCL collective operations, NCCL networking troubleshooting, and NVIDIA/nccl-tests references instead of expanding it here
Bit-level packet format referenceCompanion documentpacket-format-reference.md — LRH/GRH/BTH/extended headers/MAD/SMP DR/IPoIB bit layouts and the full BTH opcode master table

When reading the report, treat the observed sections as analysis of the provided pcap files. Treat the RDMA READ/WRITE section as a packet-analysis guide for future captures that include one-sided RDMA operations. For byte- and bit-level field layouts that the report references but does not exhaustively tabulate, see the companion packet format reference.

FilePacketsDurationMain Observation
ib_initial_sniffer.pcap10810.90 sInitial subnet discovery, SMP, SA, multicast membership, and performance queries
ib_ibping_sniffer.pcap6510.18 sVendor MAD request/response behavior plus performance counters
ib_ibtracert_sminfo_sniffer.pcap8430.46 sTracing and SMInfo-related control path traffic
ib_sniffer.pcap246.00 sPerformance Management traffic only
ib_ipping_sniffer.pcap3412.00 sICMP ping over IPoIB plus a small amount of ARP and performance traffic
ib_IPoIB.pcap5,8484.28 sSSH over TCP over IPoIB
infiniband.pcap43250.57 sSMInfo, IPoIB, CM connection setup, RC SEND, and RC ACK behavior

All files are pcap files with Extensible Record Format encapsulation. tshark decodes the ERF outer record and then the InfiniBand payload.

The analysis used Wireshark/TShark 4.2.2:

Terminal window
tshark -v

Basic capture metadata:

Terminal window
capinfos ../ib-packets/*.pcap

Protocol hierarchy:

Terminal window
tshark -r ../ib-packets/ib_IPoIB.pcap -q -z io,phs

InfiniBand field extraction:

Terminal window
tshark -r ../ib-packets/ib_initial_sniffer.pcap \
-Y infiniband \
-T fields \
-e frame.number \
-e frame.time_relative \
-e infiniband.lrh.dlid \
-e infiniband.lrh.slid \
-e infiniband.bth.opcode \
-e infiniband.bth.destqp \
-e infiniband.mad.method \
-e infiniband.mad.attributeid \
-E header=y

Useful fields:

FieldMeaning
infiniband.lrh.dlidDestination Local ID from the Local Route Header
infiniband.lrh.slidSource Local ID from the Local Route Header
infiniband.bth.opcodeBase Transport Header opcode
infiniband.bth.destqpDestination Queue Pair
infiniband.mad.mgmtclassMAD management class
infiniband.mad.methodMAD method, such as Get or GetResp
infiniband.mad.attributeidMAD attribute ID

The exact capture commands cannot be proven from the pcap files alone. The following is an inference from file names, encapsulation type, protocol hierarchy, and decoded packet contents.

The captures are likely the result of running InfiniBand diagnostic or IPoIB workloads while a native InfiniBand sniffer was recording traffic. The files use Extensible Record Format encapsulation and expose InfiniBand LRH/BTH/MAD fields, which is more consistent with a native InfiniBand capture path than with a simple Ethernet-style tcpdump on an IP interface.

FileLikely workload during captureEvidence
ib_initial_sniffer.pcapFabric initialization or subnet discoverySubnGet(NodeInfo), NodeDescription, PortInfo, SMInfo, QP0 traffic
ib_ibping_sniffer.pcapibping between two InfiniBand nodesRepeated vendor MAD request/response traffic between LID 5 and LID 8
ib_ibtracert_sminfo_sniffer.pcapibtracert, sminfo, and possibly counter queriesSMInfo, LinearForwardingTable, PortCounters, PortCountersExtended
ib_sniffer.pcapPerformance counter pollingMostly PERF (PortCounters) and PortCountersExtended
ib_ipping_sniffer.pcapIP ping over IPoIBICMP echo request/reply plus ARP over InfiniBand
ib_IPoIB.pcapSSH/TCP session over IPoIBTCP conversation 10.10.10.12:34826 <-> 10.10.10.11:22, SSH payload
infiniband.pcapMixed InfiniBand sample workloadSMInfo, PathRecord, ConnectRequest, ConnectReply, ReadyToUse, RC SEND, and RC ACK

A plausible collection workflow would have looked like this:

Terminal 1:
Start a native InfiniBand sniffer and write to a pcap file.
Terminal 2:
Run one diagnostic or workload command, such as ibping, ibtracert,
sminfo, perfquery, ping over IPoIB, or SSH over an IPoIB address.
Result:
The sniffer records LRH/BTH/MAD/IPoIB traffic into a pcap file.

For example, the ib_ibping_sniffer.pcap name and decoded packets suggest this type of scenario:

Start capture:
native IB sniffer -> ib_ibping_sniffer.pcap
Run workload:
ibping between two IB endpoints
Observed packets:
VENDOR MAD request/response traffic between LIDs

The IPoIB captures likely came from ordinary IP tools running over an ib0-style interface:

Start capture:
native IB or IPoIB-aware capture -> ib_ipping_sniffer.pcap
Run workload:
ping <remote IPoIB address>
Observed packets:
ARP, ICMP Echo request, ICMP Echo reply over InfiniBand

and:

Start capture:
native IB or IPoIB-aware capture -> ib_IPoIB.pcap
Run workload:
ssh <remote IPoIB address>
Observed packets:
TCP handshake and SSH payload over IPoIB

If reproducing a similar analysis from existing files, no superuser privileges are required:

Terminal window
tshark -r ../ib-packets/ib_ibping_sniffer.pcap -c 10
tshark -r ../ib-packets/ib_ipping_sniffer.pcap -c 10
tshark -r ../ib-packets/ib_IPoIB.pcap -q -z conv,tcp

If reproducing the capture itself, permissions depend on the capture method. Live capture from a privileged interface or a vendor sniffer may require extra capabilities, group membership, or root privileges. Offline analysis of the resulting pcap does not.

The InfiniBand protocol stack can be viewed at two complementary levels:

  • Protocol stack view: how applications, upper-layer protocols, transport services, network routing, link behavior, and physical signaling fit together.
  • Packet structure view: how an individual packet is encoded on the wire, including routing headers, transport headers, optional extended headers, payload, and integrity checks.

The following third-party diagrams are useful orientation material. They are included here as educational figures, while the packet-level interpretation in this report is based on the fields visible through tshark and the official NVIDIA, IBTA, and Wireshark references listed below.

InfiniBand Protocol Stack

Conceptually, this stack explains why the captures include both control path protocols, such as Subnet Management and Connection Management, and data path traffic, such as IPoIB and Reliable Connection packets.

InfiniBand Packet Encapsulation Format

The encapsulation figure aligns with the next two sections: tshark exposes packet fields such as LRH, BTH, DETH, MAD, AETH, and IP payloads, depending on the packet type.

Source: What is InfiniBand? (A Complete Guide)

The packet structure visible in tshark maps well to the Chapter 1 InfiniBand Communication Stack.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    ERF[ERF capture record]
    LRH[InfiniBand LRH<br/>LID routing inside the fabric]
    BTH[InfiniBand BTH<br/>transport opcode, destination QP, PSN]
    EXT[Extended transport headers<br/>DETH / AETH / CM / MAD fields]
    PAYLOAD[Payload<br/>MAD, IPoIB, ICMP, TCP, SSH, or data]

    ERF --> LRH --> BTH --> EXT --> PAYLOAD

Important visible headers:

HeaderRoleExample from captures
LRHLocal routing inside the InfiniBand fabricslid, dlid, packet length
BTHTransport behavior and QP selectionopcode 100 for UD SEND Only, opcode 4 for RC SEND Only, opcode 17 for RC ACK
DETHDatagram transport fields for UD trafficQP0/QP1 management traffic
MADManagement datagramSubnGet, SubnGetResp, PortCounters
AETHACK Extended Transport HeaderRC Acknowledge packets in infiniband.pcap
IP payloadIP over InfiniBandTCP/SSH and ICMP over IPoIB

The captures use the Endace Extensible Record Format (ERF) as the outer wrapper. Each on-wire InfiniBand frame is encapsulated by an ERF record emitted by the capture device, and tshark dissects this outer record before handing the inner bytes to the InfiniBand dissector. Understanding what ERF preserves vs hides is what separates “I see a packet” from “I know exactly what the sniffer recorded.”

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    PHY["Physical layer<br/>8b/10b or 64b/66b symbols<br/>training, idle, recovery"]
    LFC["Link-level flow control<br/>FCCL / FCTBS credits"]
    ERF["ERF outer record<br/>ts, type, flags, rlen, wlen"]
    IB["InfiniBand frame<br/>LRH → (GRH) → BTH → ext → payload → ICRC/VCRC"]

    PHY -.discarded.-> ERF
    LFC -.discarded.-> ERF
    ERF --> IB

The ERF record is small but carries every piece of metadata the sniffer hardware can supply. All seven captures use ERF type 0x15 (INFINIBAND), which is set by the capture device firmware and is the strongest single piece of evidence that the recording came from an IB-aware sniffer rather than a host-side tcpdump.

ERF fieldFilter nameExample (infiniband.pcap frame 10)Meaning
Timestamperf.ts0x482b41f8ae3041c064-bit fractional-seconds hardware timestamp
Record typeerf.types0x15 (Type 21: INFINIBAND)Identifies inner payload as native IB
Extension header presenterf.types.ext_header0No ERF extension headers in this dataset
Capture interfaceerf.flags.cap1 (Port B)Which sniffer port observed this frame
Varying record lengtherf.flags.vlen1Record length varies per frame
Truncatederf.flags.trunc0Frame was captured at full wire length
RX errorerf.flags.rxe0Capture device flagged no receive error
DS errorerf.flags.dse0No data-stream error
Record lengtherf.rlen136ERF record bytes including padding
Loss countererf.lctr0Frames dropped between this and the previous record
Wire lengtherf.wlen114Original on-wire byte count

The pair rlen (136) vs wlen (114) shows ERF’s per-record padding to alignment boundaries. For timing analysis, erf.ts is the authoritative clock — frame.time_relative derives from it but rounds to microseconds in some output modes.

ERF’s flags.cap field tells you which sniffer port saw each frame, which is essential for interpreting bidirectional flows.

FileCapture interfaces usedImplication
infiniband.pcap0 and 1Bidirectional tap; both link directions captured
All other pcaps0 onlySingle-direction tap

Concrete evidence from infiniband.pcap:

Frame 10 (RC SEND Only, DLID=1, SLID=4) → Capture interface 1 (Port B), ts=0x482b41f8ae3041c0
Frame 11 (RC Acknowledge, DLID=4, SLID=1) → Capture interface 0 (Port A), ts=0x482b41f8ae30ede0

The SEND and its ACK arrive on different sniffer ports because they travel in opposite directions on the link. In single-interface captures this asymmetry is invisible — you may only see one half of an exchange depending on which port was tapped.

The ERF wrapper is thin, but the IB dissector behind it is comprehensive. The “simplification” you might perceive comes from two places: (a) hardware events that occur below the packet boundary and never become packets, and (b) the IB dissector’s choice of which fields to expose as filterable names vs tree-only fields.

Layer / signalVisible in tshark?Notes
Physical 8b/10b or 64b/66b symbolsNoDecoded by HCA SerDes; never reach the capture host
Link training, recovery, idle symbolsNoSub-packet events, discarded by the link layer
Link-level flow-control credits (FCCL, FCTBS)NoCarried in dedicated link-level subheaders, not delivered as IB packets
Inter-packet gaps and bandwidth headroomNoReconstruct from erf.ts deltas instead
Frames dropped or rejected by sniffer hardwarePartialVisible only as a non-zero erf.lctr jump
RX-error framesConditionalForwarded with erf.flags.rxe = 1 if the device is configured to keep them
LRHYesinfiniband.lrh.* (slid, dlid, lnh, vl, sl, packet length)
GRHOnly when LRH.LNH = 0x3All packets in this set carry LNH = 0x2, so GRH is correctly absent
BTH and extended headersYesDETH, AETH, MAD, RETH (when present) all decoded
Payload (MAD, IPoIB IP/TCP/ICMP)YesStandard upper-layer dissection
Invariant CRCYesinfiniband.invariant.crc, e.g. 0x0acca5df in frame 10
Variant CRCYesinfiniband.variant.crc, e.g. 0x24a8 in frame 10

A common misconception is that ERF strips ICRC/VCRC. In this dataset both are present in the IB tree and are filterable as infiniband.invariant.crc and infiniband.variant.crc. The Wireshark IB dissector does not auto-validate them, however; integrity is asserted by the capture device’s RX-error flag (erf.flags.rxe), not by the dissector.

The following anonymized layout is infiniband.pcap frame 10 (the RC SEND Only carrying an IPoIB ICMP echo request). It demonstrates how the ERF outer record, the InfiniBand headers, the EtherType-encapsulated IPoIB payload, and the trailing CRCs all coexist in a single 114-byte wire frame.

Frame 10 — 114 bytes wire / 136-byte ERF record / capture interface 1 (Port B)
ERF outer record
Timestamp: 0x482b41f8ae3041c0
Type: 0x15 (INFINIBAND)
Ext header: 0
Flags: cap=1, vlen=1, trunc=0, rxe=0, dse=0
Record len: 136
Loss counter: 0
Wire length: 114
InfiniBand
LRH (Local Route Header)
VL = 0
Service Level = 0
LNH = 0x2 (BTH only — no GRH)
DLID = 1, SLID = 4
Packet length = 28 (4-byte words)
BTH (Base Transport Header)
Opcode = 4 (RC SEND Only)
Solicited Event = False
MigReq = True
Pad Count = 0
P_Key = 0xffff
Destination QP = <masked>
Acknowledge Request = True
PSN = <masked>
IBA Payload — EtherType-encapsulated for IPoIB
Ethertype = 0x0800 (IPv4)
Invariant CRC: 0x0acca5df
Variant CRC: 0x24a8
IPv4 → ICMP Echo request
Src 10.0.1.34 → Dst 10.0.0.58

A few details worth noticing:

  • LRH.LNH = 0x2 confirms local-subnet routing, which is why no GRH appears between LRH and BTH.
  • The IBA Payload — EtherType-encapsulated line is the IPoIB shim: a 4-byte header with an EtherType selecting IPv4 or ARP, sitting between the BTH and the IP packet. This is the layer that lets ordinary IP applications run over IB.
  • Both Invariant CRC and Variant CRC are present in the dissection tree. ICRC covers everything except mutable fields; VCRC covers the entire packet on the link.
  • MigReq = True indicates the path supports automatic path migration. This is a per-QP attribute set during connection setup and is unrelated to the data being carried.
  • ERF is a thin metadata wrapper; nearly every IB header field, including ICRC/VCRC, survives into the dissection tree. The “simplification” is real only at sub-packet hardware-event level.
  • Use erf.ts for nanosecond-resolution timing analysis (e.g., the one-second ibping cadence in ib_ibping_sniffer.pcap is precisely measurable from this field).
  • Use erf.flags.cap to distinguish link directions in infiniband.pcap, and to recognize that single-interface captures may show only one half of a bidirectional exchange.
  • Use erf.lctr to detect sniffer drops; a non-zero value means there is a gap in the recording that no amount of IB-layer analysis can recover.
  • Conclusions about link credit exhaustion, link training, or symbol-error rates require switch counters and HCA hardware diagnostics — they are intrinsically not in the pcap, regardless of which sniffer was used.

Building on the encapsulation diagram above, an InfiniBand packet can be read from left to right as fabric routing, transport selection, operation-specific metadata, and payload. The exact extended header depends on the transport and opcode.

For bit-level field layouts of every header listed here (LRH, GRH, BTH, DETH, RETH, AETH, AtomicETH, ImmDt, IETH, RDETH, XRCETH, MAD, SMP DR, IPoIB encap), the AETH syndrome encoding, and the full BTH opcode master table, see the companion packet-format-reference.md. This section gives the high-level packet shape; the reference document drills down to byte and bit boundaries.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    LRH["LRH<br/>Local Route Header<br/>DLID, SLID, VL, packet length"]
    GRH["GRH<br/>Global Route Header<br/>optional, GID-based routing"]
    BTH["BTH<br/>Base Transport Header<br/>opcode, P_Key, destination QP, PSN"]
    EXT["Extended Header<br/>DETH, RETH, AETH, Atomic, Immediate, or none"]
    PAYLOAD["Payload<br/>MAD, IPoIB packet, SEND data, RDMA data"]
    CRC["ICRC / VCRC<br/>integrity checks on the wire"]

    LRH --> GRH --> BTH --> EXT --> PAYLOAD --> CRC

GRH is optional, so many local-subnet packets are effectively LRH -> BTH -> .... In this dataset every packet carries LRH.LNH = 0x2, which is why no GRH is decoded. Whether ICRC/VCRC are exposed depends on the capture path; this dataset preserves both as filterable fields (infiniband.invariant.crc, infiniband.variant.crc) — see ERF Capture Anatomy for evidence and the full preservation matrix.

Common packet shapes:

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    MGMT["Management traffic<br/>QP0/QP1 control path"]
    MGMT_SEQ["LRH -> BTH -> DETH -> MAD"]
    IPOIB["IP over InfiniBand<br/>normal IP payload over IB"]
    IPOIB_SEQ["LRH -> BTH -> DETH -> IP -> TCP/ICMP/SSH"]
    RC_SEND["Reliable Connection SEND<br/>message-style data path"]
    RC_SEND_SEQ["LRH -> BTH -> SEND payload"]
    RC_ACK["Reliable Connection ACK<br/>transport acknowledgement"]
    RC_ACK_SEQ["LRH -> BTH -> AETH"]
    RDMA_WRITE["RDMA WRITE<br/>one-sided push, not fully shown here"]
    RDMA_WRITE_SEQ["LRH -> BTH -> RETH -> data payload"]
    RDMA_READ["RDMA READ<br/>one-sided pull, not fully shown here"]
    RDMA_READ_SEQ["Request: LRH -> BTH -> RETH<br/>Response: LRH -> BTH -> AETH -> data payload"]

    MGMT --> MGMT_SEQ
    IPOIB --> IPOIB_SEQ
    RC_SEND --> RC_SEND_SEQ
    RC_ACK --> RC_ACK_SEQ
    RDMA_WRITE --> RDMA_WRITE_SEQ
    RDMA_READ --> RDMA_READ_SEQ

How this maps to the current captures:

Packet familyTypical structureVisible in this packet set?Notes
Subnet ManagementLRH -> BTH -> DETH -> MADYesSeen in ib_initial_sniffer.pcap, ib_ibtracert_sminfo_sniffer.pcap, and infiniband.pcap
Performance ManagementLRH -> BTH -> DETH -> MADYesSeen as PortCounters, PortCountersExtended, and ClassPortInfo
IPoIBLRH -> BTH -> DETH -> IP payloadYesCarries ICMP, TCP, SSH, and ARP-like behavior over InfiniBand
RC SENDLRH -> BTH -> payloadYesinfiniband.pcap shows RC SEND Only
RC ACKLRH -> BTH -> AETHYesinfiniband.pcap shows RC Acknowledge
RDMA WRITELRH -> BTH -> RETH -> payloadNoThis would show remote virtual address and rkey in RETH
RDMA READrequest with RETH, response with dataNoThis would show the pull model described in Chapter 1

The current pcap set does not contain a complete RDMA READ or RDMA WRITE exchange. This section is therefore a reference model for how such packets should be interpreted if future captures include one-sided RDMA operations. It is based on the InfiniBand transport-layer behavior described in the official references and the Tencent Cloud article listed in the references.

The key header for one-sided RDMA operations is RETH, the RDMA Extended Transport Header.

HeaderImportant fieldsWhy it matters
BTHopcode, destination QP, PSN, ACK requestIdentifies the operation type and packet ordering
RETHvirtual address, rkey, DMA lengthAuthorizes and describes the remote memory range
AETHACK/NAK syndrome, MSNConfirms reliable transport progress or reports an error
Payloadread response data or write dataCarries user data depending on operation direction

InfiniBand transport services do not support all verbs-style operations equally. The practical takeaway is that one-sided operations that need a response, strict ordering, or read-modify-write semantics require a reliable transport context.

OperationRCUCUDRD
SEND/RECV
RDMA WRITE
RDMA READ
Atomic

In modern RDMA software, RC is the common practical transport for RDMA READ and Atomic operations. RD also supports them in the InfiniBand architecture, but it is rarely the default choice in mainstream application stacks.

Why RDMA READ does not fit UC/UD:

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant Req as Requester
    participant Resp as Responder

    Req->>Resp: READ Request<br/>"Read N bytes from remote VA with rkey"
    Resp->>Resp: Validate rkey, VA, length, ordering, responder resources
    Resp-->>Req: READ Response<br/>Data returns to requester

RDMA READ is not just a one-way packet. It creates responder-side work: the responder RNIC must validate the request, fetch remote memory, generate one or more response packets, preserve ordering, and handle retry/error behavior. UC has no reliable response/ACK machinery, and UD is message-oriented datagram transport without the connected responder state needed for remote memory reads.

Why Atomic does not fit UC/UD:

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant Req as Requester
    participant Resp as Responder
    participant MR as Remote MR

    Req->>Resp: Atomic request<br/>Compare-and-swap or fetch-and-add
    Resp->>MR: Read old value, compute, write new value atomically
    Resp-->>Req: Atomic response<br/>Return original value or completion state

Atomic operations require a single globally ordered read-modify-write at the remote memory location. The requester also needs a reliable response to know the returned value and whether the operation completed. That requires connected state, ordering, and retry/error semantics, which is why practical deployments use RC-style reliable transport for atomics.

Practical note for NCCL, UCX, MPI, and DC transport:

NCCL collectives such as AllReduce move large chunks from GPU memory to other GPU memory. Some phases can be implemented as push-style transfers, but pull-based peer access patterns benefit from RDMA READ semantics. Ring and tree algorithms also require predictable ordering and completion behavior, so reliable transport is important.

UCX is a general-purpose communication layer. Small messages may use SEND/RECV or inline paths, while large messages can use RDMA. UCX also exposes tag matching and RMA-style operations, including atomics on capable transports. That naturally favors reliable connection-oriented transports for the paths that need READ, Atomic, ordering, or retry semantics.

MPI implementations often map one-sided primitives such as MPI_Put, MPI_Get, and MPI_Accumulate onto RDMA WRITE, RDMA READ, and Atomic operations when the transport supports it. Since MPI semantics assume reliable communication, the underlying network path usually needs reliable completion and ordering behavior.

At large cluster scale, pure RC can become expensive because a dense all-to-all peer mesh may require a large number of QPs and associated HCA memory. DC transport, or Dynamically Connected transport, addresses this by keeping reliable semantics while dynamically reusing connection resources. This is why DC-style transports are important in large InfiniBand deployments. NVIDIA SHARP and NCCL-RDMA-SHARP paths can also appear in modern collective stacks, but the exact use of DC, UCX, verbs, or SHARP depends on hardware, plugin availability, topology, and runtime environment settings.

The Tencent Cloud article is useful because it frames RDMA READ/WRITE as InfiniBand transport-layer operations, not just verbs API calls. The following details are worth carrying into packet analysis:

DetailPacket-analysis implication
Transport service typeBTH opcode bits identify whether the packet belongs to RC, UC, RD, UD, or XRC style transport behavior. This matters because ACK/NAK behavior and packet validation differ by service.
BTH is the operation decoderBTH opcode determines how the bytes after BTH should be interpreted: RETH, AETH, DETH, immediate data, payload, or no extended header.
PSN is not just a counterPacket Sequence Number is used by the responder/requester to detect missing, duplicate, or out-of-order packets. In reliable services, this drives ACK/NAK and retry behavior.
P_Key and destination QP are validation inputsA packet can be silently dropped if its destination QP, QP state, transport type, or partition key does not match the responder context.
RETH is a protection boundaryRETH is not only an address descriptor. The responder must validate rkey, access permissions, virtual address range, and DMA length before touching remote memory.
AETH carries ACK/NAK stateIn reliable transports, AETH tells the requester whether progress was acknowledged or whether a retry/error condition exists.
ICRC/VCRC are on-wire integrity checksCapture tools may expose only part of this, but invalid CRCs are normally discarded before useful transport-layer interpretation.

P_Key deserves special attention. It is a partition membership value carried in BTH, similar in spirit to a fabric-level tenant or isolation tag. The high bit indicates full vs limited membership, and the lower bits identify the partition. If a packet’s P_Key does not match the destination port’s partition membership or the QP context, the packet is not accepted as valid traffic for that partition. This is why P_Key should be read together with destination QP, transport type, and QP state when debugging packet drops.

Two subtle points are easy to miss:

  • Multi-packet SEND and RDMA WRITE messages are not interleaved with other operations on the same send queue until the final packet of that message has been generated.
  • RDMA READ behaves differently: after issuing a READ request, the requester may issue later requests without waiting for the READ response, but the maximum number of outstanding READ and ATOMIC operations is negotiated during connection setup.

RDMA READ is a one-sided pull. The requester asks the responder RNIC to read from remote memory and return the data.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant Req as Requester RNIC
    participant LocalMR as Local MR
    participant Fabric as InfiniBand Fabric
    participant Resp as Responder RNIC
    participant RemoteMR as Remote MR

    Req->>Fabric: RDMA READ request<br/>BTH + RETH, no data payload
    Fabric->>Resp: Deliver READ request
    Resp->>Resp: Validate QP, PSN, VA, rkey, length, access rights
    Resp->>RemoteMR: DMA read requested bytes
    Resp-->>Fabric: RDMA READ response packet(s)<br/>BTH + AETH on first/last/only + payload
    Fabric-->>Req: Return read payload
    Req->>LocalMR: DMA write payload into requester local buffer

For a small READ whose response fits within the path MTU:

Request: LRH -> BTH(RDMA READ Request) -> RETH(VA, rkey, length)
Response: LRH -> BTH(RDMA READ Response Only) -> AETH -> payload

For a multi-packet READ response:

Request: LRH -> BTH(RDMA READ Request) -> RETH
First: LRH -> BTH(RDMA READ Response First) -> AETH -> PMTU-sized payload
Middle: LRH -> BTH(RDMA READ Response Middle) -> PMTU-sized payload
Last: LRH -> BTH(RDMA READ Response Last) -> AETH -> remaining payload

Important analysis points:

  • The READ request packet is small because it describes what to read; it does not carry the requested data.
  • A single READ request can produce multiple READ response packets when the requested length exceeds the path MTU.
  • AETH is present in RDMA READ Response First, RDMA READ Response Last, and RDMA READ Response Only.
  • RDMA READ Response Middle carries payload but does not carry AETH.
  • PSN is used to detect missing or out-of-order response packets.
  • The responder validates the retry request, rkey, remote virtual address, and access permissions.
  • The requester may have more than one outstanding READ, depending on the negotiated connection limits.
  • RDMA READ does not carry immediate data.

Example Wireshark decode, with sensitive values anonymized:

RDMA READ Request
BTH:
Opcode: Reliable Connection (RC) - RDMA READ Request
Partition Key: 0xffff
Destination QP: 0x00xxxx
Acknowledge Request: True
Packet Sequence Number: <request_psn>
RETH:
Virtual Address: 0x0000xxxxxxxxxxxx
Remote Key: 0x00xxxxxx
DMA Length: 65536 bytes
ICRC:
Present
RDMA READ Response Middle
BTH:
Opcode: Reliable Connection (RC) - RDMA READ Response Middle
Partition Key: 0xffff
Destination QP: 0x00xxxx
Acknowledge Request: False
Packet Sequence Number: <response_psn>
Payload:
Data: 1024 bytes
ICRC:
Present

The request decode is the key evidence for a one-sided READ: it has BTH + RETH, and RETH carries the remote virtual address, rkey, and requested DMA length. BTH also carries the Partition Key (P_Key), which identifies the InfiniBand partition membership used by the packet. A commonly seen value such as 0xffff represents full membership in the default partition, but production fabrics may use different partition keys for isolation. The response-middle decode shows the reverse data movement: it carries data payload but no RETH and no AETH. This matches the multi-packet READ model where AETH appears on the first, last, or only response packet, while middle response packets are pure data-bearing segments.

For public documentation, avoid publishing raw screenshots unless the following fields are masked:

  • Remote virtual address
  • rkey
  • destination QP
  • packet sequence number
  • any payload bytes that may contain application data

Example AETH decode, with sensitive values anonymized:

RC Acknowledge
BTH:
Opcode: Reliable Connection (RC) - Acknowledge
Partition Key: 0xffff
Destination QP: 0x00xxxx
Acknowledge Request: False
Packet Sequence Number: <ack_psn>
AETH:
Syndrome: 0, Ack
OpCode: Ack
Credit Count: <credit_count>
Message Sequence Number: <msn>
ICRC:
Present

AETH is the key ACK/NAK carrier for reliable transport. A normal ACK indicates that the responder accepted progress for the relevant reliable operation. If the syndrome indicates NAK or an error condition, the requester may need to retry or fail the Work Request depending on the transport state and retry counters. In packet analysis, BTH tells us this is an RC acknowledge packet and which partition/QP context it belongs to, while AETH tells us whether it is a successful acknowledgement or an error/flow-control signal.

RDMA WRITE is a one-sided push. The requester sends data to a remote memory range that the responder has already registered and shared through metadata exchange.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
sequenceDiagram
    participant Req as Requester RNIC
    participant Fabric as InfiniBand Fabric
    participant Resp as Responder RNIC
    participant MR as Remote MR

    Req->>Fabric: RDMA WRITE request<br/>BTH + RETH + payload
    Fabric->>Resp: Deliver WRITE packets
    Resp->>Resp: Validate QP, PSN, VA, rkey, length, access rights
    Resp->>MR: DMA write payload into remote memory
    Resp-->>Req: ACK / NAK<br/>BTH + AETH

For a small WRITE whose payload fits within the path MTU, the packet shape is:

LRH -> BTH(RDMA WRITE Only) -> RETH(VA, rkey, length) -> payload -> ICRC/VCRC

For a multi-packet WRITE, the message is segmented:

First packet: LRH -> BTH(RDMA WRITE First) -> RETH -> PMTU-sized payload
Middle packet: LRH -> BTH(RDMA WRITE Middle) -> PMTU-sized payload
Last packet: LRH -> BTH(RDMA WRITE Last) -> remaining payload
ACK: LRH -> BTH(Acknowledge) -> AETH

Important analysis points:

  • RETH appears in the first packet or the only packet of an RDMA WRITE message.
  • RETH carries the remote virtual address, rkey, and DMA length.
  • Middle and last WRITE packets carry payload but do not repeat the full remote memory metadata.
  • The responder checks the rkey, access permissions, address range, and packet sequence.
  • Multi-packet WRITE messages are ordered as one message and are not interleaved with other operations on the same send queue before the final WRITE packet.
  • In reliable transports such as RC, the responder returns an ACK or NAK using AETH.
  • A normal RDMA WRITE updates remote memory but does not automatically notify the remote application. Notification requires a higher-level protocol, RDMA_WRITE_WITH_IMM, SEND/RECV, or polling.

A future pcap that truly contains one-sided RDMA traffic should show at least some of the following:

Expected evidenceRDMA READRDMA WRITE
BTH opcodeRDMA READ Request, RDMA READ Response First/Middle/Last/OnlyRDMA WRITE First/Middle/Last/Only
RETHRequest packetFirst or only request packet
Remote virtual addressIn request RETHIn RETH
rkeyIn request RETHIn RETH
Payload directionResponder to requesterRequester to responder
AETHFirst, last, or only read responseACK/NAK response
Target CPU involvementNot in data pathNot in data path

This explains why the current packet set is useful for RDMA/IB fundamentals but still cannot be treated as a full one-sided RDMA data-path capture.

The captures make the Chapter 1 distinction between control path and data path concrete.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    subgraph Control[Control path]
        SM[Subnet Management<br/>NodeInfo, PortInfo, SMInfo]
        SA[Subnet Administration<br/>PathRecord, MCMemberRecord]
        PM[Performance Management<br/>PortCounters]
        CM[Connection Management<br/>ConnectRequest, ConnectReply, ReadyToUse]
    end

    subgraph Data[Data path]
        IPoIB[IP over InfiniBand<br/>ICMP, TCP, SSH]
        RC[Reliable Connection traffic<br/>RC SEND and ACK]
    end

    Control --> Ready[QP/path/resources ready]
    Ready --> Data

Control path traffic is strongly represented in these captures. It includes:

  • Subnet discovery through QP0.
  • Subnet Administration through QP1.
  • Path discovery and multicast membership.
  • Performance counter queries.
  • Connection Management messages.

This corresponds to Chapter 1’s explanation that RDMA does not remove the kernel or control software from the system. The CPU, kernel driver, subnet manager, RDMA runtime, and NIC firmware still configure resources and paths.

Data path traffic appears in two forms:

  • IPoIB traffic, where normal IP applications run over InfiniBand.
  • RC SEND/ACK traffic, where InfiniBand transport behavior is visible below an IP payload.

The packet set does not show full RDMA Write or RDMA Read operations with RETH. Therefore, it is better to describe it as an InfiniBand and IPoIB packet set, not as a complete RDMA Read/Write capture set.

This capture is the best example of InfiniBand control path initialization.

Protocol hierarchy:

erf
infiniband
arp

Representative packets:

UD Send Only QP=0x000000 SubnGet(NodeInfo)
UD Send Only QP=0x000000 SubnGetResp(NodeInfo)
UD Send Only QP=0x000000 SubnGet(NodeDescription)
UD Send Only QP=0x000000 SubnGetResp(NodeDescription)
UD Send Only QP=0x000000 SubnGet(PortInfo)
UD Send Only QP=0x000000 SubnGetResp(PortInfo)

Interpretation:

  • QP0 is used for Subnet Management Packets.
  • The node is being discovered and configured.
  • NodeInfo, NodeDescription, PortInfo, P_KeyTable, and SMInfo are part of fabric discovery and setup.
  • Subnet Administration records such as MCMemberRecord and InformInfo also appear.
  • This is control path traffic, not user payload movement.

Why it matters for Chapter 1:

  • It shows the setup work that must happen before the fast path can be used.
  • It supports the point that “kernel bypass” does not mean “no setup or control path.”
  • It maps to PD/QP/MR/path readiness concepts in the RDMA Process section.

This capture is centered on ibping behavior and vendor MAD messages.

Protocol hierarchy:

erf
infiniband

Representative packets:

LID: 5 -> LID: 8 InfiniBand 290 VENDOR (Unknown Attribute)
LID: 8 -> LID: 5 InfiniBand 290 VENDOR (Unknown Attribute)

Top decoded items:

VENDOR (Unknown Attribute)
PERF (PortCounters)
PERF (ClassPortInfo)
PERF (PortCountersExtended)

LID-pair distribution (65 packets total):

Source LIDDest LIDMAD classMethodPacketsLikely workload
580x32 Vendor OUIGet11ibping request
850x32 Vendor OUIGetResp11ibping reply
510x04 PerfMgtGet27PortCounters poll
540x04 PerfMgtGet12PortCounters poll
520x04 PerfMgtGet2PortCounters poll
250x04 PerfMgtGetResp2PortCounters reply

The vendor MAD class 0x32 is what ibping uses to carry its own request/response, separate from the standard PerfMgt class 0x04. The ibping exchange is exclusively LID 5 ↔ LID 8 (22 of 65 packets, perfectly symmetric request/response). The remaining 43 packets are background perfquery-style polling originating from LID 5, with only LID 2 replying within the capture window.

Reproducing this breakdown:

Terminal window
tshark -r ../ib-packets/ib_ibping_sniffer.pcap \
-T fields \
-e infiniband.lrh.slid \
-e infiniband.lrh.dlid \
-e infiniband.mad.mgmtclass \
-e infiniband.mad.method \
| sort | uniq -c

Interpretation:

  • ibping uses InfiniBand management-style traffic rather than IP ping.
  • The ibping request/response pair is between LID 5 and LID 8, carried over the vendor MAD class 0x32.
  • The periodic pattern is visible: requests and responses are roughly one second apart.
  • Performance Management traffic from a separate tool (likely perfquery) is mixed into the same capture window, which is why LIDs 1, 2, and 4 also appear.

Why it matters for Chapter 1:

  • It demonstrates that InfiniBand has its own management and diagnostic traffic independent of TCP/IP.
  • LIDs are used directly at the fabric level.

This capture combines trace-related behavior, SMInfo, and performance management.

Protocol hierarchy:

erf
infiniband

Representative packets:

PERF (PortCounters)
PERF (PortCountersExtended)
UD Send Only QP=0x000000 SubnGet(SMInfo)
UD Send Only QP=0x000000 SubnGetResp(SMInfo)
UD Send Only QP=0x000000 SubnGet(LinearForwardingTable)
UD Send Only QP=0x000000 SubnGetResp(LinearForwardingTable)

Interpretation:

  • ibtracert needs fabric topology and forwarding information.
  • SMInfo identifies subnet manager information.
  • LinearForwardingTable points to switch forwarding behavior.
  • Performance counters show management-plane visibility into port state.

Why it matters for Chapter 1:

  • It shows the control plane objects behind the switched fabric model.
  • It supports the Packet Relay / Fabric section: switches forward packets, while management traffic discovers and programs fabric behavior.

This is a small Performance Management capture.

Protocol hierarchy:

erf
infiniband

Top decoded items:

PERF (PortCounters)
PERF (PortCountersExtended)

Interpretation:

  • The capture is dominated by performance counter queries.
  • It is useful for observing monitoring traffic, but it does not show application payloads or RDMA Read/Write operations.

Why it matters for Chapter 1:

  • It connects to the monitoring side of the control path.
  • Fabric health is not inferred only from data packets. It is also queried through management traffic.

This capture shows ICMP over IPoIB.

Protocol hierarchy:

erf
infiniband
ip
icmp
arp

Representative packets:

203.0.113.17 -> 203.0.113.18 ICMP Echo request
203.0.113.18 -> 203.0.113.17 ICMP Echo reply

Interpretation:

  • This is not ibping; it is IP ping carried over InfiniBand.
  • It shows normal IP packets mapped onto InfiniBand.
  • ARP appears because IPoIB still needs address resolution for IP communication.

Why it matters for Chapter 1:

  • It demonstrates the difference between native InfiniBand management tools and IP over InfiniBand.
  • It shows how normal IP applications can run above InfiniBand transport.

This is the largest capture and shows SSH over TCP over IPoIB.

Protocol hierarchy:

erf
infiniband
ip
tcp
ssh

TCP conversation:

10.10.10.12:34826 <-> 10.10.10.11:22
Total frames: 5848
Total bytes: 10 MB
Duration: 4.2846 s

Representative packets:

10.10.10.12 -> 10.10.10.11 TCP 34826 -> 22 [SYN]
10.10.10.11 -> 10.10.10.12 TCP 22 -> 34826 [SYN, ACK]
10.10.10.12 -> 10.10.10.11 SSH Client: Protocol

Interpretation:

  • This is an IP workload carried over InfiniBand.
  • The application is SSH, not RDMA Read/Write.
  • tshark decodes the upper layers just like ordinary IP traffic once it gets past the InfiniBand encapsulation.
  • The high frame count and 10 MB size make this the best sample for IPoIB throughput-style traffic.

Why it matters for Chapter 1:

  • It shows that InfiniBand can carry ordinary IP workloads.
  • It should not be confused with RDMA semantics. IPoIB is not the same as one-sided RDMA.

This is the most useful capture for seeing multiple InfiniBand concepts in one place.

Protocol hierarchy:

erf
infiniband
ip
udp
icmp
arp
ipv6
icmpv6

Important decoded items:

UD Send Only QP=0x000000 SubnGet(SMInfo)
UD Send Only QP=0x000000 SubnGetResp(SMInfo)
CM: ConnectRequest
CM: ConnectReply
CM: ReadyToUse
RC Acknowledge
ICMP Echo request
ICMPv6 Neighbor Solicitation / Advertisement

Condensed flow:

1-2 SMInfo query and response
3-6 IPoIB UDP/ARP traffic
7-9 Connection Management: request, reply, ready
10-23 RC SEND Only carrying ICMP and RC ACK packets
24-25 More UDP over IPoIB
26-31 IPv6 neighbor discovery and RC ACK
32-33 Subnet Administration PathRecord lookup
34-40 Second CM setup plus ICMPv6 traffic and ACKs
41-42 SMInfo query and response
43 ICMPv6 echo request

The most important pair is frame 10 and frame 11:

Frame 10:
Opcode: Reliable Connection (RC) - SEND Only (4)
Destination QP: 0xfc0407
Acknowledge Request: True
Payload: IPv4 ICMP Echo request
Frame 11:
Opcode: Reliable Connection (RC) - Acknowledge (17)
Destination QP: 0x870408
AETH: Ack

Interpretation:

  • Connection Management prepares communication.
  • RC SEND carries an IP payload.
  • RC ACK confirms reliable delivery.
  • AETH is visible in the ACK packet.
  • This is close to the Chapter 1 discussion of two-sided reliable transport, even though it is not a full RDMA Write or Read example.

Why it matters for Chapter 1:

  • It visibly connects QP, BTH opcode, PSN, ACK, and AETH.
  • It shows how a ready connection can carry payload and receive transport-level acknowledgments.
  • It helps explain why InfiniBand reliability is handled by the HCA/RNIC rather than by the CPU for each packet.

From ib_initial_sniffer.pcap:

UD Send Only QP=0x000000 SubnGet(NodeInfo)
UD Send Only QP=0x000000 SubnGetResp(NodeInfo)

Meaning:

  • QP0 is used for Subnet Management.
  • The traffic is part of fabric discovery and configuration.
  • This is control path behavior.

From ib_sniffer.pcap and related captures:

PERF (PortCounters)
PERF (PortCountersExtended)

Meaning:

  • Fabric components expose counters through management traffic.
  • These counters support monitoring and troubleshooting.
  • This is not application data movement.

From ib_ipping_sniffer.pcap:

203.0.113.17 -> 203.0.113.18 ICMP Echo request
203.0.113.18 -> 203.0.113.17 ICMP Echo reply

From ib_IPoIB.pcap:

10.10.10.12:34826 <-> 10.10.10.11:22 TCP/SSH

Meaning:

  • IP packets can be transported over InfiniBand.
  • Upper-layer tools may look familiar, but the L2/L3 underlay is InfiniBand rather than Ethernet.

From infiniband.pcap:

RC SEND Only QP=0xfc0407
RC Acknowledge QP=0x870408

Meaning:

  • The BTH opcode distinguishes the transport operation.
  • The destination QP identifies the queue pair endpoint.
  • The ACK includes AETH.
  • This shows reliable InfiniBand transport behavior.

These captures are not full RDMA Read/Write examples.

Missing from the packet set:

  • RDMA Write packets with RETH.
  • RDMA Read request and response flows.
  • Remote virtual address and rkey fields in RETH.
  • A complete one-sided RDMA data movement example.
  • NCCL collective traffic.

Therefore, the correct interpretation is:

These captures demonstrate InfiniBand fabric management, IPoIB, and some reliable connection transport behavior. They support the RDMA/IB concepts in Chapter 1, but they do not fully demonstrate RDMA Read or RDMA Write data movement.

The RDMA Read/Write Packet Analysis Model section above describes what should appear in future captures that include one-sided operations.

List basic packet summaries:

Terminal window
tshark -r ../ib-packets/infiniband.pcap -c 20

Show protocol hierarchy:

Terminal window
tshark -r ../ib-packets/ib_IPoIB.pcap -q -z io,phs

Extract LRH, BTH, and MAD fields:

Terminal window
tshark -r ../ib-packets/ib_initial_sniffer.pcap \
-Y infiniband \
-T fields \
-e frame.number \
-e frame.time_relative \
-e infiniband.lrh.slid \
-e infiniband.lrh.dlid \
-e infiniband.bth.opcode \
-e infiniband.bth.destqp \
-e infiniband.mad.mgmtclass \
-e infiniband.mad.method \
-e infiniband.mad.attributeid \
-E header=y

Count BTH opcodes:

Terminal window
tshark -r ../ib-packets/infiniband.pcap \
-Y "infiniband.bth.opcode" \
-T fields \
-e infiniband.bth.opcode | sort | uniq -c

Show a detailed packet decode:

Terminal window
tshark -r ../ib-packets/infiniband.pcap \
-Y "frame.number==10 || frame.number==11" \
-V

Find IPoIB TCP conversations:

Terminal window
tshark -r ../ib-packets/ib_IPoIB.pcap -q -z conv,tcp

Discover the exact InfiniBand field names supported by the local Wireshark/TShark build:

Terminal window
tshark -G fields | rg -i "infiniband.*(bth|reth|aeth|opcode|psn|rkey)"

If a future capture contains RDMA READ or WRITE packets, start with BTH opcodes and then expand into RETH/AETH details:

Terminal window
tshark -r ../ib-packets/<rdma-read-write-capture>.pcap \
-Y "infiniband.bth.opcode" \
-T fields \
-e frame.number \
-e infiniband.bth.opcode \
-e infiniband.bth.destqp \
-e infiniband.bth.psn
  1. InfiniBand has a visible control path. The captures show Subnet Management, Subnet Administration, Performance Management, and Connection Management traffic. This reinforces the Chapter 1 point that RDMA kernel bypass mainly applies to the data path.

  2. LIDs and QPs are real packet-level identifiers. LRH fields show source and destination LIDs. BTH fields show destination QPs and opcodes. These are not just abstract API concepts.

  3. QP0 and QP1 matter for management. QP0 appears in Subnet Management traffic. QP1 appears in Subnet Administration and Connection Management traffic.

  4. IPoIB is different from one-sided RDMA. IPoIB carries normal IP traffic over InfiniBand. It can show TCP, SSH, ARP, and ICMP, but that does not mean it is showing RDMA Read or RDMA Write.

  5. Reliable Connection behavior is visible. infiniband.pcap shows RC SEND and RC ACK packets. The ACK includes AETH, which matches the Chapter 1 discussion of ACK Extended Transport Header behavior.

  6. The data path/control path distinction should stay explicit. The management captures are mostly control path. IPoIB and RC SEND/ACK are closer to data path. Full RDMA Read/Write would require additional captures that include RETH and one-sided operation fields.

  • packet-format-reference.md — bit-level layouts for every IB header used in this dataset, plus the full BTH opcode master table and operation→extended-header mapping.