Skip to content

Chapter 1: Wonders in the Workload

The machine learning lifecycle consists of data gathering, training, inference, retrieval augmentation, monitoring, and retraining.

Data gathering means data collection from multiple sources, followed by data cleansing, deduplication, noise removal, and tagging. The key point is data quality and relevance. A model trained on irrelevant or low-quality data may produce inaccurate results.

Training means model selection, parameter configuration, iterative learning, accuracy validation, and parameter optimization. The model is trained on curated datasets, and its parameters are adjusted repeatedly based on the training results until the desired accuracy is achieved.

There are three major types of machine learning:

  • Supervised Learning: Training with labeled data, including both inputs and expected outputs.
  • Unsupervised Learning: Training with unlabeled data to discover hidden patterns or structures.
  • Reinforcement Learning: Training through trial and error, using rewards and penalties to improve decision-making.

Inference means trained model deployment for real-world use. Applications such as ChatGPT and Gemini mainly use inference when generating answers for users.

RAG and MCP integration means external knowledge retrieval and contextual enhancement during inference. The model can access recent or domain-specific information and generate more accurate and relevant responses.

Monitoring means continuous model performance tracking, accuracy evaluation, and drift detection. When performance decreases or new data becomes available, the model goes through periodic retraining to maintain accuracy.

Training StageCore DescriptionPractical Implication
TrainingSplits training data into multiple batches and iteratively learns from themProcesses data in batches rather than all at once, balancing GPU memory usage and training efficiency
Forward PropagationInput data passes through the model’s layers to produce a predictionEach layer performs operations such as matrix multiplication and activation functions
Error Calculation / Loss CalculationCompares the model’s prediction against the ground truth to compute the errorThe loss function quantifies how far off the model currently is
Backward PropagationPropagates the computed error from the model’s later layers back toward the earlier onesUses differentiation and the chain rule to determine how much each parameter contributed to the error
Local Gradient CalculationComputes the gradient for each parameter and adjusts the weights accordinglyThe optimizer uses these gradients to update the model parameters
Collective CommunicationSynchronizes gradients, parameters, and activations across multiple GPUs or nodes during distributed trainingxCCL libraries such as NCCL, RCCL, Gloo, and UCC handle high-speed GPU-to-GPU communication
IterationRepeats the cycle of forward propagation, loss calculation, backward propagation, and parameter updateReduces loss and gradually improves model accuracy through repeated training
Collective PatternOperationMeaning in AI Training
AllReduceReduces values from all ranks using an operation such as sum, min, or max, then returns the same result to every rankThe most common pattern for averaging or summing gradients computed by each GPU in Distributed Data Parallel training
BroadcastCopies a buffer from one root rank to all ranksUsed to distribute initial weights, configuration values, or checkpoint metadata from rank 0 to other GPUs
ReduceSimilar to AllReduce, but only the root rank receives the reduced resultUsed to aggregate results from multiple GPUs into one rank
AllGatherCollects data from every rank and makes the complete data available to every rankUsed to reassemble distributed shards in Tensor Parallelism, ZeRO, and sharded parameter or activation layouts
ReduceScatterReduces data across ranks, then splits and distributes the reduced result by rankUsed as a more efficient decomposition of AllReduce, and important in ZeRO, FSDP, and Megatron-style training
AlltoAllSends different data from each rank to every other rank and receives different data from every other rankCommonly used in Mixture-of-Experts models when routing tokens to expert GPUs
GatherCollects data from all ranks into a single root rankUsed to collect evaluation results, logs, or sample outputs into one rank
ScatterSplits data from the root rank and distributes a shard to each rankUsed to distribute data batches or shards across multiple GPUs

A collective call must be invoked by every rank with the same count and datatype. If this condition is violated, undefined behavior such as hangs, crashes, or data corruption can occur. This is why NCCL hang debugging usually starts by checking whether every rank entered the collective and whether the count and datatype match.

NCCL hang can also be a secondary symptom of GPU-side faults such as ECC, Xid, PCIe, or NVLink errors. See GPU Cluster Failure Analysis for an operational troubleshooting timeline.

  • Between Three Workers:

Collective communication patterns between three workers

Source: researchgate.net

In distributed AI training, many training algorithms (DDP, FSDP, ZeRO, Megatron-LM, etc.) require all GPUs to:

  1. Share model parameters/gradients (AllReduce / ReduceScatter / AllGather)
  2. Propagate initial states (Broadcast)
  3. Aggregate results (Gather)

Doing these naively with point-to-point sends and receives would be extremely inefficient, especially for large models and large GPU counts. Collective communication libraries like NCCL optimize these patterns to use the network topology and GPU hardware effectively.

The core motivation is scalability. If you implement broadcast using only point-to-point communication (P2P), the root rank has to send the data sequentially to (p-1) other ranks. This makes the communication time scale linearly with the number of ranks, and the root’s outgoing link becomes a bottleneck. In contrast, when you use a collective API, NCCL automatically selects an algorithm (e.g., ring, tree, NVLS, CollNet, or PAT) based on the topology, message size, and operation type. With ring-based algorithms, all links can be activated simultaneously, making the time scale nearly independent of the number of ranks.

NCCL can be understood as a GPU-oriented implementation of MPI-style collective vocabulary such as AllReduce and AllGather. The key differences are:

  • Execution location: MPI is a host-side library, while NCCL runs as GPU kernels. Communication and reduction are performed in a single kernel.
  • Data path: MPI commonly follows a host memory to network path, while NCCL uses GPU memory to GPU memory paths through GPUDirect P2P or RDMA.

Every collective can be described as a combination of four basic patterns:

PatternDirectionData Form
Broadcastroot → allReplicates the same value
Scatterroot → allDistributes different chunks
Gatherall → rootconcat
Reduceall → rootApplies an element-wise operation such as sum or max
  • AllGather = Gather + Broadcast
  • AllReduce = Reduce + Broadcast, or ReduceScatter + AllGather. The latter is used by NCCL and MPICH and is directly leveraged by ZeRO-3 and FSDP.
  • ReduceScatter = Reduce + Scatter
  • AlltoAll = transposed Scatter × N
  • 8 collectives: AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, Gather, Scatter, and AlltoAll. The last three are internally implemented as grouped P2P operations, or bundled Send/Recv calls.
  • 2 P2P operations: ncclSend and ncclRecv (NCCL 2.7+).
  • 6 one-sided RMA operations: ncclPutSignal, ncclSignal, ncclWaitSignal, ncclCommWindowRegister, and related operations (NCCL 2.29.2+). These are important for KV cache transfer in disaggregated serving patterns such as separated prefill and decode.

Two-sided communication is synchronous in terms of endpoint coupling, while one-sided RMA is asynchronous.

  • Two-sided: Even if the host call is nonblocking, both sides must issue matching calls before data can be transferred, so rendezvous coupling remains.
  • One-sided RMA: Once the receiver registers a window, the sender can write directly through DMA and send only a signal. This naturally fits workloads that need to decouple producer and consumer timing, such as separated prefill and decode.
  • Intra-node: NVLink P2P → PCIe P2P → SHM through host RAM → NIC loopback
  • Inter-node: GPU kernel → GPU vidmem → CPU proxy thread → NIC. If GPUDirect RDMA is available, the NIC can directly read from and write to GPU memory. Otherwise, data must pass through host pinned memory as a staging buffer, adding two extra PCIe crossings. This uses the same memory type as PyTorch DataLoader’s pin_memory=True.

Ring AllReduce consists of ReduceScatter (p1p-1 steps) followed by AllGather (p1p-1 steps), completing in 2(p1)2(p-1) total steps. The per-GPU send volume converges to 2(p1)Kp2K\frac{2(p-1)K}{p} \to 2K, which exactly matches the information-theoretic lower bound for AllReduce. This makes it bandwidth-optimal.

Key NCCL implementation details:

  • Single kernel design: recv → reduce → send is handled inside one CUDA kernel and within the same thread’s register set. Data does not round-trip through HBM, and kernel launch overhead is paid only once.
  • Channels: A single collective call is split into multiple channels, or CUDA blocks, so multiple parallel rings can run at the same time. This helps saturate NVLink links and NIC FIFOs.
  • Slots (NCCL_STEPS=8): Each channel is further pipelined through FIFO slots. While slot 0 is being reduced, the next chunk may already be arriving in slot 1.
  • Protocols: Simple for large messages and memory fences, LL for small messages with 8-byte atomics and roughly 25-50% bandwidth, and LL128 for 128-byte atomics and about 95% NVLink bandwidth.

Even for the same AllReduce call, NCCL may use different schedules depending on message size, rank count, and topology. The semantics are fixed, but the execution strategy is a separate layer. For each call, NCCL builds a cost table with 7 algorithms × 3 protocols = 21 cells and selects the cell with the lowest estimated time.

For AllReduce, only 10 of the 21 cells are actually evaluated. PAT is excluded, and NVLS, NVLSTree, and CollNet allow only the Simple protocol.

  • α: Startup latency for one message, similar to RTT and usually measured in microseconds.
  • β: Per-byte transfer cost, equivalent to 1 / bandwidth.
  • γ: Per-byte reduction cost, usually small compared with β and often ignored.

One message costs α+nβ. With reduction included, the cost becomes α+nβ+nγ.

Two regimes:

  • Latency-bound (small messages): step count × α. Because α dominates, algorithms with logp steps are advantageous, such as Tree and PAT.
  • Bandwidth-bound (large messages): total transfer volume dominates. Algorithms that use all links in parallel are advantageous, such as Ring.

NCCL algorithm selection can be summarized as computing time = lat + bytes / bw across the 21 cells and selecting the argmin. The topology first eliminates ineligible cells, then the cost model chooses the algorithm and protocol pair.

AlgorithmStrengthWeaknessUsually Favorable Scenario
RingStrong bandwidth utilizationLatency steps increase as rank count growsLarge tensors, large gradients, bandwidth-sensitive AllReduce
TreeLow latency and fewer hopsMay use bandwidth less efficiently than Ring for large messagesSmall messages and latency-sensitive collectives
NVLS / NVLink SHARPNVSwitch-based reduction offloadDepends on NVSwitch/NVLink hardwareInside NVSwitch systems such as DGX/HGX
CollNet / SHARPUses network or switch offloadRequires NIC, switch, or plugin supportLarge-scale multi-node collectives
PATIncreases parallelism by running multiple trees in parallelDepends on newer NCCL versions and environment supportAlgorithm family that may be selected in large-scale environments

NCCL selects among Simple, LL, and LL128 at runtime based on user settings, the collective algorithm, and internal heuristics. It considers factors such as topology, GPU architecture, and message size when choosing the algorithm-protocol pair.

Benchmark results typically show that LL and LL128 are favorable for small messages, while Simple tends to be stronger for large distributed transfers.

ProtocolCharacteristicsSuitable Scenario
SimpleLarge chunk oriented, maximizes bandwidthLarge messages and large gradients
LLLow latency with small-granularity synchronizationSmall messages where latency matters
LL128Improves bandwidth efficiency compared with LLSmall to medium messages on high-speed interconnects
TransportMeaning
P2PDirect communication between GPUs in the same node, using NVLink or PCIe P2P
SHMUses host shared memory when P2P is unavailable or inefficient
NETInter-node NIC/RDMA communication through InfiniBand, RoCE, or TCP
CollNetUses collective network offload
NVLSUses NVSwitch-based reduction acceleration

NCCL can be understood as three abstraction layers:

  1. Semantics layer — The meaning of collective APIs such as AllReduce, AllGather, and AlltoAll. This matches the MPI standard.
  2. Algorithm layer — The schedule and protocol used to execute the same semantics, such as Ring, Tree, NVLS, Simple, or LL128. The cost model selects this automatically.
  3. Kernel layer — Multi-level channel × slot pipelining inside a single CUDA kernel, with recv → reduce → send fused inside the register set.
  • DGX B200/H100
DDP → AllReduce centered
FSDP → ReduceScatter + AllGather centered
TP → AllReduce / AllGather centered
MoE → AllToAll centered
Intra-node → NVLink / NVSwitch / NVLS
Inter-node → InfiniBand / RoCE / GPUDirect RDMA
NCCL Runtime → topology graph + cost model + channelization
  • For Debug
Terminal window
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=INIT,GRAPH,COLL,NET
  • For Benchmark Experiment
Terminal window
NCCL_ALGO=Ring
NCCL_PROTO=Simple
NCCL_ALGO=Tree
NCCL_PROTO=LL
NCCL_ALGO=Ring
NCCL_PROTO=LL128

3D Parallelism: Data, Tensor, and Pipeline

Section titled “3D Parallelism: Data, Tensor, and Pipeline”
TermMeaning
JCTThe total time from when a job starts until the entire job finishes
Tail latencyThe delay of the slowest subset of flows, tasks, or requests
StragglerA slow worker, task, GPU, or network flow that creates tail latency
ThroughputThe amount of work processed per unit of time
GPU utilizationThe percentage of time that a GPU is performing useful computation

RDMA is often described as kernel bypass, but this mainly applies to the data path, not to the entire system. The control path still requires the CPU, kernel driver, and RDMA runtime.

PathWhat HappensCPU / Kernel Role
Data pathActual payload movement between registered memory regionsMostly bypassed after setup
Control pathResource setup, memory registration, key creation, QP connection, completion handling, and teardownStill required

The rest of this section first compares TCP/IP and RDMA data paths, then later explains the control-path objects and steps such as PD, CQ, QP, MR, lkey, rkey, metadata exchange, and QP state transitions.

When the sender calls send(), data usually follows this path:

  1. Application memory in user space
  2. Kernel entry through a system call and context switch
  3. Copy into the kernel socket buffer
  4. TCP layer segmentation and header generation
  5. IP layer packet construction and routing
  6. NIC driver transfer to the NIC through DMA
  7. wire

The receiving side follows the reverse path.

The important point is that memory copies between kernel buffers and user buffers are required. Each packet also goes through a long chain of interrupt → ISR → soft IRQ → socket queue → process wakeup.

When the sender calls ibv_post_send() through the Verbs API:

  1. The application enqueues a work request that points to a pre-registered Memory Region (MR) directly into the NIC’s Queue Pair (QP).
  2. The NIC reads data directly from the registered memory through DMA and sends it onto the wire.
  3. The receiving NIC writes the received data directly into the peer’s registered memory through DMA.
  4. When the operation completes, the NIC adds an entry to the Completion Queue (CQ).

The OS kernel does not appear in the data path. This is called kernel bypass. RDMA also avoids extra memory copies (zero-copy) and does not require the receiving CPU to process data arrival (CPU offload).

CategoryTCP/IPRDMA
Data pathApplication → Kernel → TCP/IP stack → NICApplication registered memory ↔ RNIC/HCA DMA
Kernel involvementHighVery low in the data path
CPU usageRelatively highLow
Memory copySocket buffer copies are commonDesigned for zero-copy transfer
LatencyRelatively highLow
GeneralityVery highRequires RDMA NICs, drivers, and fabric support
Network requirementsWorks on standard Ethernet/IPRequires InfiniBand, RoCE, iWARP, or similar support
Programming modelSocket APIVerbs, queue pairs, completion queues, and memory regions
Failure and operations complexityRelatively simpleMore difficult to configure and debug
Typical use casesWeb, API, database, and service communicationAI training, HPC, distributed storage, NVMe-oF, and high-performance file services

First, kernel bypass: TCP/IP requires system calls, kernel buffers, and TCP/IP stack processing during send() and recv(). With RDMA, once a work request is posted to a queue pair, the RNIC/HCA performs the data transfer.

Second, zero-copy: TCP/IP usually copies data between the application buffer and the kernel socket buffer. RDMA moves data directly from one registered memory region to another. TechTarget also describes RDMA as zero-copy networking that bypasses the kernel networking stack on both systems.

Third, NIC offload: RDMA NICs handle much of packetization, ordering, reliability, DMA, and completion processing in hardware. The CPU focuses on posting work requests and checking completions rather than moving data directly.

Main RDMA Modes: InfiniBand, RoCE, and iWARP

Section titled “Main RDMA Modes: InfiniBand, RoCE, and iWARP”
ModeDescriptionCharacteristics
InfiniBand RDMARDMA over an InfiniBand fabricWidely used in HPC and AI training; uses credit-based flow control
RoCEv1RDMA over Ethernet layer 2Mostly limited to L2 domains and difficult to route
RoCEv2RDMA over UDP/IPSupports IP routing and is important in AI data center Ethernet fabrics
iWARPRDMA over TCPLess dependent on lossless Ethernet, but less common than RoCE or InfiniBand in AI/HPC environments

InfiniBand and RoCEv2 are the most common options in AI data centers. RoCEv2 provides RDMA semantics over UDP/IP, so it can run on Ethernet fabrics. However, loss and congestion management become critical, so PFC, ECN, and DCQCN are usually part of the deployment.

In distributed training, GPUs continuously exchange gradients, activations, and parameter shards after forward and backward computation. If TCP/IP is used, the CPU and kernel stack can become bottlenecks.

With RDMA, collective communication libraries such as NCCL can exchange data quickly between GPU nodes over InfiniBand or RoCE. With GPUDirect RDMA, the NIC can directly access GPU memory without staging through host memory. Red Hat describes GPUDirect RDMA as useful for distributing GPU-accelerated workloads across a cluster.

The general difference is:

TCP/IP-based GPU communication
GPU memory → Host memory → Kernel TCP/IP stack → NIC
NIC → Kernel TCP/IP stack → Host memory → GPU memory
RDMA + GPUDirect RDMA
GPU memory ↔ RNIC/HCA ↔ Network ↔ RNIC/HCA ↔ GPU memory

RDMA is not just about having a faster network. It reduces CPU involvement, memory copies, GPU idle time, collective communication latency, and ultimately JCT.

RDMA does not make all kernel responsibilities disappear. Instead, it moves much of the data-path work from the kernel to RNIC/HCA hardware, firmware, and user-space RDMA runtimes.

Traditional Kernel FunctionWho Handles It in RDMA?
Reliable delivery, including retransmission, ACKs, and sequencingNIC/HCA hardware
Flow control to avoid overflowing the receive bufferNIC hardware with credit-based flow control
Congestion controlNIC plus network fabric mechanisms such as PFC, ECN, and DCQCN
Packetization and segmentationNIC hardware
Header generation and validationNIC hardware
Memory protection against invalid memory accessKernel at registration time plus NIC MMU
Address translation from virtual to physical addressesKernel at registration time plus NIC IOMMU/MTT
Connection setupKernel plus user-space library during setup
RoutingNetwork fabric and switches

RDMA shifts work from “the CPU handles everything directly” to “the NIC/HCA hardware handles the fast path.”

The key change is a shift from “the kernel participates in every packet” to “the system is configured once, then hardware handles the data path.” Work that used to happen per packet becomes setup-time work. The kernel remains involved in the control plane, but it is removed from the data plane.

One of the heavy parts of TCP is ACK tracking and retransmission. In RDMA Reliable Connected (RC) mode, the NIC handles this directly:

  • Assigns a Packet Sequence Number (PSN) to each packet
  • Automatically sends ACK/NAK responses from the receiving NIC
  • Automatically retransmits when the sending NIC does not receive an ACK
  • Reorders out-of-order packets in the NIC

To support this, the NIC contains small state machines and buffers. In effect, part of a TCP-like reliability mechanism is implemented inside the NIC chip.

When the receive-side buffer fills up, the sender must stop. RDMA handles this at two layers:

  • Credit-based flow control at the NIC level: The receiving NIC tells the sending NIC how many packets it can currently accept. When credits run out, the sender stops.
  • Priority Flow Control (PFC) at the link level: A switch can signal that a priority queue is full and temporarily pause traffic. In RoCE deployments, this is often referred to as a lossless network.

Like TCP Reno or Cubic, RDMA also needs congestion control. The difference is where the control runs:

  • DCQCN (Data Center Quantized Congestion Notification): When ECN-marked packets arrive, the sending NIC automatically reduces its rate.
  • Timely: Adjusts the rate based on RTT.
  • These mechanisms are handled by NIC firmware, so the CPU is not directly involved.

In TCP, the kernel is the gatekeeper for memory protection. In RDMA:

  • When ibv_reg_mr() is called, the kernel pins the memory and registers the Memory Translation Table (MTT) with the NIC.
  • After that, the NIC validates the remote key (rkey) and memory region for every RDMA request.
  • Invalid keys or out-of-range accesses are blocked by NIC hardware.

Protection does not disappear. The validation responsibility moves from the kernel to the NIC. This is why RDMA NIC firmware bugs can become security vulnerabilities, and related CVEs appear from time to time.

The NIC also participates in translating virtual addresses to physical addresses. It has a small MMU-like mechanism and caches MTT entries, allowing it to perform work similar to the CPU MMU independently.

Benefits:

  • Much lower CPU usage
  • Single-digit microsecond latency
  • Throughput that can approach wire rate

Costs:

  • NICs become more expensive and complex: RDMA NICs require larger chips, more firmware, and more memory than ordinary Ethernet NICs.
  • Firmware bugs become more dangerous: Kernel bugs can be patched through the OS, but NIC firmware updates are heavier and often require rebooting.
  • Network fabric requirements become stricter: RoCEv2 often assumes a lossless network, so PFC and ECN must be tuned carefully. Bad configuration can lead to events such as PFC storms.
  • Kernel flexibility is reduced: TCP can add new algorithms such as BBR through OS updates, while RDMA congestion control is tied more closely to the NIC chipset.
  • Memory pinning has a cost: Pinning large memory regions limits the OS’s ability to manage memory, compress pages, or swap.
  • Setup is expensive: Creating QPs and registering MRs is costly, so RDMA can be inefficient for short one-off messages.

The kernel’s work is not eliminated. It is split into a model where the kernel validates and configures the system once, then the NIC handles each packet quickly afterward. Memory registration and QP setup are handled by the kernel and runtime; subsequent traffic is handled by the NIC.

GPUDirect RDMA extends this model to GPU memory, allowing the NIC to read and write GPU memory directly without passing through host RAM.

CategoryInfiniBandRoCEv1RoCEv2iWARP
Base networkInfiniBand fabricEthernet L2UDP/IP over EthernetTCP/IP over Ethernet
RoutingIB routingLimited to an L2 broadcast domainSupports IP routingSupports IP routing
RDMA transportNative IB transportIB transport semanticsIB transport over UDP/IPiWARP RDMA stack
Typical useHPC, AI training, DGX/HGX clustersRDMA within the same L2 domainAI Ethernet fabrics and leaf-spine RDMASome storage and HPC use cases
StrengthsLow latency, lossless fabric, RDMA-native designSimple Ethernet RDMARoutable and easier to integrate with Ethernet data centersTCP/IP-based and routable
WeaknessesRequires operating a dedicated fabricPoor L3 scalabilityOperational complexity around PFC, ECN, and DCQCNRelatively uncommon in AI GPU fabrics
  • InfiniBand requires IB-capable switches.
  • RoCEv1 introduces Ethernet framing and enables the use of commodity switches.
  • RoCEv2 adds UDP/IP transport and enables routing.

Source: indico.cern.ch

Key fields included in the BTH header:

  • Opcode: Used to specify the type of RDMA operation, such as RDMA_WRITE, RDMA_READ, …
  • Destination QP: Used to distinguish between different destination Queue Pairs for a packet.
  • Acknowledge Request: Indicates whether the receiving end needs to return an ACK for this packet.
  • Packet Sequence Number (PSN): used in packet sequence tracking for reliable delivery.

As explained earlier, the InfiniBand transport layer needs a way to notify the sender when reliable transmission requires acknowledgment or error reporting. This is done using the AETH extension, which contains the following fields:

  • Syndrome: A field that contains response codes indicating success (ACK), error conditions (NAK), or receiver not ready (RNR) status, along with flow control information
  • Message Sequence Number (MSN): Indicates the sequence number of the most recently completed message, implying that all messages with lower sequence numbers have been successfully received

Source: https://qsysarch.com/blog/what-is-rocev2/

IB follows a five-layer network model: Physical, Link, Network, Transport, and Application layers. Unlike traditional networks where L1 & L2 are hardware-based and L3 & L4 are software-based, IB implements L1-L4 in hardware. The IB transport layer API connects HCA NICs and CPUs, with Verbs API as the application network interface for IB, similar to the Socket API for traditional TCP/IP networks. MPI, a method library for parallelization, can be based on OFA Verbs API or TCP/IP Socket.

Source: https://www.naddod.com/blog/infiniband-system-fabric-an-overview

The following diagram shows that in InfiniBand/RDMA, the application does not directly send packets. Instead, it submits a WQE to a QP, the Channel Adapter sends packets through the transport, network, link, and physical layers, and completion is reported through a CQE.

Source: https://www.researchgate.net/figure/nfiniBand-Architecture-Communication-Stack-7_fig1_242258491

In InfiniBand, every transfer starts or ends at a Channel Adapter. On a normal server, this is usually called a Host Channel Adapter (HCA). On the I/O device side, it is called a Target Channel Adapter (TCA). InfiniBand uses a switched fabric topology rather than a shared medium.

Consumer / Application
↓ Submit WQE
Queue Pair(QP): Send Queue / Receive Queue
↓ Channel Adapter executes
Transport Layer
↓ Generate IBA packet
Network / Link / PHY Layer
↓ Fabric / Packet Relay / Switch
Remote Channel Adapter
↓ Generate CQE
Remote Consumer / Application

The consumer is the upper-layer software, such as an application, MPI, NCCL, or a storage stack.

The consumer does not create network packets directly. Through the Verbs API, it creates requests such as “send this memory there,” “read remote memory,” or “perform an RDMA Write.” Red Hat describes InfiniBand as both a physical/link-layer protocol and an upper programming API, the InfiniBand Verbs API, and explains that this Verbs API is an implementation of RDMA technology.

WQE is the command unit that tells the HCA what operation to perform. The HCA can keep multiple WQEs in a queue and process them.

WQE FieldMeaning
opcodeSEND, RDMA_READ, RDMA_WRITE, ATOMIC, and similar operations
local addressAddress of the local memory buffer
lkeyAccess key for local memory
remote addressAddress of the remote memory
rkeyAccess key for remote memory
lengthTransfer length
flagsOptions such as whether signaled completion is requested

Mellanox/NVIDIA InfiniBand training material also describes a Work Queue as a consumer/producer interface to the fabric. When the consumer creates a WQE, the Channel Adapter executes that work request.

A Queue Pair is a communication channel between two HCAs. For RDMA communication to occur, QPs must be created on both sides and connected to each other. It is a core unit of RDMA communication.

A QP usually consists of two queues.

QueueRole
Send QueueSubmits outgoing work requests such as SEND, RDMA_READ, RDMA_WRITE, and ATOMIC
Receive QueuePre-posts receive buffers for incoming SEND operations

Send and receive queues are separated because one-sided operations such as RDMA Read and Write do not necessarily use the remote Receive Queue, while SEND/RECV requires the receiver to pre-post receive buffers to the Receive Queue.

NVIDIA documentation also explains that a Completion Queue can service a send queue, a receive queue, or both, and that work queues from multiple QPs can be connected to one CQ.

CQ / CQE: Completion Queue / Completion Queue Element

Section titled “CQ / CQE: Completion Queue / Completion Queue Element”

When the Channel Adapter completes an operation, it writes a CQE into the CQ to tell the application whether the operation succeeded or failed.

Consumer posts WQE
Channel Adapter executes operation
Write CQE to Completion Queue
Consumer polls CQ
operation completed
CQE FieldMeaning
opcodeOperation type
statusSuccess or failure
wr_idID associated with the WQE
byte_countNumber of bytes transferred
vendor_specificVendor-specific extension information

On servers, the Channel Adapter is usually an HCA or RNIC. It performs the following roles:

RoleDescription
WQE fetchReads work requests from the Work Queue
DMAReads from or writes to host memory or GPU memory
packetizationGenerates IBA packets
transport processingHandles PSN, ACK, retry, ordering, and related functions
completion generationWrites CQEs into the CQ
link processingSends packets through the port into the fabric

The important point is that a Channel Adapter is not just a simple NIC. It is an RDMA transport offload engine.

Because hardware creates packets and handles transport processing instead of the CPU, tail latency and CPU overhead are reduced. NVIDIA Spectrum-X NICs are representative examples.

The Transport Layer is responsible for QPs, WQEs, packet sequencing, ACKs, retries, and message segmentation and reassembly.

In InfiniBand, a message can be split into multiple packets, and those packets together form one message. InfiniBand messages can include RDMA read/write, send/receive, atomic operations, and multicast.

In transports such as Reliable Connection (RC), the receiver acknowledges packets, and the sender updates the completion queue after receiving the ACK. NVIDIA/Mellanox InfiniBand introduction material describes the same flow: the receiver acknowledges packets, and the sender receives acknowledgments and updates the completion queue.

At these layers, IBA packets are generated and transmitted through the fabric. These layers are often grouped together as the Network Layer, but they can be separated as follows:

LayerRole
Transport LayerQP, operation, reliability, ordering, ACK, retry
Network LayerLID/GID-based routing and forwarding between subnets
Link Layerlocal link framing, flow control, packet forwarding
PHY LayerPhysical cables, optical/copper media, and symbol transmission

InfiniBand uses a switched fabric topology. Processors are connected through HCAs, and peripherals can be connected through TCAs. These hardware resources are accessed through OFED drivers.

The Packet Relay in the middle can be understood as an InfiniBand switch or a forwarding device inside the fabric.

It is important that the packet relay does not go up to the transport layer. Switches usually do not process end-host QPs or CQEs. They handle packet forwarding and routing. The semantic meaning of an RDMA operation, such as “this is an RDMA Write” or “write to this remote memory address,” is handled by the Channel Adapters on both ends.

When NCCL executes collectives such as AllReduce, ReduceScatter, and AllGather, the lower layers still follow this model: WQEs are posted, the HCA creates packets, and completions are checked through CQEs. Libibverbs can be viewed as the low-level API, UCX as middleware that makes it easier to use, and NCCL as a library for AI collectives.

NCCL / MPI / PyTorch Distributed
libibverbs / UCX / NCCL net plugin
QP / CQ / MR / WQE
HCA / RNIC
InfiniBand or RoCE fabric
Remote HCA / RNIC
Remote GPU or host memory

The RDMA process can be understood in three major stages:

A. Connection / Resource Setup
B. RDMA WRITE
C. RDMA READ

The full flow starts with connection and resource setup. This phase is not yet the high-speed data path. It is mainly handled by the CPU, kernel driver, and RDMA library.

First, the two servers create a control channel through a TCP socket or RDMA-CM. This channel is not meant for large data movement. It is used to exchange RDMA connection metadata.

Then each server creates the RDMA resources required for communication.

ObjectMeaning
PDProtection Domain. A protection boundary that groups QPs and MRs
CQCompletion Queue. Completed Work Request results appear here as CQEs
QPQueue Pair. A pair of a Send Queue and a Receive Queue
MRMemory Region. Registered memory that the RNIC can access through DMA
lkeyKey used by the local RNIC to access a local MR
rkeyKey used by a remote RNIC or remote peer to access an MR

Each server also exchanges connection metadata:

QPN = Queue Pair Number
PSN = Packet Sequence Number
LID/GID = InfiniBand/RoCE address information
MTU = Negotiated packet size
remote VA = Remote virtual address
rkey = Remote memory access key

This step is critical because RDMA Read and Write directly use the remote memory address + rkey. The remote side must register memory and provide an rkey before one-sided RDMA can access that memory.

The QP is not immediately ready for traffic. It usually moves through the following states:

RESET → INIT → RTR → RTS
StateMeaning
INITBasic QP attributes are configured
RTRReady To Receive. Remote QP information is applied
RTSReady To Send. Work Requests can now be executed

After this point, one-sided operations such as RDMA Write and RDMA Read do not require the remote CPU to participate directly in the data path.

RDMA Write is a push operation. Server 1 writes data into Server 2’s registered memory.

Server 1 local MR
Server 1 RNIC
↓ WRITE packets with payload
Fabric
Server 2 RNIC
Server 2 remote MR
  1. The application posts an RDMA_WRITE Work Request to the Send Queue using ibv_post_send().
opcode = RDMA_WRITE
local address
lkey
length
remote address
rkey
signaled flag

If the signaled flag is enabled, a CQE is generated when the operation completes. If it is disabled, a successful operation may not produce a CQE. High-performance code often signals only some Work Requests to reduce CQ pressure.

  1. After ibv_post_send(), the application or library rings a doorbell to notify the RNIC that a new WQE is available.
App posts WQE
Doorbell MMIO
RNIC fetches WQE

From this point forward, the CPU does not copy bytes. The RNIC/HCA fetches the WQE and executes it.

  1. The RNIC reads the local MR through DMA using the local address and lkey from the WQE.
local address + lkey
RNIC validates local access
DMA read from local MR

This is the zero-copy property of RDMA: the CPU does not copy data from an application buffer into a kernel socket buffer.

  1. The RNIC splits the data into packets and sends it through the fabric. For RoCEv2, the packet structure is roughly:
Ethernet
+ IP
+ UDP(dst port 4791)
+ BTH
+ RETH
+ payload
+ ICRC

The key point is that an RDMA Write carries the payload in the request direction:

WRITE request = header + payload
  1. The target RNIC validates the incoming packet:
QPN match?
PSN in order?
rkey valid?
remote address inside the registered MR range?
REMOTE_WRITE permission present?

After validation, the target RNIC writes directly into the remote MR through DMA without waking the remote CPU.

  1. In Reliable Connection (RC) mode, the target RNIC returns an ACK or NAK. When the operation completes, the initiator RNIC writes a CQE into the CQ.
Target RNIC writes remote MR
ACK returned
Initiator CQE generated
App polls CQ

A normal RDMA_WRITE does not automatically notify the target application. The target memory is updated, but the target CPU may not know that it happened. If target-side notification is needed, designs often use RDMA_WRITE_WITH_IMM, SEND/RECV, a separate doorbell variable, or a polling protocol.

RDMA Read is a pull operation. Server 1 reads data from Server 2’s registered memory.

Server 1 sends READ request
Server 2 RNIC reads remote MR
READ response carries payload back
Server 1 RNIC writes local MR
  1. The application posts an RDMA_READ Work Request to the Send Queue using ibv_post_send().
opcode = RDMA_READ
local address
lkey
length
remote address
rkey

The fields look similar to RDMA Write, but the meaning is different. In Write, the local buffer is the data source. In Read, the local buffer is the destination where returned data will be stored.

  1. The initiator RNIC sends a READ request packet.
READ request = BTH + RETH

The READ request usually has no payload. Instead, it describes what to read from remote memory:

remote virtual address
rkey
length

The request packet is small, and the actual data returns in the response direction.

  1. The target RNIC validates the request:
QPN match?
PSN valid?
rkey valid?
REMOTE_READ permission present?
remote address inside the registered MR range?

The remote MR must have been registered with IBV_ACCESS_REMOTE_READ for RDMA Read to succeed.

  1. After validation, the target RNIC reads the remote MR through DMA.
remote MR
↓ DMA read
target RNIC

The remote CPU is still not involved in the data path.

  1. The READ response carries the payload back to the initiator.
READ request = header only, no payload
READ response = header + payload

Large reads are split into multiple response packets according to MTU. The FIRST / MIDDLE / LAST / ONLY flow in the diagram represents this packetization and reassembly behavior.

  1. The initiator RNIC writes the READ response payload into the local MR through DMA.
READ response payload
initiator RNIC
local MR

After all response packets arrive and reassembly completes, the initiator RNIC generates a CQE in the local CQ. The application observes completion with ibv_poll_cq().

CategoryRDMA WRITERDMA READ
Data directioninitiator → targettarget → initiator
MeaningPush into remote memoryPull from remote memory
Payload locationPayload is in the request packetPayload is in the response packet
Remote CPUNot involvedNot involved
Required remote MR permissionREMOTE_WRITEREMOTE_READ
Completion CQEGenerated on the initiator sideGenerated on the initiator side
Target notificationA normal Write does not notify the targetA normal Read does not notify the target
Latency characteristicsRelatively simpleUsually more RTT-dependent because it requires request and response
Example use casesGradient or tensor push, KV-cache pushRemote parameter fetch, KV-cache pull, remote storage read

In one sentence: Write sends data in the request, while Read requests data and receives it in the response.

1. How does the data life cycle influence AI model performance in data centers?

In my understanding, the machine learning lifecycle starts with high-quality data gathering and preprocessing. Then, the model is trained through model selection, parameter tuning, and iterative validation. Once the model reaches the required accuracy, it is deployed for inference. In production, RAG or MCP can be integrated to provide the model with up-to-date or domain-specific context. After deployment, continuous monitoring is required to track accuracy, detect model drift, and trigger retraining when necessary.

2. Explain the iterative process of training an AI model, including the roles of forward and backward propagation.

Training an AI model is an iterative process. First, the training data is divided into batches. During forward propagation, the model processes each batch through multiple layers and generates predictions. Then, the loss is calculated by comparing the predictions with the expected outputs. Through backward propagation, the model computes gradients using calculus and updates its parameters to reduce the error. In distributed training, collective communication libraries such as NCCL, RCCL, Gloo, and UCC synchronize data and gradients across multiple GPUs or nodes. This process is repeated many times until the model reaches the desired level of accuracy.

3. What are the main types of parallelism in AI training, and how do they address scalability challenges?

The main types of parallelism in AI training are data parallelism, model parallelism, tensor parallelism, and pipeline parallelism. Data parallelism replicates the model across multiple GPUs and splits the training data, which improves throughput. Model parallelism splits the model itself across GPUs when the model is too large to fit into a single device. Tensor parallelism further divides large tensor operations inside each layer, such as matrix multiplications in Transformer blocks. Pipeline parallelism splits the model by layers or stages and processes micro-batches through the pipeline. In practice, large-scale LLM training often combines these methods to scale across many GPUs while managing memory, computation, and communication overhead.

4. Define job completion time (JCT) and discuss its significance in AI/ML clusters.

Job Completion Time is the total time from when a training job starts to when the final required work finishes. In distributed AI training, the job is split across many GPUs, workers, and network flows. Even if most workers finish early, the entire job cannot complete until the slowest worker or flow finishes. Therefore, throughput improves average progress, but tail latency and stragglers determine the actual job completion time. This is why AI training networks focus heavily on reducing synchronization delay, congestion, and tail latency.

5. How does tail latency affect distributed AI training, and what architectural features help mitigate it?

In distributed AI training, even if the average latency is low, a few slow flows or straggler GPUs can delay synchronization points such as AllReduce or checkpointing. Since the whole job cannot move forward until the slowest required work finishes, tail latency directly increases Job Completion Time. Common sources include network congestion, hardware failures, and inefficient collective communication patterns. Architectural improvements include non-blocking fabrics, congestion control algorithms, hardware offloading, and optimized collective communication patterns. These features aim to minimize synchronization delays and reduce job completion time.

6. Describe the function and benefits of RDMA in AI/ML data center networks.

RDMA plays a critical role in AI and ML data center networks by enabling direct memory access between servers, GPUs, and network adapters with minimal CPU and kernel involvement. Instead of moving data through the traditional TCP/IP stack and kernel socket buffers, RDMA allows an RDMA-capable NIC to transfer data directly between registered memory regions. In distributed AI training, this is important because GPUs frequently exchange gradients, parameters, and activation data through collective communication operations such as AllReduce and ReduceScatter. RDMA reduces latency, improves throughput, lowers CPU overhead, enables zero-copy communication, and helps keep GPUs busy. As a result, it improves accelerator utilization and reduces overall job completion time in large-scale AI training clusters.

7. Compare InfiniBand, RoCEv2, and iWARP as RDMA transport protocols for AI/ML workloads.

InfiniBand, RoCEv2, and iWARP are all RDMA transport options, but they fit AI and ML workloads differently. InfiniBand is a native RDMA fabric designed for very low latency, high throughput, and predictable performance. It is widely used in high-end HPC and AI training clusters where collective communication performance is critical. RoCEv2 carries RDMA over UDP/IP on Ethernet, so it provides RDMA performance while fitting into Ethernet-based data center designs. It is attractive for hyperscale AI fabrics, but it requires careful congestion control and lossless Ethernet tuning. iWARP provides RDMA over TCP/IP, which makes it routable and compatible with standard IP networks, but it is less common in modern GPU training because the AI ecosystem and collective communication optimizations are much stronger around InfiniBand and RoCEv2. In practice, I would choose InfiniBand for maximum performance and predictability, RoCEv2 for scalable Ethernet-based AI fabrics, and iWARP mainly for niche or storage-oriented RDMA-over-IP use cases.

8. What are the key requirements for AI data center fabrics to support efficient AI/ML workloads?

The key requirement for an AI data center fabric is to keep expensive accelerators busy. Unlike traditional data center networks, AI fabrics must support massive east-west GPU-to-GPU communication with high bandwidth, low tail latency, and predictable performance. They need RDMA support through InfiniBand or RoCEv2 to reduce CPU overhead and enable direct memory movement. They also need lossless or near-lossless behavior, strong congestion control, and efficient load balancing because AI workloads generate synchronized bursts, elephant flows, and incast traffic. In addition, the fabric must provide performance isolation, deep telemetry, and resiliency, since a single slow flow, congested link, or unstable route can delay collective operations and increase job completion time. In short, an AI fabric should be optimized not only for raw bandwidth, but for low tail latency, high GPU utilization, and predictable job completion time.