Skip to content

Chapter 4: Optics and Cable Management

This chapter explains how optics, transceivers, cables, connectors, and physical cable management affect AI/ML data center fabrics.

The core idea is:

AI fabric performance depends not only on topology and routing, but also on the physical optical and electrical links that make high-bandwidth GPU-to-GPU communication possible.

alt text

The chapter focuses on these topics:

  • 200 Gbps, 400 Gbps, 800 Gbps, and 1.6 Tbps server and switch links
  • GPU server connectivity using OSFP and high-density NICs
  • Packet flow through transceivers, mux/demux, DSPs, SerDes, and PFE ASICs
  • Modulation schemes such as NRZ, PAM4, PAM8, QAM, and DWDM
  • FEC, clock data recovery, and equalization
  • MMF, SMF, DAC, AEC, and AOC cabling
  • Reach types such as VR, SR, DR, FR, LR, ZR, and CR
  • LC, MPO, and MTP connectors
  • QSFP28, QSFP56, QSFP-DD, OSFP, and CFP form factors
  • Pluggable optics, LPO, LRO, and CPO
  • Cable management for rail-optimized AI fabrics
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef server fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef optics fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef signal fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef cable fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef risk fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    GPU[GPU server<br/>NVLink, NVSwitch, NICs]:::server
    NIC[400G/800G NIC ports<br/>OSFP or QSFP-DD]:::server
    OPT[Transceiver<br/>optical/electrical conversion]:::optics
    DSP[DSP<br/>modulation, FEC, equalization]:::signal
    ASIC[PFE ASIC<br/>SerDes lanes]:::signal
    CABLE[MMF, SMF, DAC, AEC, AOC<br/>reach and connector choice]:::cable
    FABRIC[Leaf-spine fabric<br/>training, storage, inference]:::server
    RISK[Power, cooling, signal integrity,<br/>cost, availability, maintenance]:::risk

    GPU --> NIC --> OPT
    OPT --> DSP --> ASIC
    OPT --> CABLE --> FABRIC
    OPT --> RISK
    CABLE --> RISK

Chapter 3 covered topology and high-level network design. This chapter moves down the stack to the physical links that make those designs possible.

AI/ML clusters create unusually aggressive physical-layer requirements:

  • GPU servers expose many high-speed NIC ports.
  • Each GPU may have a dedicated scale-out NIC.
  • Training traffic needs predictable bandwidth between many servers.
  • Leaf and spine switches need high radix and high per-port speed.
  • Power and cooling budgets are constrained at rack scale.
  • Cable count and cable length directly affect operations.

Optics and cable choices therefore become architecture decisions. A design that looks valid on a topology diagram can fail operationally if the optics are unavailable, too expensive, too power hungry, too hot, or too difficult to cable cleanly.

Modern AI servers include several connectivity domains:

DomainExample TechnologyPurpose
Intra-server GPU fabricNVIDIA NVLink/NVSwitch, AMD Infinity Fabric, CXL, UCIe, PCIeGPU-to-GPU communication inside one server
Scale-out networkConnectX or similar NICs, Ethernet or InfiniBandGPU-to-GPU communication across servers
Storage networkNVMe-oF, Ethernet, InfiniBand, or storage NICsDataset and checkpoint movement
Out-of-band managementBMC and management EthernetOperations and recovery

The chapter uses NVIDIA DGX-class systems as examples. An 8-GPU server can have a dedicated NIC path per GPU, with OSFP ports that expose 400 Gbps or 800 Gbps connectivity toward the external fabric.

DGX H100 rear panel port view

The DGX H100 rear panel shows why AI server networking must be planned as a set of separate connectivity domains: high-speed ConnectX-7 OSFP network ports, storage or host Ethernet ports, BMC management, console access, boot storage, and redundant power supplies all share the same rear-service area.

DGX H100 rear panel port map

The colored port map highlights the operational difference between scale-out network ports, storage or host ports, and management ports. These physical distinctions should be reflected in rack cabling, rail labels, switch port allocation, and troubleshooting documentation.

DGX H100 rear panel modules

The rear module view separates the GPU tray, motherboard tray, and six 3.3 kW power supplies. This matters for optics and cabling because network service access, airflow, and power-cable routing all converge at the rear of the chassis.

DGX H100 storage and networking module layout

The storage and networking module layout shows the internal relationship between ConnectX-7 storage networking, the OSFP carrier board, DGX networking modules, and DensiLink cables. External fabric design should account for how these internal modules map to OSFP-facing links.

The important distinction is:

  • Intra-server communication uses the local GPU interconnect and internal switch.
  • Server-to-server communication uses external NICs, cables, optics, and fabric switches.

AMD has Infinity Fabric, while Intel has switch and interconnect technologies around CXL, UCIe, and PCIe. These technologies are not the external data center fabric itself, but they influence how compute, memory, accelerators, and NICs are connected inside or near the server boundary.

[Note]

  • CXL, Compute Express Link: A cache-coherent interconnect used to connect CPUs, memory expanders, accelerators, and devices.
  • UCIe, Universal Chiplet Interconnect Express: A chiplet-to-chiplet interconnect for connecting dies inside a package.
  • PCIe, Peripheral Component Interconnect Express: A general-purpose high-speed I/O interconnect used for NICs, GPUs, storage, and other devices.
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef gpu fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef local fill:#3b2f20,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef nic fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fabric fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    G1[GPU 1]:::gpu
    G2[GPU 2]:::gpu
    G8[GPU 8]:::gpu
    NVS[NVSwitch / local GPU fabric]:::local
    N1[NIC 1<br/>400G/800G]:::nic
    N2[NIC 2<br/>400G/800G]:::nic
    N8[NIC 8<br/>400G/800G]:::nic
    L1[Leaf / ToR 1]:::fabric
    L2[Leaf / ToR 2]:::fabric
    L8[Leaf / ToR 8]:::fabric

    G1 --- NVS
    G2 --- NVS
    G8 --- NVS
    G1 --- N1 --> L1
    G2 --- N2 --> L2
    G8 --- N8 --> L8

AI server-to-leaf links have moved from 200 Gbps and 400 Gbps toward 800 Gbps and 1.6 Tbps. Optics roadmaps continue toward even higher speeds such as 3.2 Tbps.

This growth is driven by:

  • Larger GPU clusters
  • Higher GPU memory bandwidth and compute throughput
  • More frequent distributed training collectives
  • Faster storage and checkpointing requirements
  • High-radix switches with many ports per rack unit
  • Need to reduce training job completion time

The optics layer must keep pace with the compute layer. If GPU compute grows faster than network bandwidth, expensive accelerators sit idle waiting for data or synchronization.

Higher-speed optics face several constraints.

ChallengeWhy It Matters
Signal qualityHigher speeds increase attenuation, dispersion, crosstalk, and noise sensitivity
Signal conditioningEqualization and DSP become more important as impairments increase
PowerHigh-speed optics can consume significant rack power
CoolingMore power creates more heat near dense switch ports
AvailabilityNew optics may lag switch and server roadmaps
CostHigh-speed optical modules can dominate fabric bill of materials
StandardsMulti-vendor deployments need interoperable form factors and link types

For AI fabrics, these issues are not minor. A topology may require thousands of identical links. A small per-module power or cost increase multiplies quickly.

When a network device receives data, the signal arrives as either an electrical signal or an optical signal. Before the packet reaches the packet forwarding engine, it passes through several physical-layer stages.

Packet flow from optics through demux, DSP, and PFE ASIC

The physical receive path starts at the optical or electrical signal, splits the signal through demux logic, conditions and recovers it through the DSP, and then hands lanes toward the PFE ASIC. The transmit direction reverses the process through mux logic before the signal leaves the transceiver or cable interface.

A demultiplexer splits a high-speed incoming signal into multiple lower-rate lanes.

For example:

  • A 400 Gbps optic can be split into 8 x 50 Gbps lanes.
  • A 400 Gbps optic can also be split into 4 x 100 Gbps lanes.
  • An 800 Gbps optic can be split into 8 x 100 Gbps lanes.

A multiplexer performs the reverse operation. After the ASIC processes the packet, multiple lower-rate lanes are combined back into the desired outgoing optical or electrical rate.

The PFE ASIC uses SerDes lanes to convert between serial and parallel data streams.

ASIC SerDes CapabilityExample Mapping
50 Gbps SerDes400G as 8 x 50G
100 Gbps SerDes400G as 4 x 100G
100 Gbps SerDes800G as 8 x 100G
200 Gbps SerDesFuture high-speed optics with fewer lanes or higher aggregate bandwidth

SerDes rate matters because it determines how many electrical lanes must connect the ASIC to the transceiver. More lanes increase design complexity, power, pin count, and signal-integrity work.

A DSP inside the transceiver or system handles several key functions:

  • Modulation and demodulation
  • Error detection and correction
  • Clock data recovery
  • Equalization
  • Signal conditioning

The DSP is a major part of the cost and power profile of a high-speed transceiver. In pluggable optics, the DSP can account for a significant share of module cost and power. This is why LPO, LRO, and CPO attempt to change where DSP work is performed.

Higher bandwidth does not come only from faster switching silicon. It also depends on encoding more information onto the electrical or optical signal and recovering that signal reliably.

Modulation converts data into electrical or optical signal states.

PAM levels and QAM constellation comparison

ModulationBasic IdeaBenefitTrade-Off
NRZ / PAM2Two levels, one bit per symbolSimple and matureNot enough for modern high-speed optics
PAM4Four levels, two bits per symbolDoubles bits per symbol compared with NRZLower signal margin and more noise sensitivity
PAM8Eight levelsHigher data rate per symbolEven more demanding signal recovery
QAMCombines amplitude and phase statesHigh spectral efficiencyMore complex DSP and signal quality requirements

PAM4 is important because it enables higher bandwidth over a given lane rate. For example, it can carry two bits per symbol instead of one. The cost is that each level is closer to the next, so noise and distortion become more difficult to tolerate.

Dense Wavelength Division Multiplexing(고밀도 파장 분할 다중화), DWDM, carries multiple channels over a single fiber by using different wavelengths of light.

In data centers and data center interconnects, DWDM is useful when:

  • Fiber count is constrained.
  • Long distance is required.
  • Multiple high-speed channels must share a fiber path.
  • IP-over-DWDM is desired.

400G ZR optics are an example of pluggable coherent DWDM optics used for data center interconnect distances, commonly up to about 80 km.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef ch fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef mux fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef fiber fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    C1[Channel lambda 1]:::ch
    C2[Channel lambda 2]:::ch
    C3[Channel lambda 3]:::ch
    M[Mux]:::mux
    F[Single fiber pair]:::fiber
    D[Demux]:::mux
    O1[Channel lambda 1]:::ch
    O2[Channel lambda 2]:::ch
    O3[Channel lambda 3]:::ch

    C1 --> M
    C2 --> M
    C3 --> M
    M --> F --> D
    D --> O1
    D --> O2
    D --> O3

At 400 Gbps and beyond, signal recovery features become normal rather than optional.

FunctionRole
FECCorrects bit errors introduced during transmission
LDPC / BCHCommon error-correction coding approaches
Clock Data Recovery, CDRRecovers timing from the incoming signal
EqualizationRestores signal shape and improves signal-to-noise ratio
FFE / DFEFeed-forward and decision-feedback equalization methods

FEC improves reliability, but it adds latency. In AI/ML fabrics, this is a practical trade-off: reliable high-speed links are required, but excessive latency can affect tightly synchronized training jobs.

Older data center links often used copper cabling. Copper remains useful, but its reach becomes very limited at high speeds.

MediumStrengthLimitation
CopperLow cost, simple, good for short in-rack linksShort reach at high speeds, signal integrity limits
FiberLong reach, immune to electromagnetic interference, high bandwidthHigher optics cost, connector cleanliness and handling requirements

As links moved to 200G, 400G, and 800G, fiber became the dominant choice for many server-to-switch and switch-to-switch connections. Copper still appears as DAC and AEC for short in-rack links.

Multi-mode fiber carries multiple rays of light through a wider core, commonly 50 um or 62.5 um. It often uses lower-cost light sources such as LEDs or VCSELs and wavelengths such as 850 nm and 1300 nm.

MMF is typically used for:

  • In-rack or nearby-rack connectivity
  • Short to medium data center links
  • AOC cables
  • SR optics

MMF grades:

GradeTypical Capability
OM11G up to 300 m, 10G up to about 33 m
OM21G up to 550 m, 10G up to about 82 m
OM340G up to about 240 m, 100G/400G up to about 100 m
OM4100G/400G up to about 150 m
OM5100G/400G up to about 150 m, supports WDM use cases

OM1 has a larger 62.5 um core, while newer MMF grades commonly use 50 um.

OM1 to OM5 multi-mode fiber comparison

Source: BlueOptics, “OM2, OM3, OM4, OM5: Which multimode fiber optic cable is the right choice?”

Single-mode fiber carries a single ray of light through a smaller core, commonly about 8 um to 10 um. It uses laser sources and wavelengths such as 1310 nm and 1550 nm.

SMF is typically used for:

  • Longer data center links
  • Inter-row or inter-building links
  • Data center interconnects
  • DR, FR, LR, and ZR optics

SMF costs more than MMF in many short-distance cases, but it supports longer reach and avoids modal dispersion caused by multiple light paths.

AI server connectivity is determined by the server port form factor, NIC speed, switch port speed, rack layout, and whether breakout is required.

For servers with 800 Gbps OSFP ports, the switch side can be designed in two common ways:

  • Connect the server OSFP directly to an 800 Gbps switch port.
  • Break out one 800 Gbps server port into two 400 Gbps switch ports.

This creates a practical design dependency: the cable and transceiver plan must align with both server and switch port speeds.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef server fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef switch fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef cable fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    S[GPU server<br/>800G OSFP port]:::server
    C1[Option A<br/>800G direct cable]:::cable
    L1[Leaf switch<br/>800G port]:::switch
    C2[Option B<br/>2 x 400G breakout]:::cable
    L2[Leaf switch<br/>400G port 1]:::switch
    L3[Leaf switch<br/>400G port 2]:::switch

    S --> C1 --> L1
    S --> C2
    C2 --> L2
    C2 --> L3

A100-class systems commonly expose multiple 200 Gbps ports. In a rail-optimized design, each server port can connect to a different leaf switch so that each GPU or GPU rail has a separate network path.

This spreads traffic across rails and helps avoid concentrating all GPU traffic through a single ToR.

Rail-Optimized Design, ROD, creates many parallel server-to-leaf connections. It improves scale-out bandwidth, but it also increases cabling complexity.

Rack placement affects cable type and length:

PlacementCabling PatternOperational Impact
ToRShort server-to-leaf cables inside the rack, varied leaf-to-spine lengthsGood for short server cables, many rack-local connections
MoRMore uniform server cable runs within a rowMay simplify row-level cable bundles
EoRLonger server-to-network runs, centralized network rackCan simplify switch placement but increases cable length

In ROD, leaf and spine switches may be close enough that AOC or AEC is practical. For server-to-leaf links, DR optics are commonly considered when SMF reach and predictable link behavior are needed.

Optics naming often includes a reach suffix. The suffix helps identify distance, medium, and sometimes lane count.

For example, 400G-SR8 means:

  • 400G: aggregate data rate
  • SR: short reach
  • 8: eight optical lanes

Typical reach categories:

SuffixMeaningTypical ReachMedium
VRVery short reachAbout 50 mMMF
SRShort reachAbout 100 mMMF
DRData center reachAbout 500 mSMF
FRFar reachAbout 2 kmSMF
LRLong reachAbout 10 kmSMF
ZRExtended reachMore than 80 kmDWDM / coherent optics
CRCopper reachUp to about 7 m passive DAC, about 10 m active DACDirect attach copper

The right reach type depends on rack layout, fiber plant, port speed, breakout design, cost, and power.

Connectors join the cable to the transceiver. As bandwidth increases, connector density and cleanliness become more important.

DAC, ACC, and AEC cable form factors

The figure above compares common high-speed cable and pluggable form factors, including OSFP-XD, OSFP, OSFP-RHS, QDD, QSFP, SFP-DD, DSFP, and SFP. These form factors matter because AI cluster cabling must match the server port, switch port, cable type, reach, airflow, and service model.

LC is a common connector for duplex fiber links. It can be used with single-mode or multi-mode fiber, but in high-speed data center optics it is often associated with SMF use cases such as FR and LR.

Example:

  • QSFP-DD-FR4 with 2 km SMF and dual LC connectors.

MPO is a multi-fiber push-on connector. MTP is a branded MPO-style connector.

MPO/MTP is used when many fibers must terminate in a dense connector, especially for:

  • Breakout cables
  • Parallel optics
  • High-density patch panels
  • SR and DR variants with multiple transmit and receive fibers

Examples:

  • QSFP-DD-SR8 with 100 m MMF and MPO-16.
  • QSFP-DD-DR4 with 500 m SMF and MPO-12.
Cable TypeMediumTypical UseBenefitTrade-Off
DACCopperVery short in-rack linksLow cost, low powerLimited distance
Passive DACCopperShort direct connectionsVery low power, simpleUsually limited to about 7 m
Active DACCopper with electronicsSlightly longer short linksSignal boostMore power than passive DAC
AECCopper with active electronicsHigh-speed short in-rack linksCost-effective, improves signal qualityShorter reach than fiber
AOCFiber with integrated transceiversToR/MoR/EoR links up to about 100 mThin, flexible, lower bend radiusLess modular than separate optics plus fiber

For AI clusters:

  • DAC is attractive for very short and low-cost links.
  • AEC is attractive inside racks at high speed.
  • AOC is attractive across nearby racks or rows.
  • Separate pluggable optics plus structured fiber is attractive for larger, more serviceable deployments.

AI fabrics need standardized form factors to avoid vendor lock-in and to make high-speed ports deployable at scale.

SFP and QSFP form factor comparison

The figure above compares common SFP and QSFP-family form factors across generations. For AI fabrics, the important operational point is that higher link speeds usually require newer cages, more lanes, different cable assemblies, and stricter power and cooling planning.

Important standards bodies and form-factor families include:

  • IEEE 802.3bm
  • IEEE 802.3bs
  • OIF OSFP
  • ITU-T G.959.1

QSFP evolved from 40 Gbps toward higher speeds.

Form FactorTypical Data RateLane CountTypical Lane RateNotes
QSFP28100 Gbps425 GbpsMature, compact, broadly deployed
QSFP56200 Gbps450 GbpsUses PAM4 for higher lane rate
QSFP-DD200G/400G and future higher speeds825G/50G/100G depending generationDouble-density form factor, backward compatibility

QSFP-DD is important because it offers high density and backward compatibility with earlier QSFP modules. It is widely considered for 400G Ethernet migration.

QSFP-DD packaging variants include:

TypeDescription
Type 1Similar size to QSFP28
Type 2Longer back cage for more design room
Type 2AHeat sink packaged on the optics
Type 2BTaller heat sink to allow room for internal connector and port separation

OSFP is designed for higher speeds and higher power envelopes. It is common in NVIDIA AI server connectivity and is a major form factor for 400G and 800G AI deployments.

Compared with QSFP-DD, OSFP is physically larger. The extra size can help thermal design, but it may affect density and compatibility.

CFP, CFP2, and CFP4 were early 100G form factors. CFP variants support longer-distance use cases, but they are larger and less dense than QSFP-DD or OSFP.

The chapter notes that CFP can provide far fewer ports per rack unit than QSFP-DD or OSFP, so it is less attractive for dense AI fabric switch ports.

FeatureQSFP-DDOSFP
SizeCompactLarger
Lane configurationCommonly 8 lanes, with evolution toward moreCommonly 8 lanes
Speed target200G, 400G, and higher variants200G, 400G, 800G and beyond
PowerModerate relative to OSFPHigher power envelope
Cable compatibilityCopper and fiber optionsPrimarily high-speed optical and cable options
Thermal behaviorCompact size requires careful thermal designLarger size can support stronger thermal handling
MigrationStrong backward compatibilityStrong fit for next-generation AI servers
CostOften lower due to smaller standardized ecosystemCan be higher due to larger package and thermal design

The practical lesson is not that one form factor always wins. The best choice depends on server vendor, switch vendor, port speed, thermal budget, cable plan, optics availability, and operational preference.

The industry is pushing beyond traditional pluggable optics to reduce power, increase port density, and improve economics.

Traditional pluggable optics place the optical module and DSP inside the removable transceiver.

Benefits:

  • Maximum flexibility
  • Easy replacement
  • Broad operational familiarity
  • Ability to choose different optics for different reaches

Trade-offs:

  • Higher module power
  • Higher module cost
  • Heat concentrated at the front panel
  • DSP included in every pluggable module
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef asic fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef cage fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef cable fill:#5a3520,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    ASIC[PFE ASIC]:::asic
    CAGE[Front-panel cage]:::cage
    MOD[Pluggable module<br/>optics + DSP]:::cage
    CABLE[Cable / fiber]:::cable

    ASIC --> CAGE --> MOD --> CABLE

LPO moves DSP functions out of the optical module and relies more heavily on the switch ASIC.

Benefits:

  • Lower transceiver cost
  • Lower module power
  • Smaller pluggable module
  • Reduced cooling demand at the module

Trade-offs:

  • More dependency on the PFE ASIC
  • Interoperability risk
  • Deployment and operational risk
  • Potential robustness concerns

LPO is attractive because the DSP can represent a large part of optical module cost and power. However, moving that function changes the operating model and may complicate multi-vendor interoperability.

LRO removes DSP from the receive path while maintaining DSP in the transmit path.

The chapter positions LRO as a compromise:

  • Better standards compliance than fully DSP-less designs
  • Better interoperability than aggressive LPO designs
  • Improved power efficiency compared with fully traditional pluggables
  • Stronger deployment reliability trade-off

CPO integrates optics close to, or into, the switch package rather than keeping all optics as separate front-panel pluggables.

Benefits:

  • Higher port density
  • Lower power
  • Shorter electrical traces
  • Potentially lower per-bit cost at scale

Trade-offs:

  • Less field flexibility
  • Switch includes the optics cost up front
  • Failed optics may require switch-level service
  • Technology and operational model are still emerging
%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart LR
    classDef chip fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef optic fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef note fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    ASIC[Switch ASIC package]:::chip
    O1[Optical engine 1]:::optic
    O2[Optical engine 2]:::optic
    O3[Optical engine 3]:::optic
    F[Fiber exits package area]:::note

    ASIC --- O1
    ASIC --- O2
    ASIC --- O3
    O1 --> F
    O2 --> F
    O3 --> F

LPO and CPO can benefit switch and ASIC vendors by shifting integration closer to the system. LRO is often more aligned with optics vendors because it preserves more transceiver-side functionality.

Design QuestionCommon ChoiceReason
In-rack short linkDAC or AECLow cost and low power for short reach
Short row-level linkAOC or SR opticsFlexible fiber reach without long-distance optics cost
Server-to-leaf 400G/800G in RODOSFP, QSFP-DD, DR/SR/AOC/AEC depending layoutMust match server port, switch port, and rack geometry
Leaf-to-spine within rowAOC, AEC, SR, or DRDepends on distance and structured cabling preference
Inter-row or building linkSMF with DR/FR/LRLonger reach and lower dispersion
Data center interconnectZR / coherent DWDMLong reach and wavelength multiplexing
High-density NVIDIA server connectivityOSFP commonly usedServer ecosystem and 800G readiness
400G switch migration with backward compatibilityQSFP-DDDensity and QSFP ecosystem compatibility
Lowest module power trendLPO/LRO/CPO evaluationReduces DSP or electrical path burden

Use this checklist before finalizing an AI fabric optics and cable plan.

%%{init: {"theme": "base", "themeVariables": {"background": "#171717", "primaryColor": "#232323", "primaryTextColor": "#f5f5f5", "primaryBorderColor": "#d0d0d0", "lineColor": "#cfcfcf", "fontFamily": "Inter, Arial, sans-serif"}}}%%
flowchart TB
    classDef step fill:#232323,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef test fill:#173f32,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef risk fill:#62164d,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;
    classDef done fill:#52676b,stroke:#d0d0d0,color:#f5f5f5,stroke-width:2px;

    A[Confirm server port speed and form factor]:::step
    B[Map switch port speed and breakout plan]:::step
    C[Measure rack, row, and spine distances]:::step
    D[Select DAC, AEC, AOC, MMF, or SMF]:::step
    E[Validate reach suffix and connector type]:::test
    F[Check power, cooling, airflow, bend radius]:::test
    G[Test link BER, FEC counters, flaps, and DOM telemetry]:::test
    H{Operational target met?}:::risk
    I[Approve cable bill of materials]:::done
    J[Revise optics, cable route, or topology]:::risk

    A --> B --> C --> D --> E --> F --> G --> H
    H -->|Yes| I
    H -->|No| J --> C

Checklist:

  • Confirm server-side form factor: OSFP, QSFP-DD, or another type.
  • Confirm switch-side form factor and port speeds.
  • Confirm whether breakout is needed, such as 800G to 2 x 400G.
  • Validate every link distance against the selected optics reach.
  • Avoid using long-reach optics where short-reach cables are sufficient.
  • Ensure MMF/SMF selection matches the transceiver type.
  • Ensure LC, MPO/MTP, and fiber count match the optics.
  • Validate polarity and transmit/receive mapping for MPO trunks.
  • Account for bend radius and cable tray fill.
  • Account for airflow and front-panel service access.
  • Budget power for every optical module.
  • Confirm cooling capacity at high-density switch faceplates.
  • Monitor link BER, FEC correction, uncorrectable errors, and flaps.
  • Track DOM telemetry, including temperature, optical power, voltage, and current.
  • Label cables consistently by rail, rack, leaf, spine, and port.
  • Keep spare optics and cables for the exact deployed types.
  • Validate cleaning and inspection procedures for fiber connectors.
  • Confirm that the selected optics are available in production volume.
  • Test representative links before ordering at cluster scale.
  • Update network diagrams with real cable types, lengths, and connector types.

AI/ML data center optics are moving rapidly from 200G and 400G toward 800G, 1.6T, and beyond. This is driven by high-speed GPU servers, distributed training, and the need to move large amounts of data between accelerators.

GPU servers use local interconnects such as NVLink/NVSwitch for intra-server traffic and external NICs for scale-out traffic. The external path depends on OSFP or QSFP-DD ports, transceivers, cabling, connector choices, and switch port capabilities.

High-speed optics rely on mux/demux, SerDes, DSPs, modulation, FEC, clock recovery, and equalization. PAM4 and higher-order modulation increase bandwidth, but they reduce signal margin and require more sophisticated signal processing.

MMF is used for short links, while SMF supports longer reach. DAC and AEC are attractive for short in-rack links, while AOC and pluggable optics are common for short to medium data center runs. DR, FR, LR, and ZR optics extend reach for larger layouts and data center interconnects.

QSFP-DD and OSFP are central form factors for AI fabrics. QSFP-DD provides compact density and backward compatibility, while OSFP is popular for high-speed AI server connectivity and 800G-class designs.

Pluggable optics provide flexibility, but power and cost are major concerns. LPO, LRO, and CPO attempt to reduce power and increase density, but each changes the trade-off among interoperability, serviceability, reliability, and cost.

TermMeaning
OSFPOctal Small Form-factor Pluggable, common for high-speed AI server links
QSFP-DDQuad Small Form-factor Pluggable Double Density
QSFP28100G QSFP form factor using 4 lanes
QSFP56200G QSFP form factor using 4 x 50G lanes
CFPEarlier 100G form-factor family
SerDesSerializer/deserializer used between ASIC and transceiver
PFE ASICPacket forwarding engine ASIC
DSPDigital signal processor used for modulation, FEC, equalization, and recovery
NRZNon-Return-to-Zero, two-level modulation
PAM4Four-level modulation carrying two bits per symbol
QAMQuadrature amplitude modulation
FECForward error correction
CDRClock data recovery
MMFMulti-mode fiber
SMFSingle-mode fiber
DWDMDense Wavelength Division Multiplexing
DACDirect attach copper
AECActive electrical cable
AOCActive optical cable
LCLucent Connector, common duplex fiber connector
MPOMulti-fiber push-on connector
MTPBranded MPO-style connector
VRVery short reach
SRShort reach
DRData center reach
FRFar reach
LRLong reach
ZRExtended reach, often coherent DWDM
CRCopper reach
LPOLinear-drive pluggable optics
LROLinear receive optics
CPOCo-packaged optics
RODRail-Optimized Design

1. Why are high-speed optics critical for AI/ML data centers?

Section titled “1. Why are high-speed optics critical for AI/ML data centers?”

In an interview, I would start from the workload. AI/ML clusters spend a lot of time moving data between GPUs, servers, and storage, especially during all-reduce, checkpointing, and data loading. If the optical links are too slow, the GPUs wait on the network. So high-speed optics are critical because they protect GPU utilization and reduce distributed training bottlenecks.

2. What is the role of DSPs in optical transceivers?

Section titled “2. What is the role of DSPs in optical transceivers?”

I would describe a DSP as the signal-recovery engine inside the optical path. At 400G and 800G, the signal is not clean enough to simply pass through unchanged. The DSP handles modulation and demodulation, FEC, clock recovery, and equalization so the receiver can reconstruct the original data reliably.

PAM4 increases bandwidth by carrying more information per symbol. NRZ, or PAM2, has two levels and carries one bit per symbol. PAM4 has four levels, so it carries two bits per symbol. The trade-off is that the voltage levels are closer together, which means lower noise margin and a greater need for DSP, equalization, and FEC.

I would compare them by reach and signal behavior. MMF has a wider core and carries multiple light paths, so it is cost-effective for short links such as in-rack or nearby-rack connectivity. SMF has a smaller core and carries a single light path, so it supports longer distances with less modal dispersion. In practice, MMF is common for short SR-style links, while SMF is used for DR, FR, LR, and longer links.

I would choose based on distance, power, and serviceability. DAC is the simplest and cheapest option for very short in-rack links. AEC is still copper, but it adds electronics to improve signal quality at higher speeds. AOC is better when the link needs more reach or thinner, more flexible cabling, for example across racks or rows.

6. What does a reach suffix such as SR8 or DR4 mean?

Section titled “6. What does a reach suffix such as SR8 or DR4 mean?”

The suffix tells us the optical reach class and the number of optical lanes. For example, 400G-SR8 means 400G short reach using eight optical lanes. SR is usually short-reach MMF, while DR is data center reach over SMF. This matters because the suffix must match the cable plant, connector type, and physical distance.

7. Why are QSFP-DD and OSFP important in AI fabrics?

Section titled “7. Why are QSFP-DD and OSFP important in AI fabrics?”

I would say they are important because they are the practical packaging choices for high-speed AI links. QSFP-DD gives compact, high-density switch ports and backward compatibility with the QSFP ecosystem. OSFP provides a larger form factor with a higher power and thermal envelope, which is why it is common in NVIDIA-style 400G and 800G server connectivity.

8. What are the trade-offs of pluggable optics?

Section titled “8. What are the trade-offs of pluggable optics?”

The main advantage is operational flexibility. With pluggable optics, I can choose SR, DR, FR, LR, or ZR modules depending on distance, and I can replace a failed optic without replacing the switch. The downside is power, cost, and heat. At 400G and 800G, front-panel optics can become a major thermal and power design constraint.

I would explain them as three ways to reduce the cost and power of traditional pluggables. LPO moves DSP functions out of the module and relies more on the switch ASIC. LRO is a middle ground: it removes DSP from the receive path but keeps transmit-side DSP. CPO goes further by integrating optics close to the switch ASIC package. These approaches can improve power and density, but they also change interoperability, serviceability, and failure-replacement models.

10. How does cable management affect AI data center reliability?

Section titled “10. How does cable management affect AI data center reliability?”

I would treat cable management as part of reliability, not just neatness. Bad bend radius, dirty connectors, unlabeled cables, blocked airflow, or overloaded trays can all lead to link errors, thermal problems, and slow troubleshooting. In AI fabrics, where one rack can have many high-speed GPU, storage, and management links, disciplined cabling directly affects uptime and repair time.