AI Data Center Network
Table of Contents
Section titled “Table of Contents”- Chapter 01: Wonders in the Workload
- Chapter 02: ‘The Common-Man View’ of AI Data Center Fabrics
- Chapter 03: Network Design Considerations
- Chapter 04: Optics and Cable Management
- Chapter 05: Thermal and Power Efficiency Considerations
- Chapter 06: Effective Load Balancing
- Chapter 07: RoCEv2 Transport and Congestion Management
- Chapter 08: IP Routing for AI/ML Fabrics
- Chapter 09: Storage Network Design and Technologies for AI Data Centers
- Chapter 10: AI Network Performance KPIs
- Chapter 11: Monitoring and Telemetry
- Chapter 12: Ultra Ethernet Consortium, UEC
Appendix
Section titled “Appendix”- InfiniBand Packet Analysis
- RDMA Read/Write examples
- GPU Cluster Failure Analysis: ECC, Xid, RDMA, and NCCL Hang
- Clos Fabric Lab Series
Resources
Section titled “Resources”- AI Data Center Network Design and Technologies (2026.02)
- Deep Learning for Network Engineers: Understanding Traffic Patterns and Network Requirements in the AI Data Center (2026.05)
- InfiniBand Network Architecture (2022.10)
Articles
Section titled “Articles”- InfiniBand vs RoCEv2 실측 비교 — 대규모 AI 학습 클러스터의 네트워크 선택 (2026.04)
- DGX B300 ConnectX-8 기반 800G 네트워크에서 소규모 클러스터를 스위치 없이 구성하는 방법 (2026.04)
- A Practical Guide to RoCEv2 Lossless Networks for GPU Clusters (2026.04)
- InfiniBand Is Losing the Fabric War. Here’s What That Changes for Your Architecture. (2026.03)
- From Megawatts to Gigawatts: The 10 Largest AI Datacenters in the World (2026 Edition) (2026.01)
- AI Data Center Network with Juniper Apstra, AMD GPUs, Broadcom Thor2 NIC, AMD Pollara NIC, and Vast Storage—Juniper Validated Design (JVD) (2025.11)
- Cisco Data Center Networking Solutions: Addressing the Challenges of AI/ML Infrastructure (2025.10)
- InfiniBand vs RoCEv2: Choosing the Right Network for Large-Scale AI (2025.08)
- Data center design requirements for AI workloads. A Comprenshive guide
- RoCE networks for distributed AI training at scale (2024.08)
- Cisco Data Center Networking Blueprint for AI/ML Applications
- Network Best Practices for Artificial Intelligence Data Centre (2024)
- How to Choose Between InfiniBand and RoCEv2 (2024.07)
- Managing the Elephant in the Room for AI Data Centers (2024.03)