Skip to content
AI Data Center Systems
Search
Ctrl
K
Cancel
GitHub
Select theme
Dark
Light
Auto
AI Data Center Network
Efficient LLM Inference Systems
Deep Learning for Network Engineers
AI Systems Performance Engineering
CME295 Lecture Notes
Training
Storage
AI Systems Performance Engineering
chap01
Chapter 1: Introduction and AI System Overview
chap02
Chapter 2: AI System Hardware Overview
chap03
Chapter 3: OS, Docker, and Kubernetes Tuning for GPU-Based Environments
chap04
Chapter 4: Tuning Distributed Networking Communication
labs
ch04
Chapter 4 Labs
communication-overlap
Lab: Communication and Computation Overlap
communicator-lifecycle
Lab: Communicator Lifecycle
dataparallel-vs-ddp
Lab: DataParallel vs DDP
gpu-communication-reference
Lab: GPU Communication Reference
gradient-bucketing
Lab: Gradient Bucketing, Fusion, and Compression
nixl-tier-handoff
Lab: NIXL-Style Tier Handoff
pipeline-tensor-parallel
Lab: Pipeline and Tensor Parallel Scheduling
symmetric-memory-nvshmem
Lab: Symmetric Memory and NVSHMEM Patterns
topology-aware-bandwidth
Lab: Topology-Aware Bandwidth
GitHub
Select theme
Dark
Light
Auto
AI Systems Performance Engineering
Chapter 1: Introduction and AI System Overview
Chapter 2: AI System Hardware Overview
Chapter 3: OS, Docker, and Kubernetes Tuning for GPU-Based Environments
Chapter 4: Tuning Distributed Networking Communication
Resources
Section titled “Resources”
Books
Section titled “Books”
AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch
(2025.12)
Code
Articles
Section titled “Articles”
Making Deep Learning Go Brrrr From First Principles
Hardware Architectures for LLM Inference
Talks
Section titled “Talks”
The Engineering Behind Training a 2 Trillion Parameter LLM
(2026.04)
GPU
Section titled “GPU”
H100 Tensor Core GPU Architecture
NVIDIA Blackwell Architecture Technical Brief
NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit
(2025.08)
Using FP8 and FP4 with Transformer Engine
NCCL and Communication Collectives
NCCL Algorithms