Pavlos Maniotis, Nicolas Dupuis, et al.
OECC/PSC 2022
Modern Artificial Intelligence (AI) and HighPerformance Computing (HPC) workloads impose diverse demands on data center networks in terms of both performance and reliability. Applications such as inferencing are usually compute-bound, while distributed training and HPC workloads are network-bound, requiring high bandwidth and low, predictable latency. In terms of reliability, some workloads require seamless failure recovery, while others can tolerate failures through checkpointing or application/transport-layer reliability. In this paper, we address the challenge of designing a highperformance network for private data centers at scales of 1-2K Graphics Processing Units (GPUs) or other accelerators, supporting a range of workloads over Remote Direct Memory Access (RDMA), RDMA over Converged Ethernet (RoCE), and Transmission Control Protocol (TCP) with off-the-shelf hardware and software. We first present a network control plane design that accommodates flexible reliability and performance requirements, and we present our approach across key system components to efficiently utilize multiple equal-cost paths in a two-level leafspine topology. We then use simulations to guide our design choices, including a 1 speed ratio between server-facing and spine-facing ports, along with switch partitioning at the spine layer to mitigate the impact of flow collisions. Leveraging Ansible automation for efficient configuration management, we integrate these findings into a 12 -node cluster with 8 GPUs and 1.6 Tbps bandwidth per server. Using network micro-benchmarks that mimic the demands of intensive workloads, our results show that the network sustains near-line-rate throughput under traffic patterns where up to two-thirds of traffic traverses the spine, while also supporting a flexible reliability model.
Pavlos Maniotis, Nicolas Dupuis, et al.
OECC/PSC 2022
Jeffrey A. Kash, Alan F. Benner, et al.
OFC 2011
William M. J. Green, Eric J. Zhang, et al.
OFC 2019
Christos A. Thraskias, Eythimios N. Lallas, et al.
IEEE Commun. Surv. Tutor.