Pol G. Recasens, Ferran Agullo, et al.
CLOUD 2025
The exponential growth in Artificial Intelligence (AI) adoption presents unique challenges and opportunities for deploying AI workloads in modern Data Center (DC) networks, particularly in terms of performance, scalability, and reliability. AI workloads, such as inference and distributed training, impose different network demands: inference is primarily computebound and typically requires low network latency, while distributed training is network-bound and requires high bandwidth, placing significant strain on the network. This paper focuses on the network requirements of widely known AI communication patterns, and studies their impact on modern DC architectures by analyzing the effects of different orchestration strategies-specifically packing and spreading-on throughput, response time, and network congestion. The results show that packing strategies generally deliver higher performance for most covered AI collectives. However, spreading strategies can be beneficial in certain scenarios, such as when larger workloads span across higher number of racks, as they can help mitigate network congestion between the switches of leaf-spine network configurations. This paper offers valuable insights into optimizing the orchestration of popular AI collectives in data center networks, presenting informed strategies to improve performance in response to growing AI demands, with findings demonstrating completion time reductions of up to 30%.
Pol G. Recasens, Ferran Agullo, et al.
CLOUD 2025
Pavlos Maniotis, Nicolas Dupuis, et al.
OECC/PSC 2022
Pavlos Maniotis, Laurent Schares, et al.
SPIE OPTO 2021
Parijat Dube, Tonghoon Suk, et al.
SBAC-PAD 2019