Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025
Vela is a cloud-native system designed for LLM training workloads built using off-the-shelf hardware, Linux KVM-based virtualization, and a virtualized RDMA over Converged Ethernet (RoCE) network. Vela virtual machines (VMs) support peer-to-peer DMA between the GPUs and SRIOV-based network interface. In this paper, we share Vela's key architectural aspects with details from an NVIDIA A100 GPU-based deployment in one of the IBM Cloud data centers. Throughout the paper, we share insights and experiences from designing, building, and operating the system over a ∼2.5 year timeframe to highlight the capabilities of readily available software and hardware technologies and the improvement opportunities for future AI systems, thereby making AI infrastructure more accessible to a broader community. As we evaluated the system for performance at ∼1500 GPU scale, we achieved ∼80% of the ideal throughput while training a 50 billion parameter decoder model using model parallelism, and ∼70% per GPU FLOPS compared to a single VM with the High-Performance Linpack benchmark.
Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025
Ilias Iliadis
International Journal On Advances In Networks And Services
Alessandro Pomponio
Kubecon + CloudNativeCon NA 2025
Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024