Talk

How to Deploy a High-performance Distributed AI Training Cluster with NVIDIA GPUs and KVM

Abstract

AI systems are often deployed as bare-metal servers in on-prem environments. While bare-metal deployment is performant, it is not flexible in a multi-tenant environment. Instead, public clouds and some private clouds use virtual machines (VMs) to securely partition the system resources interconnect among multiple customers. Major public cloud providers provision AI VM systems using KVM-derived hypervisors (AWS Nitro, GCE KVM, etc). However, the changes and configuration are not published so others cannot reproduce the same performance using open source virtualization stack (KVM/QEMU) on emerging AI systems. In this technical talk, we will discuss the optimizations required to achieve near bare-metal performance: enabling GPU passthrough, 100GbE RoCE over virtual functions and GPUDirect RDMA (GDR) inside VMs. These optimizations include hardware configuration changes to enable topology visibility into the VM, firmware changes, virtual machine configurations to faithfully represent the AI system capabilities inside the VM, and AI training configurations.