A Multiscale Workflow for Thermal Analysis of 3DI Chip Stacks
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
Large Language Model (LLM) training workloads share computational characteristics with high-performance computing applications, requiring intensive parallel processing, complex matrix operations, and distributed computing with frequent synchronization -- requiring specialized hardware to deliver optimal performance.
This talk presents insights from Vela, a cloud-native system architecture introduced in 2021 for LLM training using commercial hardware and open-source software. The Vela architecture combines off-the-shelf hardware, Linux KVM virtualization with PCIe passthrough, and virtualized RDMA over Converged Ethernet networks. The system employs software-defined networking with SRIOV technology for GPU Direct RDMA, achieving near-bare-metal performance while maintaining virtualization benefits.
Based on multiple data center deployments and iterations, we present two case studies examining what it takes for virtualization-based systems to deliver (a) bare-metal RoCE-like performance and (b) bare-metal InfiniBand-like performance for LLM training workloads. The discussion focuses on virtualization challenges, experiences, and runtime optimizations required for optimal performance in cloud-native training infrastructure.
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
Marcelo Amaral
OSSEU 2023
Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025
Ilias Iliadis
International Journal On Advances In Networks And Services