Conference paper

Multi-Dimensional ML-Pipeline Optimization in Cost-Effective Disaggregated Datacenter

Abstract

Machine learning (ML) pipelines deployed in datacenters are becoming increasingly complex and resource intensive, requiring careful optimizations to meet performance and latency requirements. Deployment in NUMA architectures with heterogeneous memory types such as CXL introduces a vast configuration space for memory and thread management. However, existing methods often need help to efficiently navigate this space, relying on significant manual tuning and needing more adaptability across diverse hardware platforms and extensive user-side code modification. This paper presents and experimentally evaluates an adaptive auto-tuning framework that optimizes memory configurations to maximize system throughput while adhering to Service Level Agreements (SLAs) on latency. Our optimization is carried out in two phases, addressing both performance and power efficiency. The proposed framework integrates an extended Berkeley Packet Filter (eBPF)-based kernel module for real-time performance monitoring and a user-space optimization core leveraging Bayesian Optimization and Pareto Optimality to explore high-dimensional configuration spaces. Extensive experimental evaluations on various workloads demonstrate that our framework achieves up to a 48% increase in throughput compared to default NUMA, TPP, Caption, and GPU while reducing search costs by as much as 77%. Furthermore, our two-phase optimization approach incorporates power efficiency, achieving up to 14.3% power savings without compromising throughput