Data management in machine learning: Challenges, techniques, and systems
Abstract
Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related challenges that arise in ML workloads, and has built several systems for advanced analytics. This tutorial provides a comprehensive review of such systems and analyzes key data management challenges and techniques. We focus on three complementary lines of work: (1) integrating ML algorithms and languages with existing data systems such as RDBMSs, (2) adapting data management-inspired techniques such as query optimization, partitioning, and compression to new systems that target ML workloads, and (3) combining data management and ML ideas to build systems that improve ML lifecycle-related tasks. Finally, we identify key open data management challenges for future research in this important area.