Publication
OpML 2020
Conference paper

An experimentation and analytics framework for large-scale AI operations platforms

Abstract

This paper presents a trace-driven experimentation and analytics framework that allows researchers and engineers to devise and evaluate operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive system and simulation model. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of experiments to test and examine pipeline scheduling, cluster resource allocation, and similar operational mechanisms.

Date

Publication

OpML 2020

Authors

Topics

Share