Demo paper

AssetOpsBench-Live: Privacy-Aware Online Evaluation of Multi-Agent Performance in Industrial Operations

Abstract

Industrial automation increasingly relies on multi-agent AI, yet evaluation remains difficult due to task complexity and data confidentiality. We present AssetOpsBench-Live, a demo of a competition-ready platform for real-time, privacy-preserving evaluation of multi-agent AI in industrial contexts. The platform integrates AssetOpsBench, which measures six dimensions of multi-agent performance and performs automated failure-mode discovery, with Codabench, which supports reproducible, code-oriented competitions. End users first validate agents locally, then submit containerized code for execution on hidden industrial scenarios. Instead of raw trajectories, the system provides quantitative scores and clustered failure modes (e.g., reasoning-action mismatch, step repetition), enabling participants to identify failures, apply targeted improvements, and iteratively resubmit. By combining competition-based engagement with actionable diagnostics, AssetOpsBench-Live delivers reproducible, real-time insights reflecting real-world industrial constraints.