Christodoulos Constantinides, Dhaval Patel, et al.
NeurIPS 2025
This talk will focus on designing and evaluating agentic benchmarks with a strong emphasis on in-domain evaluation and real-world task reliability. Drawing from the development of AssetOpsBench, we’ll discuss practical considerations for measuring agent behavior, task completion quality, and decision robustness. The session will highlight what works, what doesn’t, and what matters most when building benchmarks for agent-based systems.
Christodoulos Constantinides, Dhaval Patel, et al.
NeurIPS 2025
Lisa Hamada, Akihiro Kishimoto, et al.
NeurIPS 2025
Sarath Swaminathan, Nathaniel Park, et al.
NeurIPS 2025
Giovanni De Felice, Arianna Casanova Flores, et al.
NeurIPS 2025