Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
The rise of the use of Large Language Models (LLMs) in work has driven the need for robust evaluation methods that align model behavior with human values and preferences. LLM-as-a-judge approaches have emerged as a scalable solution, leveraging LLMs to evaluate generated outputs based on flexible user-defined criteria. However, users often struggle to articulate clear evaluation criteria. In addition, human preferences and criteria definitions evolve, and predefined templates fail to account for context-specific nuances. To address these challenges, we present MetricMate, an interactive tool that supports users in defining and calibrating evaluation criteria for LLM-as-a-judge systems. MetricMate introduces hierarchical criteria definitions and curated examples of success and failure to promote human-AI criteria negotiation and alignment. Additionally, MetricMate learns from users’ interactions with data by enabling users to group data to identify patterns and provide context-specific criteria.
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Michael Muller, Vera Khovanskaya
CHIWORK 2025