Werner Geyer, Jessica He, et al.
CHIWORK 2025
The rise of the use of Large Language Models (LLMs) in work has driven the need for robust evaluation methods that align model behavior with human values and preferences. LLM-as-a-judge approaches have emerged as a scalable solution, leveraging LLMs to evaluate generated outputs based on flexible user-defined criteria. However, users often struggle to articulate clear evaluation criteria. In addition, human preferences and criteria definitions evolve, and predefined templates fail to account for context-specific nuances. To address these challenges, we present MetricMate, an interactive tool that supports users in defining and calibrating evaluation criteria for LLM-as-a-judge systems. MetricMate introduces hierarchical criteria definitions and curated examples of success and failure to promote human-AI criteria negotiation and alignment. Additionally, MetricMate learns from users’ interactions with data by enabling users to group data to identify patterns and provide context-specific criteria.
Werner Geyer, Jessica He, et al.
CHIWORK 2025
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR