MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow

Simret Araya Gebreegziabher; Charles Chiang; Zichu Wang; Zahra Ashktorab; Michelle Brachman; Werner Geyer; Toby Jia-Jun Li; Diego Gómez-Zará

doi:10.1145/3729176.3729199

CHIWORK 2025

Conference paper

22 Jun 2025

MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow

View publication

Abstract

The rise of the use of Large Language Models (LLMs) in work has driven the need for robust evaluation methods that align model behavior with human values and preferences. LLM-as-a-judge approaches have emerged as a scalable solution, leveraging LLMs to evaluate generated outputs based on flexible user-defined criteria. However, users often struggle to articulate clear evaluation criteria. In addition, human preferences and criteria definitions evolve, and predefined templates fail to account for context-specific nuances. To address these challenges, we present MetricMate, an interactive tool that supports users in defining and calibrating evaluation criteria for LLM-as-a-judge systems. MetricMate introduces hierarchical criteria definitions and curated examples of success and failure to promote human-AI criteria negotiation and alignment. Additionally, MetricMate learns from users’ interactions with data by enabling users to group data to identify patterns and provide context-specific criteria.

Conference paper