Workshop paper

MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow

Abstract

Large Language Models (LLMs) are increasingly employed to evaluate complex, large datasets in automated ways. By combining LLMs' rationale capabilities with user-defined criteria, LLM-as-a-judge systems can automate the evaluation of thousands of observations based on predefined criteria, offering a scalable and flexible solution. However, users often struggle to define and articulate clear evaluation criteria. Moreover, human preferences and criteria definitions evolve, and predefined templates fail to account for the context-specific nuances necessary for effective evaluation. To address these challenges, we present MetricMate, an interactive tool that supports users in defining and iterating evaluation criteria for LLM-as-a-judge systems. MetricMate introduces hierarchical criteria definitions and curated examples of success and failure to promote human-AI criteria negotiation and alignment. Additionally, MetricMate provides several visualizations to help users iterate and comprehend how their criteria affect the overall evaluation process. We aim to provide a tool for a wide range of users, from annotators to software developers.