Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versa- tility of such evaluations make the use of LLM- based judges a compelling solution for this chal- lenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source sys- tems. We argue that this setting overlooks criti- cal factors affecting system-level ranking, such as a judge’s positive or negative bias towards certain systems. To address this gap, we con- duct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge’s quality is as- sessed by comparing the resulting system rank- ing to a human-based ranking. Beyond over- all judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Yidi Wu, Thomas Bohnstingl, et al.
ICML 2025
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010
Ben Fei, Jinbai Liu
IEEE Transactions on Neural Networks