Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy
Jie Ren, Zhenwei Dai, et al.
NeurIPS 2025
Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., Spearman correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing, which can lead to invalid conclusions and mistrust. By analyzing over 40 prominent benchmarks, we show how some overlooked methodological choices can significantly influence BAT results. To address these inconsistencies, we propose a set of best practices and demonstrate their impact on robustness and validity. To foster adoption and facilitate future research, we introduce BenchBench (links in the App), a Py package and Leaderboard for BAT.
Jie Ren, Zhenwei Dai, et al.
NeurIPS 2025
Rares Christian, Pavithra Harsha, et al.
NeurIPS 2025
Tian Gao, Amit Dhurandhar, et al.
NeurIPS 2025
Vidushi Sharma, Andy Tek, et al.
NeurIPS 2025