Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark EvaluationYotam PerlitzAriel Geraet al.2025NeurIPS 2025
Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language ModelsAnna A. IvanovaAalok Satheet al.2025Transactions of the Association for Computational Linguistics
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the CommunityShachar Don-YehiyaLeshem Choshenet al.2025ACL 2025
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM EvaluationEliya HabbaOfir Arvivet al.2025ACL 2025
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual EvaluationShivalika SinghAngelika Romanouet al.2025ACL 2025
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the CommunityShachar Don-YehiyaLeshem Choshenet al.2025ACL 2025
Compress then Serve: Serving Thousands of LoRA Adapters with Little OverheadRickard GabrielssonJiacheng Zhuet al.2025ICML 2025
The future of open human feedbackShachar Don-YehiyaBen Burtenshawet al.2025Nature Machine Intelligence