Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark EvaluationYotam PerlitzAriel Geraet al.2025NeurIPS 2025
Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language ModelsAnna A. IvanovaAalok Satheet al.2025Transactions of the Association for Computational Linguistics
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM EvaluationEliya HabbaOfir Arvivet al.2025ACL 2025
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the CommunityShachar Don-YehiyaLeshem Choshenet al.2025ACL 2025
Compress then Serve: Serving Thousands of LoRA Adapters with Little OverheadRickard GabrielssonJiacheng Zhuet al.2025ICML 2025
The future of open human feedbackShachar Don-YehiyaBen Burtenshawet al.2025Nature Machine Intelligence
LIVEXIV - A MULTI-MODAL LIVE BENCHMARK BASED ON ARXIV PAPERS CONTENTNimrod ShabtayFelipe Maia Poloet al.2025ICLR 2025
LiveXiv - A Multi-Modal live benchmark based on Arxiv papers contentNimrod ShabtayFelipe Maia Poloet al.2025ICLR 2025