BenchmarkCards: Standardized Documentation for Large Language Model BenchmarksAnna SokolElizabeth Dalyet al.2025NeurIPS 2025
Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model PerformanceAladin DjuheraSwanand Ravindra Kadheet al.2025NeurIPS 2025
PaTH Attention: Position Encoding via Accumulating Householder TransformationsSonglin YangYikang Shenet al.2025NeurIPS 2025
Causally Reliable Concept Bottleneck ModelsGiovanni De FeliceArianna Casanova Floreset al.2025NeurIPS 2025
Forging Time Series with Language: A Large Language Model Approach to Synthetic Data GenerationCécile RousseauTobia Boschiet al.2025NeurIPS 2025
Musings on AI Muses: Support for Human CreativityJohn RichardsJacquelyn Martinoet al.2025NeurIPS 2025
Quantifying policy uncertainty in generative flow networks with uncertain rewardsRamon Nartallo-kaluarachchiRobert Manson Sawkoet al.2025NeurIPS 2025
STRIDE: A Systematic Framework for Selecting AI Modalities—Agentic AI, AI Assistants, or LLM CallsShubhi AsthanaRuchi Mahindruet al.2025NeurIPS 2025
Foundation Models Enabling Multi-Scale Battery Materials Discovery: From Molecules To DevicesVidushi SharmaAndy Teket al.2025NeurIPS 2025
MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram GenerationBasel ShbitaFarhan Ahmedet al.2025NeurIPS 2025