Agentic Process Observability: Discovering Behavioral VariabilityFabiana FournierLior Limonadet al.2025ECAI 2025
Exposing AI Bias by Crowdsourcing: Democratizing Critique of Large Language ModelsHangzhi GuoPranav Venkitet al.2025AIES 2025
Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?Giacomo CamposampieroMichael Herscheet al.2025NeSy 2025
StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional EvaluationSatyananda KashyapSola Shiraiet al.2025VLDB 2025
Evaluating LLM-based Agents: Foundations, Best Practices and Open ChallengesRoy Bar-HaimArman Cohanet al.2025IJCAI 2025
Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language ModelsGeorge KourItay Nakashet al.2025ACL 2025
DOES YOUR MODEL UNDERSTAND GENES? A MODALITY-AGNOSTIC BENCHMARK OF GENE PROPERTIESYoav Kan-TorMichael Morris Danzigeret al.2025ISMB 2025