ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMsIrene HuangWei Linet al.2024NeurIPS 2024
Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object TokensElad Ben AvrahamRoi Herziget al.2022NeurIPS 2022