ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMsIrene HuangWei Linet al.2024NeurIPS 2024
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene GraphsRoi HerzigAlon Mendelsonet al.2023EMNLP 2023
Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object TokensElad Ben AvrahamRoi Herziget al.2022NeurIPS 2022