Spoken Moments: Learning Joint Audio-Visual Representations from Video DescriptionsMathew MonfortSouyoung Jinet al.2021CVPR 2021
We Have So Much in Common: Modeling Semantic Relational Set Abstractions in VideosAlex AndonianCamilo Foscoet al.2020ECCV 2020
Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video UnderstandingMathew MonfortBowen Panet al.2019IEEE TPAMI