What When and Where? Self-Supervised Spatio Temporal Grounding in Untrimmed Multi-Action Videos from Narrated InstructionsBrian ChenNina Shvetsovaet al.2024CVPR 2024
Everything at Once - Multi-modal Fusion Transformer for Video RetrievalNina ShvetsovaBrian Chenet al.2022CVPR 2022
AVLnet: Learning audio-visual language representations from instructional videosAndrew RouditchenkoAngie Boggustet al.2021INTERSPEECH 2021
Cascaded multilingual audio-visual learning from videosAndrew RouditchenkoAngie Boggustet al.2021INTERSPEECH 2021