Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Modern video understanding requires integrating multimodal signals, but current Multimodal Large Language Models (MLLMs) often process audio and visual streams separately, missing key relationships and causing fragmented understanding with a disjointed audio-visual representation. In this work, we propose UniAVLM, a large audio-visual language model for comprehensive video understanding, which first employing Whisper-style audio feature extraction to capture relevant auditory information. We then introduce spatiotemporal position encoding to enhance the video representation with temporal dynamics. Finally, we implement cross-modal attention mechanisms to explicitly fuse the audio and visual features, allowing the model to learn the intricate relationships between these modalities and creating a cohesive multimodal representation. We conduct extensive experiments on the Audio-Visual Scene-Aware Dialogue (AVSD) benchmark, comparing our model against seven representative multimodal baselines and demonstrate state-of-the-art performance, with our model achieving 48.91% accuracy and 89.93 BERTScore-F1. Specifically, our model outperforms the best vision-language model by 6.79% accuracy, and surpasses the state-of-the-art full multimodal model by 4.07% accuracy, while using only parameter-efficient fine-tuning. Comprehensive ablation studies highlight the critical impact of lightweight integration strategies and thorough cross-modal fusion on comprehensive video understanding.
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Miao Guo, Yong Tao Pei, et al.
WCITS 2011