Laxmi Parida, Pier F. Palamara, et al.
BMC Bioinformatics
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
Laxmi Parida, Pier F. Palamara, et al.
BMC Bioinformatics
M. Tismenetsky
International Journal of Computer Mathematics
Juliann Opitz, Robert D. Allen, et al.
Microlithography 1998
Nimrod Megiddo
Journal of Symbolic Computation