Joint Audio-Visual Speech Processing for Recognition and Enhancement
Abstract
Visual speech information present in the speaker's mouth region has long been viewed as a source for improving the robustness and naturalness of human-computer-interfaces (HCI). Such information can be particularly crucial in realistic HCI environments, where the acoustic channel is corrupted, and as a result, the performance of traditional automatic speech recognition (ASR) systems falls below usability levels. In this paper, we review two general approaches that utilize visual speech to improve ASR in acoustically challenging environments: One directly combines features extracted from the acoustic and visual channels, aiming at superior recognition performance of the resulting audio-visual ASR system. The other seeks to eliminate the noise present in the acoustic features, aiming at their audio-visual based enhancement, and thus resulting in improved speech recognition. We present a number of techniques recently introduced in the literature for bimodal ASR and enhancement, and we study their performance using a suitable audio-visual database. Among the methods considered, our recognition experiments demonstrate that decision based combination of audio and visual features significantly outperforms simpler feature based integration methods for audio-visual ASR. For audio feature enhancement, a non-linear technique is more successful than a regression-based approach. As expected, bimodal ASR and enhancement outperform their audio-only counterparts.