Dusan Macho, Jaume Padrell, et al.
ICME 2005
We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-visual features that improve automatic speech recognition. Linear discriminant analysis (LDA), followed by a maximum likelihood linear transform (MLLT) is first applied on MFCC based audio-only features, as well as on visual-only features, obtained by a discrete cosine transform of the video region of interest. Subsequently, a second stage of LDA and MLLT is applied on the concatenation of the resulting single modality features. The obtained audio-visual features are used to train a traditional HMM based speech recognizer. Experiments on the IBM ViaVoice™ audio-visual database demonstrate that the proposed feature fusion method improves speaker-independent, large vocabulary, continuous speech recognition for both clean and noisy audio conditions considered. A 24% relative word error rate reduction over an audio-only system is achieved in the latter case.
Dusan Macho, Jaume Padrell, et al.
ICME 2005
Chalapathy Neti, Gerasimos Potamianos, et al.
MMSP 2001
Gerasimos Potamianos, Chalapathy Neti
AVSP 2001
Iain Matthews, Gerasimos Potamianos, et al.
ICME 2001