Edge guided single depth image super resolution
Jun Xie, Rogerio Schmidt Feris, et al.
ICIP 2014
In this paper, we consider the problem of resource-efficient architectures for audio-visual automatic speech recognition (AVSR). Specifically, we complement our earlier work that introduced efficient convolutional neural networks (CNNs) for visual-only speech recognition, by focusing here on the sequence modeling component of the architecture, proposing a novel resource-efficient time-delay neural network (TDNN) that we extend for AVSR. In more detail, we introduce the sTDNN-F module, which combines the factored TDNN (TDNN-F) with grouped fully-connected layers and the shuffle operation. We then develop an AVSR system based on the sTDNN-F, incorporating the efficient CNNs of our earlier work and other standard visual processing and speech recognition modules. We evaluate our approach on the popular TCD-TIMIT corpus, under two speaker-independent training/testing scenarios. Our best sTDNN-F based AVSR system turns out 74% more efficient than a traditional TDNN one and 35% more efficient than TDNN-F, while maintaining similar recognition accuracy and noise robustness, and also significantly outperforming its audio-only counterpart.
Jun Xie, Rogerio Schmidt Feris, et al.
ICIP 2014
Eugene H. Ratzlaff
ICDAR 2001
Ritendra Datta, Jianying Hu, et al.
ICPR 2008
Srideepika Jayaraman, Chandra Reddy, et al.
Big Data 2021