Exploiting lower face symmetry in appearance-based automatic speechreading
Abstract
Appearance-based visual speech feature extraction is being widely used in the automatic speechreading and audio-visual speech recognition literature. In its most common application, the discrete cosine transform (DCT) is utilized to compress the image of the speaker’s mouth region-of-interest (ROI), and the highest energy spatial frequency components are retained as visual features. Good generalization performance of the resulting system however requires robust ROI extraction and its consistent normalization, designed to compensate for speaker head-pose and other data variations. In general, one expects that the ROI - if correctly normalized - will be nearly laterally symmetric, due to the approximate symmetry of human faces. We thus argue that forcing lateral ROI symmetry can be beneficial to automatic speechreading, providing a mechanism to compensate for small face and mouth tracking errors, which would otherwise result to incorrect ROI normalization. In this paper, we propose to achieve such ROI symmetry indirectly, by considering the spatial frequency domain and exploiting the DCT properties. In particular, we propose to remove the odd frequency DCT components from the selected visual feature vector. We experimentally demonstrate that, in general, this approach does not hurt speechreading performance, while it reduces computation, since it results to less DCT features. In addition, for the same number of features, as in traditional DCT coefficient selection, the method results in significant speechreading improvements. For the connected-digit automatic speechreading experiments considered, and for low feature dimensionalities, such can reach up to 12% relative reduction in word error rate.