Towards View-Independent Viseme Recognition Based on CNNS and Synthetic Data
Abstract
Visual Speech Recognition is the ability to interpret spoken text using video information only. To address such task automatically, recent works have employed Deep Learning and obtained high accuracy on the recognition of words and sentences uttered in controlled environments, with limited head-pose variation. However, the accuracy drops for multi-view datasets and when it comes to interpreting isolated mouth shapes, such as visemes, the values reported are considerably lower, as shorter segments of speech lack temporal and contextual information. In this work, we evaluate the applicability of synthetic datasets for assisting recognition of visemes in real-world data acquired under controlled and uncontrolled environments, using GRID and AVICAR datasets, respectively. We create two large-scale synthetic 2D datasets based on realistic 3D facial models - with near-frontal and multi-view mouth images. We perform experiments that indicate that a transfer learning approach using synthetic data can get higher accuracy than training from scratch using real data only, on both scenarios.