Cross-modality automatic face model training from large video databases
Abstract
Face recognition is an important issue on video indexing and retrieval applications. Usually, supervised learning is used to build face models for various specific named individuals. However, a huge amount of labeling work is needed in a traditional supervised learning framework. In this paper, we propose an automatic cross-modality training scheme without supervision which uses automatic speech recognition of videos to build visual face models. Based on Multiple-Instance Learning algorithms, we introduce novel concepts of "Quasi-Positive bags" and "Extended Diverse Density", and use them to develop an automatic training scheme. We also propose to use the "Relative Sparsity" of a cluster to detect the anchorperson in the news videos. Experiments show that our algorithm can get correct models for the persons we are interested in. The automatic learned models are tested and compared with a supervised learning algorithm for face recognition in large news video databases, and show promising results.