Multi-Speaker Data Augmentation for Improved end-to-end Automatic Speech Recognition
Abstract
Publicly available datasets traditionally used to train E2E ASR models for conversational telephone speech recognition are based on clean, short duration, single speaker utterances collected on separate channels. While E2E ASR models achieve state-of-the-art performance on recognition tasks that match well with such training data, they are observed to fail on test recordings that contain multiple speakers, significant channel or background noise or span longer durations than training data utterances. To mitigate these issues, we propose an on-the-fly data augmentation strategy that transforms single speaker training data into multiple speaker data by appending together multiple single speaker utterances. The proposed technique encourages the E2E model to become robust to speaker changes and also process longer utterances effectively. During training, the model is also guided by a teacher model trained on single speaker utterances to map its multi-speaker encoder embeddings to better performing single speaker representations. With the proposed technique we obtain 7-14% relative improvement on various single speaker and multiple speaker test sets. We also show how this technique is able to improve recognition performance by up to 14% by capturing useful information from preceding spoken utterances used as dialog history.