Publication
INTERSPEECH 2017
Conference paper

Ensembles of multi-scale VGG acoustic models

View publication

Abstract

We present our work on constructing multi-scale deep convolutional neural networks for automatic speech recognition. Several VGG nets have been trained that differ solely in the kernel size of the convolutional layers. The general idea is that receptive fields of varying sizes match structures of different scales, thus supporting more robust recognition when combined appropriately. We construct a large multi-scale system by means of system combination. We use ROVER and the fusion of posterior predictions as examples of late combination, and knowledge distillation using soft labels from a model ensemble as a way of early combination. In this work, distillation is approached from the perspective of knowledge transfer pretraining, which is followed by a fine-tuning on the original hard labels. Our results show that it is possible to bundle the individual recognition strengths of the VGGs in a much simpler CNN architecture that yields equal performance with the best late combination.

Date

Publication

INTERSPEECH 2017