Publication
ICMI 2024
Conference paper

DoubleDistillation: Enhancing LLMs for Informal Text Analysis using Multistage Knowledge Distillation from Speech and Text

View publication

Abstract

Traditional large language models (LLMs) leverage extensive text corpora but lack access to acoustic and para-linguistic cues present in speech. There is a growing interest in enhancing text-based models with audio information. However, current models often require an aligned audio-text dataset which is frequently much smaller than typical language model training corpora. Moreover, these models often require both text and audio streams during inference/testing. In this study, we introduce a novel two-stage knowledge distillation (KD) approach that enables language models to (a) incorporate rich acoustic and paralinguistic information from speech, (b) utilize text corpora comparable in size to typical language model training data, and (c) support text-only analysis without requiring an audio stream during inference/testing. Specifically, we employ a pre-trained speech embedding teacher model (OpenAI Whisper) to train a Teacher Assistant (TA) model on an aligned audio-text dataset in the first stage. In the second stage, the TA’s knowledge is transferred to a student language model trained on a conventional text dataset. Thus, our two-stage KD method leverages both the acoustic and paralinguistic cues in the aligned audio-text data and the nuanced linguistic knowledge in a large text-only dataset. Based on our evaluation, this DoubleDistillation system consistently outperforms traditional LLMs in 15 informal text understanding tasks.