SALSA: Speedy ASR-LLM Synchronous Aggregation

Ashish Mittal; Darshan Prabhu; Sunita Sarawagi; Preethi Jyothi

INTERSPEECH 2024

Conference paper

01 Sep 2024

SALSA: Speedy ASR-LLM Synchronous Aggregation

Abstract

Automatic speech recognition (ASR) systems still lag in performance on low-resource languages. The rise of multilingual large language models (LLMs) offers the potential for effective integration with ASR systems to improve its performance on low-resource languages. One major challenge towards achieving this goal is that the tokenization of the LLM and the ASR systems differ. In this work, we propose SALSA – a synchronous, lightweight solution to merge pretrained ASR and LLM systems with varying token vocabularies. The LLM’s predictions are tokenized using the ASR system to unroll its decoder; the last ASR decoder state is then mapped using a learnable projection and added as a residual connection to the LLM’s representations. SALSA is parameter-efficient using learned projection layers only for a select set of layers in the ASR and LLM decoders. We evaluate SALSA on more than 10 low-resource languages in the FLEURS benchmark yielding substantial WER reductions of up to 36%.

Conference paper