Conference paper

Exploring the Limits of Conformer CTC-Encoder for Speech Emotion Recognition using Large Language Models

Abstract

Conformer CTC-Encoders have consistently delivered state-of-the-art results in the field of Automatic Speech Recognition (ASR); however, their merits for tasks that demand more semantic and paralinguistic information, such as Automatic Speech Understanding (ASRU), Speech Emotion Recognition (SER) and Speech Translation (ST), still need further investigation. In this paper, we introduce a Speech Large Language Model (SLLM) system based on a Conformer CTC-Encoder and on the Granite Large Language Model that allowed us to perform several experiments on ASR, SER and ST tasks. These experiments have not only confirmed the strength of Conformer CTC-encoders for ASR, but also, they have shown that the outputs of intermediate Conformer Blocks, of the Conformer CTC-Encoder, carry important information for SER tasks and that the Conformer CTC-Encoder can be efficiently fine-tuned for SER tasks.