Kaizhi Qian, Yang Zhang, et al.
ICML 2020
We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU and MMAR benchmarks. On MMAU, Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.
Kaizhi Qian, Yang Zhang, et al.
ICML 2020
Vishal Sunder, Samuel Thomas, et al.
ICASSP 2022
Xiaodong Cui, George Saon, et al.
INTERSPEECH 2023
Prateeti Mohapatra, Gargi Dasgupta
IAAI 2024