IBM Granite model tops Hugging Face speech recognition leaderboard
We’re in the middle of an AI revolution, with so many new and powerful tools springing up each day. And the better they get, the more we want to interact with them in ways that feel natural to us. Models that can understand what we say and carry out instructions based on our asks make it simpler and quicker for people to get things done — both in their regular lives and in the world of work. And ones that make the fewest errors in understanding us will be the ones most useful for voice chatbots, call centers, audio summarizations, and myriad other tasks businesses carry out each day.
IBM recently launched Granite Speech 3.3 8B model, a speech model that was built by modality aligning and LoRA fine-tuning the Granite 3.3 8B Instruct model. This speech model was designed to excel at enterprise tasks based on automatic speech recognition (ASR). Its strengths are in turning English speech into text, as well as translating from English to French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin. The model was open-sourced and made available for anyone to download on Hugging Face.
And now, at the time of publishing, IBM Granite Speech 3.3 8B has topped the chart of Hugging Face’s Open ASR leaderboard, outperforming all other open speech models on the site. Hugging Face’s leaderboard ranks speech-recognition models on the site based on two metrics; the average word error rate (where fewer errors is better), and how many seconds of audio can be processed per second, which is referred to as RTFx.
When tested on several different bodies of audio data, the Granite model had the lowest word error rate (or the highest accuracy), beating out several proprietary models as well. The IBM model also managed to top the chart with a relatively small model, beating out models from other major competitors, such as OpenAI's Whisper, and several models from Facebook.
Model | Average WER | RTFx | License |
---|---|---|---|
ibm-granite/granite-speech-3.3-8b | 5.85 | 31.33 | Open |
nvidia/parakeet-tdt-0.6b-v2 | 6.05 | 3386.02 | Open |
microsoft/Phi-4-multimodal-instruct | 6.14 | 62.12 | Open |
nvidia/canary-1b-flash | 6.35 | 1045.75 | Open |
nvidia/canary-1b | 6.5 | 235.34 | Open |
nyrahealth/CrisperWhisper | 6.67 | 84.05 | Open |
ibm-granite/granite-speech-3.3-2b | 6.86 | 52.47 | Open |
elevenlabs/scribe_v1 | 6.88 | NA | Proprietary |
speechmatics/enhanced | 6.91 | NA | Proprietary |
nvidia/parakeet-tdt-1.1b | 7.01 | 2390.61 | Open |
assemblyai/assembly_best | 7.03 | NA | Proprietary |
revai/fusion | 7.12 | NA | Proprietary |
Granite Speech 3.3 8B was trained on a large variety of public databases, with the goal of trying to capture as different English dialects and ways of speaking as possible. For example, the way someone communicates in a voicemail would be very different from an audiobook or an earnings report. The team added in noise and randomly cut parts of the audio signal during training to increase how well the model might perform in real-life situations where conversations are not always crystal clear or completely linear.
The team outlined how they were able to achieve such strong results in a paper they recently published on arXiv. They built on top a long history of speech-recognition work at IBM, including the speech encoder used in Watson speech-to-text services. In this case, being able to train each individual parts of the model, including the acoustic encoder, the speech modality adapter and the LoRA adapters for the LLM, also helped improve the quality. The team also used convolution-augmented transformers (also called a conformer) for the encoder and window query transformers for the modality adapter, both of which are state-of-the-art technologies.
The model's success can be attributed to a balanced sampling of the training data, according to George Saon, a distinguished research scientist at IBM who worked on the model. This sampling ensures performance remains high across multiple different audio types. Saon also noted that improvements in the acoustic encoder, such as conditioning on intermediate predictions and block self-attention in the conformer layers, were key drivers in helping create such a performant model.
Speech recognition is a complicated problem to tackle. Even the strongest models today can’t perform at the level of understanding that the average human can. All the noise in real-life conversations, the various accents and dialects people have, and the conversations that are overlapping with others in the area — these are all generally trivial for humans to deal with, but very difficult for AI to parse. But the team at IBM Research believes in the next five to 10 years, we will have speech recognition systems that are on par with our own abilities. This leaderboard result is just the latest step on that path.
Related posts
- ResearchKim Martineau
How to make AI models more accurate: Embrace failure
ResearchPeter HessLarge memory storage through astrocyte computation
Technical noteDmitry Krotov and Leo KozachkovA 360 review of AI agent benchmarks
ResearchKim Martineau