IBM Granite model tops Hugging Face speech recognition leaderboard

We’re in the middle of an AI revolution, with so many new and powerful tools springing up each day. And the better they get, the more we want to interact with them in ways that feel natural to us. Models that can understand what we say and carry out instructions based on our asks make it simpler and quicker for people to get things done — both in their regular lives and in the world of work. And ones that make the fewest errors in understanding us will be the ones most useful for voice chatbots, call centers, audio summarizations, and myriad other tasks businesses carry out each day.

IBM recently launched Granite Speech 3.3 8B model, a speech model that was built by modality aligning and LoRA fine-tuning the Granite 3.3 8B Instruct model. This speech model was designed to excel at enterprise tasks based on automatic speech recognition (ASR). Its strengths are in turning English speech into text, as well as translating from English to French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin. The model was open-sourced and made available for anyone to download on Hugging Face.

And now, at the time of publishing, IBM Granite Speech 3.3 8B has topped the chart of Hugging Face’s Open ASR leaderboard, outperforming all other open speech models on the site. Hugging Face’s leaderboard ranks speech-recognition models on the site based on two metrics; the average word error rate (where fewer errors is better), and how many seconds of audio can be processed per second, which is referred to as RTFx.

When tested on several different bodies of audio data, the Granite model had the lowest word error rate (or the highest accuracy), beating out several proprietary models as well. The IBM model also managed to top the chart with a relatively small model, beating out models from other major competitors, such as OpenAI's Whisper, and several models from Facebook.

Model	Average WER	RTFx	License
ibm-granite/granite-speech-3.3-8b	5.85	31.33	Open
nvidia/parakeet-tdt-0.6b-v2	6.05	3386.02	Open
microsoft/Phi-4-multimodal-instruct	6.14	62.12	Open
nvidia/canary-1b-flash	6.35	1045.75	Open
nvidia/canary-1b	6.5	235.34	Open
nyrahealth/CrisperWhisper	6.67	84.05	Open
ibm-granite/granite-speech-3.3-2b	6.86	52.47	Open
elevenlabs/scribe_v1	6.88	NA	Proprietary
speechmatics/enhanced	6.91	NA	Proprietary
nvidia/parakeet-tdt-1.1b	7.01	2390.61	Open
assemblyai/assembly_best	7.03	NA	Proprietary
revai/fusion	7.12	NA	Proprietary

Granite Speech 3.3 8B was trained on a large variety of public databases, with the goal of trying to capture as different English dialects and ways of speaking as possible. For example, the way someone communicates in a voicemail would be very different from an audiobook or an earnings report. The team added in noise and randomly cut parts of the audio signal during training to increase how well the model might perform in real-life situations where conversations are not always crystal clear or completely linear.

The team outlined how they were able to achieve such strong results in a paper they recently published on arXiv. They built on top a long history of speech-recognition work at IBM, including the speech encoder used in Watson speech-to-text services. In this case, being able to train each individual parts of the model, including the acoustic encoder, the speech modality adapter and the LoRA adapters for the LLM, also helped improve the quality. The team also used convolution-augmented transformers (also called a conformer) for the encoder and window query transformers for the modality adapter, both of which are state-of-the-art technologies.

The model's success can be attributed to a balanced sampling of the training data, according to George Saon, a distinguished research scientist at IBM who worked on the model. This sampling ensures performance remains high across multiple different audio types. Saon also noted that improvements in the acoustic encoder, such as conditioning on intermediate predictions and block self-attention in the conformer layers, were key drivers in helping create such a performant model.

Speech recognition is a complicated problem to tackle. Even the strongest models today can’t perform at the level of understanding that the average human can. All the noise in real-life conversations, the various accents and dialects people have, and the conversations that are overlapping with others in the area — these are all generally trivial for humans to deal with, but very difficult for AI to parse. But the team at IBM Research believes in the next five to 10 years, we will have speech recognition systems that are on par with our own abilities. This leaderboard result is just the latest step on that path.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

IBM and Kaggle launch new AI leaderboards for enterprise tasks
News
Mike Murphy
02 Dec 2025
IBM Granite 4.0: Hyper-efficient, high performance hybrid models for India
Technical note
Rudra Murthy, Rameswar Panda, Jaydeep Sen, and Amith Singhee
28 Nov 2025
- AI
- Generative AI
IBM and ESA open-source AI models trained on a new dataset for analyzing extreme floods and wildfires
Release
Kim Martineau
25 Nov 2025
Accelerating AI inference with IBM Storage Scale
Technical note
Yue Zhu, Radu Stoica, Animesh Trivedi, Jonathan Terner, Frank Schmuck, Jeremy Cohn, Christof Schmitt, Anthony Hsu, Guy Margalit, Vasily Tarasov, Swaminathan Sundararaman, Talia Gershon, and Vincent Hsu
18 Nov 2025

Related posts

IBM and Kaggle launch new AI leaderboards for enterprise tasks

IBM Granite 4.0: Hyper-efficient, high performance hybrid models for India

IBM and ESA open-source AI models trained on a new dataset for analyzing extreme floods and wildfires

Accelerating AI inference with IBM Storage Scale