News
3 minute read

IBM’s safety checkers top a new AI benchmark

Granite Guardian rises to the top of GuardBench, the first independent measure of how well guardrail models can detect harmful and hallucinated content as well as attempts to ‘jailbreak’ LLM safety controls.

Using AI models can come with risks, but AI is also getting smarter about flagging and maneuvering around them.

When IBM Research released its Granite Guardian models last year, the team considered them the most powerful tools out there for detecting a broad spectrum of risks associated with generative AI. Now, the first benchmark to independently evaluate so-called AI “guardrail” models has Granite Guardian leading the pack.

IBM Granite Guardian models hold six of the top 10 spots on the new GuardBench Leaderboard, the first third-party measure of how well AI classifiers can flag harmful or malicious prompts and LLM-generated responses. The top three models — Granite Guardian 3.1 8B, Granite Guardian 3.0 8B, and Granite Guardian 3.2 5B — have also been embraced publicly, with nearly 36,000 downloads on Hugging Face, the open-source AI model hub.

Created by researchers at the European Commission for Joint Research Centre, GuardBench is made up of 40 datasets, including five that are completely new. In addition to being the first independent benchmark for testing AI safety, it’s the first to extend test questions beyond English, with tests in French, German, Italian, and Spanish.

Granite Guardian had already distinguished itself on a variety of public datasets internally. The GuardBench results provide further confirmation of the models’ capabilities, including in languages the models had not been explicitly trained on. “We trained Granite Guardian on English data only,” said Prasanna Sattigeri, an IBM researcher who led the project. “The fact that we did so well shows that we had a strong multilingual Granite LLM to start with.”

The top four Granite Guardian models had scores of 86% and 85% across Guardian Bench’s 40 datasets. By contrast, Nvidia and Meta, the only other companies to break the top 10, had guardrail models that scored 82%, 80%, 78%, and 76%.

Researchers unveiled GuardBench last November at EMNLP, a top natural language processing conference. Because their paper came out before IBM released its Granite Guardian models, the GuardBench leaderboard that went live last week was the first public validation of the IBM models.

“We weren’t surprised, but it was good to see how well they generalized and performed on benchmarks we hadn’t tested them on,” said IBM researcher Inkit Padhi, who was part of the team that developed Granite Guardian.

A comprehensive solution

Granite Guardian was designed to run with any LLM, regardless of whether its weights were open or proprietary. The models were also trained under IBM’s AI Risk Atlas to flag socially-biased content, hateful, abusive, or profane (HAP) language, as well as any attempts by users to ‘jailbreak,’ or bypass, the LLM’s safety controls.

Unlike many other guardrail models, Granite Guardian was also trained to detect ‘hallucinated’ responses that might contain incorrect or misleading information, including in retrieval-augmented generation (RAG) applications. The models can match the performance of specialized hallucination detectors, and be customized for other risk dimensions, with build-your-own-detector prompting.

“There is no other single guard model that is so comprehensive across risks and harms,” said IBM Fellow Kush Varshney on LinkedIn.

The team attributes much of Granite Guardian’s abilities to the quality of their training data. Researchers hired people from diverse backgrounds to label examples of unwanted content. They also included synthetic data generated during internal red-teaming exercises on older Granite language models.

Speed is one of the decisive factors in whether guardrail models succeed. Filtering unwanted content on the fly, when an LLM might be generating millions of words, can add additional delays that users may not be willing to tolerate.

Here, the Granite Guardian series also shines. IBM researchers developed several lightweight variations to give users more flexibility. Filters specialized for HAP detection only were released earlier this year. Researchers also pared down a Granite Guardian 8B model to 5B by identifying and pruning redundant layers.

This intervention sped up inference by 1.4 times without any loss in accuracy. The 5B model (currently #3 on GuardBench) also introduced new features, including the ability to flag harmful comments in multi-turn conversations and verbalize its level of certainty in its responses.

The Granite Guardian collection is available on Hugging Face under an Apache 2.0 license and through IBM’s watsonx AI platform. The latest quantized versions of the models are also available on Hugging Face.