News
2 minute read

A toxic language filter built for speed

IBM is open sourcing an efficient AI model for detecting hateful, abusive, and profane text in LLM training data and LLM-generated text.

IBM is open sourcing an efficient AI model for detecting hateful, abusive, and profane text in LLM training data and LLM-generated text.

Large language models inevitably pick up foul language during pre-training, in the process of ingesting text taken from the internet. But profanity, as well as hateful and abusive speech, can also creep in later, during fine-tuning or at inference time, when LLMs interact with people in the real world.

Ideally, each stage of the generative AI pipeline would include checks for hate speech, abusive language, and profanity, or HAP, as researchers refer to it. That would be feasible if only HAP filters were quicker than they are today.

IBM’s new filter granite-guardian-hap-38m is built for speed: small enough to run on a CPU, and quick enough to filter data at each phase of the LLM lifecycle, from pre-processing to inferencing. IBM just open sourced the 38-million parameter encoder model on Hugging Face.

“Extensive checks are needed to make sure no toxic language slips through,” said Yang Zhao, an IBM research scientist focused on AI “It’s only really practical to do extensive checking if your HAP filter is lightweight and fast.”

In addition to granite-guardian-hap-38m, IBM is open sourcing its close cousin, granite-guardian-hap-125m. In internal testing, the larger, 125 million parameter model outperformed HAP detectors of comparable size, including Meta’s Dynabench HAP filter, on a suite of toxic-language benchmarks.

To improve the model further, IBM researchers set out to shrink its size. Smaller LLMs perform fewer computations, reducing the cost and carbon emissions associated with training and deploying them.

With the help of something called neural architecture search, researchers transferred the big model’s knowledge to a compact format with eight fewer layers of artificial neurons. Granite-guardian-hap-38m was born.

In internal testing, the 38-million parameter model ran eight times faster than its larger cousin on a CPU, and about twice as fast on a GPU, researchers found.

image (9).png
IBM's granite-guardian-hap-38m ran eight times faster than granite-guardian-hap-125m and about twice as fast on a GPU.

“We wanted a model with 95% of the big model’s accuracy but fast enough to be useful when low latency and high throughput is needed,” said Bishwaranjan Bhattacharjee, a senior technical leader who led the development of both models.

Similar versions of both HAP detectors have been available on IBM’s watsonx AI platform for more than a year — and in 11 languages, including English. IBM used a version of granite-guardian-hap-38m to vet all the data that went into its Granite language and code models.

By making both HAP detectors available on Hugging Face, IBM is continuing its tradition of promoting trustworthy AI.

“People tend to treat social harm and environmental harm as separate issues with LLMs, but it’s possible to minimize both,” said Kush Varshney, an IBM Fellow who has led IBM’s efforts to adapt its Trust 360 toolkits for generative AI.

What’s next

Both HAP filters currently flag problematic sentences. Researchers are trying to develop more granular tools to isolate offensive words and phrases within a sentence. At the 2023 EMNLP conference, they demoed a multi-lingual HAP visualization tool that could eventually be used to hide offensive content from users.