Granite Guardian: Comprehensive LLM Safeguarding

Inkit Padhi; Manish Nagireddy; Giandomenico Cornacchia; SUBHAJIT CHAUDHURY; Tejaswini Pedapati; Pierre Dognin; Keerthiram Murugesan; Erik Miehling; Martin Santillan Cooper; Kieran Fraser; Giulio Zizzo; Muhammad Zaid Hameed; Mark Purcell; Michael Desmond; Qian Pan; Inge Vejsbjerg; Elizabeth Daly; Michael Hind; Werner Geyer; Ambrish Rawat; Kush Varshney; Prasanna Sattigeri

NAACL 2025

Conference paper

29 Apr 2025

Granite Guardian: Comprehensive LLM Safeguarding

Abstract

The deployment of language models in real-world applications exposes users to various risks, including hallucinations and harmful or unethical content. These challenges highlight the urgent need for robust safeguards to ensure safe and responsible AI. To address this, we introduce Granite Guardian, a suite of advanced models designed to detect and mitigate risks associated with prompts and responses, enabling seamless integration with any large language model (LLM). Unlike existing open-source solutions, our Granite Guardian models provide comprehensive coverage across a wide range of risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related issues such as context relevance, groundedness, and answer accuracy in retrieval-augmented generation (RAG) scenarios. Trained on a unique dataset combining diverse human annotations and synthetic data, Granite Guardian excels in identifying risks often overlooked by traditional detection systems, particularly jailbreak attempts and RAG-specific challenges.

Workshop paper