LLMs have model cards. Now, benchmarks do, too

To help AI developers more accurately assess a model’s capabilities, IBM Research and University of Notre Dame are open-sourcing a template and automation tool for creating LLM benchmark cards at scale.

Much of today’s progress in AI can be attributed to benchmarks — rigorous, standardized tests designed to expose what large language models can and can’t do, and to spur further innovation.

Benchmarks provide a structured way to compare models side-by-side and gauge their biases, risks, and fitness for a given task. The popularity of an LLM can rise or fall on its benchmark score. Why, then, are these make-or-break exams that everyone shares so poorly documented?

Elizabeth Daly, a senior technical staff member at IBM Research, kept asking herself that as she surveyed the inconsistent, often incomplete, state of benchmark documentation.

“I started to see this massive disconnect between what information is available and how it's represented in the test harnesses,” she said. “Is the benchmark telling you that the model’s really good at algebra, or only when the test questions are presented as multiple-choice?”

At the time, Daly and other researchers across IBM were working on a model risk evaluation engine for IBM’s AI governance platform, watsonx.governance. Together, they realized that to properly assess an LLM’s capabilities, you had to understand what exactly was being evaluated and how.

Model cards, of course, could tell you how an LLM was designed, trained, and evaluated. But sometimes, the most meaningful information was hidden in the benchmark fine print, which was typically buried in academic research papers.

Had the benchmark been properly implemented, consistently applied, and its results correctly interpreted? The burden of checking off these boxes typically fell to developers. What if, instead, easy-to-read summaries could be provided?

IBM already had a collaboration going with researchers at the University of Notre Dame. Together, they conceived of the BenchmarkCards project, formulating a template and an automated workflow for generating and validating new benchmark cards. This week, they open-sourced 105 validated benchmark cards on Hugging Face. Notre Dame has separately released a dataset of 4,000 benchmark cards.

Some of the most recognizable names in LLM benchmarking are here — University of California at Berkeley’s MT-Bench for measuring an LLM’s conversational skills, Allen AI’s WinoGrande for common-sense ‘reasoning,’ and University of Oxford’s TruthfulQA for tendencies to ‘hallucinate.’

The team is calling on the community to try out the benchmark cards and create cards of their own to help popularize the standard. “In the long run, we hope the cards can serve as a common language for describing evaluation resources, to reduce redundancy and to help the field progress more coherently,” said Anna Sokol, a PhD student at Notre Dame’s Lucy Family Institute for Data and Society, who worked on the project.

Anatomy of a benchmark

IBM and Google in 2019 were among the first tech companies to create AI-specific documentation called fact sheets and model cards, respectively. Like nutrition labels for food, they gave consumers the facts needed to weigh the pros and cons of applying AI in a specific context, be it lending, resume screening, and policing, among others.

Researchers at IBM and Notre Dame sought to bring a similar transparency and structure to benchmarks. By making the fine print more accessible, they hoped to help developers and other researchers more easily find the right benchmark for their needs, and better predict how their benchmarked model would perform in real life.

The template the team designed includes a high-level overview of the benchmark and five sections, starting with Purpose: listing appropriate and inappropriate uses; then Data: detailing the data’s sources, scale, type of representation, and annotation process; Methodology: explaining how metrics were computed and should be interpreted; Targeted risks: identifying potential harmful outcomes for models performing poorly on the test; and finally, Ethical and legal considerations: covering data licensing and relevant privacy regulations.

blog-benchmark cards-table-2.png — IBM and University of Notre Dame's BenchmarkCards template makes it easier for developers to compare one benchmark to another to find the best test to evaluate an LLM's capabillities.

Each benchmark card lists benchmarks with similar goals. The standardized format also simplifies benchmark-to-benchmark comparisons, allowing developers to make more informed choices. The researchers show how this could be useful by placing cards for NYU’s 2021 benchmark, Bias Benchmark for Question Answering (BBQ) alongside Allen AI’s 2020 benchmark, RealToxicityPrompts (RTP).

Through this apples-to-apples comparison, a social media company could see that RTP would be the best way to test an LLM on its ability to filter harmful outputs, while a researcher auditing a question-answering system for fairness would find BBQ more useful for identifying bias.

“It can be difficult to know which benchmark to trust, how to interpret a score, or whether two benchmarks are even comparable,” said Sokol. “If you’re studying LLM hallucinations, how do you know which of the tens of benchmark options to choose from?”

Filling in the blanks

Extracting and summarizing methodology details from a research paper can be tedious and time-consuming. A previous user study by the team suggests that this could be why benchmarks have been so poorly documented in the first place.

To expedite the documentaion process, Aris Hofmann, a data science student at DHBW Stuttgart, designed an automated workflow as part of his internship at IBM. Previously, it took several hours to create a benchmark card, he said. Now, it can be done in 10 minutes. The workflow makes use of several open-source technologies incubated at IBM Research, including unitxt, Docling, Risk Atlas Nexus, and FactReasoner.

The workflow starts (captured in this demo Hofmann recorded) by selecting a benchmark from IBM’s unitxt catalog, and downloading known documentation for the benchmarks listed. The document conversion tool, Docling, is used to translate the material into machine-readable text, which is then passed to an LLM to extract the relevant details and plug them into the template.

Once a benchmark card is drafted, IBM’s Risk Atlas Nexus flags and adds potential risk factors to the card. The draft is passed to IBM’s fact-checking tool, FactReasoner, to check the accuracy of each claim. The tool flags statements that potentially contradict supporting material, which either an LLM or a human can then correct.

“We’re not just putting information into an LLM and asking it to synthesize a bunch of context, we're actually verifying it,” said Daly.

Benchmark cards are intended to pick up where model cards leave off. “They can give you a more contextualized perspective on the model’s behavior,” she added. “Our goal was to help developers communicate more effectively to the stakeholder why a given model is better for their use case.”

To check out benchmark cards already created by the team, head to Hugging Face, or create your own using Auto-BenchmarkCard’s easy-to-follow instructions.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

Meet the IBM researchers trying to raise AI’s intelligence-per-watt ratio
Q & A
Kim Martineau
21 Jan 2026
- AI
Boost your tools: Introducing ToolOps, the tool lifecycle extension in ALTK
Technical note
Himanshu Gupta, Jim Laredo, Neelamadhav Gantayat, Jayachandu Bandlamudi, Prerna Agarwal, Sameep Mehta, Renuka Sindhgatta, Ritwik Chaudhuri, and Rohith Vallam
11 Dec 2025
- AI
IBM Granite tops Stanford’s list as the world’s most transparent model
News
Peter Hess
09 Dec 2025
Teams of agents can take the headaches — and potential costs — out of finding IT bugs
Release
Mike Murphy
08 Dec 2025
- AI

Anatomy of a benchmark

Filling in the blanks

Related posts

Meet the IBM researchers trying to raise AI’s intelligence-per-watt ratio

Boost your tools: Introducing ToolOps, the tool lifecycle extension in ALTK

IBM Granite tops Stanford’s list as the world’s most transparent model

Teams of agents can take the headaches — and potential costs — out of finding IT bugs