LLMs have model cards. Now, benchmarks do, too
To help AI developers more accurately assess a model’s capabilities, IBM Research and University of Notre Dame are open-sourcing a template and automation tool for creating LLM benchmark cards at scale.
Much of today’s progress in AI can be attributed to benchmarks — rigorous, standardized tests designed to expose what large language models can and can’t do, and to spur further innovation.
Benchmarks provide a structured way to compare models side-by-side and gauge their biases, risks, and fitness for a given task. The popularity of an LLM can rise or fall on its benchmark score. Why, then, are these make-or-break exams that everyone shares so poorly documented?
Elizabeth Daly, a senior technical staff member at IBM Research, kept asking herself that as she surveyed the inconsistent, often incomplete, state of benchmark documentation.
“I started to see this massive disconnect between what information is available and how it's represented in the test harnesses,” she said. “Is the benchmark telling you that the model’s really good at algebra, or only when the test questions are presented as multiple-choice?”
At the time, Daly and other researchers across IBM were working on a model risk evaluation engine for IBM’s AI governance platform, watsonx.governance. Together, they realized that to properly assess an LLM’s capabilities, you had to understand what exactly was being evaluated and how.
Model cards, of course, could tell you how an LLM was designed, trained, and evaluated. But sometimes, the most meaningful information was hidden in the benchmark fine print, which was typically buried in academic research papers.
Had the benchmark been properly implemented, consistently applied, and its results correctly interpreted? The burden of checking off these boxes typically fell to developers. What if, instead, easy-to-read summaries could be provided?
IBM already had a collaboration going with researchers at the University of Notre Dame. Together, they conceived of the BenchmarkCards project, formulating a template and an automated workflow for generating and validating new benchmark cards. This week, they open-sourced 105 validated benchmark cards on Hugging Face. Notre Dame has separately released a dataset of 4,000 benchmark cards.
Some of the most recognizable names in LLM benchmarking are here — University of California at Berkeley’s MT-Bench for measuring an LLM’s conversational skills, Allen AI’s WinoGrande for common-sense ‘reasoning,’ and University of Oxford’s TruthfulQA for tendencies to ‘hallucinate.’
The team is calling on the community to try out the benchmark cards and create cards of their own to help popularize the standard. “In the long run, we hope the cards can serve as a common language for describing evaluation resources, to reduce redundancy and to help the field progress more coherently,” said Anna Sokol, a PhD student at Notre Dame’s Lucy Family Institute for Data and Society, who worked on the project.
Anatomy of a benchmark
IBM and Google in 2019 were among the first tech companies to create AI-specific documentation called fact sheets and model cards, respectively. Like nutrition labels for food, they gave consumers the facts needed to weigh the pros and cons of applying AI in a specific context, be it lending, resume screening, and policing, among others.
Researchers at IBM and Notre Dame sought to bring a similar transparency and structure to benchmarks. By making the fine print more accessible, they hoped to help developers and other researchers more easily find the right benchmark for their needs, and better predict how their benchmarked model would perform in real life.
The template the team designed includes a high-level overview of the benchmark and five sections, starting with Purpose: listing appropriate and inappropriate uses; then Data: detailing the data’s sources, scale, type of representation, and annotation process; Methodology: explaining how metrics were computed and should be interpreted; Targeted risks: identifying potential harmful outcomes for models performing poorly on the test; and finally, Ethical and legal considerations: covering data licensing and relevant privacy regulations.
Each benchmark card lists benchmarks with similar goals. The standardized format also simplifies benchmark-to-benchmark comparisons, allowing developers to make more informed choices. The researchers show how this could be useful by placing cards for NYU’s 2021 benchmark, Bias Benchmark for Question Answering (BBQ) alongside Allen AI’s 2020 benchmark, RealToxicityPrompts (RTP).
Through this apples-to-apples comparison, a social media company could see that RTP would be the best way to test an LLM on its ability to filter harmful outputs, while a researcher auditing a question-answering system for fairness would find BBQ more useful for identifying bias.
“It can be difficult to know which benchmark to trust, how to interpret a score, or whether two benchmarks are even comparable,” said Sokol. “If you’re studying LLM hallucinations, how do you know which of the tens of benchmark options to choose from?”
Filling in the blanks
Extracting and summarizing methodology details from a research paper can be tedious and time-consuming. A previous user study by the team suggests that this could be why benchmarks have been so poorly documented in the first place.
To expedite the documentaion process, Aris Hofmann, a data science student at DHBW Stuttgart, designed an automated workflow as part of his internship at IBM. Previously, it took several hours to create a benchmark card, he said. Now, it can be done in 10 minutes. The workflow makes use of several open-source technologies incubated at IBM Research, including unitxt, Docling, Risk Atlas Nexus, and FactReasoner.
The workflow starts (captured in this demo Hofmann recorded) by selecting a benchmark from IBM’s unitxt catalog, and downloading known documentation for the benchmarks listed. The document conversion tool, Docling, is used to translate the material into machine-readable text, which is then passed to an LLM to extract the relevant details and plug them into the template.
Once a benchmark card is drafted, IBM’s Risk Atlas Nexus flags and adds potential risk factors to the card. The draft is passed to IBM’s fact-checking tool, FactReasoner, to check the accuracy of each claim. The tool flags statements that potentially contradict supporting material, which either an LLM or a human can then correct.
“We’re not just putting information into an LLM and asking it to synthesize a bunch of context, we're actually verifying it,” said Daly.
Benchmark cards are intended to pick up where model cards leave off. “They can give you a more contextualized perspective on the model’s behavior,” she added. “Our goal was to help developers communicate more effectively to the stakeholder why a given model is better for their use case.”
To check out benchmark cards already created by the team, head to Hugging Face, or create your own using Auto-BenchmarkCard’s easy-to-follow instructions.
Related posts
- NewsPeter Hess
Teams of agents can take the headaches — and potential costs — out of finding IT bugs
ReleaseMike MurphyIBM’s software engineering agent tops the Multi-SWE-bench leaderboard for Java
NewsPeter HessThe quest to teach LLMs how to count
ResearchKim Martineau