AI Testing
We’re designing tools to help ensure that AI systems are trustworthy, reliable and can optimize business processes. We create tests to simulate real-life scenarios and localize the faults in AI systems. We’re working on automating testing, debugging, and repairing AI models across a wide range of scenarios.
Our work
Tiny benchmarks for large language models
NewsKim MartineauWhat is red teaming for generative AI?
ExplainerKim MartineauAn open-source toolkit for debugging AI models of all data types
Technical noteKevin Eykholt and Taesung LeeAI diffusion models can be tricked into generating manipulated images
NewsKim MartineauDOFramework: A testing framework for decision optimization model learners
Technical noteOrit DavidovichManaging the risk in AI: Spotting the “unknown unknowns”
ResearchOrna Raz, Sam Ackerman, and Marcel Zalmanovici5 minute readIBM researchers check AI bias with counterfactual text
ResearchInkit Padhi, Nishtha Madaan, Naveen Panwar, and Diptikalyan Saha5 minute read
Publications
Combinatorial Test Design Model Creation using Large Language Models
- Debbie Furman
- Eitan Farchi
- et al.
- 2025
- IWCT 2025
Evolution of catalysis at IBM: From microelectronics to biomedicine to sustainability with AI-driven innovation
- James Hedrick
- Tim Erdmann
- et al.
- 2025
- ACS Spring 2025
Workshop on Data Integrity and Secure Cloud Computing (DISCC)
- Pradip Bose
- Augusto Vega
- et al.
- 2025
- HPCA 2025
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking
- Gabriel Rioux
- Apoorva Nitsure
- et al.
- 2024
- NeurIPS 2024
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios
- Samuel Ackerman
- Ella Rabinovich
- et al.
- 2024
- EMNLP 2024
Towards a Benchmark for Causal Business Process Reasoning with LLMs
- 2024
- BPM 2024