Teams of agents can take the headaches — and potential costs — out of finding IT bugs
IBM Research’s Project ALICE is a new experimental multi-agent system that can help engineers speed up the time it takes to locate and quash software bugs.
The cloud architecture that underpins much of the way we live and work is truly a modern marvel of complexity. But even with the profound capabilities today’s software systems provide, there is always the chance for bugs that can be difficult and costly to find and fix. Estimates suggest that the average cost of an IT outage is over $14,000 — per minute of downtime.
These bugs get so costly quickly because of how difficult they can be to pin down. Cloud systems often have labyrinthine layers of software, with different parts of the stack more or less up to date than others. Even small tweaks somewhere down the stack can have massive implications when it comes to thousands or millions of customers relying on a given system. Imagine a bank is experiencing a massive outage. Services are down, which leaves customers frustrated, and sends the IT operations team into a scramble. Using traditional observability tools, they can get metrics, logs, and traces on the incident, but the actual cause remains elusive.
Without spending time diving into the entire stack, resolving the issue can take hours, or days — impacting the company’s bottom line and leaving customers considering their competitors. Roughly 27% of unplanned outages were the result of software updates, and in just the last year, we’ve seen these outages cost companies billions of dollars.
With the advent of agentic AI, IT engineers have seen the potential of agents that can work on their own to systematically root out issues and ensure software is working as intended. Instead of spending hours sifting through logs to hopefully find the issue, human engineers can be freed up for more complex, or strategic, tasks, and systems can keep humming along on their own.
This is where Project ALICE comes in. Short for ‘Agentic Logic for Incident and Codebug Elimination,’ it’s a new multi-agent system born out of IBM Research designed to automate these sorts of IT challenges. It brings together two critical areas of IT operations, site reliability engineering (SRE), and software development. When an incident occurs in a system, engineers can deploy ALICE to investigate the problem.
ALICE uses several IBM-designed tools and agents, working in concert and sequentially, to tackle some of the biggest time sucks when debugging software. First, it’ll initiate the investigation with an incident analysis agent that will gather observability data. Next, a code context agent will generate a dependency graph for interconnected pieces of software, and determine which micro services in the application are most likely to be relevant to the problem. Then, a code analysis agent that’s partially powered by IBM’s CodeLLM DevKit will work to localize where the bug is and develop the report to send to the human engineers as a GitHub issue. The team would now have the exact observability information and code issue details to remedy the problem as quickly as possible.
These agents communicate through the open Model Context Protocol (MCP), allowing them to work in concert and play nicely with any external models that also use the protocol. Long term, the goal is for ALICE to be able to identify, strategize, and fix bugs on its own, and even look for potential failure points before incidents happen. IBM’s own SREs have started validating ALICE in their own workflows. Early results are promising, showing a 10% to 25% improvement in identifying root causes of issues with the addition of agentic code analysis in ALICE. The team used ITBench scenarios to measure their progress, and the work was just presented this month at the NeurIPS 2025 conference.
The team behind ALICE plans to work on future versions that could detect changes to a codebase as they’re made, potentially remedying incidents in seconds before they become catastrophes. This work is part of a larger effort within IBM Research to build automation tools that make work easier for engineers, physical plant managers, or just about anyone who uses software to monitor the health of physical or digital systems. The team recently built an “undo button” function for agents like ALICE that are still learning to find the best way to solve problems, and partnered with Kaggle to turn its open IT operations benchmarks into leaderboards that can help engineers determine which models and agents are best suited to solve their problems.
Related posts
- ResearchKim Martineau
IBM and Kaggle launch new AI leaderboards for enterprise tasks
NewsMike MurphyIBM Granite 4.0: Hyper-efficient, high performance hybrid models for India
Technical noteRudra Murthy, Rameswar Panda, Jaydeep Sen, and Amith SingheeIBM and ESA open-source AI models trained on a new dataset for analyzing extreme floods and wildfires
ReleaseKim Martineau
