Research
4 minute read

Ctrl+Z for agents

IBM Research and University of Illinois propose an undo-and-retry mechanism for cloud engineering agents that could allow them to more safely and effectively troubleshoot IT issues.

The convenience of the cloud can come with risks, as a wave of recent outages have shown. By one estimate, the average cost of an unplanned IT outage is now $14,000 per minute, up nearly 10% from 2022.

Rising costs have put more pressure than ever on site reliability engineers (SREs) to resolve incidents quickly. But with new servers coming online faster than engineers to keep the system safe, cloud providers have looked with hope toward AI.

Today’s AI tools for IT operations, known colloquially as “AIOps,” mostly help SREs perform triage — spotting symptoms and narrowing down suspected points of failure. Operators don’t have enough trust in AI agents to let them fix incidents directly. Without an auditable trail and a way to rollback unsuccessful moves, operators are unlikely to delegate this last mile of a response to an AI.

A novel safety guarantee proposed by researchers at IBM and University of Illinois at Urbana Champaign (UIUC) could be the first step toward solving the accountability problem. A bit like the keyboard shortcut, Ctrl+Z, it’s at the heart of a new multi-agent system called STRATUS. When STRATUS’s remediation agent makes an unsuccessful move, an ‘undo’ maneuver reverts the system to the last checkpoint, so alternate solutions can be explored.

Named for the low-hanging clouds in real life, STRATUS outperformed by at least 150% other state-of-the-art systems on AIOpsLab and ITBench, open-source cloud engineering benchmarks seeded by Microsoft and IBM respectively. The team reported their work in a new paper accepted at NeurIPS 2025, and attributed STRATUS’s performance boost to its undo-and-retry mechanism.

“We thought the agent might get stuck in a loop and make the same mistake over and over,” said Saurabh Jha, a senior research scientist at IBM who co-led the team behind STRATUS. “Instead, we found it was able to safely explore new mitigation paths and seemed to perform better with each new attempt.”

When fixing becomes breaking

SREs today use a variety of algorithms to detect, localize, and identify the root causes of machine and disk failures in the cloud, but ultimately it’s the humans who operate on live systems to make the fix.

Typically, they collaborate with colleagues and consult technical documents to reason through a problem and minimize mistakes. “When they do err, their training and creative problem-solving skills allow them to diagnose and recover,” said Jha.

Companies trying to replace human ingenuity with AI agents face the very real risk of digging an even deeper hole. LLMs are famously bad at knowing what they don't know. Like an overeager student, they can throw out facts regardless of their accuracy or relevance, and they can be susceptible to ‘groupthink’ since many were trained on similar data and can share many of the same blind spots.

LLMs can also act in ways that humans find reckless or impulsive, with potentially dire consequences for the state of the internet and important records. “We've observed AI agents that wouldn’t hesitate to take catastrophic actions, like attempting to delete an entire production cluster,” said Jha. “A human engineer would almost never make such a decision, especially not without extensive verification and consultation.”

STRATUS is designed to bring greater trust and transparency to agentic systems. Its safe rollback strategy is based on a concept familiar to any database manager or software engineer. It’s called transactional-no-regression (TNR), and it ensures that only reversible changes that won’t break existing functionality can be made.

undoAgent_inline3.jpg
Agents in IBM and the University of Illinois at Urbana-Champaign's proposed STRATUS multi-agent system work together to detect, diagnose, and mitigate a cloud failure using a variety of observational data. An undo mechanism (described here as an agent), allows the mitigation agent to reverse unsuccessful actions and try again until the outage is resolved.

On its face, TNR may seem like a logical way to make agents safer. The researchers were unsure, however, if it could be applied to unpredictable multi-agent systems. “It's much harder to make sure the agent won’t do anything harmful and will complete its goals without getting stuck when you don't have fixed tests or oracles to definitively define "correct" behavior,” said Jha.

To get around the behavior modeling problem, the researchers devised a way for the agent to monitor the state of its environment, to effectively check its work. After the STRATUS mitigation agent takes a series of steps — operations adding up to one unit of work called a “transaction” — the agent assesses the severity level of the system. If the new state is worse off, the agent aborts the transaction and the system reverts to its initial, checkpointed state.

“Our core assumption is that every action must be undoable,” said Jha. “If the agent proposes an action, like deleting a database, that the system identifies as destructive and non-recoverable, it will be rejected before it can even run. At that point, the agent must either find an alternative, safe-to-undo solution or escalate the problem to a human.”

For the rollback to work, the researchers had to build several constraints into STRATUS. A write lock prevents agents from executing commands at the same time, and those commands are simulated first to catch potential errors. Each action taken by an agent also has a corresponding undo operator. This effectively prevents irreversible changes like deleted files. Lastly, the number of commands that can be made within a transaction is limited to make rollbacks easier.

As enterprises explore the potential for LLM agents to resolve incidents in the cloud and in other production settings, researchers are increasingly focused on safety. A team at Stanford recently proposed a multi-agent system called SagaLLM that includes transactional safeguards for agents. It doesn’t, however, model the system’s external state, which would allow it to undo actions as a safety backup, said Jha.

The cloud and beyond

The team is currently refining STRATUS, and looking to apply its transactional safety concept to software engineering agents. It’s an easier problem, said Jha, because coding platforms like GitHub already provide checkpoint and abort mechanisms (“commit” and “checkout” in Git lingo).

Transactional safety could also be applied to AI agents for database management and performance engineering, a field focused on designing efficiency and stability into software systems. Transaction abstractions allow multiple programs to safely access a shared state, which is why they’re a natural fit for systems with multiple agents operating at once, said Indranil Gupta, a computer science professor at UIUC who was not involved in the work.

“The STRATUS work is groundbreaking in that it understands (and fleshes out) which aspects of the transactional abstraction are relevant to the agentic world,” Gupta said, “Specifically ‘no regression,’ which is key to ensuring that a group of agents makes progress, rather than going in circles.”

Transactional abstractions are well studied in database management, he added, but it’s still unclear which can be transferred to agents, and which will need to be discovered. “These all remain open questions,” he said. “The STRATUS paper will hopefully encourage researchers across the community to engage in addressing these questions.”

Related posts