Q & A
5 minute read

Meet the IBM researchers trying to raise AI’s intelligence-per-watt ratio

The tech leads behind IBM’s new software library, Mellea, explain how it works and why giving language models explicit requirements can improve performance and lower costs.

Large language models still dominate AI leaderboards, but a new class of lighter weight models are closing the gap. IBM’s new Granite 4.0 model family, for example, can outperform older and much larger frontier models at a fraction of the price.

It’s why IBM and other tech companies have embraced small language models (SLMs) for many enterprise tasks. Running them requires less computational power, memory, and electricity, and as a team at Stanford recently found, they can capably handle most AI tasks from a laptop or phone. To mark this milestone, the team proposed rating LLMs by their size, through a so-called “intelligence per watt” ratio.

IBM Granite models already stand out by this measure. But what if their intelligence could be augmented further simply by building applications a bit differently, in a more SLM-friendly way? IBM recently released Mellea, a new open-source library designed to make interacting with a language model as predictable as any other software by, among other things, imposing requirements at inference time.

Still in its early stages, Mellea is part of a larger research agenda IBM is calling generative computing. Today, AI agents are built in a messy ad-hoc way that requires long, complex prompts that only large frontier models can handle. Generative computing envisions a more structured, simplified design that would allow Granite and other lean, open-source LLMs to perform as well as or better than the heavyweights.

A pair of research scientists at IBM, Nathan Fulton and Hendrik Strobelt, set out to build Mellea nearly a year ago. Both have been coding since they were kids. Raised in East Germany, Strobelt learned BASIC on his dad’s Robotron. In suburban St. Louis, Fulton taught himself Applesoft BASIC on the family’s old Apple II computer long relegated to a closet.

They both also studied computer science, 15 years a part, in graduate school. Strobelt focused on methods to search and visualize massive document collections, while Fulton specialized in the mathematical logic for verifying that automated machines, from cars to planes, would behave as humans intended. They both wound up at IBM Research in Cambridge, Mass., as interest in large language models was ramping up.

We recently caught up with them to talk about Mellea, and their quest to make small open-source language models more reliable and user-friendly.

What’s the core problem you’re trying to solve?

Fulton: We want to do big model things with small models. We think the best way to do that is by getting away from long-winded prompts and magical incantations to get the response you want. We think you can do that by breaking a problem into bite-sized pieces that can validated and iteratively solved. Decomposing a task into a series of sub-tasks often leads to better results, and small models can do this very efficiently.

Strobelt: LLMs need a failure mode. Any developer who has worked with an LLM immediately understands why getting away from prompts and providing code instead could be useful. Small to mid-sized LLMs have a lot of value that Mellea could unlock.

Is a failure mode that important?

Fulton: Yes. It’s easy to build a demo that works on 90% of examples, but a 10% failure rate when you don’t know where the system will fail is unacceptable. If you’re trying to automate a task where failure matters, and there’s no way to detect your failure modes, then it doesn’t work. Imagine if every tenth email you wrote didn’t send or was sent to everyone. It wouldn’t be a useful business tool.

How are failure modes implemented in Mellea?

Strobelt: Through a pattern we call instruct-validate-repair. I send instructions to the model; I validate what comes back against a set of requirements. Instead of just chatting with the model, I can ask it to write an email inviting colleagues to the office party with two conditions: the email should be engaging, and no longer than 100 words. If both conditions aren’t met, the model goes back and tries to repair its initial work. By adding specifications, you also define failure.

Fulton: if you’re writing a legal brief, for example, we can parse the citations and check the case law to see if they exist. If the model produces a bad citation at runtime, you can reject it and move on.

That makes sense. When an LLM fails, it keeps trying until the sub-task is solved. Is the model invoked every time?

Fulton: Not necessarily. Mellea breaks the problem into pieces and only uses the language model when needed. It makes no sense to run a large language model on state-of-the-art GPUs to solve a relatively simple problem. A language model does its computation in natural language. Whether you tell it to write a program, or solve a mathematical problem, the problem is processed as text; math problems are reformulated as arithmetic, run on a calculator, and the answer is returned in natural language.

So, by decomposing the problem, you can use the LLM more selectively?

Strobelt: Exactly. If you can break a long prompt into smaller pieces you can reduce model size because each instruction is smaller. This is a classic divide-and-conquer approach. You connect the components, and can run some in parallel, but each can be optimized individually.

Why has IBM embraced small models?

Fulton: They’re more energy efficient, and they use shorter prompts, which consume less compute. LLMs require top-of-the-line chips which get very hot and drive-up inferencing energy costs. Small models don’t need watt-hungry chips nor all the cooling apparatus.

What does your collaborative process look like?

Fulton: When we started, we just wrote code together, side by side. As the scope widened, we became co-leads and hired two software developers. That’s been enormously helpful.

Strobelt: We now have a standup meeting every morning at 10 am and have been out advertising Mellea within IBM.

How have you each approached the project?

Fulton: I like thinking about how we should design systems. Hendrik is interested in building things that are a pleasure to use.

Strobelt: I want Mellea to be intuitive and easy to use. Nathan wants to build a software system. His theoretical background and my UX focus can lead to very good discussions.

How does Mellea differ from agent frameworks like Langchain or DSPy?

Fulton: Mellea is designed for writing structured programs that can decompose complex tasks into smaller, checkable steps. It provides a mechanism to enforce step-by-step constraints. You can do this on other frameworks, but Mellea has an opinionated style of programming. We’re building it for software engineers who are designing robust systems that need to work in real life.

Strobelt: Mellea doesn’t lock you into the agentic software pattern which can be expensive. If you’re a business, you don’t need a cannon to shoot a bird.

What does long-term success look like?

Fulton: A co-designed software stack and model in the open space. It’s already happening in the closed space. It could start happening using the software we provide. IBM benefits by commodifying the stack and models.

Strobelt: We built Mellea for the long tail of the hype cycle. If you can run small models, you can run more tokens because each token is cheaper. You can run validation calls and still save some money. Mellea is super simple. It does 10 simpler things instead of one complex thing, using the old school idea of divide and conquer. You break a problem into smaller pieces. You have a backend and context. It’s much more engineering.

What excites you about the future of AI?

Strobelt: Creating applications that can help us find cures for diseases or discover underlying principles of how the world works.

Fulton: To build software you used to have to learn a lot of math and engineering. AI is exciting because you can build very powerful things – and anyone can do it. If I were a PhD student now, I’d probably be studying general purpose robotics — in China, where there’s an ecosystem we don’t have. Robotics will probably have its ChatGPT moment in the next few years.

Where do you hope to be when robotics takes the world by storm, as ChatGPT did three years ago?

Strobelt: Living in a less politicized world with my friends and family and being inspired by the creativity of my younger colleagues.

Fulton: Retired and living in the woods!

Related posts