Research
4 minute read

For LLMs, IBM’s NorthPole chip overcomes the tradeoff between speed and efficiency

Lower latency usually comes at the expense of energy efficiency, but in newly released experimental results, the brain-inspired NorthPole research prototype chip achieved considerably lower latency than the next most energy-efficient GPU, at much higher energy efficiency than the next fastest one.

A 2U server blade containing NorthPole cards

Lower latency usually comes at the expense of energy efficiency, but in newly released experimental results, the brain-inspired NorthPole research prototype chip achieved considerably lower latency than the next most energy-efficient GPU, at much higher energy efficiency than the next fastest one.

As researchers race to develop the next generation of computer chips, AI is at the front of their minds. With the recent explosion of generative AI use, including large language models, it’s become clear that traditional CPUs and GPUs are struggling to provide the necessary combination of speed and energy efficiency. To deliver AI at scale — especially for agentic workflows and digital workers — the hardware running these models will need to work faster. At the same time, the environmental impact of AI’s power consumption is a pressing issue, so it’s critical for AI to consume less power. At IBM Research’s lab in Almaden, California, a team has been rethinking the basics of chip architecture to achieve both aims, and their latest results show how tomorrow’s processors may consume less energy and work faster.

AIU NorthPole is an AI inference accelerator chip that IBM Research first unveiled last year. In inference tests run on a 3-billion-parameter LLM developed from IBM’s Granite-8B-Code-Base model, NorthPole achieved latency below 1 millisecond per token, 46.9 times faster than the next most energy-efficient GPU. In an off-the-shelf thin 2U server running 16 NorthPole processors communicating via PCIe, the team behind the chip saw that it could attain a throughput of 28,356 tokens per second on that same model. It reached these levels of speed while still achieving 72.7 times more energy efficiency than the next lowest latency GPU.

Chart showing energy efficiency and latency for NorthPole and four commercially available GPUs
The research prototype NorthPole has lower latency and higher energy efficiency than four GPUs that are commonly used for LLM inference.

The team presented their findings today at the IEEE Conference on High-Performance Computing. The new performance figures build on their results from last October, when the team showed NorthPole is capable of faster and more efficient neural inference than the other chips on the market in edge applications. In those experiments, NorthPole was 25 times more energy efficient than the commonly available 12 nm GPUs and 14 nm CPUs, as measured by the number of frames interpreted per unit of power.

NorthPole is fabricated in a 12nm process, and each chip contains 22 billion transistors in 795 square millimeters. In results that were published in Science, the chip also achieved lower latency than all the other chips it was tested against, even those with smaller fabrication processes. Those tests were run on the ResNet-50 image recognition and YOLOv4 object detection models, as the team was focused on visual recognition tasks for applications like autonomous vehicles. A year later, the new results come from trialing NorthPole chips on a much larger 3-billion-parameter Granite LLM.

“What is essential here is qualitative orders of magnitude in improvement. These new results are on par with our Science results — but in a completely different application domain,” says IBM Fellow Dharmendra Modha, who leads the team behind the chip’s development. “Given that NorthPole’s architecture works so well in a completely different domain, these new results underscore not only the broad applicability of the architecture, but also the importance of foundational research.”

An exploded depiction of 16 NorthPole cards in a 2U server
A standard 2U server holds four NorthPole cards in each of its four bays.

Low latency is crucial for AI to run smoothly when businesses deploy agentic workflows, digital workers, and interactive dialogues, Modha says. But there’s a fundamental tension between latency and energy efficiency — typically, improvement in one area comes at the expense of the other.

One of the major obstacles to lowering latency and power consumption in AI inference is the so-called von Neumann bottleneck. Nearly all modern microprocessors adopt a von Neumann architecture, where memory is physically separated from processing — including CPUs and GPUs. Although this design has historically had the advantage of being simple and flexible, shuttling data back and forth between memory and computing limits a processor’s speed. This is especially true for AI models, whose calculations are simple but numerous. And although processor efficiency has been tripling every two years, the bandwidth between memory and computation is only growing at about half that rate. Additionally, high-bandwidth memory is expensive.

NorthPole’s design eliminates this mismatch by situating memory and processing in the same place, an architecture known as on-chip memory, or in-memory computing. Inspired by the brain, which co-locates memory and processing in the connections between neurons, NorthPole tightly couples memory with the chip’s compute units and control logic. This leads to a massive 13 terabytes per second on-chip memory bandwidth.

A chart showing how the team mapped the large language model's 15 layers onto 16 NorthPole cards
The NorthPole team mapped a 3-billion-parameter LLM onto 16 cards: 14 transformer layers on one card apiece and one output layer on two cards.

The team’s next challenge was to see if NorthPole, which was designed for edge inference, would work for language models in data centers. At the outset, this seemed like a difficult task, given that LLMs do not fit in NorthPole’s on-chip memory.

To meet the challenge, the team chose to run the 3-billion-parameter Granite LLM on the 16-card NorthPole setup. They mapped 14 transformer layers onto one card apiece and mapped the output layer onto the remaining two cards. LLMs are often limited by memory bandwidth, but in this pipelined parallelism setup, very little data needs to be moved from card to card — PCIe is sufficient, and no high-speed networking is required. This is the result of the on-chip memory that stores model weights and the so-called key-value (KV) cache, meaning less data needs to be passed among the separate PCIe cards when generating tokens. The model was quantized to 4-bit weights and activations, and the quantized model was fine-tuned to match accuracy.

A blueprint schematic showing how the NorthPole chips fit onto PCIe cards, which in turn fit into a server, and how the server racks make up an IBM AIU NorthPole cluster
IBM Research scientists are working toward server racks filled with hundreds of NorthPole cards to perform high volumes of inference operations at faster speeds and with lower energy consumption than comparable GPU-based hardware.

What’s next

Based on the success of their latest experiments, Modha says that his team is now working to build units containing more NorthPole chips, and they have plans to map even larger models onto them.

As groundbreaking as the new performance results are, Modha is confident that his team will continue to push the frontier by orders of magnitude to increase NorthPole’s energy efficiency while reducing its latency. The key, he says, is innovating across the entire vertical stack. That will require co-designing algorithms from the ground up that are made to run on this next-generation hardware, taking advantage of technology scaling and packaging, and imagining entirely new systems and inference appliances — advances that he and others at IBM Research are currently working on.