The future of AI is in your hands
New research suggests the AI industry is rapidly evolving, where local devices that many more people have access to can run smaller, more powerful models to answer most of their queries. IBM’s latest Granite models are up to the task.
In photography, there’s an adage: The best camera is the one you have on you. For the computing revolution that’s taken place over the last two decades, this has also been true. The laptops and phones in our hands are often much easier to access than powerful cloud clusters, and oftentimes, are capable enough to get the jobs we need done. That’s quickly proving true for AI, as well.
Large language models have progressed rapidly in capability over the last few years, although usually at the expense of their size. The largest frontier models require hundreds of gigabytes to store, and even more memory to run properly. Even high-powered, modern laptops struggle to run the massive LLMs locally, and making calls to the cloud can often take longer to get an answer than users are willing to wait.
But recent research suggests that for a large majority of the queries that the average user has in a given day, much smaller LLMs can handle their requests adequately. It’s something IBM Research has seen the value of in its own model development strategy. The team recently launched Granite 4.0 Nano, their smallest set of language models yet, which are specifically designed to run on edge devices like phones, laptops, and tablets.
Hazy Research, an influential computer science research lab at Stanford University, sees a step change coming for the AI industry. In a new blog post today, they argue that we’re currently in an era where “monolithic data centers and frontier models handle most inference traffic.” The world’s infrastructure providers and hyperscalers have scrambled to keep up with the power and computational demands of modern AI models. Hazy’s research suggests that this might soon change, that the world is on the precipice of a new era for AI, where smaller models and more powerful local hardware could be enough to disaggregate much of the way that AI is used in the future.
The Stanford team’s research has found that today’s best local language models can accurately respond to “88.7% of today’s single-turn naturalistic chat and reasoning queries.” In turn, by keeping queries on a device using something as quotidian as a MacBook Pro, it can drastically reduce energy consumption by not having to route to an external server. Their argument is that the AI industry should begin to factor in these energy costs when it comes to performance. They propose a new metric, which they call “intelligence per watt.” They also released TrafficBench, a new benchmark to compare models and accelerators with energy efficiency in mind.
The Hazy team found that the gap between local and cloud accelerators is closing. But with newer hardware for local devices, such as Apple’s Apple M4 MAX and Nvidia’s DGX Spark, that gap is closing quickly, with “consumer laptops with 128GB unified memory that can run models approaching 200B parameters.”
The situation becomes even clearer with state-of-the-art local language models. Their tests of OpenAI’s gpt-oss, Qwen’s Qwen3, and IBM’s Granite 4.0 showed an intelligence per watt considerably higher than older-generation local models running on specialized hardware. The Stanford team predicts that local intelligence efficiency will double to triple each year, driven by further improvements in hardware and model architectures, training, and distillation methods.
The Stanford team tested four Granite models as part of their research, including Granite-4.0-H-Micro, Granite-4.0-Micro, Granite-4.0-H-Tiny, and Granite-4.0-H-Small. Recently hailed as the “Western Qwen,” Granite 4.0 is IBM’s family of open, trusted small models. Although they’re fast and nimble enough to run on the edge with minimal memory requirements, they’re ready to run at scale. These models were built with enterprise in mind, using transparent data sources, completing an ISO 42001 certified build process, and cryptographically signed weights.
IBM’s family of models is released under the business-friendly Apache 2.0 license, and the latest Granite 4.0 Nano, has been trending on Hugging Face, while outperforming similarly sized models from Qwen, Liquid AI, and Google, on benchmarks across general knowledge, math, code and safety.
In the future, the way we interact with models will continue to evolve. But for many instances, small, local models that don’t need to call out to a distant sever to get a satisfactory answer will prove to be more valuable in space-constrained places, like wearables, or in places where internet access isn’t guaranteed. And IBM is continuing to work on models that can work wherever the world needs them.
Related posts
- Deep DiveJuan Bernabé-Moreno
Boost your agents: Introducing ALTK, the open-source agent lifecycle toolkit
Technical noteKiran Kate, Jim Laredo, Vinod Muthusamy, Jason Tsay, Yara Rizk, and Zidane WrightAn artist’s tribute to modern AI
Q & AKim MartineauExpanding AI model training and inference for the open-source community
NewsPeter Hess
