News
6 minute read

Expanding AI model training and inference for the open-source community

At PyTorch 2025, IBM showed off the new open-source AI software tools that are redefining possibilities for a multi-AI accelerator world.

At PyTorch Conference 2025, IBM Research is showcasing its deep contributions to the open-source AI ecosystem — from advancing kernel performance in vLLM through Triton, to scaling efficient LLM training with torchtitan. These projects demonstrate IBM’s commitment to open, modular AI infrastructure that spans research and enterprise deployment, built on community-driven technologies like PyTorch and vLLM.

A major focus of this year's presence is enabling performance portability through Triton, the PyTorch-native kernel programming framework. IBM researchers collaborated with AMD and Red Hat to develop a fully contained attention backend for vLLM that runs efficiently on multiple GPU platforms without relying on proprietary libraries. This work not only broadens hardware support but also sets a foundation for the next wave of open, extensible inference systems, directly benefiting the broader PyTorch and vLLM communities.

On the training front, IBM Research teams are demonstrating how open-source training stacks can achieve production-grade efficiency. Using torchtitan, the new PyTorch-native training framework, IBM researchers successfully trained one of the first Llama 3-derived 70B models from an open repository — achieving the same quality with just one-third of the original training budget. This milestone, powered by FP8 precision and a high-throughput data loader contributed by IBM, exemplifies how PyTorch continues to evolve into a fully capable platform for large-scale model development.

IBM recently announced the Spyre AI accelerator for IBM Z and Power Systems, integrated with vLLM and torch.compile, demonstrating IBM’s commitment to advancing heterogeneous compute. In 2026, IBM plans to deepen this integration across the PyTorch and vLLM ecosystems — enabling Spyre to participate as a first-class backend in open-source inference workflows. This next phase will focus on unified compiler pathways, optimized kernel generation, and transparent multi-accelerator scheduling, ensuring that developers can deploy large language models seamlessly on Spyre.

Together, these efforts illustrate IBM's role as a leading contributor to the PyTorch Foundation’s open-innovation ecosystem. By improving vLLM with hardware-agnostic kernels, scaling training efficiency with torchtitan, and exploring new accelerator integrations, IBM is helping define the blueprint for open AI infrastructure evolution: efficient, transparent, and built by the community.

Improving vLLM with hardware-agnostic kernels

A robust project with more than 1,600 contributors, vLLM recently underwent a redesign, called vLLM V1, intended to simplify its codebase and make it more extensible, while also enabling all its performance enhancements by default and making them work together seamlessly. As a result of this overhaul's architectural changes, though, the only kernels that could be run on a GPU were coming from a FlashAttention external library — and they could only run on NVIDIA GPUs.

"Red Hat is committed to enabling vLLM to help organizations more easily build and deploy AI solutions with any model, using any accelerator, across any cloud," said Taneem Ibrahim, director of engineering for AI inference at Red Hat. "We are pleased to collaborate with AMD and IBM Research to deliver enhanced efficiency and performance for vLLM on multi-GPU instances."

That’s where IBM Research principal research scientist Thomas Parnell and his team came in. They developed a solution for AMD GPUs involving Triton kernels. Triton, an open-source project that is used by PyTorch compile, is a framework for writing kernels for GPUs, and to some extent it is platform-independent, making it more portable than writing custom NVIDIA CUDA code, said Parnell. 'You can write a Triton kernel and, it will work on NVIDIA GPUs, AMD GPUs, Intel GPUs."

His team's first effort worked on AMD GPUs, but it was slow, 6x slower than vLLM V0 which was implemented using custom C++/HIP kernels. So the challenge was in understanding the way Triton kernels behave. Part of the reason for the slowdown was that vLLM V1's scheduler interwove prefill and decode requests in batches. This challenged Triton kernels, which tend to favor either prefill or decode performance, not both at the same time. Over the next few months, they performed a litany of kernel-level optimizations to remove that discrepancy, along with other improvements to optimize for models that use grouped query attention, like IBM's Granite 3.0 family. Benchmark testing showed they were able to improve token throughput for vLLM V1 on AMD hardware by more than 10% over vLLM V0.

This week at PyTorch 2025, Parnell is scheduled to present the technical details of this project — how they developed the kernels, what worked and what didn't, and the sequence of optimizations from the IBM side. He will present alongside AMD principal member of technical staff Aleksandr Malyshev, who will outline the improvements AMD made to kernels, as well as its additional optimizations.

The benefit to the community, according to Parnell, is that vLLM now has an entirely contained attention backend. The attention mechanism is at the heart of modern transformer-based LLMs. "It doesn't use any external dependencies, it doesn't use any libraries like FlashAttention or FlashInfer, and it works on multiple GPU platforms," Parnell said. And given that these kernels are owned and maintained by the vLLM team, there is great flexibility to adapt them to new models as they come out.

Moving forward, Parnell and his colleagues are working to reduce the launch overhead of these Triton kernels to achieve better performance and lower latency. They're also working on expanding the scope of optimizations beyond attention, benefiting IBM's recently released Granite 4, a family of hybrid models that uses Mamba layers in addition to attention ones. Parnell will be demonstrating this at PyTorch 2025. "We contributed a lot of kernels, and a lot of optimizations that improve Granite 4, as well as the way we manage the state for hybrid models," Parnell said. Thanks to the collaboration with AMD to make sure the Triton kernels work, AMD was able to offer day-0 support for Granite 4.

On the topic of kernels, Meta's PyTorch Team is scheduled to announce its public beta for Helion, a domain-specific language for authoring kernels. This DSL compiles down to Triton and is intended to simplify kernel development.

A new model training milestone on IBM's journey with torchtitan

About a year ago, research scientist Linsong Chu's team at IBM Research switched from a proprietary software stack to torchtitan. Part of this switch involved integrating a high-throughput data loader that lets developers save training checkpoints more efficiently, spread LLM training workloads across machines, and reconfigure allocations mid-job. For Chu, the switch was an obvious choice between spending a lot of time maintaining IBM’s training software stack to keep up with others, versus actively participating in the open-source community.

At this year's PyTorch conference, Chu and his team are presenting the fruit of that project: They've trained a new branch of the Llama3 70b model, yielding one of the first models to be trained out of torchtitan. IBM researchers trained it with fewer resources than the original required, according to Chu. Part of the savings came from using half as many tokens thanks to careful data curation, and part came from training in FP8 precision — which wasn't yet available when the original model was trained in FP16 precision.

"FP8 gave us 1.5 times training speed, plus half the token count gave us only one-third of the training budget — but with the same quality of model," Chu said. This proof-of-concept model was made possible by torchtitan and the features IBM contributed, including the data loader released last year. While this new model is just meant to be a research vehicle for internal use, this accomplishment shows that PyTorch has the capability to train a production model out of an open-source repo, added Chu.

This project and the others presented at PyTorch 2025 emphasize IBM's commitment to participating in the open-source community. "We've been contributing kernels and new models," Chu said. "We want to make sure the device can support transformers, hybrid models, and whatever comes next."

A vLLM and PyTorch layer for the Spyre AI accelerator

IBM has adopted vLLM as its inference runtime and torch.compile as the frontend compiler for integrating emerging accelerators such as Spyre. As part of this work, IBM Research developed a new backend compiler that interfaces cleanly with torch.compile, making it possible to add Spyre to a user's stack with minimal effort. The team also contributed a Spyre plugin for vLLM that enables paged attention, an optimization that improves memory efficiency and scalability for large language model inference.

Paged attention divides the key-value (KV) cache of an LLM into smaller, addressable memory blocks — or "pages" — that can be fetched on demand. Since the KV cache functions as the model's short-term memory, it can quickly become a bottleneck for long or complex outputs. The Spyre plugin exposes hardware-level paging constructs to vLLM, ensuring efficient retrieval and management of these pages while maintaining compatibility with the broader PyTorch and vLLM stack.

According to Mudhakar Srivatsa, IBM Research distinguished engineer and PyTorch layer technical lead for Spyre, the key challenges included mapping logical page identifiers to physical hardware addresses, and implementing paged attention kernels capable of operating on non-contiguous memory regions. IBM's implementation addresses both, providing a plug-and-play pathway for Spyre to participate in vLLM inference pipelines. "Any sort of agentic application built against a vLLM endpoint really sees no change from a development perspective," said Srivatsa. "BeeAI, LangChain, or any framework that already supports vLLM can transparently substitute this Spyre endpoint."

Christian Jacobi, IBM fellow and the CTO for IBM Systems Development, will deliver a keynote on purpose-built enterprise AI hardware on Wednesday, and conference attendees are invited to visit IBM's booth, where they can see a new z17, as well as demos showcasing Granite, vLLM, and llm-d.

Looking ahead, IBM's work with the Spyre AI accelerator represents an important step toward a more diverse and open compute landscape. While early integration with vLLM and torch.compile demonstrates the feasibility of bringing Spyre into open-source inference workflows, deeper enablement is planned for 2026. This next phase will focus on tighter compiler integration, unified runtime interfaces, and expanded support for multi-accelerator scheduling — continuing IBM’s broader mission to make heterogeneous compute a seamless part of the PyTorch and vLLM ecosystems.

Related posts