Training AI models faster than ever

At PyTorch 2024, IBM demonstrated a world-class data loader and new milestones in high-throughput AI model training that aim to empower the open-source AI community.

At this year’s PyTorch Conference, IBM Research scientists are presenting their contributions to the open-source model training framework, including a data loader that can seamlessly handle massive quantities of data, and new improvements to large language model training throughput.

To deliver LLMs’ ever-increasing capabilities cost effectively, we need to continuously improve the efficiency and robustness of the cloud infrastructure that supports their training, tuning, and inference. The open-source PyTorch framework and ecosystem have played a major role in supporting the AI revolution that’s poised to change our lives, and in recognition that it can’t be done alone, IBM joined the PyTorch Foundation last year and continues contributing new techniques and technologies to the AI community.

Along with IBM’s previous contributions, these new tools are shoring up PyTorch’s ability to meet the ever-growing needs of the community — whether they have to do with using GPUs more efficiently, pushing data loading to be nimbler, or making checkpointing more cost-effective.

A world-class data loader for training and tuning foundation models

PyTorch users can now take advantage of a high-throughput data loader that lets them seamlessly spread LLM training workloads across machines, and even reconfigure their allocations mid-job. It also lets developers save checkpoints more efficiently when training models to ensure that work isn’t duplicated. And it’s all thanks to a team of researchers who were simply building the tools they needed to get a job done.

IBM Research scientist Davis Wertheimer and his colleagues didn’t initially set out to build a world-class data loader, but when they started working with the IBM Natural Language Processing (NLP) team on model training tools, there wasn’t one out there that could do what they needed it to do. Their work required something that could manage and stream massive quantities of data across multiple devices, while keeping pace with increasingly efficient GPUs.

They started with working sessions where they walked through the data loader the NLP team was already using, to make sure the new one could do everything they needed it to do. Take for example a situation where a team is training a model on eight nodes with eight GPUs each. The way their previous data loader, based on Megatron, is set up, the first CPU streams all the data and then sends that to all the worker machines. In that scenario, only one process was doing data loading, which caused a training bottleneck. “You can't stream 64 GPUs’ worth of text through a single CPU at this scale,” says Wertheimer. “So, we had to build something that was distributed, partitioned, and asynchronous, with the ability to coordinate across devices without constant communication overhead.”

Through a process of iterating, learning from experience, and fixing along the way, they built their PyTorch-native data loader. “The whole time, we were being very ad hoc,” Wertheimer says. “Here’s a problem — let’s fix it. Here’s a new problem — let’s fix it.” The whole idea was to eventually move over to something more formalized, but instead this became the final product.

The resulting tool is well-suited for LLM training in research contexts where you have all your raw text data and want to use a different tokenizer or maximum sequence length, or if you want to retry your training run, but with a different mixture of sub-datasets to adjust model weights. The data loader makes it so you don’t have to recreate your dataset every time you want to make changes like these; you can just tell it on the fly what you want to do. Even halfway through the job, you can make changes like scaling your number of GPUs up or down if your resource quota changes. The data loader will ensure that previously seen data won’t be revisited. “It’s meant to be adaptable and dynamic,” Wertheimer says.

In stress testing, they streamed through 2 trillion tokens over a month of continuous operation. “Nothing freaked out or died,” says Wertheimer proudly. In further testing, they’ve observed it be able to load over 90,000 tokens per second per worker. On 64 GPUs, that would translate to half a trillion tokens per day. “We know it can go faster, because when we ran this, it was a training job for a tiny model on GPU, and training the tiny model was the bottleneck — not the data loader.”

The team is now working with PyTorch to integrate the data loader into PyTorch’s open-source training platform, torchtitan. In the meantime, the data team is already using it. The last few fixes they’ve pushed out have moved it from being a good working solution to a fully functional data loader, says Wertheimer.

Maximizing training throughput

When it comes to model training at scale, everything moves at the speed of the slowest piece — that’s just how bottlenecks happen. In AI workloads, this bottleneck is often how efficiently the GPU is being used. Another IBM team is working on this problem, looking at efficient ways to communicate across nodes and across GPUs.

One part of this strategy is fully sharded data parallel (FSDP), which distributes large training datasets evenly across multiple machines, preventing any individual machine from becoming overwhelmed. This distribution, which enables faster AI training with fewer GPUs, has been shown to significantly improve model training and tuning speed and efficiency. IBM Research staff and their partners have found major gains in throughput when using it with torch.compile, which optimizes how PyTorch code is executed, speeding up training and inference of machine learning models.

“We at IBM were among the first, along with Meta, to train a model using torch.compile and FSDP,” says IBM Research scientist Linsong Chu. “In March of this year, we demonstrated a proof point of the fastest training possible to date.” In those results, the team trained a Granite 7B model on 2 trillion tokens at a rate of 4,550 tokens per second per GPU on A100 GPUs. This model was released earlier this month on Red Hat Enterprise Linux AI (RHEL AI).

To a significant degree, he says, these gains are made possible by the raw capacity that’s already sitting on servers — but which developers didn’t previously have ways to harness. Using these two tools together makes it possible to link up these capabilities and go even further.

This work runs parallel to the data loader advances, because as the team optimized GPU usage with FSDP and torch.compile, they figured out how to use GPUs more efficiently. As a result, data loaders became the bottleneck, not the GPUs. “torch.compile made it so we could go faster on the GPUs, but it killed the data loader,” Chu says. “So, we had to fix the data loader so we could actually go faster.” Demonstrating their effectiveness, these tools were used to train IBM’s Granite 7B base model.

Since then, the target has moved. With this much horsepower in the GPUs, the question is, “how can we make it even faster?” says IBM Research scientist Raghu Ganti. Using FSDP with torch.compile helped the team reach their internal training milestone, but now they’re adding FP8 (8-point floating bit) in the mix, a low-precision datatype supported by Nvidia H100 GPUs. This combination has shown up to 50% gains in throughput, Ganti says. “A 50% throughput improvement is not a joke in this field. Think about what that means for infrastructure cost reduction. That’s a big deal.”

What’s next

FP8 isn’t widely available for developers to use yet, but the team is working on experiments that will demonstrate its capabilities, says Ganti. In other work, they’re using torch.compile to optimize model training and tuning on IBM’s artificial intelligence unit (AIU).

Ganti, Wertheimer, and their colleagues will also be focusing on Triton, Nvidia’s open-source software for AI deployment and execution. Triton gives developers a way to write code in Python, then compiles it into the programming language of the specific hardware, whether that’s Nvidia or Intel, speeding up compute. So far, Triton is about 10-15% behind the speed of CUDA, the typical software platform for running Nvidia GPUs, but the researchers recently used Triton to do the first ever end-to-end CUDA-free inferencing. This effort is picking up speed, and they’re optimistic that Triton will close both this gap and further optimize training.

Taken all together, these tools are moving faster cloud-based model training out of the lab and into the community.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

All decisions have trade-offs. IBM’s Wei Sun is an expert at weighing them
Q & A
Kim Martineau
06 Aug 2025
IBM Storage Scale delivers real-world performance: an in-depth analysis
Technical note
Brian Belgodere, Chris Miller, John Lewars, Matthew Klos, Yukio Hayashi Leon, Mara Miranda Bautista, and Olaf Weiser
04 Aug 2025
- AI
- Hybrid Cloud Infrastructure
Debugging LLMs to improve their credibility
Research
Kim Martineau
30 Jul 2025
From simulated steps to real-world care: AI learns how we walk for neurology
Research
Peter Hess
29 Jul 2025

A world-class data loader for training and tuning foundation models

Maximizing training throughput

What’s next

Related posts

All decisions have trade-offs. IBM’s Wei Sun is an expert at weighing them

IBM Storage Scale delivers real-world performance: an in-depth analysis

Debugging LLMs to improve their credibility

From simulated steps to real-world care: AI learns how we walk for neurology