About
IBM is proud to sponsor the PyTorch Conference 2025 – the world’s premier event dedicated to the framework powering today’s most groundbreaking AI innovations. Connect with AI pioneers, researchers, developers, and startup founders through deep-dive technical sessions, panels, workshops on AI from bare metal all the way up to the application and agent layers. Our program features keynotes from visionary AI leaders, interactive sessions on scaling and benchmarking models, and special tracks focusing on AI safety and ethical development.
Whether you’re an experienced ML engineer, researcher, or developer, PyTorch Conference 2025 is your gateway to the future of AI. Join the community that’s creating the AI revolution, not just witnessing it.
Why attend
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
- Booth Schedule - Demos & Staff (by time)
- List of booth demos (by title)
- IBM staff on site list & schedule
- IBM @ PyTorch invited talks (see agenda below)
_
What's Next?
Join us at IBM Z Day, Nov 12, 2025 - 08 AM to 05 PM (ET) - a free, 1-day enterprise computing virtual conference for anyone and everyone! Hear the latest about IBM Z and LinuxONE, and join our lineup of global thought leaders who will highlight industry trends and innovation spanning AI, Hybrid Cloud, Quantum-Safe Security, and more.
Looking for more from IBM Research?
- Check out Circuit Breaker on Linked In
- Visit our YouTube Channel
- Stay up to date on news and announcements from IBM Research → Future Forward Newsletter
Career opportunities
Agenda
- Description:
Speaker: Christian Jacobi, IBM Fellow
From fraud detection to core banking, AI is reshaping mission-critical systems—see how PyTorch and IBM’s Spyre accelerator bring dataflow to the enterprise.
AI at enterprise scale isn’t just about building bigger models—it’s about running them with the reliability, security, and performance that mission-critical workloads demand. IBM Fellow Christian Jacobi will share how IBM Z, LinuxONE, Power, and Storage systems are bringing AI directly into business operations, powering everything from fraud detection to RAG pipelines. He will also highlight the Spyre Accelerator—a scalable PCIe card for AI expansion—and show how its integration with PyTorch is enabling the development of secure, efficient, and resilient AI systems at scale.
Speakers:CJChristian JacobiIBM Fellow and CTO, IBM Systems DevelopmentIBM - Description:
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
- Description:
In this session, we will share our journey with TorchTitan over the past year and a half, starting from early 2024. During this journey, we went from using TorchTitan as a secondary codebase solely for throughput benchmarking to leveraging it for several internal production trainings; from being an end user to becoming an active contributor within the TorchTitan community.
Our story will cover why we adopted TorchTitan in our production trainings, what we've accomplished with it, and what lies ahead. Highlights include training an in-house 70B model earlier this year that matches the performance of the LLaMA 3 family - while requiring significantly fewer GPU hours - thanks to the latest features such as FP8 training. We'll also discuss our current work with TorchTitan, including our ongoing MoE training enabled by integrating our fast MoE kernel into TorchTitan, as well as exploring additional MoE kernels with FP8 row-wise and MXFP8, which are currently being developed within the TorchTitan community.
We’ll also share key lessons learned along the way and explain why we think this is a great community for everyone to explore and contribute to.
Speaker(s): Linsong Chu & Garrett Goot
- Description:
Join informal discussion, provide feedback, and uncover opportunities to collaborate. Developers:
- Andrea Fritolli: Open Source Developer Advocate from IBM, (specializing in CI/CD)
- Thanh Ha: Engineer from Linux Foundation (CI/CD Infrastructure)
- Jordan Conway: Engineer from Linux Foundation (CI/CD Infrastructure)
- Zhe Zhang: Distinguished Engineer from NVIDIA (CI/CD Infrastructure)
- Eli Uregas: Leads CI/CD infrastructure efforts in collaboration with PyTorch Foundation
- Andrey Talman: Engineer (PyTorch releases)
- Nikita Shulga: Core PyTorch OSS Developer and domain expert.
- Anita Katahoire: Technical Program Manager leading PyTorch release activities
- Yang Wang: Engineer (benchmarking, monitoring)
- Armen Donigian: Engineering manager
- Description:
Mert Toslali & Yu Chin Fabian Lim, IBM Research
Training LLMs with online RL methods like GRPO presents a unique challenge: inference is required at every training step. In the standard Hugging Face TRL setup, inference is handled by vLLM running as a separate server on dedicated GPUs, communicating via HTTP. This creates a “ping-pong” inefficiency—training GPUs wait during generation, and inference GPUs wait during training—leading to poor GPU utilization and high cost.
Our talk introduces co-located vLLM, a key optimization that enables training and inference to run on the same GPUs. Built on vLLM’s external_launcher, it allows in-process, torch-compatible execution. We contributed a now-merged PR to TRL that eliminates the need for HTTP calls or separate servers. Our setup supports torchrun, TP/DP, and scales to training large models (like 72B). This setup improves training throughput by up to 1.7×, reduces # of GPUs needed, and is now part of the official TRL repo.
- Description:
Martin Hickey, IBM & Junchen Jiang, University of Chicago
Session: Poster Presentations - Generative & Large Models
- Description:
Fine-tuning on multiple datasets? Static mixing with pre-determined percentages can often lead to overfitting and demands extensive ablations for the right mix. Dynamic data mixing addresses this using signals/rewards like training loss. While this has been studied (aclanthology.org/2024.emnlp-main.787) in research, full-fledged tooling is limited. In this session, we present a PyTorch-native (uses DataLoader and IterableDataset), online, reward-based data mixing framework that is: (a) composable with existing training loops with minimal code changes, (b) plug-and-play with user-defined mixing strategies and rewards, and (c) compatible with distributed training. We demonstrate its flexibility through 5 reward-driven data mixing recipes and its scalability via large-scale multi-GPU experiment with insights on mixing. We believe our session will motivate PyTorch developers to adopt our framework for their use cases involving multiple finetuning datasets. The code is available at github.com/foundation-model-stack/fms-hf-tuning/tree/online-dyn-reward-data-mixing.
Authors:ASAmal Joe R SIBMMKPSRJRomit JainIBMAKPJ - Description:
Poster Presentations: Generative & Large Models | Exhibit Hall
Routing Stateful AI Workloads in Kubernetes: Optimizing PyTorch LLM Inference - Maroon Ayoub, IBM
- Description:
Yidi Wu, Meta & Thomas Ortner, IBM Research Europe
Session: Poster Presentations - PyTorch Core
- Description:
Sahdev Zala, IBM
Session: Poster Presentations - PyTorch Core
- Description:
Andrea Frittoli, IBM
Session: Poster Presentations - Responsible AI & Community
- Description:
Cong Liu, Google; Carlos Costa, IBM
Session: Poster Presentations - Generative & Large Models
- Description:
Maroon Ayoub, IBM & Tyler Michael Smith, Red Hat
Session: Poster Presentations - Generative & Large Models
- Description:
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
- Description:
In January 2025, vLLM announced the alpha release of V1: a major upgrade to vLLM’s core architecture. One of the key goals of V1 was to enable all vLLM’s inference performance optimizations (e.g., continuous batching, paged attention, chunked prefill, prefix caching, speculative decoding) to work seamlessly together. Achieving this required architectural changes that propagated down to the kernel-level. In fact, in the alpha release of V1, only NVIDIA GPUs were supported due to lack of V1-compliant attention kernels.
In this talk, we will describe how we enabled vLLM V1 to run with state-of-the-art performance on AMD GPUs. We will begin by describing an initial attempt to enable V1 using a relatively old Triton kernel and explain why this approach was not performant. We will then describe a sequence of kernel-level optimizations made by the teams from IBM, Red Hat and AMD that, when combined, allowed us to improve the performance of V1 on AMD GPUs by up to 5x.
This talk will provide deep insights into how vLLM V1 works from community and industry experts. It will also provide Triton kernel developers with tips and tricks into how to achieve maximum performance.
Speaker(s): Thomas Parnell(IBM) , Aleksandr Malyshev (AMD)
- Description:
Maroon Ayoub, IBM Research & Cong Liu, Google
As PyTorch-based LLMs scale in complexity and user concurrency, their inference demands diverge across stages. Prefill is compute-heavy; decode is latency-sensitive. In this talk, we introduce a disaggregated serving pattern for PyTorch LLMs using llm-d—a Kubernetes-native, open-source framework co-developed by IBM Research, Google, and Red Hat. We'll walk through how llm-d separates prefill and decode into orchestrated sidecars, improving GPU utilization and QoS alignment. You'll learn how the Gateway API Inference Extension (GIE) enables routing based on load, cache locality, and session affinity. The talk includes real-world benchmarks and a visual demo of llm-d serving PyTorch models with vLLM across heterogeneous hardware on Kubernetes.
- Description:
Today, vLLM (part of pytorch) is the de-facto industry standard for serving Large Language Models. vLLM is increasingly being adopted in production and can be executed on NVIDIA GPUS, AMD GPUs, as well as custom accelerators like AWS Inferentia.
However, for most of the past, vLLM’s state-of-the-art performance war largely depending on a number of hand-written CUDA or HIP kernels. These kernels have typically been carefully optimized for a specific GPU platform and may pose a serious obstacle to the portability of vLLM across different hardware.
Leveraging Open AI Triton, we were able to introduce a “triton backend” to vllm that produces state-of-the-art performance across GPU platforms with a single code base, without involving hand written CUDA or HIP kernels.
In this talk, we will present our recent advances that lead to state-of-the-art performance on both NVIDIA and AMD GPUs with a single Triton-only code-base. We will present the engineering and science behind this triton-only backend, including autotuning for different platforms, system aspects like the launch overhead of Tritons Just-in-time compiler, and different kernel optimizations.
Authors:BRTPJYJamie YangIBMJLJan van LunterenIBMSSSara Kokkila SchumacherIBM
Upcoming events
AI Compute Symposium 2025
- Yorktown Heights, NY, USA and virtual
- —
IBM Quantum Developer Conference 2025
- Atlanta, Georgia, USA
- —
AI Hardware Forum 2025
- Yorktown Heights, NY, USA