IBM at PyTorch 2025
- San Francisco, CA, USA
IBM is proud to sponsor KubeCon + CloudNativeCon North America 2025.
The Cloud Native Computing Foundation’s flagship conference brings together adopters and technologists from leading open source and cloud native communities.
Be a part of the conversation as CNCF Graduated, Incubating, and Sandbox Projects unite for four days of collaboration, learning, and innovation to drive the future of cloud native computing.
Boaz Michaely, Red Hat & Adi Sosnovich, IBM Research
Kubernetes networking by default is Malicious Actors’ heaven.
Why? Because by default, any pod can send and receive traffic to and from any other pod, ignoring namespace and privilege boundaries. External traffic in both directions is allowed as well, as far as Kubernetes is concerned.
Indeed, best practices rightfully dictate that this default be modified, using "Kubernetes Network Policies" .
Yet most teams find this too difficult to implement. Authoring NetworkPolicy YAML is very challenging. Baseline/AdminNetworkPolicy fills a gap for cluster administrators, but authoring these policies and understanding their impact is a new, additional challenge. Furthermore, policy authors may not know what the application’s communication needs are.
What if there was a way to automatically produce tight network policy rules, in YAML, and see the impact of applied B/ANP network policies?
Join this session to see the magic yourself, and learn how you can leverage this technology today!
Maia Iyer, Alan Cha & Mariusz Sabath, IBM Research; Anjali Telang & Andrew Block, Red Hat
Agentic platforms are redefining how cloud-native applications interact—but behind every action lies a critical question: who is allowed to do what, and why? Emerging standards such as MCP allow AI agents to easily connect with tools, but organizations looking to support agents must maintain security and transparency. They can do so by combining the power of OAuth 2.0 with strongly attested workload identity from SPIFFE.
In this hands-on workshop, we’ll dive into the mechanics of secure workload identity for agents and tools—no prior experience required. Attendees will work hands-on with a working agentic stack, including MCP for agentic tool-calling, and integrate with cloud-native tools such as SPIRE for workload identity, and Keycloak for user management. These existing technologies are key for enabling granular access control and rich audit trails across the full agentic flow. This workshop lays the foundations to building identity-first, zero-trust agentic platforms.
Moderated by Katie Norton; Alex Zenla, Edera; Jason Hall, Chainguard; Jon Ceanfaglione, IBM
After a decade of "microservices all the things," the industry is experiencing a fascinating recalibration. Organizations that rushed to decompose monoliths are now grappling with distributed system complexity, operational overhead, and the cognitive load on development teams. This panel explores how modern organizations are making more intentional architectural choices and evolving their approach to software consumption and deployment.
This panel will cover:
Blaine Gardner, IBM
The Rook project will be introduced to attendees of all levels and experience. Rook is an open source cloud-native storage operator for Kubernetes, providing the platform, framework, and support for Ceph to natively integrate with Kubernetes. The panel will discuss various scenarios to show how Rook configures Ceph to provide stable block, shared file system, and object storage for your production data. Rook was accepted as a graduated project by the Cloud Native Computing Foundation in October 2020.
Ryan Jarvinen, Red Hat & Daniel Oh, IBM
Running Java applications in Kubernetes brings a set of performance expectations: fast startup, low memory usage, and efficient container images. This session is a hands-on walkthrough of tools and techniques to help meet those goals. You'll learn how to use Jib to build lean container images, accelerate cold starts with GraalVM native image compilation, and improve runtime responsiveness with Class Data Sharing (CDS) and Coordinated Restore at Checkpoint (CRaC). We'll dive into real-world configuration examples, discuss trade-offs, and demonstrate how to combine these tools to boost performance in Kubernetes-native Java workloads.
Sunyanan Choochotkaew, IBM Research & John Belamaric, Google
Divvying up a network card using Kubernetes is really hard to do. If you need to spin up virtual interfaces on top of a NIC, limit their bandwidth, and hand them out to different Pods, you will have a rough time.
Come find out how the Kubernetes project will make sharing network hardware just as easy as sharing node CPU and memory! And networking is just the initial use case - this functionality can work with any device. Being able to sub-divide devices will really improve utilization of your pricey hardware.
In this talk, we detail a new way to request resources from attached devices like NICs, GPUs, and DPUs. Building on the recently released Device Resource Allocation (DRA), this feature performs on-demand provisioning based on resource requests, allowing a physical device to be independently shared among Pods multiple times. It extends K8s multi-tenancy to the sub-device level. We’ll dive deep and explore real-world use cases, under the hood details, and future extensions.
Daniel Oh & Kevin Dubois, IBM
This session delves into the critical aspects of developing production-ready Large Language Model (LLM) applications using Java. We'll explore how to leverage Java's strengths to build scalable and efficient LLM systems, addressing key challenges such as performance optimization, resource management, and seamless integration with existing infrastructures.
Attendees will gain practical knowledge on handling massive datasets, optimizing model inference, and fine-tuning LLMs for optimal performance. We'll discuss strategies for ensuring the reliability and scalability of your LLM deployments, empowering you to create robust and high-performing AI applications. Whether you're a seasoned Java developer or new to the AI domain, this session will provide valuable insights and guidance for your LLM development journey, equipping you with the tools and knowledge to navigate the complexities of building production-grade LLM systems.
Maroon Ayoub, IBM & Michey Mehta, Red Hat
Kubernetes excels at stateless service routing - but modern AI workloads are not stateless. Generative workloads demand context-aware routing that maximizes performance while reducing costs.
This talk explores layered routing strategies for stateful LLM workloads on Kubernetes - from round-robin to full KV-Cache-aware load balancing. We’ll explain when each level applies, and its effects on performance.
Based on our experience developing llm-d - a framework using the K8s Gateway API Inference Extension, a collaboration between Google, IBM Research, and RedHat - we’ll cover: - Why traditional Kubernetes routing falls short for generative AI - Routing patterns for long-context, sessionful traffic - Global cache indices and local offloading for smart routing - Benchmarks showing latency, cache hit rates, and GPU utilization - Practical ways to adopt cache-aware routing without major infra changes
If you’re scaling multi-turn, agentic, or LLM-powered workloads, this session is for you.
Jing Chen, IBM Research; Junchen Jiang, University of Chicago; Ganesh Kudleppanavar, NVIDIA; Samuel Monson, Red Hat; Jason Kramberger, Google
As organizations deploy LLMs as distributed stacks in production Kubernetes environments, optimizing inference performance has been critical. This collaborative tutorial brings together experts from Google, NVIDIA, RedHat, IBM, and University of Chicago (LMCache) to provide practical benchmarking techniques for impactful LLM optimization strategies.
Using identified use cases as examples, we'll show how to benchmark key optimization strategies: KV Cache offloading, autoscaling, prefix/session-aware routing, KVCache-aware routing, and xPyD for prefill decode disaggregation. Attendees will learn a unified benchmarking approach integrating tools including vLLM, LMBenchmark, GuideLLM, GenAIperf, inference-perf, and fmperf. Through live demonstrations, participants gain hands-on experience with production-tested methodologies reflecting real-world scenarios. Attendees will be equipped to implement these approaches for data-driven LLM serving optimizations on Kubernetes.
Prajakta Kashalkar-Joshi & Socheat Sou, IBM
When working on a cloud-native product that is an aggregate of separate products, each with their own cadence, clear communication is critical to smooth integration. Ideally, each product team wants to know when the other related products have published a new release and if it's ready to be integrated with their product. It becomes an exponential, logistical nightmare to have each team subscribe to notifications of each other team. Using an "integration repository" helps solve both the business and technical needs of product integration and currency. In this talk, learn how the Fusion DevOps team turned the pull-request into a mechanism for streamlining team handoffs, notifying the appropriate focals, and clearly defining boundaries of responsibilities between teams.
Chen Wang, IBM Research & Huamin Chen, Red Hat
This research-driven talk introduces a novel architecture paradigm that complements recent advances in timely intelligent inference routing for large language models. By integrating proxy-based classification and reranking techniques, we've developed a system that efficiently routes incoming prompts to domain-specialized LLMs based on rapid content analysis. Our approach creates a meta-layer of intelligence above traditional model serving infrastructures, enabling specialized models to handle queries they're optimized for while maintaining a unified API interface. We'll present performance research comparing this distributed approach against monolithic inference-time scaling, demonstrating how intelligent routing can achieve superior results for complex, multi-domain workloads while reducing computational overhead. The session includes a Kubernetes-based reference implementation and quantitative analysis of throughput, latency, and accuracy across diverse prompt categories.
Martin Bartoš & Ryan Emerson, IBM
In order to mitigate the impact of CVEs and allow continuous delivery of features, it is crucial that upgrades can be rolled out seamlessly. For stateless applications zero downtime upgrades is a solved problem, but for stateful applications, upgrades can present a significant challenge.
As the leading open-source identity and access management solution, Keycloak is a critical component in many organizations' infrastructure. Achieving maximum uptime is vital in order for dependent services to function.
Join us to discover how Keycloak has evolved to support zero-downtime rollouts of configuration changes and patch upgrades. In this talk we explain the technical and project management challenges we faced, the measures taken to overcome them and what best practices you can leverage in your projects to enable zero-downtime upgrades. Key focus areas will be the Keycloak Operator, how we ensure clustering compatibility, testing strategies and our plans for the future.
Jitendra Singh, IBM India Pvt. Ltd.
Kubernetes observability tools, like Fluent Bit, OpenTelemetry, and Loki, provide deep visibility, but they also handle sensitive data: user identifiers, tokens, and internal service metadata. Even with encryption at rest and in transit, telemetry data is often exposed during collection and processing.
In this lightning talk, we’ll show how to secure observability pipelines on Kubernetes using confidential computing-enabled nodes. We demonstrate how observability components (e.g., Fluent Bit, OpenTelemetry Collector, Loki) can run inside hardware-isolated Kubernetes nodes, ensuring that telemetry data is encrypted at the source and only processed by trusted, attested workloads. Attendees will learn a practical, zero-intrusion design that combines Kubernetes-native observability tools with confidential compute infrastructure to deliver end-to-end encrypted, trusted observability, ideal for regulated workloads in finance, healthcare, and government.
Ezra Silvera, IBM & Michael Hrivnak, Red Hat
Running AI/ML workloads in Pods on bare-metal is common for maximizing GPU performance but lacks strong isolation and flexibility.
In this talk, we share how we use KubeVirt to run high-performance AI workloads inside VMs with NVIDIA GPUs and NVLink, achieving near bare-metal speeds. This enables multi-tenancy, improved security, and resource partitioning—critical for service providers and cost-efficient for customers. We’ll show how VM-based worker nodes enable virtual Kubernetes clusters on shared infrastructure, supporting both full BM nodes and partitioned node use cases. We'll also dive into challenges like integrating NVIDIA Fabric Manager with the Kubebernets/KubeVirt workflow , optimizing NUMA and PCI topology, and aligning Kubernetes scheduling with VM-based GPU layouts. Finally, we’ll share customer use cases demonstrating the need for isolated, high-performance AI environments using Kubernetes-native tooling.
Paolo Patierno, IBM & Michael Morris, Ericsson Software Technology
Strimzi is best known for its operators, but its ecosystem includes a rich set of components that make Apache Kafka on Kubernetes truly production-ready. This talk dives into the broader Strimzi landscape: the HTTP Bridge for RESTful Kafka access, the Drain Cleaner for safe node maintenance, the OAuth library for secure authentication, the Access Operator for declarative user and ACL management, and the Metrics Reporter for enhanced observability. We’ll also touch on other complementary tools like the Kubernetes Config Provider for dynamic configuration and the MQTT Bridge for IoT integration. Whether you're running Kafka at scale or exploring cloud-native streaming for the first time, this session will offer a practical look at how the full Strimzi ecosystem works together to simplify and strengthen your deployment.