Generative AI Model Data Pre-Training on Kubernetes: A Use Case Study

Alexey Roytman; Anish Asthana

KubeCon EU 2025

Talk

01 Apr 2025

Generative AI Model Data Pre-Training on Kubernetes: A Use Case Study

Abstract

Large Language Models (LLM) require preprocessing vast amounts of data, a process that can span days due to its complexity and scale, often involving PetaBytes of data. This talk demonstrates how Kubeflow Pipelines (KFP) simplify LLM data processing with flexibility, repeatability, and scalability. These pipelines are being used daily at IBM Research to build indemnified LLMs tailored for enterprise applications. Different data preparation toolkits are built on Kubernetes, Rust, Slurm, or Spark. How would you choose one for your own LLM experiments or enterprise use cases and why should you consider Kubernetes and KFP? This talk describes how open source Data Prep Toolkit leverages KFP and KubeRay for scalable pipeline orchestration, e.g. deduplication, content classification, and tokenization. We share challenges, lessons, and insights from our experience with KFP, highlighting its applicability for diverse LLM tasks, such as data preprocessing, RAG retrieval, and model fine-tuning.

Conference paper