FedExP: Speeding up Federated Averaging via Extrapolation
Divyansh Jhunjhunwala, Shiqiang Wang, et al.
ICLR 2023
Large Language Models (LLM) require preprocessing vast amounts of data, a process that can span days due to its complexity and scale, often involving PetaBytes of data. This talk demonstrates how Kubeflow Pipelines (KFP) simplify LLM data processing with flexibility, repeatability, and scalability. These pipelines are being used daily at IBM Research to build indemnified LLMs tailored for enterprise applications. Different data preparation toolkits are built on Kubernetes, Rust, Slurm, or Spark. How would you choose one for your own LLM experiments or enterprise use cases and why should you consider Kubernetes and KFP? This talk describes how open source Data Prep Toolkit leverages KFP and KubeRay for scalable pipeline orchestration, e.g. deduplication, content classification, and tokenization. We share challenges, lessons, and insights from our experience with KFP, highlighting its applicability for diverse LLM tasks, such as data preprocessing, RAG retrieval, and model fine-tuning.
Divyansh Jhunjhunwala, Shiqiang Wang, et al.
ICLR 2023
Helgi I. Ingolfsson, Chris Neale, et al.
PNAS
Romeo Kienzler, Johannes Schmude, et al.
Big Data 2023
Christopher Giblin, Sean Rooney, et al.
BigData Congress 2021