Data preparation for fine tuning Large Language Models

Parameswaran Selvam; Hima Patel; Saptha Surendran; Shivdeep Singh

doi:10.1145/3703323.3704802

CODS-COMAD 2024

Conference paper

25 Jun 2025

Data preparation for fine tuning Large Language Models

View publication

Abstract

When it comes to creating Large Language Model (LLM) applications, such as fine-tuning, pre-training, or instruct-tuning, data preparation is a vital early step. It is widely acknowledged that the quality of a model is heavily influenced by the quality of the data it is trained on, as demonstrated in [4, 6, 8]. This tutorial will focus on the data preparation for LLM application development, focussing on the latest data preparation techniques. The tutorial will start by covering the state-of-the-art methods for preparing data for LLMs. We will then provide a hands on tutorial on how to use the data-prep-kit [7], an open source toolkit to implement various data preparation steps. To provide LLM app developers with a practical understanding, we will create a data processing pipeline for a specific LLM app development use case. This will offer an end-to-end experience that users can then apply to their own projects..

Conference paper