Conference paper

Data preparation for fine tuning Large Language Models

Abstract

When it comes to creating Large Language Model (LLM) applications, such as fine-tuning, pre-training, or instruct-tuning, data preparation is a vital early step. It is widely acknowledged that the quality of a model is heavily influenced by the quality of the data it is trained on, as demonstrated in [4, 6, 8]. This tutorial will focus on the data preparation for LLM application development, focussing on the latest data preparation techniques. The tutorial will start by covering the state-of-the-art methods for preparing data for LLMs. We will then provide a hands on tutorial on how to use the data-prep-kit [7], an open source toolkit to implement various data preparation steps. To provide LLM app developers with a practical understanding, we will create a data processing pipeline for a specific LLM app development use case. This will offer an end-to-end experience that users can then apply to their own projects..