Building robust and reliable Large Language Models (LLMs) hinges on the quality of the data used for training. This talk delves into the critical role of data quality in LLM development, particularly for code generation tasks. We will discuss the challenges posed by low-quality data and explore state-of-the-art techniques for data preparation.
The session will then provide a hands-on demonstration of building data pipelines using the Data Prep Kit, an open-source library. Participants will learn to construct and execute data processing pipelines, clean and filter data, and prepare it for fine-tuning LLMs. By the end of this session, participants will be equipped to build effective data pipelines for their own LLM projects.