DataLore: Can a large language model find all lost scrolls in a data repository?
2024
How can we effectively generate missing data transformations among tables in a data repository? Multiple versions of the same tables are generated from the iterative process when data scientists and machine learning engineers fine-tune their ML pipelines, making incremental improvements. This process often involves data transformation and augmentation that produces an augmented table based on its base version and related tables. However, data transformations are often not well-documented or completely missing, resulting in poor traceability, reproducibility and explainability of ML pipelines. In this paper, we propose DATALORE, a framework that explains data changes between an initial dataset and its augmented version to improves traceability. Given a base table, DATALORE first discovers its potentially related tables from the data repository using a variety of data discovery techniques. DATALORE then effectively leverages a large language model (LLM) to generate a variety of data transformations that lead to the augmented table. DATALORE validates these transformations and selects the minimum number of related tables to ensure traceability and reproducibility of the ML pipelines. A preliminary experiment shows that DATALORE is able to effectively recover data transformations on two benchmark datasets.
Research areas