Machine learning (ML) has become a central component in modern software applications, giving rise to many new challenges [8, 15, 20]. Tremendous progress has been made in this context with respect to model serving [1, 6, 10], experiment tracking [14, 16, 22, 23], model diagnosis [4, 5, 11, 21] and data validation [4, 18].
In this paper, we focus on the arising challenge of automating the operation of deployed ML applications, especially with respect to monitoring the quality of their input data. Existing approaches [1, 18, 22] for this problem have not yet reached broad adoption. One reason for that is that they often require a large amount of domain knowledge, e.g., to define “data unit tests” and corresponding similarity metrics and thresholds for detecting data shifts. Additionally, it is very challenging to test data at early stages of a pipeline (e.g., during integration) without explicit knowledge of how the data will be processed by downstream applications. In other cases, the engineers in charge of operating a deployed ML model may not have access to the model internals, for example if they leverage a popular cloud ML service such as Google AutoML1 for training and inference. Integrating and automating data quality monitoring into ML applications is also difficult due to the lack of agreed upon abstractions for defining and deploying such applications.
We summarize three recent approaches that tackle data quality in ML applications and outline our vision towards their automation and synthesis, as depicted in Fig. 1: (i) Measuring data quality with “data unit tests” using the Deequ [18] library; (ii) Improving data quality with missing value imputation using the DataWig [3] library; and (iii) Quantifying the impact of data quality issues on the predictive performance of a deployed ML model [19]. Finally, we outline challenges and potential directions for combining these approaches and for automating their configuration in real world deployment settings.
Towards automated ML model monitoring: Measure, improve and quantify data quality
2020
Research areas