Machine Learning (ML) algorithms are a standard component of modern software systems. The validation of data ingested and produced by ML components has become a central challenge in the deployment and maintenance of ML systems. Subtle changes in the input data can result in unpredictable behavior of an ML algorithm that can lead to unreliable or unfair ML predictions. Responsible usage of ML components thus requires well calibrated and scalable data validation systems. Here, we highlight some challenges associated with data validation in ML systems. We review some of the solutions developed to validate data at the various stages of a data pipeline in modern ML systems, discuss their strengths and weaknesses and assess to what extent these solutions are being used in practice. The research reviewed indicates that the increasing need for data validation in ML systems has driven enormous progress in an emerging community of ML and Data Base Management Systems (DBMS) researchers. While this research has led to a number of technical solutions we find that many ML systems deployed in industrial applications are not leveraging the full potential of data validation in practice. The reasons for this are not only technical challenges, but there are also cultural, ethical and legal aspects that need to be taken into account when building data validation solutions for ML systems. We identify the lack of automation in data validation as one of the key factors slowing down adoption of validation solutions and translation of research into useful and robust ML applications. We conclude with an outlook on research directions at the intersection of ML and DBMS research to improve the development, deployment and maintenance of ML systems.
Research areas