Updates to Machine Learning as a Service (MLaaS) APIs may affect downstream systems that depend on their predictions. However, performance changes introduced by these updates are poorly documented by providers and seldom studied in the literature. As a result, API producers and consumers are left wondering: do model updates introduce performance changes that could adversely affect users’ system? Ideally, producers and consumers would have access to a detailed ChangeList specifying the slices of data where model performance has improved and degraded since the update. But, producing a ChangeList is challenging because it requires (1) discovering slices in the absence of detailed annotations or metadata,(2) accurately attributing coherent concepts to the discovered slices, and (3) communicating them to the user in a digestable manner. In this work, we demonstrate, discuss, and critique one approach for building, verifying, and releasing ChangeLists that aims to address these challenges. Using this approach, we analyze six real-world MLaaS API updates including GPT-3 and Google Cloud Vision. We produce a prototype ChangeList for each, identifying over 100 coherent data slices on which the model’s performance changed significantly. Notably, we find 63 instances where an update improves performance globally, but hurts performance on a coherent slice – a phenomenon not previously documented at scale in the lit-erature. Finally, with diverse participants from industry, we conduct a think-aloud user study that explores the importance of releasing ChangeLists and highlights the strengths and weaknesses of our approach. This serves to validate some parts of our approach and uncover important areas for future work.
Research areas