Database performance troubleshooting is a complex multi-step process that broadly involves three key stages– (a) Detection: determining what’s wrong and when; (b) Root Cause Analysis (RCA): reasoning about why is the performance poor; (c) Resolution: identifying a fix. A plethora of techniques exist to address each of these problems, but they hardly work in real-world at scale. First, real-world customer workloads are noisy, non-stationary and quasi-periodic in nature rendering traditional detectors ineffective. Second, real-world production databases execute a highly diverse set of queries that skew the database statistics into long-tail distributions causing traditional RCA methods to fail. Third, these databases typically execute millions of such diverse queries every minute rendering traditional methods inefficient when deployed at scale.
In this paper we describe Vista, a machine learning based performance troubleshooting framework for databases, and dive-deep into how it addresses the 3 real-world problems outlined above. Vista deploys a deep auto-regressive model trained on a large and diverse Amazon Relational Database Service (RDS) fleet with cus-tom skip connections and periodicity alignment features to model long range and varying periodicity in customer workloads, and detects performance bottlenecks in the form of outliers. Furthermore, it efficiently filters only a top few dominating SQL queries from millions in a problematic workload, and uses a robust causal inference framework to identify the culprit queries and their statistics leading to a low false-positive and false-negative rate. Currently, Vista runs on hundreds of thousands of RDS databases, analyzes millions of workloads every day bringing down the troubleshooting time for RDS customers from hours to seconds. At the end, we also describe several challenges and learnings from implementing and deploying Vista at Amazon scale.
Vista: Machine learning based database performance troubleshooting framework in Amazon RDS
2024
Research areas