This project contains a metrics generated by adapting a PetShop application https://aws.amazon.com/blogs/mt/improve-your-application-availability-with-aws-observability-solutions/ and https://github.com/aws-samples/one-observability-demo/tree/main/PetAdoptions/petsite/petsite/Controllers for the code.
Performance issues were triggered in various microservices of this application leading to increased latency and reduced availability.
The task of root-cause analysis (RCA) is to identify the source of such issues. This is an important step to mitigate and resolve the issue. Identifying the root cause of such issues, however, can be extremely cumbersome and time-consuming, particularly in complex applications composed of tens or hundreds of microservices.
The dataset encompasses latency, requests, and availability metrics, gathered from a distributed application comprising 41 components, including databases, load balancers, queues, storage systems, and containerized microservices. In addition to normal operation metrics, the dataset includes 68 injected performance issues, such as request overload, memory leaks, CPU hog, and misconfigurations, which increase latency and reduce availability throughout the system. The metrics are annotated with the corresponding issues, serving as ground truth for the analysis.
We illustrate how to use the dataset and apply a few different RCA methods released as part of sfr-pyrca and do-why.