Compute elasticity is a primary benefit of using cloud-based data processing platforms such as Amazon EMR, where clusters can be scaled both horizontally and vertically. For example, a query scanning petabytes of data can run faster in a cluster with thousands of nodes compared to one with only a few hundred. However, not all workloads require the same computational power or have the same resource utilization patterns throughout their lifetime. Optimizing solely for performance by over-provisioning clusters can result in extra costs for customers stemming from unused capacity. Under-provisioning clusters, on the other hand, will keep costs low but can delay job completion time and cause SLAs to be missed. To further complicate the problem, a single cluster may run concurrent workloads, each with different resource needs. Typical solutions to managing time-varying resource needs involve frequent manual cluster rescaling or using different-sized clusters for different types of queries, which increases the cluster administrator’s workload. This paper presents Amazon EMR Managed Scaling, a feature that continuously and automatically resizes EMR clusters with a goal to optimize the cost/performance ratio, with minimal user input. We describe how we iteratively built our engine-agnostic scaling approach that uses a cluster’s telemetry, topology and past behavior to adjust its resource utilization in response to changing resource requirements. We show how each of the individual techniques presented herein solve specific workload execution patterns, and how EMR combines them into a single, unified algorithm that can make correct scaling decisions within seconds.
Research areas