Profiling deep learning workloads at scale using Amazon SageMaker
2022
With the rise of deep learning (DL), machine learning (ML) has become compute and data intensive, typically requiring multi-node multi-GPU clusters. As state-of-the-art models grow in size in the order of trillions of parameters, their computational complexity and cost also increase rapidly. Since 2012, the cost of deep learning doubled roughly every quarter, and this trend is likely to continue. ML practitioners have to cope with common challenges of efficient resource utilization when training such large models. In this paper, we propose a new profiling tool that cross-correlates relevant system utilization metrics and framework operations. The tool supports profiling DL models at scale, identifies performance bottlenecks, and provides insights with recommendations. We deployed the profiling functionality as an add-on to Amazon SageMaker Debugger, a fully-managed service that leverages an on-the-fly analysis system (called rules) to automatically identify complex issues in DL training jobs. By presenting deployment results and customer case studies, we show that it enables users to identify and fix issues caused by inefficient hardware resource usage, thereby reducing training time and cost.
Research areas