Distributed training of large language models on AWS Trainium
2024
Large language models (LLMs) are ubiquitously powerful but prohibitively expensive to train, often requiring thousands of compute devices, typically GPUs. To reduce the cost of training LLMs for customers, Amazon Web Services (AWS) launched the Amazon EC2 trn1 instances, powered by AWS Trainium, Amazon’s homegrown deep-learning accelerator, as an alternative to distributed LLM training. The trn1 instances provide a high-performance LLM training solution at a lower cost compared to their GPU-based counterpart, the p4d instances, which are powered by Nvidia A100 GPUs. This paper describes the design and development of the Neuron Distributed Training Library, a component of the AWS Neuron SDK, which enables distributed training of large language models on AWS Trainium. Neuron Distributed Training Library supports a variety of existing distributed training techniques with unified interfaces, and provides features to address trn1-specific challenges as well. Our evaluation shows that trn1 instances, specifically the trn1.32xlarge, achieve better or comparable performance (up to 24.6% improvement) while offering significant lower costs (up to 46.3% cost saving) in selected workloads when compared to p4d.24xlarge instances. As a result, AWS Trainium has been adopted for training numerous external and internal models, showcasing its high-performance and cost effectiveness. Several supported open-source LLMs are accessible via HuggingFace Optimum Neuron.
Research areas