Near-linear scaling of gigantic-model training on AWS

A new distributed-training library achieves near-linear efficiency in scaling from tens to hundreds of GPUs.

State-of-the-art language models have billions of parameters. Training these models within a manageable time requires distributing the workload across a large computing cluster. Ideally, training time would decrease linearly as the cluster size scales up. However, linear scaling is difficult to achieve because the communication required to coordinate the work of the cluster nodes eats into the gains from parallelization.

Related content
Amazon researchers optimize the distributed-training tool to run efficiently on the Elastic Fabric Adapter network interface.

Recently, we put some effort into optimizing the communication efficiency of Microsoft’s DeepSpeed distributed-training library, dramatically improving performance for up to 64 GPUs. However, when we scale from tens of GPUs to hundreds, in the public cloud environment, communication overhead again begins to overwhelm efficiency gains.

In a paper that we'll present in 2023 at the International Conference on Very Large Data Bases (VLDB), we propose a method to make model training scale efficiently on hundreds of GPUs in the cloud. We call this method MiCS, because it minimizes communication scale to bring down communication overhead.

Specifically, where existing distributed-training frameworks such as DeepSpeed and FairScale divide a model state across all GPUs, MiCS makes multiple replicas of the model state and partitions each replica within a subset of GPUs. Depending on the model size, a replica may fit on a single computing node — a single machine with high-speed connections between its GPUs — or on multiple nodes.

Thus, in MiCS, frequent communication operations, like parameter gathering, are restricted to a subset of GPUs. In this way, when we scale a cluster up — by adding new replicas across new nodes — the communication latency of frequent communication operations remains fixed, rather than growing with the size of the cluster.

We also reduce the data volume transmitted between nodes in the event that a copy of the model state won’t fit in a single node. Lastly, MiCS includes a gradient synchronization schedule that amortizes expensive gradient synchronization among all workers.

Our experimental results show significant improvement in throughput and scaling efficiency on different-sized BERT models evaluated on clusters consisting of p3dn.24xlarge instances. MiCS is able to achieve near-linear scalability (denoted by the rectangular frames in the figure below) and provides up to 2.82-fold throughput compared to the second and third states of the three-stage zero-redundancy optimizer, or ZeRO, the communication management method built into DeepSpeed-v0.5.6 .

We have also compared MiCS with our earlier optimizations of ZeRO’s third stage (see figure below), demonstrating improvements even at the lower GPU counts that we investigated previously. We report all these findings in greater detail in a preprint paper on the arXiv.

MiCS results.png
A comparison of MiCS and our earlier optimizations of DeepSpeed Zero’s third stage.

AWS P4d provides up to 400Gbps networking bandwidth for high-performance computing. Unfortunately, the distributed system may not be able to fully utilize 400Gbps efficiently because of communication overhead — especially latency, which increases when adding more GPUs to the cluster.

Related content
Optimizing placement of configuration data ensures that it’s available and consistent during “network partitions”.

We have deployed MiCS to train proprietary models with up to 175 billion parameters on p4d.24xlarge (40GB A100) and p4de.24xlarge (80GB A100) instances. When training a 175-billion-parameter model with a sequence length of 2,048 on 16 p4de.24xlarge instances, we are able to achieve 169-teraflops (54.2% of the theoretical peak) performance on each GPU. When we train a 100-billion-parameter model on 64 p4d.24xlarge instances (512 A100 GPUs), MiCS maintains over 170 teraflops per GPU (54.5% of the theoretical peak).

When the size of the cluster is scaled from 128 GPUs to 512 GPUs, MiCS achieves 99.4% of the linear-scaling efficiency (as measured by the “weak scaling” metric). In contrast, DeepSpeed ZeRO’s third stage achieves only 72% weak-scaling efficiency and saturates at 62 teraflops per GPU (19.9% of the theoretical peak).

Scale-aware model partitioning

By default, DeepSpeed partitions model states across all devices, a strategy that lowers the memory consumption on each GPU in the cluster but incurs large communication overhead in training. More importantly, the overhead scales up with the size of the cluster, which causes the scalability to drop significantly at large scale.

Instead of partitioning model states to all GPUs, MiCS divides GPUs in the cluster into multiple groups and partitions model states within each group. We call these groups partition groups. Each group holds a complete replica of model states. The following figure gives an example of partition groups consisting of two consecutive GPUs. Those GPUs holding the same part of the model state form another kind of group, a replication group.

Graphic shows the relationship between partition groups and replication groups in MiCS.
The relationship between partition groups and replication groups in MiCS.

Partitioning model states within each partition group restricts the most frequent communications, parameter gathering and gradient synchronization, within a fixed number of GPUs. This strategy effectively controls the communication overhead and does not let it grow with the size of the cluster.

Hierarchical communication strategy

When the memory requirement for a single replica of the model state is larger than the total amount of GPU memory in a single node, we need to store the replica on GPUs spanning multiple nodes. In that case, we have to rely on less-efficient internode communication.

Related content
Earlier this year, we reported a speech recognition system trained on a million hours of data, a feat possible through semi-supervised learning, in which training data is annotated by machines rather than by people. These sorts of massive machine learning projects are becoming more common, and they require distributing the training process across multiple processors. Otherwise, training becomes too time consuming.

The volume of transmitted data and the latency in a collective communication are determined by the message size and the number of participants. Particularly, the communication volume is proportional to (p - 1)/p, where p denotes the number of participants, and if the participants use the standard ring-shaped communication pattern, the latency has a linear dependency on the number of participants.

The message size cannot be reduced without compromising data integrity, but we can reduce the number of participants in internode communications. This lowers the communication volume factor to (p - k)/p and latency by p/(p/k + k) times, where k is the number of GPUs on a single node.

Consider the simple example below, involving two nodes with two GPUs each. The standard ring-shaped communication pattern would aggregate data across nodes (left) by passing messages from each GPU to the next, so a single internode communication involves four GPUs.

Internode communication.png
MiCS reduces the number of GPUs that participate in any given internode communication.

MiCS, by contrast, executes these internode operations in parallel, so each internode communication involves only two GPUs (right), which exchange only half the information that we want to communicate. Each node then aggregates the internode data locally to assemble the full message. In this case, the communication volume factor is reduced from ¾ ((4-1)/4) to ½ ((4-2/4).

Two-hop gradient synchronization

Synchronizing gradients among all workers is an expensive operation, required to keep workers working on the same model states. During the training of large neural nets, batch size is typically limited by GPU memory. Gradient accumulation is a technique that splits a batch of samples into several microbatches that will be run sequentially in multiple microsteps.

Related content
“Anytime query” approach adapts to the available resources.

With MiCS, we can accumulate gradients inside each partition group in multiple microbatches until the last microbatch is processed. That is, for each microstep, we can accumulate the full set of gradients for each model replica inside a subset of GPUs (i.e., a partition group). Then, after the last microbatch is handled, each GPU synchronizes gradients with the other GPUs representing the same part of the model state.

This allows us to amortize the synchronization overhead across replication groups to multiple microsteps. The following figure gives an example of two-hop gradient synchronization for training with four microsteps.

Gradient accumulation.png
Two-hop gradient synchronization.

Because of these three techniques, MiCS shows great scalability on large clusters and delivers excellent training throughput performance, and it enables us to achieve a new state-of-the-art performance on AWS p4de.24xlarge machines.

We are working to open-source MiCS for public use, in the belief that it will greatly reduce the time and cost of large-model training on the Amazon EC2 platform. Please refer to our preprint for a more detailed explanation of our system and analysis of its performance.

Acknowledgements: Yida Wang, Justin Chiu, Roshan Makhijani, RJ, Stephen Rawls, Xin Jin

Research areas

Related content

AU, VIC, Melbourne
Are you excited about leveraging state-of-the-art Computer Vision algorithms and large datasets to solve real-world problems? Join Amazon as an Applied Scientist Intern and be at the forefront of AI innovation! As an Applied Scientist Intern, you'll work in a fast-paced, cross-disciplinary team of pioneering researchers. You'll tackle complex problems, developing solutions that either build on existing academic and industrial research or stem from your own innovative thinking. Your work may even find its way into customer-facing products, making a real-world impact. Please note: This internship is a duration of 6 months full time with a start date in Jan-March 2027. The successful intern is required to be based in Melbourne and relocation allowance will be provided if you are based outside of Melbourne. Key job responsibilities - Develop novel solutions and build prototypes - Work on complex problems in Computer Vision and Machine Learning - Contribute to research that could significantly impact Amazon's operations - Collaborate with a diverse team of experts in a fast-paced environment - Collaborate with scientists on writing and submitting papers to Tier-1 conferences (e.g., CVPR, ICCV, NeurIPS, ICML) - Present your research findings to both technical and non-technical audiences Key Opportunities - Collaborate with leading machine learning researchers - Access Amazon tools and hardware (large GPU clusters) - Address challenges at an unparalleled scale - Become a disruptor, innovator, and problem solver in the field of computer vision - Potentially deliver solutions to production in customer-facing applications - Opportunities to become an FTE after the internship Join us in shaping the future of AI at Amazon. Apply now and turn your research into real-world solutions!
IN, KA, Bengaluru
The Trust CX Innovations team is looking for an Applied Scientist with strong background in Generative AI space to build solutions that help in upholding customer trust for Alexa+. As an Applied Scientist in Trust CX innovations, you will be at the forefront of developing innovative solutions to critical challenges in AI trust and privacy. You'll lead research in trust-preserving machine learning techniques. We are working on revolutionizing the way Amazonians work and collaborate. You will help us achieve new heights of productivity through the power of advanced generative AI technologies. Key job responsibilities - Lead research initiatives in generative AI, focusing on LLMs, multimodal models, and frontier AI capabilities - Develop innovative approaches for model optimization, including prompt engineering, few-shot learning, and efficient fine-tuning - Pioneer new methods for AI safety, alignment, and responsible AI development - Design and execute sophisticated experiments to evaluate model performance and behavior - Lead the development of production-ready AI solutions that scale efficiently - Collaborate with product teams to translate research innovations into practical applications - Guide engineering teams in implementing AI models and systems at scale - Author technical papers for top-tier conferences - File patents for novel AI technologies and applications A day in the life You will be working with a group of talented scientists on researching algorithm and running experiments to test scientific proposal/solutions to improve our trust-preserving experiences. This will involve collaboration with partner teams including engineering, PMs, data annotators, and other scientists to discuss data quality, policy, and model development. You work closely with partner teams across Alexa to deliver platform features that require cross-team leadership. About the team Who We Are: Trust CX Innovations is a strategic innovation team within Amazon Devices & Services that focuses on advancing AI technology while prioritizing customer trust and experience. Our team operates at the intersection of artificial intelligence, privacy engineering and customer-centric design. Our Mission: To pioneer trustworthy AI innovations that delight customers while setting new standards for privacy and responsible technology development. We aim to transform how Amazon builds AI products by creating solutions that balance innovation with customer trust.
US, WA, Redmond
We are searching for a talented candidate with expertise in orbital mechanics and spaceflight navigation, including LEO Satellite Orbit Determination. This position requires experience in simulation and analysis of spacecraft orbital mechanics and sequential orbit determination methods, including Extended Kalman Filters (EKF) and/or Unscented Kalman Filter (UKF). Strong analysis skills are required to develop engineering studies of complex large-scale dynamical systems. This position requires demonstrated expertise in computational analysis automation and tool development. Key job responsibilities - Perform spacecraft maneuver or navigation analysis in support of multi-disciplinary trades within the Amazon Leo team. - Contribute to prototype software development of flight algorithms. - Test and assess navigation software for integration into flight systems. - Assess and trouble-shoot the performance of Leo on-board GNSS hardware and software systems. - Work closely with GNC engineers to manage on-orbit performance and develop flight dynamics operations processes. Export Control Requirement: Due to applicable export control laws and regulations, candidates must be a U.S. citizen or national, U.S. permanent resident (i.e., current Green Card holder), or lawfully admitted into the U.S. as a refugee or granted asylum. A day in the life - Interacting with GNC teams to evaluate and troubleshoot satellite issues. - Working within the Flight Dynamics Research team to prioritize tasks. - Performing analysis, simulation, testing and documentation to address assigned tasks.
US, CA, Pasadena
The Amazon Web Services (AWS) Center for Quantum Computing in Pasadena, CA, is looking to hire a Research Scientist with experience in semiconductor process development who will aid in AWS’s effort to bring cloud quantum computing services to its worldwide customer base. You will join a multi-disciplinary team of scientists, and hardware and software engineers working at the forefront of quantum computing. Through your work inside and outside of the cleanroom environment in the fabrication research and development group, you will solve problems related to developing next-generation quantum processors. Candidates must have a demonstrated background in sound scientific and engineering principles, and must have excellent data analysis, bias for action, problem solving, and communication skills, and be highly motivated and curious to research and learn new technical topics as needed. As a research scientist you will be expected to work on new ideas and stay abreast of novel approaches in fabricating and packaging superconducting quantum processors. Working effectively within a team environment is critical. Key job responsibilities Responsibilities include developing novel processes to fabricate high-coherence superconducting qubits; developing advanced 3DI interconnect and routing technologies for integrating superconducting quantum technologies; analyzing inline metrology and electrical test data; writing production standard operating procedures to transfer newly-developed processes to production teams; interacting with project leads to provide feedback that continuously improves different processes. A day in the life The candidate will develop novel technologies using micro-/nano-fabrication techniques inside the cleanroom (independently or in collaboration with other scientists and engineers) for next-generation quantum computing. Outside the cleanroom, the candidate will plan experiments, analyze data, and conceive future innovations. About the team AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Internet of Things (Iot), Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for their cloud services. Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. Hybrid Work We value innovation and recognize this sometimes requires uninterrupted time to focus on a build. We also value in-person collaboration and time spent face-to-face. Our team affords employees options to work in the office every day or in a flexible, hybrid work model near one of our U.S. Amazon offices.
IN, KA, Bengaluru
Are you passionate about solving complex business problems at scale through Generative AI? Do you want to help build intelligent systems that reason, act, and learn from minimal supervision? If so, we have an exciting opportunity for you on Amazon's Trustworthy Shopping Experience (TSE) team. At TSE, our vision is to guarantee customers a worry-free shopping experience by earning their trust that the products they buy are safe, authentic, and compliant with regulations and policy. We do this in close partnership with our selling partners, empowering them with best-in-class tools and expertise to offer a high-quality, compliant selection that customers trust. As an Applied Scientist I, you will bring subject matter expertise in at least one relevant discipline (e.g., NLP, computer vision, representation learning, agentic architecture) to contribute to next-generation agentic AI solutions that automate complex manual investigation processes at Amazon scale. Working alongside senior scientists, you will map business goals—such as reducing cost-of-serving while maintaining trust and safety standards—to well-defined scientific problems and metrics. You will invent, refine, and experiment with solutions spanning agentic reasoning, self-supervised representation learning, few-shot adaptation, multimodal understanding, and model compression. With guidance from senior scientists, you will stay current on research trends and benchmark your results against the state of the art. You will help design and execute experiments to identify optimal solutions, initiating the development and implementation of small components with team guidance. You will write secure, stable, testable, and well-documented production code at the level of an SDE I, rigorously evaluating models and quantifying performance. You will handle data in accordance with Amazon policies, troubleshoot issues to root cause, and ensure your work does not put the company at risk. Your scope of influence will typically be at the self-level, with the possibility of mentoring interns. You will participate in team design and prioritization discussions, learn the business context behind TSE's products, and escalate problems with proposed solutions. You will publish internal technical reports and may contribute to peer-reviewed publications and external review activities when aligned with business needs. This role offers a unique opportunity to contribute to end-to-end AI development—from research through production—with your contributions serving hundreds of millions of customers within months, not years. Key job responsibilities • Contribute to the design and development of agentic AI systems with multi-step reasoning, autonomous task execution, and multimodal intelligence, including feedback and memory mechanisms, leveraging reinforcement learning techniques for agent decision-making and policy optimization, with input and guidance from senior scientists • Help productionize models built on top of SFT (Supervised Fine-tuning) and RFT (Reinforced Fine-tuning) approaches, as well as few-shot approaches based on multimodal datasets spanning text, images, and structured data, applying mathematical optimization techniques to improve efficiency, resource allocation, and decision-making in complex workflows, working alongside senior scientists to identify optimal solutions • Contribute to building production-ready deep learning and conventional ML solutions, including multimodal fusion and cross-modal alignment techniques that seamlessly connect visual, textual, and relational understanding, to support automation requirements within your team's scope • Help identify customer and business problems; use reasonable assumptions, data, and customer requirements to solve well-defined scientific problems involving multimodal inputs such as unstructured text, documents, product images, and relational data, developing representations that capture complementary signals across modalities and mapping business goals to scientific metrics • May co-author research papers for peer-reviewed internal and/or external venues, including contributions in areas such as multimodal representation learning and vision-language modeling, and contribute to the wider scientific community by reviewing research submissions, when aligned with business needs • Prototype rapidly, iterate based on feedback, and deliver small components at SDE I level—including multimodal data pipelines and inference modules—that integrate into production-scale systems • Write secure, stable, testable, maintainable, and well-documented code, balancing model capability, deployment cost, and resource usage across multimodal architectures while understanding state-of-the-art data structures, algorithms, and performance tradeoffs • Rigorously test code and evaluate models across individual and combined modalities, quantifying their performance; troubleshoot issues, research root causes, and thoroughly resolve defects, leaving systems more maintainable • Participate in team design, scoping, and prioritization discussions through clear verbal and written communication; seek to learn the business context, science, and engineering behind your team's products, including how multimodal signals contribute to trust and safety decisions • Participate in engineering best practices with peer reviews; clearly document approaches and communicate design decisions; publish internal technical reports to institutionalize scientific learning • Help train and mentor scientist interns; identify and escalate problems with proposed solutions, taking ownership or ensuring clear hand-off to the right owner About the team Trustworthy Shopping Experience Product team in TSE is responsible for the human-in-the-loop products and technology used in the risk investigations at Amazon. The team is also responsible for reducing the cost of performing the investigations, by automating wherever possible and optimizing the experience where manual interventions are needed. The team leverages state-of-the art technology and GenAI to deliver the products and associated goals.
IN, KA, Bengaluru
Do you want to join an innovative team of scientists who use machine learning and statistical techniques to create state-of-the-art solutions for providing better value to Amazon’s customers? Do you want to build and deploy advanced ML systems that help optimize millions of transactions every day? Are you excited by the prospect of analyzing and modeling terabytes of data to solve real-world problems? Do you like to own end-to-end business problems/metrics and directly impact the profitability of the company? Do you like to innovate and simplify? If yes, then you may be a great fit to join the Machine Learning team for India Consumer Businesses. Machine Learning, Big Data and related quantitative sciences have been strategic to Amazon from the early years. Amazon has been a pioneer in areas such as recommendation engines, ecommerce fraud detection and large-scale optimization of fulfillment center operations. As Amazon has rapidly grown and diversified, the opportunity for applying machine learning has exploded. We have a very broad collection of practical problems where machine learning systems can dramatically improve the customer experience, reduce cost, and drive speed and automation. These include product bundle recommendations for millions of products, safeguarding financial transactions across by building the risk models, improving catalog quality via extracting product attribute values from structured/unstructured data for millions of products, enhancing address quality by powering customer suggestions We are developing state-of-the-art machine learning solutions to accelerate the Amazon India growth story. Amazon India is an exciting place to be at for a machine learning practitioner. We have the eagerness of a fresh startup to absorb machine learning solutions, and the scale of a mature firm to help support their development at the same time. As part of the India Machine Learning team, you will get to work alongside brilliant minds motivated to solve real-world machine learning problems that make a difference to millions of our customers. We encourage thought leadership and blue ocean thinking in ML. Key job responsibilities Use machine learning and analytical techniques to create scalable solutions for business problems Analyze and extract relevant information from large amounts of Amazon’s historical business data to help automate and optimize key processes Design, develop, evaluate and deploy, innovative and highly scalable ML models Work closely with software engineering teams to drive real-time model implementations Work closely with business partners to identify problems and propose machine learning solutions Establish scalable, efficient, automated processes for large scale data analyses, model development, model validation and model maintenance Work proactively with engineering teams and product managers to evangelize new algorithms and drive the implementation of large-scale complex ML models in production Leading projects and mentoring other scientists, engineers in the use of ML techniques About the team International Machine Learning Team is responsible for building novel ML solutions that attack India first (and other Emerging Markets across MENA and LatAm) problems and impact the bottom-line and top-line of India business. Learn more about our team from https://www.amazon.science/working-at-amazon/how-rajeev-rastogis-machine-learning-team-in-india-develops-innovations-for-customers-worldwide
GB, Cambridge
Alexa is looking for an Applied Scientist with a strong background in Natural Language Processing (NLP) and Large Language Models (LLMs) to help build state-of-the-art conversational systems. In this role, you will collaborate with a large team of scientists training the Large Language Models that power the Alexa stack, as well as software engineers serving them in production systems. You will own solutions end-to-end: from ideation and research through to production deployment, enabling conversational assistants to support external tools, leverage diverse sources of information, and deliver novel reasoning capabilities to millions of Alexa customers. Key job responsibilities As an Applied Scientist, you will develop innovative solutions to complex problems to extend the functionalities of conversational assistants. You will use your technical expertise to research and implement novel algorithms and modelling solutions in collaboration with other scientists and engineers. You will analyze customer behaviors and define metrics to enable the identification of actionable insights and measure improvements in customer experience. You will communicate results and insights to both technical and non-technical audiences through written reports, presentations and external publications. You would be able to bi-modal on science and engineering: someone who combines strong scientific foundations with the execution skills to ship high-quality solutions. A day in the life As an Applied Scientist on the Alexa Science team, you'll drive innovation in evaluating new product experiences while discovering novel approaches to enhance model capabilities and enrich customer interactions. You'll collaborate with cross-functional teams of engineers and scientists to identify root causes of model and system integration issues, continuously improving the end-to-end customer experience. You'll partner closely with scientists developing and fine-tuning large language models, engineers building low-latency inference infrastructure, and product teams defining customer experience metrics. About the team We are a team of applied scientists and engineers building the intelligence layer that powers Alexa+. Our work sits at the intersection of large language models, decision-making under uncertainty, and production ML systems. What we build directly shapes the customer experience: determining which models serve their requests, optimizing response latency, and creating natural, seamless interactions. We're a collaborative team that values rigorous experimentation, clear communication, and delivering solutions that perform at scale in real-world environments.
US, WA, Seattle
Applied Scientists in AWS Science of Security are dedicated to making AWS the best computing service in the world for customers who require advanced and rigorous solutions for security, privacy, and sovereignty. Key job responsibilities The successful candidate will: * Solve large or significantly complex problems that require deep knowledge and understanding of your domain and scientific innovation. * Own strategic problem solving, and take the lead on the design, implementation, and delivery for solutions that have a long-term quantifiable impact. *Provide cross-organizational technical influence, increasing productivity and effectiveness by sharing your deep knowledge and experience. * Develop strategic plans to identify fundamentally new solutions for business problems. * Assist in the career development of others, actively mentoring individuals and the community on advanced technical issues. A day in the life This is a unique and rare opportunity to get in early on a fast-growing segment of AWS and help shape the technology, product and the business. You will have a chance to utilize your deep technical experience within a fast moving, start-up environment and make a large business and customer impact. About the team Diverse Experiences Amazon Security values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why Amazon Security? At Amazon, security is central to maintaining customer trust and delivering delightful customer experiences. Our organization is responsible for creating and maintaining a high bar for security across all of Amazon’s products and services. We offer talented security professionals the chance to accelerate their careers with opportunities to build experience in a wide variety of areas including cloud, devices, retail, entertainment, healthcare, operations, and physical stores. Inclusive Team Culture In Amazon Security, it’s in our nature to learn and be curious. Ongoing DEI events and learning experiences inspire us to continue learning and to embrace our uniqueness. Addressing the toughest security challenges requires that we seek out and celebrate a diversity of ideas, perspectives, and voices. Training & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, training, and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve.
US, MA, Boston
Our team is involved with pre-silicon design verification for custom IP. A critical requirement of the verification flow is the requirement of legal and realistic stimulus of a custom Machine Learning Accelerator Chip. Content creation is built using formal methods that model legal behavior of the design and then solving the problem to create the specific assembly tests. The entire frame work for creating these custom tests is developed using a SMT solver and custom software code to guide the solution space into templated scenarios. This highly visible and innovative role requires the design of this solving framework and collaborating with design verification engineers, hardware architects and designers to ensure that interesting content can be created for the projects needs. Key job responsibilities Develop an understanding for a custom machine learning instruction set architecture. Model correctness of instruction streams using first order logic. Create custom API's to allow control over scheduling and randomness. Deploy algorithms to ensure concurrent code is safely constructed. Create coverage metrics to ensure solution space coverage. Use novel methods like machine learning to automate content creation. About the team Utility Computing (UC) AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Internet of Things (Iot), Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for customers who require specialized security solutions for their cloud services. Annapurna Labs (our organization within AWS UC) designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time ago—even yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world.
US, WA, Seattle
We are seeking an Applied Scientist to join the Amazon Precision Match (APM) team within Customer Journey, Network Solutions. APM is a transformative initiative replacing Amazon's legacy queue-based customer service routing with intelligent algorithmic matching — connecting customers with the best available service option based on their needs and Customer Service Associates (CSA) capabilities. This role will drive the science behind a high-scale system with significant projected impact on operational efficiency and customer experience. You will work at the intersection of recommendation systems, real-time ML inference, and large-scale experimentation to redefine how Amazon serves its customers. Key job responsibilities - Design, develop, and optimize ML-based matching algorithms that pair customers with optimal CSAs based on contact complexity, intent, and CSA skill profiles. - Build and iterate on feature engineering pipelines across CSA-level (skills, tenure, sentiment handling), contact-level (intent, complexity, urgency), and customer-level (language, communication style) attributes. - Run offline simulations on large-scale historical contact data and design statistically rigorous A/B experiments to validate matching improvements. - Develop real-time low-latency scoring and inference systems for production contact routing. - Address the cold start problem for new CSAs and build continuous model retraining infrastructure using production feedback. - Partner with CS Economics, Capacity Planning, and Quality teams on experiment design and results interpretation. - Evolve the matching framework from individual CSA ranking to set-based optimization balancing performance and operational sustainability. A day in the life You will spend your days iterating on matching models, analyzing experiment results from live production traffic, and collaborating with engineers and product managers to translate science insights into system improvements. You'll partner with the Customer Service Economics team to design experiments, review simulation outputs, and present findings to senior leadership. You'll also deep-dive into CSA behavioral patterns, contact transcripts, and performance data to identify new matching signals and continuously improve the algorithm. About the team The Amazon Precision Match team is a high-impact, fast-moving science and engineering team within Customer Journey, Network Solutions. Our mission is to ensure every Amazon customer is connected with the right service option at the right time — improving customer experience while driving operational efficiency at scale. We value intellectual curiosity, rigorous experimentation, and a bias for action. We operate with a continuous improvement flywheel: offline simulation, A/B testing, and production rollout. We collaborate closely with Customer Service Operations, Capacity Planning, Quality, and partner science teams across Amazon.