Lessons learned from 10 years of DynamoDB

Prioritizing predictability over efficiency, adapting data partitioning to traffic, and continuous verification are a few of the principles that help ensure stability, availability, and efficiency.

Amazon DynamoDB is one of the most popular NoSQL database offerings on the Internet, designed for simplicity, predictability, scalability, and reliability. To celebrate DynamoDB’s 10th anniversary, the DynamoDB team wrote a paper describing lessons we’d learned in the course of expanding a fully managed cloud-based database system to hundreds of thousands of customers. The paper was presented at this year’s USENIX ATC conference.

The paper captures the following lessons that we have learned over the years:

  • Designing systems for predictability over absolute efficiency improves system stability. While components such as caches can improve performance, they should not introduce bimodality, in which the system has two radically different ways of responding to similar requests (e.g., one for cache misses and one for cache hits). Consistent behaviors ensure that the system is always provisioned to handle the unexpected. 
  • Adapting to customers’ traffic patterns to redistribute data improves customer experience. 
  • Continuously verifying idle data is a reliable way to protect against both hardware failures and software bugs in order to meet high durability goals. 
  • Maintaining high availability as a system evolves requires careful operational discipline and tooling. Mechanisms such as formal proofs of complex algorithms, game days (chaos and load tests), upgrade/downgrade tests, and deployment safety provide the freedom to adjust and experiment with the code without the fear of compromising correctness. 
Related content
Amazon DynamoDB was introduced 10 years ago today; one of its key contributors reflects on its origins, and discusses the 'never-ending journey' to make DynamoDB more secure, more available and more performant.

Before we dig deeper into these topics, a little terminology. A DynamoDB table is a collection of items (e.g., products), and each item is a collection of attributes (e.g., name, price, category, etc.). Each item is uniquely identified by its primary key. In DynamoDB, tables are typically partitioned, or divided into smaller sub-tables, which are assigned to nodes. A node is a set of dedicated computational resources — a virtual machine — running on a single server in a datacenter.

DynamoDB stores three copies of each partition, in different availability zones. This makes the partition highly available and durable because the availability zones’ storage resources share nothing and are substantially independent. For instance, we wouldn’t assign a partition and one of its copies to nodes that share a power supply, because a power outage would take both of them offline. The three copies of the same partition are known as a replication group, and there is a leader for the group that is responsible for replicating all the customer mutations and serving strongly consistent reads.

DynamoDB architecture.png
The DynamoDB architecture, including a request router, the partition metadata system, and storage nodes in different availability zones (AZs).

Those definitions in hand, let’s turn to our lessons learned.

Predictability over absolute efficiency

DynamoDB employs a lot of metadata caches in order to reduce latency. One of those caches stores the routing metadata for data requests. This cache is deployed on a fleet of thousands of request routers, DynamoDB’s front-end service.

In the original implementation, when the request router received the first request for a table, it downloaded the routing information for the entire table and cached it locally. Since the configuration information about partition replicas rarely changed, the cache hit rate was approximately 99.75%.

Related content
How Alexa scales machine learning models to millions of customers.

This was an amazing hit rate. However, on the flip side, the fallback mechanism for this cache was to hit the metadata table directly. When the cache becomes ineffective, the metadata table needs to instantaneously scale from handling 0.25% of requests to 100%. The sudden increase in traffic can cause the metadata table to fail, causing cascading failure in other parts of the system. To mitigate against such failures, we redesigned our caches to behave predictably.

First, we built an in-memory datastore called MemDS, which significantly reduced request routers’ and other metadata clients’ reliance on local caches. MemDS stores all the routing metadata in a highly compressed manner and replicates it across a fleet of servers. MemDS scales horizontally to handle all incoming requests to DynamoDB.

Second, we deployed a new local cache that avoids the bimodality of the original cache. All requests, even if satisfied by the local cache, are asynchronously sent to the MemDS. This ensures that the MemDS fleet is always serving a constant volume of traffic, regardless of cache hit or miss. The regular exercise of the fallback code helps prevent surprises during fallback.

DDB-MemDS.png
DynamoDB architecture with MemDS.

Unlike conventional local caches, MemDS sees traffic that is proportional to the customer traffic seen by the service; thus, during cache failures, it does not see a sudden amplification of traffic. Doing constant work removed the need for complex logic to handle edge cases around cache misses and reduced the reliance on local caches, improving system stability.

Reshaping partitioning based on traffic

Partitions offer a way to dynamically scale both the capacity and performance of tables. In the original DynamoDB release, customers explicitly specified the throughput that a table required in terms of read capacity units (RCUs) and write capacity units (WCUs). The original system assigned partitions to nodes based on both available space and computational capacity.

Related content
Optimizing placement of configuration data ensures that it’s available and consistent during “network partitions”.

As the demands on a table changed (because it grew in size or because the load increased), partitions could be further split to allow the table to scale elastically. Partition abstraction proved really valuable and continues to be central to the design of DynamoDB.

However, the early version of DynamoDB assigned both space and capacity to individual partitions on the basis of size, evenly distributing computational resources across table entries. This led to challenges of “hot partitions” and throughput dilution.

Hot partitions happened because customer workloads were not uniformly distributed and kept hitting a subset of items. Throughput dilution happened when partitions that had been split to handle increased load ended up with so few keys that they could quickly max out their meager allocated capacity.

Our initial response to these challenges was to add bursting and adaptive capacity (along with other features such as split for consumption) to DynamoDB. This line of work also led to the launch of on-demand tables.

Bursting is a way to absorb temporal spikes in workloads at a partition level. It’s based on the observation that not all partitions hosted by a storage node use their allocated throughput simultaneously.

Related content
Amazon researchers describe new method for distributing database tables across servers.

The idea is to let applications tap into unused capacity at a partition level on a best-effort basis to absorb short-lived spikes. DynamoDB still maintains workload isolation by ensuring that a partition can burst only if there is unused throughput at the node level.

DynamoDB also launched adaptive capacity to handle long-lived spikes that cannot be absorbed by the burst capacity. Adaptive capacity monitors traffic patterns and repartitions tables so that heavily accessed items reside on different nodes.

Both bursting and adaptive capacity had limitations, however. Bursting was helpful only for short-lived spikes in traffic, and it was dependent on nodes’ having enough throughput to support it. Adaptive capacity was reactive and kicked in only after transmission rates had been throttled down to avoid overloads.

To address these limitations, the DynamoDB team replaced adaptive capacity with global admission control (GAC). GAC builds on the idea of token buckets, in which bandwidth is allocated to network nodes as tokens, and the nodes “cash in” tokens in order to transmit data. Each request router maintains a local token bucket and communicates with GAC to replenish tokens at regular intervals (on the order of every few seconds). For an extra layer of defense, DynamoDB also uses token buckets at the partition level.

Continuous verification 

To provide durability and crash recovery, DynamoDB uses write-ahead logs, which record data writes before they occur. In the event of a crash, DynamoDB can use the write-ahead logs to reconstruct lost data writes, bringing partitions up to date.

Write-ahead logs are stored in all three replicas of a partition. For higher durability, the write-ahead logs are periodically archived to S3, an object store that is designed for more than 99.99% (in fact, 11 nines) durability. Each replica contains the most recent write-ahead logs, which are usually waiting to be archived. The unarchived logs are typically a few hundred megabytes in size.

Storage replica vs. log replica.png
Healing a storage replica by copying the B-tree can take several minutes, while adding a log replica, which takes only a few seconds, ensures that there is no impact on durability.

DynamoDB continuously verifies data at rest. Our goal is to detect any silent data errors or “bit rot” — bit errors caused by degradation of the storage medium. An example of continuous verification is the scrub process.

The scrub process verifies two things: that all three copies in a replication group have the same data and that the live replicas match a reference replica built offline using the archived write-ahead-log entries.

The verification is done by computing the checksum of the live replica and matching that with a snapshot of the reference replica. A similar technique is used to verify replicas of global tables. Over the years, we have learned that continuous verification of data at rest is the most reliable method of protecting against hardware failures, silent data corruption, and even software bugs.

Availability

DynamoDB regularly tests its resilience to node, rack, and availability zone (AZ) failures. For example, to test the availability and durability of the overall service, DynamoDB performs power-off tests. Using realistic simulated traffic, a job scheduler powers off random nodes. At the end of all the power-off tests, the test tools verify that the data stored in the database is logically valid and not corrupted.

Related content
Amazon Athena reduces query execution time by 14% by eliminating redundant operations.

The first point about availability is that it needs to be measurable. DynamoDB is designed for 99.999% availability for global tables and 99.99% availability for regional tables. To ensure that these goals are being met, DynamoDB continuously monitors availability at the service and table levels. The tracked availability data is used to estimate customer-perceived availability trends and trigger alarms if the number of errors that customers see crosses a certain threshold.

These alarms are called customer-facing alarms (CFAs). The goal of these alarms is to report any availability-related problems and proactively mitigate them either automatically or through operator intervention. The key point to note here is that availability is measured not only on the server side but on the client side.

We also use two sets of clients to measure the user-perceived availability. The first set of clients is internal Amazon services using DynamoDB as the data store. These services share the availability metrics for DynamoDB API calls as observed by their software.

The second set of clients is our DynamoDB canary applications. These applications are run from every AZ in the region, and they talk to DynamoDB through every public endpoint. Real application traffic allows us to reason about DynamoDB availability and latencies as seen by our customers. The canary applications offer a good representation of what our customers might be experiencing both long and short term.

The second point is that read and write availability need to be handled differently. A partition’s write availability depends on the health of its leader and of its write quorum, meaning two out of the three replicas from different AZs. A partition remains available as long as there are enough healthy replicas for a write quorum and a leader.

Related content
“Anytime query” approach adapts to the available resources.

In a large service, hardware failures such as memory and disk failures are common. When a node fails, all replication groups hosted on the node are down to two copies. The process of healing a storage replica can take several minutes because the repair process involves copying the B-tree — a data structure that maps partitions to storage locations — and write-ahead logs.

Upon detecting an unhealthy storage replica, the leader of a replication group adds a log replica to ensure there is no impact on durability. Adding a log replica takes only a few seconds, because the system has to copy only the most recent write-ahead logs from a healthy replica; reconstructing the more memory-intensive B-tree can wait. Quick healing of affected replication groups using log replicas thus ensures the high durability of the most recent writes. Adding a log replica is the fastest way to ensure that the write quorum of the group is always met. This minimizes disruption to write availability due to an unhealthy write quorum. The leader replica serves consistent reads.

Introducing log replicas was a big change to the system, but the Paxos consensus protocol, which is formally provable, gave us the confidence to safely tweak and experiment with the system to achieve higher availability. We have been able to run millions of Paxos groups in a region with log replicas. Eventually, consistent reads can be served by any of the replicas. In case a leader fails, other replicas detect its failure and elect a new leader to minimize disruptions to the availability of consistent reads.

Research areas

Related content

US, VA, Arlington
Are you fascinated by the power of Large Language Models (LLM) and Artificial Intelligence (AI) to transform the way we learn and interact with technology? Are you passionate about applying advanced machine learning (ML) techniques to solve complex challenges in the cloud learning space? If so, AWS Training & Certification (T&C) team has an exciting opportunity for you as an Applied Scientist. At AWS T&C, we strive to be leaders in not only how we learn about the latest AI/ML development and AWS services, but also how the same technologies transform the way we learn about them. As an Applied Scientist, you will join a talented and collaborative team that is dedicated to driving innovation and delivering exceptional experiences in our Skill Builder platform for both new learners and seasoned developers. You will be a part of a global team that is focused on transforming how people learn. The position will interact with global leaders and teams across the globe as well as different business and technical organizations. Join us at the AWS T&C Science Team and become a part of a global team that is redefining the future of cloud learning. With access to vast amounts of data, exciting new technology, and a diverse community of talented individuals, you will have the opportunity to make a meaningful impact on the ways how worldwide learners engage with our learning system and builders develop on our platform. Together, we will drive innovation, solve complex problems, and shape the future of future-generation cloud builders. Please visit https://skillbuilder.awsto learn more. Key job responsibilities - Apply your expertise in LLM to design, develop, and implement scalable machine learning solutions that address challenges in discovery and engagement for our international audiences. - Collaborate with cross-functional teams, including software engineers, data engineers, scientists, and product managers, to define project requirements, establish success metrics, and deliver high-quality solutions. - Conduct thorough data analysis to gain insights, identify patterns, and drive actionable recommendations that enhance operational performance and customer experiences across Skill Builder. - Continuously explore and evaluate state-of-the-art techniques and methodologies to improve the accuracy and efficiency of AI/ML systems. - Communicate complex technical concepts effectively to both technical and non-technical stakeholders, providing clear explanations and guidance on proposed solutions and their potential impact. About the team Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon conferences, inspire us to never stop embracing our uniqueness. Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud.
US, MA, N.reading
Amazon Industrial Robotics is seeking exceptional talent to help develop the next generation of advanced robotics systems that will transform automation at Amazon's scale. We're building revolutionary robotic systems that combine cutting-edge AI, sophisticated control systems, and advanced mechanical design to create adaptable automation solutions capable of working safely alongside humans in dynamic environments. This is a unique opportunity to shape the future of robotics and automation at an unprecedented scale, working with world-class teams pushing the boundaries of what's possible in robotic dexterous manipulation, locomotion, and human-robot interaction. This role presents an opportunity to shape the future of robotics through innovative applications of deep learning and large language models. At Amazon Industrial Robotics we leverage advanced robotics, machine learning, and artificial intelligence to solve complex operational challenges at an unprecedented scale. Our fleet of robots operates across hundreds of facilities worldwide, working in sophisticated coordination to fulfill our mission of customer excellence. We are pioneering the development of robotics dexterous hands that: - Enable unprecedented generalization across diverse tasks - Are compliant and durable - Can span tasks from power grasps to fine dexterity and nonprehensile manipulation - Can navigate the uncertainty of the environment - Leverage mechanical intelligence, multi-modal sensor feedback and advanced control techniques. The ideal candidate will contribute to research that bridges the gap between theoretical advancement and practical implementation in robotics. You will be part of a team that's revolutionizing how robots learn, adapt, and interact with their environment. Join us in building the next generation of intelligent robotics systems that will transform the future of automation and human-robot collaboration. Key job responsibilities - Design and implement robust sensing for dexterous manipulation, including but not limited to: Tactile sensing, Position sensing, Force sensing, Non-contact sensing - Prototype the various identified sensing strategies, considering the constraints of the rest of the hand design - Build and test full hand sensing prototypes to validate the performance of the solution - Develop testing and validation strategies, supporting fast integration into the rest of the robot - Partner with cross-functional teams to iterate on concepts and prototypes - Work with Amazon's robotics engineering and operations customers to deeply understand their requirements and develop tailored solutions - Document the designs, performance, and validation of the final system
IL, Tel Aviv
Come build the future of entertainment with us. Are you interested in helping shape the future of movies and television? Do you want to help define the next generation of how and what Amazon customers are watching? Prime Video is a premium streaming service that offers customers a vast collection of TV shows and movies - all with the ease of finding what they love to watch in one place. We offer customers thousands of popular movies and TV shows from Originals and Exclusive content to exciting live sports events. We also offer our members the opportunity to subscribe to add-on channels which they can cancel at any time and to rent or buy new release movies and TV box sets on the Prime Video Store. Prime Video is a fast-paced, growth business - available in over 240 countries and territories worldwide. The team works in a dynamic environment where innovating on behalf of our customers is at the heart of everything we do. If this sounds exciting to you, please read on We are seeking an exceptional Applied Scientist to join our Prime Video Sports personalization team in Israel. Our team is dedicated to developing state-of-the-art science to personalize the customer experience and help customers seamlessly find any live event in our selection. You will have the opportunity to work on innovative, large-scale projects that push the boundaries of what's possible in sports content delivery and engagement. Your expertise will be crucial in tackling complex challenges such as information retrieval, sequential modeling, realtime model optimizations, utilizing Large Language Models (LLMs), and building state-of-the-art complex recommender systems. Key job responsibilities We are looking for an Applied Scientist with domain expertise in Personalization, Information Retrieval, and Recommender Systems, or general ML to develop new algorithms and end-to-end solutions. As part of our team of applied scientists and software development engineers, you will be responsible for researching, designing, developing, and deploying algorithms into production pipelines. Your role will involve working with cutting-edge technologies in recommender systems and search. You'll also tackle unique challenges like temporal information retrieval to improve real-time sports content recommendations. As a technologist, you will drive the publication of original work in top-tier conferences in Machine Learning and Recommender Systems. We expect you to thrive in ambiguous situations, demonstrating outstanding analytical abilities and comfort in collaborating with cross-functional teams and systems. The ideal candidate is a self-starter with the ability to learn and adapt quickly in our fast-paced environment. About the team We are the Prime Video Sports team. In September 2018 Prime Video launched its first full-scale live streaming experience to world-wide Prime customers with NFL Thursday Night Football. That was just the start. Now Amazon has exclusive broadcasting rights to major leagues like NFL Thursday Night Football, Tennis majors like Roland-Garros and English Premier League to list a few and are broadcasting live events across 30+ sports world-wide. Prime Video is expanding not just the breadth of live content that it offers, but the depth of the experience. This is a transformative opportunity, the chance to be at the vanguard of a program that will revolutionize Prime Video, and the live streaming experience of customers everywhere.
US, WA, Seattle
Within Amazon’s Corporate Financial Planning & Analysis team (FP&A), we enjoy a unique vantage point into everything happening within Amazon. This is exciting opportunity for scientist to join our Financial Transformation team, where you will get to harness the power of statistical and machine learning models to revolutionize finance forecasting that spans entire company and business units. As a key player in this innovative group, you'll be at the forefront of applying state-of-the-art scientific approaches and emerging technologies to solve complex financial challenges. Your deep domain expertise will be instrumental in identifying and addressing customer needs, often venturing into uncharted territories where textbook solutions don't exist. You'll have the chance to author Finance AI articles, showcasing your novel work to both internal and external audiences. Key job responsibilities Your role will involve developing production-ready science models/components that directly impact large-scale systems and services, making critical decisions on implementation complexity and technology adoption. You'll be a driving force in MLOps, optimizing compute and inference usage and enhancing system performance. Beyond technical prowess, you'll contribute to financial strategic planning, mentor team members, and represent our tech. organization in the broader scientific community. This role offers a perfect blend of hands-on development, strategic thinking, and thought leadership in the exciting intersection of finance and advanced analytics. Ready to shape the future of financial forecasting? Join us and let's transform the industry together!
CA, QC, Montreal
Join the next revolution in robotics at Amazon's Frontier AI & Robotics team, where you'll work alongside world-renowned AI pioneers to push the boundaries of what's possible in robotic intelligence. As an Applied Scientist, you'll be at the forefront of developing breakthrough foundation models that enable robots to perceive, understand, and interact with the world in unprecedented ways. You'll drive independent research initiatives in areas such as perception, manipulation, scene understanding, sim2real transfer, multi-modal foundation models, and multi-task learning, designing novel algorithms that bridge the gap between state-of-the-art research and real-world deployment at Amazon scale. In this role, you'll balance innovative technical exploration with practical implementation, collaborating with platform teams to ensure your models and algorithms perform robustly in dynamic real-world environments. You'll have access to Amazon's vast computational resources, enabling you to tackle ambitious problems in areas like very large multi-modal robotic foundation models and efficient, promptable model architectures that can scale across diverse robotic applications. Key job responsibilities - Design and implement novel deep learning architectures that push the boundaries of what robots can understand and accomplish - Drive independent research initiatives in robotics foundation models, focusing on breakthrough approaches in perception, and manipulation, for example open-vocabulary panoptic scene understanding, scaling up multi-modal LLMs, sim2real/real2sim techniques, end-to-end vision-language-action models, efficient model inference, video tokenization - Lead technical projects from conceptualization through deployment, ensuring robust performance in production environments - Collaborate with platform teams to optimize and scale models for real-world applications - Contribute to the team's technical strategy and help shape our approach to next-generation robotics challenges A day in the life - Design and implement novel foundation model architectures, leveraging our extensive compute infrastructure to train and evaluate at scale - Collaborate with our world-class research team to solve complex technical challenges - Lead technical initiatives from conception to deployment, working closely with robotics engineers to integrate your solutions into production systems - Participate in technical discussions and brainstorming sessions with team leaders and fellow scientists - Leverage our massive compute cluster and extensive robotics infrastructure to rapidly prototype and validate new ideas - Transform theoretical insights into practical solutions that can handle the complexities of real-world robotics applications About the team At Frontier AI & Robotics, we're not just advancing robotics – we're reimagining it from the ground up. Our team is building the future of intelligent robotics through ground breaking foundation models and end-to-end learned systems. We tackle some of the most challenging problems in AI and robotics, from developing sophisticated perception systems to creating adaptive manipulation strategies that work in complex, real-world scenarios. What sets us apart is our unique combination of ambitious research vision and practical impact. We leverage Amazon's massive computational infrastructure and rich real-world datasets to train and deploy state-of-the-art foundation models. Our work spans the full spectrum of robotics intelligence – from multimodal perception using images, videos, and sensor data, to sophisticated manipulation strategies that can handle diverse real-world scenarios. We're building systems that don't just work in the lab, but scale to meet the demands of Amazon's global operations. Join us if you're excited about pushing the boundaries of what's possible in robotics, working with world-class researchers, and seeing your innovations deployed at unprecedented scale.
CA, QC, Montreal
Join the next revolution in robotics at Amazon's Frontier AI & Robotics team, where you'll work alongside world-renowned AI pioneers to push the boundaries of what's possible in robotic intelligence. As an Applied Scientist, you'll be at the forefront of developing breakthrough foundation models that enable robots to perceive, understand, and interact with the world in unprecedented ways. You'll drive independent research initiatives in areas such as perception, manipulation, scene understanding, sim2real transfer, multi-modal foundation models, and multi-task learning, designing novel algorithms that bridge the gap between state-of-the-art research and real-world deployment at Amazon scale. In this role, you'll balance innovative technical exploration with practical implementation, collaborating with platform teams to ensure your models and algorithms perform robustly in dynamic real-world environments. You'll have access to Amazon's vast computational resources, enabling you to tackle ambitious problems in areas like very large multi-modal robotic foundation models and efficient, promptable model architectures that can scale across diverse robotic applications. Key job responsibilities - Design and implement novel deep learning architectures that push the boundaries of what robots can understand and accomplish - Drive independent research initiatives in robotics foundation models, focusing on breakthrough approaches in perception, and manipulation, for example open-vocabulary panoptic scene understanding, scaling up multi-modal LLMs, sim2real/real2sim techniques, end-to-end vision-language-action models, efficient model inference, video tokenization - Lead technical projects from conceptualization through deployment, ensuring robust performance in production environments - Collaborate with platform teams to optimize and scale models for real-world applications - Contribute to the team's technical strategy and help shape our approach to next-generation robotics challenges A day in the life - Design and implement novel foundation model architectures, leveraging our extensive compute infrastructure to train and evaluate at scale - Collaborate with our world-class research team to solve complex technical challenges - Lead technical initiatives from conception to deployment, working closely with robotics engineers to integrate your solutions into production systems - Participate in technical discussions and brainstorming sessions with team leaders and fellow scientists - Leverage our massive compute cluster and extensive robotics infrastructure to rapidly prototype and validate new ideas - Transform theoretical insights into practical solutions that can handle the complexities of real-world robotics applications About the team At Frontier AI & Robotics, we're not just advancing robotics – we're reimagining it from the ground up. Our team is building the future of intelligent robotics through ground breaking foundation models and end-to-end learned systems. We tackle some of the most challenging problems in AI and robotics, from developing sophisticated perception systems to creating adaptive manipulation strategies that work in complex, real-world scenarios. What sets us apart is our unique combination of ambitious research vision and practical impact. We leverage Amazon's massive computational infrastructure and rich real-world datasets to train and deploy state-of-the-art foundation models. Our work spans the full spectrum of robotics intelligence – from multimodal perception using images, videos, and sensor data, to sophisticated manipulation strategies that can handle diverse real-world scenarios. We're building systems that don't just work in the lab, but scale to meet the demands of Amazon's global operations. Join us if you're excited about pushing the boundaries of what's possible in robotics, working with world-class researchers, and seeing your innovations deployed at unprecedented scale.
US, WA, Seattle
The Sponsored Products and Brands (SPB) team at Amazon Ads is transforming advertising through generative AI technologies. We help millions of customers discover products and engage with brands across Amazon.com and beyond. Our team combines human creativity with artificial intelligence to reinvent the entire advertising lifecycle—from ad creation and optimization to performance analysis and customer insights. We develop responsible AI technologies that balance advertiser needs, enhance shopping experiences, and strengthen the marketplace. Our team values innovation and tackles complex challenges that push the boundaries of what's possible with AI. Join us in shaping the future of advertising. Key job responsibilities This role will redesign how ads create personalized, relevant shopping experiences with customer value at the forefront. Key responsibilities include: - Design and develop solutions using GenAI, deep learning, multi-objective optimization and/or reinforcement learning to transform ad retrieval, auctions, whole-page relevance, and shopping experiences. - Partner with scientists, engineers, and product managers to build scalable, production-ready science solutions. - Apply industry advances in GenAI, Large Language Models (LLMs), and related fields to create innovative prototypes and concepts. - Improve the team's scientific and technical capabilities by implementing algorithms, methodologies, and infrastructure that enable rapid experimentation and scaling. - Mentor junior scientists and engineers to build a high-performing, collaborative team. A day in the life As an Applied Scientist on the Sponsored Products and Brands Off-Search team, you will contribute to the development in Generative AI (GenAI) and Large Language Models (LLMs) to revolutionize our advertising flow, backend optimization, and frontend shopping experiences. This is a rare opportunity to redefine how ads are retrieved, allocated, and/or experienced—elevating them into personalized, contextually aware, and inspiring components of the customer journey. You will have the opportunity to fundamentally transform areas such as ad retrieval, ad allocation, whole-page relevance, and differentiated recommendations through the lens of GenAI. By building novel generative models grounded in both Amazon’s rich data and the world’s collective knowledge, your work will shape how customers engage with ads, discover products, and make purchasing decisions. If you are passionate about applying frontier AI to real-world problems with massive scale and impact, this is your opportunity to define the next chapter of advertising science. About the team The Off-Search team within Sponsored Products and Brands (SPB) is focused on building delightful ad experiences across various surfaces beyond Search on Amazon—such as product detail pages, the homepage, and store-in-store pages—to drive monetization. Our vision is to deliver highly personalized, context-aware advertising that adapts to individual shopper preferences, scales across diverse page types, remains relevant to seasonal and event-driven moments, and integrates seamlessly with organic recommendations such as new arrivals, basket-building content, and fast-delivery options. To execute this vision, we work in close partnership with Amazon Stores stakeholders to lead the expansion and growth of advertising across Amazon-owned and -operated pages beyond Search. We operate full stack—from backend ads-retail edge services, ads retrieval, and ad auctions to shopper-facing experiences—all designed to deliver meaningful value.
US, CA, Santa Clara
The AWS Neuron Science Team is looking for talented scientists to enhance our software stack, accelerating customer adoption of Trainium and Inferentia accelerators. In this role, you will work directly with external and internal customers to identify key adoption barriers and optimization opportunities. You'll collaborate closely with our engineering teams to implement innovative solutions and engage with academic and research communities to advance state-of-the-art ML systems. As part of a strategic growth area for AWS, you'll work alongside distinguished engineers and scientists in an exciting and impactful environment. We actively work on these areas: - AI for Systems: Developing and applying ML/RL approaches for kernel/code generation and optimization - Machine Learning Compiler: Creating advanced compiler techniques for ML workloads - System Robustness: Building tools for accuracy and reliability validation - Efficient Kernel Development: Designing high-performance kernels optimized for our ML accelerator architectures A day in the life AWS Utility Computing (UC) provides product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for their cloud services. Additionally, this role may involve exposure to and experience with Amazon's growing suite of generative AI services and other cloud computing offerings across the AWS portfolio. About the team AWS Neuron is the software of Trainium and Inferentia, the AWS Machine Learning chips. Inferentia delivers best-in-class ML inference performance at the lowest cost in the cloud to our AWS customers. Trainium is designed to deliver the best-in-class ML training performance at the lowest training cost in the cloud, and it’s all being enabled by AWS Neuron. Neuron is a Software that include ML compiler and native integration into popular ML frameworks. Our products are being used at scale with external customers like Anthropic and Databricks as well as internal customers like Alexa, Amazon Bedrocks, Amazon Robotics, Amazon Ads, Amazon Rekognition and many more. About the team Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the preferred qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture AWS values curiosity and connection. Our employee-led and company-sponsored affinity groups promote inclusion and empower our people to take pride in what makes us unique. Our inclusion events foster stronger, more collaborative teams. Our continual innovation is fueled by the bold ideas, fresh perspectives, and passionate voices our teams bring to everything we do. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve.
US, CA, Sunnyvale
The Artificial General Intelligence (AGI) team is looking for a highly skilled and experienced Applied Scientist, to support the development and implementation of state-of-the-art algorithms and models for supervised fine-tuning and reinforcement learning through human feedback and and complex reasoning; with a focus across text, image, and video modalities. As an Applied Scientist, you will play a critical role in supporting the development of Generative AI (Gen AI) technologies that can handle Amazon-scale use cases and have a significant impact on our customers' experiences. Key job responsibilities - Collaborate with cross-functional teams of engineers, product managers, and scientists to identify and solve complex problems in Gen AI - Design and execute experiments to evaluate the performance of different algorithms and models, and iterate quickly to improve results - Think big about the arc of development of Gen AI over a multi-year horizon, and identify new opportunities to apply these technologies to solve real-world problems - Communicate results and insights to both technical and non-technical audiences, including through presentations and written reports
US, WA, Seattle
Application deadline: Applications will be accepted on an ongoing basis Amazon Ads is re-imagining advertising through cutting-edge generative artificial intelligence (AI) technologies. We combine human creativity with AI to transform every aspect of the advertising life cycle—from ad creation and optimization to performance analysis and customer insights. Our solutions help advertisers grow their brands while enabling millions of customers to discover and purchase products through delightful experiences. We deliver billions of ad impressions and millions of clicks daily, breaking fresh ground in product and technical innovations. If you're energized by solving complex challenges and pushing the boundaries of what's possible with AI, join us in shaping the future of advertising. Why you’ll love this role: This role offers unprecedented breadth in ML applications and access to extensive computational resources and rich datasets that will enable you to build truly innovative solutions. You'll work on projects that span the full advertising life cycle, from sophisticated ranking algorithms and real-time bidding systems to creative optimization and measurement solutions. You'll work alongside talented engineers, scientists, and product leaders in a culture that encourages innovation, experimentation, and bias for action, and you’ll directly influence business strategy through your scientific expertise. What makes this role unique is the combination of scientific rigor with real-world impact. You’ll re-imagine advertising through the lens of advanced ML while solving problems that balance the needs of advertisers, customers, and Amazon's business objectives. Your impact and career growth: Amazon Ads is investing heavily in AI and ML capabilities, creating opportunities for scientists to innovate and make their marks. Your work will directly impact millions. Whether you see yourself growing as an individual contributor or moving into people management, there are clear paths for career progression. This role combines scientific leadership, organizational ability, technical strength, and business understanding. You'll have opportunities to lead technical initiatives, mentor other scientists, and collaborate with senior leadership to shape the future of advertising technology. Most importantly, you'll be part of a community that values scientific excellence and encourages you to push the boundaries of what's possible with AI. Watch two Applied Scientists at Amazon Ads talk about their work: https://www.youtube.com/watch?v=vvHsURsIPEA Learn more about Amazon Ads: https://advertising.amazon.com/ Key job responsibilities As a Senior Applied Scientist in Amazon Ads, you will: - Research and implement cutting-edge ML approaches, including applications of generative AI and large language models - Develop and deploy innovative ML solutions spanning multiple disciplines – from ranking and personalization to natural language processing, computer vision, recommender systems, and large language models - Drive end-to-end projects that tackle ambiguous problems at massive scale, often working with petabytes of data - Build and optimize models that balance multiple stakeholder needs - helping customers discover relevant products while enabling advertisers to achieve their goals efficiently - Build ML models, perform proof-of-concept, experiment, optimize, and deploy your models into production, working closely with cross-functional teams including engineers, product managers, and other scientists - Design and run A/B experiments to validate hypotheses, gather insights from large-scale data analysis, and measure business impact - Develop scalable, efficient processes for model development, validation, and deployment that optimize traffic monetization while maintaining customer experience