Lessons learned from 10 years of DynamoDB

Prioritizing predictability over efficiency, adapting data partitioning to traffic, and continuous verification are a few of the principles that help ensure stability, availability, and efficiency.

Amazon DynamoDB is one of the most popular NoSQL database offerings on the Internet, designed for simplicity, predictability, scalability, and reliability. To celebrate DynamoDB’s 10th anniversary, the DynamoDB team wrote a paper describing lessons we’d learned in the course of expanding a fully managed cloud-based database system to hundreds of thousands of customers. The paper was presented at this year’s USENIX ATC conference.

The paper captures the following lessons that we have learned over the years:

  • Designing systems for predictability over absolute efficiency improves system stability. While components such as caches can improve performance, they should not introduce bimodality, in which the system has two radically different ways of responding to similar requests (e.g., one for cache misses and one for cache hits). Consistent behaviors ensure that the system is always provisioned to handle the unexpected. 
  • Adapting to customers’ traffic patterns to redistribute data improves customer experience. 
  • Continuously verifying idle data is a reliable way to protect against both hardware failures and software bugs in order to meet high durability goals. 
  • Maintaining high availability as a system evolves requires careful operational discipline and tooling. Mechanisms such as formal proofs of complex algorithms, game days (chaos and load tests), upgrade/downgrade tests, and deployment safety provide the freedom to adjust and experiment with the code without the fear of compromising correctness. 
Related content
Amazon DynamoDB was introduced 10 years ago today; one of its key contributors reflects on its origins, and discusses the 'never-ending journey' to make DynamoDB more secure, more available and more performant.

Before we dig deeper into these topics, a little terminology. A DynamoDB table is a collection of items (e.g., products), and each item is a collection of attributes (e.g., name, price, category, etc.). Each item is uniquely identified by its primary key. In DynamoDB, tables are typically partitioned, or divided into smaller sub-tables, which are assigned to nodes. A node is a set of dedicated computational resources — a virtual machine — running on a single server in a datacenter.

DynamoDB stores three copies of each partition, in different availability zones. This makes the partition highly available and durable because the availability zones’ storage resources share nothing and are substantially independent. For instance, we wouldn’t assign a partition and one of its copies to nodes that share a power supply, because a power outage would take both of them offline. The three copies of the same partition are known as a replication group, and there is a leader for the group that is responsible for replicating all the customer mutations and serving strongly consistent reads.

DynamoDB architecture.png
The DynamoDB architecture, including a request router, the partition metadata system, and storage nodes in different availability zones (AZs).

Those definitions in hand, let’s turn to our lessons learned.

Predictability over absolute efficiency

DynamoDB employs a lot of metadata caches in order to reduce latency. One of those caches stores the routing metadata for data requests. This cache is deployed on a fleet of thousands of request routers, DynamoDB’s front-end service.

In the original implementation, when the request router received the first request for a table, it downloaded the routing information for the entire table and cached it locally. Since the configuration information about partition replicas rarely changed, the cache hit rate was approximately 99.75%.

Related content
How Alexa scales machine learning models to millions of customers.

This was an amazing hit rate. However, on the flip side, the fallback mechanism for this cache was to hit the metadata table directly. When the cache becomes ineffective, the metadata table needs to instantaneously scale from handling 0.25% of requests to 100%. The sudden increase in traffic can cause the metadata table to fail, causing cascading failure in other parts of the system. To mitigate against such failures, we redesigned our caches to behave predictably.

First, we built an in-memory datastore called MemDS, which significantly reduced request routers’ and other metadata clients’ reliance on local caches. MemDS stores all the routing metadata in a highly compressed manner and replicates it across a fleet of servers. MemDS scales horizontally to handle all incoming requests to DynamoDB.

Second, we deployed a new local cache that avoids the bimodality of the original cache. All requests, even if satisfied by the local cache, are asynchronously sent to the MemDS. This ensures that the MemDS fleet is always serving a constant volume of traffic, regardless of cache hit or miss. The regular exercise of the fallback code helps prevent surprises during fallback.

DDB-MemDS.png
DynamoDB architecture with MemDS.

Unlike conventional local caches, MemDS sees traffic that is proportional to the customer traffic seen by the service; thus, during cache failures, it does not see a sudden amplification of traffic. Doing constant work removed the need for complex logic to handle edge cases around cache misses and reduced the reliance on local caches, improving system stability.

Reshaping partitioning based on traffic

Partitions offer a way to dynamically scale both the capacity and performance of tables. In the original DynamoDB release, customers explicitly specified the throughput that a table required in terms of read capacity units (RCUs) and write capacity units (WCUs). The original system assigned partitions to nodes based on both available space and computational capacity.

Related content
Optimizing placement of configuration data ensures that it’s available and consistent during “network partitions”.

As the demands on a table changed (because it grew in size or because the load increased), partitions could be further split to allow the table to scale elastically. Partition abstraction proved really valuable and continues to be central to the design of DynamoDB.

However, the early version of DynamoDB assigned both space and capacity to individual partitions on the basis of size, evenly distributing computational resources across table entries. This led to challenges of “hot partitions” and throughput dilution.

Hot partitions happened because customer workloads were not uniformly distributed and kept hitting a subset of items. Throughput dilution happened when partitions that had been split to handle increased load ended up with so few keys that they could quickly max out their meager allocated capacity.

Our initial response to these challenges was to add bursting and adaptive capacity (along with other features such as split for consumption) to DynamoDB. This line of work also led to the launch of on-demand tables.

Bursting is a way to absorb temporal spikes in workloads at a partition level. It’s based on the observation that not all partitions hosted by a storage node use their allocated throughput simultaneously.

Related content
Amazon researchers describe new method for distributing database tables across servers.

The idea is to let applications tap into unused capacity at a partition level on a best-effort basis to absorb short-lived spikes. DynamoDB still maintains workload isolation by ensuring that a partition can burst only if there is unused throughput at the node level.

DynamoDB also launched adaptive capacity to handle long-lived spikes that cannot be absorbed by the burst capacity. Adaptive capacity monitors traffic patterns and repartitions tables so that heavily accessed items reside on different nodes.

Both bursting and adaptive capacity had limitations, however. Bursting was helpful only for short-lived spikes in traffic, and it was dependent on nodes’ having enough throughput to support it. Adaptive capacity was reactive and kicked in only after transmission rates had been throttled down to avoid overloads.

To address these limitations, the DynamoDB team replaced adaptive capacity with global admission control (GAC). GAC builds on the idea of token buckets, in which bandwidth is allocated to network nodes as tokens, and the nodes “cash in” tokens in order to transmit data. Each request router maintains a local token bucket and communicates with GAC to replenish tokens at regular intervals (on the order of every few seconds). For an extra layer of defense, DynamoDB also uses token buckets at the partition level.

Continuous verification 

To provide durability and crash recovery, DynamoDB uses write-ahead logs, which record data writes before they occur. In the event of a crash, DynamoDB can use the write-ahead logs to reconstruct lost data writes, bringing partitions up to date.

Write-ahead logs are stored in all three replicas of a partition. For higher durability, the write-ahead logs are periodically archived to S3, an object store that is designed for more than 99.99% (in fact, 11 nines) durability. Each replica contains the most recent write-ahead logs, which are usually waiting to be archived. The unarchived logs are typically a few hundred megabytes in size.

Storage replica vs. log replica.png
Healing a storage replica by copying the B-tree can take several minutes, while adding a log replica, which takes only a few seconds, ensures that there is no impact on durability.

DynamoDB continuously verifies data at rest. Our goal is to detect any silent data errors or “bit rot” — bit errors caused by degradation of the storage medium. An example of continuous verification is the scrub process.

The scrub process verifies two things: that all three copies in a replication group have the same data and that the live replicas match a reference replica built offline using the archived write-ahead-log entries.

The verification is done by computing the checksum of the live replica and matching that with a snapshot of the reference replica. A similar technique is used to verify replicas of global tables. Over the years, we have learned that continuous verification of data at rest is the most reliable method of protecting against hardware failures, silent data corruption, and even software bugs.

Availability

DynamoDB regularly tests its resilience to node, rack, and availability zone (AZ) failures. For example, to test the availability and durability of the overall service, DynamoDB performs power-off tests. Using realistic simulated traffic, a job scheduler powers off random nodes. At the end of all the power-off tests, the test tools verify that the data stored in the database is logically valid and not corrupted.

Related content
Amazon Athena reduces query execution time by 14% by eliminating redundant operations.

The first point about availability is that it needs to be measurable. DynamoDB is designed for 99.999% availability for global tables and 99.99% availability for regional tables. To ensure that these goals are being met, DynamoDB continuously monitors availability at the service and table levels. The tracked availability data is used to estimate customer-perceived availability trends and trigger alarms if the number of errors that customers see crosses a certain threshold.

These alarms are called customer-facing alarms (CFAs). The goal of these alarms is to report any availability-related problems and proactively mitigate them either automatically or through operator intervention. The key point to note here is that availability is measured not only on the server side but on the client side.

We also use two sets of clients to measure the user-perceived availability. The first set of clients is internal Amazon services using DynamoDB as the data store. These services share the availability metrics for DynamoDB API calls as observed by their software.

The second set of clients is our DynamoDB canary applications. These applications are run from every AZ in the region, and they talk to DynamoDB through every public endpoint. Real application traffic allows us to reason about DynamoDB availability and latencies as seen by our customers. The canary applications offer a good representation of what our customers might be experiencing both long and short term.

The second point is that read and write availability need to be handled differently. A partition’s write availability depends on the health of its leader and of its write quorum, meaning two out of the three replicas from different AZs. A partition remains available as long as there are enough healthy replicas for a write quorum and a leader.

Related content
“Anytime query” approach adapts to the available resources.

In a large service, hardware failures such as memory and disk failures are common. When a node fails, all replication groups hosted on the node are down to two copies. The process of healing a storage replica can take several minutes because the repair process involves copying the B-tree — a data structure that maps partitions to storage locations — and write-ahead logs.

Upon detecting an unhealthy storage replica, the leader of a replication group adds a log replica to ensure there is no impact on durability. Adding a log replica takes only a few seconds, because the system has to copy only the most recent write-ahead logs from a healthy replica; reconstructing the more memory-intensive B-tree can wait. Quick healing of affected replication groups using log replicas thus ensures the high durability of the most recent writes. Adding a log replica is the fastest way to ensure that the write quorum of the group is always met. This minimizes disruption to write availability due to an unhealthy write quorum. The leader replica serves consistent reads.

Introducing log replicas was a big change to the system, but the Paxos consensus protocol, which is formally provable, gave us the confidence to safely tweak and experiment with the system to achieve higher availability. We have been able to run millions of Paxos groups in a region with log replicas. Eventually, consistent reads can be served by any of the replicas. In case a leader fails, other replicas detect its failure and elect a new leader to minimize disruptions to the availability of consistent reads.

Research areas

Related content

TW, TPE, Hsinchu City
Are you passionate about robotics and research? Do you want to solve real customer problems through innovative technology? Do you enjoy working on scalable research and projects in a collaborative team environment? Do you want to see your science solutions directly impact millions of customers worldwide? At Amazon, we hire the best minds in technology to innovate and build on behalf of our customers. Customer obsession is part of our company DNA, which has made us one of the world's most beloved brands. We’re looking for current PhD or Master students with a passion for robotic research and applications to join us as Robotics Applied Scientist II Intern/Co-ops in 2026 to shape the future of robotics and automation at an unprecedented scale across. For these positions, our Robotics teams at Amazon are looking for students with a specialization in one or more of the research areas in robotics such as: robotics, robotics manipulation (e.g., robot arm, grasping, dexterous manipulation, end of arm tools/end effector), autonomous mobile robots, mobile manipulation, movement, autonomous navigation, locomotion, motion/path planning, controls, perception, sensing, robot learning, artificial intelligence, machine learning, computer vision, large language models, human-robot interaction, robotics simulation, optimization, and more! We're looking for curious minds who think big and want to define tomorrow's technology. At Amazon, you'll grow into the high-impact engineer you know you can be, supported by a culture of learning and mentorship. Every day brings exciting new challenges and opportunities for personal growth. By applying to this role, you will be considered for Robotics Applied Scientist II Intern/Co-op (2026) opportunities across various Robotics teams at Amazon with different robotics research focus, with internship positions available for multiple locations, durations (3 to 6+ months), and year-round start dates (winter, spring, summer, fall). Amazon intern and co-op roles follow the same internship structure. "Intern/Internship" wording refers to both interns and co-ops. Amazon internships across all seasons are full-time positions during vacation, and interns should expect to work in office, Monday-Friday, up to 40 hours per week typically between 9am-6pm. Specific team norms around working hours will be communicated by your manager. Interns should not have other employment during the Amazon work-day. Applicants should have a minimum of one quarter/semester/trimester remaining in their studies after their internship concludes. The robotics internship join dates, length, location, and prospective team will be finalized at the time of any applicable job offers. In your application, you will be able to provide your preference of research interests, start dates, internship duration, and location. While your preference will be taken into consideration, we cannot guarantee that we can meet your selection based on several factors including but not limited to the internship availability and business needs of this role.
US, WA, Seattle
Innovators wanted! Are you an entrepreneur? A builder? A dreamer? This role is part of an Amazon Special Projects team that takes the company’s Think Big leadership principle to the limits. If you’re interested in innovating at scale to address big challenges in the world, this is the team for you. As an Applied Scientist on our team, you will focus on building state-of-the-art ML models for biology. Our team rewards curiosity while maintaining a laser-focus in bringing products to market. Competitive candidates are responsive, flexible, and able to succeed within an open, collaborative, entrepreneurial, startup-like environment. At the forefront of both academic and applied research in this product area, you have the opportunity to work together with a diverse and talented team of scientists, engineers, and product managers and collaborate with other teams. Key job responsibilities As an Applied Science, you will have access to large datasets with billions of images and video to build large-scale machine learning systems. Additionally, you will analyze and model terabytes of text, images, and other types of data to solve real-world problems and translate business and functional requirements into quick prototypes or proofs of concept. We are looking for smart scientists capable of using a variety of domain expertise combined with machine learning and statistical techniques to invent, design, evangelize, and implement state-of-the-art solutions for never-before-solved problems. About the team Our team highly values work-life balance, mentorship and career growth. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We care about your career growth and strive to assign projects and offer training that will challenge you to become your best.
US, WA, Seattle
Here at Amazon, we embrace our differences. We are committed to furthering our culture of diversity and inclusion of our teams within the organization. How do you get items to customers quickly, cost-effectively, and—most importantly—safely, in less than an hour? And how do you do it in a way that can scale? Our teams of hundreds of scientists, engineers, aerospace professionals, and futurists have been working hard to do just that! We are delivering to customers, and are excited for what’s to come. Check out more information about Prime Air on the About Amazon blog (https://www.aboutamazon.com/news/transportation/amazon-prime-air-delivery-drone-reveal-photos). If you are seeking an iterative environment where you can drive innovation, apply state-of-the-art technologies to solve real world delivery challenges, and provide benefits to customers, Prime Air is the place for you. Come work on the Amazon Prime Air Team! Prime Air is seeking an experienced Applied Science Manager to help develop our advanced Navigation algorithms and flight software applications. In this role, you will lead a team of scientists and engineers to conduct analyses, support cross-functional decision-making, define system architectures and requirements, contribute to the development of flight algorithms, and actively identify innovative technological opportunities that will drive significant enhancements to meet our customers' evolving demands. This person must be comfortable working with a team of top-notch software developers and collaborating with our science teams. We’re looking for someone who innovates, and loves solving hard problems. You will work hard, have fun, and make history! Export Control License: This position may require a deemed export control license for compliance with applicable laws and regulations. Placement is contingent on Amazon’s ability to apply for and obtain an export control license on your behalf.
US, VA, Herndon
Application deadline: Applications will be accepted on an ongoing basis Are you excited to help the US Intelligence Community design, build, and implement AI algorithms, including advanced Generative AI solutions, to augment decision making while meeting the highest standards for reliability, transparency, and scalability? The Amazon Web Services (AWS) US Federal Professional Services team works directly with US Intelligence Community agencies and other public sector entities to achieve their mission goals through the adoption of Machine Learning (ML) and Generative AI methods. We build models for text, image, video, audio, and multi-modal use cases, leveraging both traditional ML approaches and state-of-the-art generative models including Large Language Models (LLMs), text-to-image generation, and other advanced AI capabilities to fit the mission. Our team collaborates across the entire AWS organization to bring access to product and service teams, to get the right solution delivered and drive feature innovation based on customer needs. At AWS, we're hiring experienced data scientists with a background in both traditional and generative AI who can help our customers understand the opportunities their data presents, and build solutions that earn the customer trust needed for deployment to production systems. In this role, you will work closely with customers to deeply understand their data challenges and requirements, and design tailored solutions that best fit their use cases. You should have broad experience building models using all kinds of data sources, and building data-intensive applications at scale. You should possess excellent business acumen and communication skills to collaborate effectively with stakeholders, develop key business questions, and translate requirements into actionable solutions. You will provide guidance and support to other engineers, sharing industry best practices and driving innovation in the field of data science and AI. This position requires that the candidate selected must currently possess and maintain an active TS/SCI Security Clearance with Polygraph. The position further requires the candidate to opt into a commensurate clearance for each government agency for which they perform AWS work. Key job responsibilities As an Data Scientist, you will: - Collaborate with AI/ML scientists and architects to research, design, develop, and evaluate AI algorithms to address real-world challenges - Interact with customers directly to understand the business problem, help and aid them in implementation of AI solutions, deliver briefing and deep dive sessions to customers and guide customer on adoption patterns and paths to production. - Create and deliver best practice recommendations, tutorials, blog posts, sample code, and presentations adapted to technical, business, and executive stakeholder - Provide customer and market feedback to Product and Engineering teams to help define product direction - This position may require up to 25% local travel. About the team About AWS Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences and inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud.
US, TX, Austin
Our team is involved with pre-silicon design verification for custom IP. A critical requirement of the verification flow is the requirement of legal and realistic stimulus of a custom Machine Learning Accelerator Chip. Content creation is built using formal methods that model legal behavior of the design and then solving the problem to create the specific assembly tests. The entire frame work for creating these custom tests is developed using a SMT solver and custom software code to guide the solution space into templated scenarios. This highly visible and innovative role requires the design of this solving framework and collaborating with design verification engineers, hardware architects and designers to ensure that interesting content can be created for the projects needs. Key job responsibilities Develop an understanding for a custom machine learning instruction set architecture. Model correctness of instruction streams using first order logic. Create custom API's to allow control over scheduling and randomness. Deploy algorithms to ensure concurrent code is safely constructed. Create coverage metrics to ensure solution space coverage. Use novel methods like machine learning to automate content creation. About the team Utility Computing (UC) AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Internet of Things (Iot), Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for customers who require specialized security solutions for their cloud services. Annapurna Labs (our organization within AWS UC) designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time ago—even yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world. About AWS Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional.
CN, 11, Beijing
职位:Applied scientist 应用科学家实习生 毕业时间:2026年10月 - 2027年7月之间毕业的应届毕业生 · 入职日期:2026年6月及之前 · 实习时间:保证一周实习4-5天全职实习,至少持续3个月 · 工作地点:北京朝阳区 投递须知: 1 填写简历申请时,请把必填和非必填项都填写完整。提交简历之后就无法修改了哦! 2 学校的英文全称请准确填写。中英文对应表请查这里(无法浏览请登录后浏览)https://docs.qq.com/sheet/DVmdaa1BCV0RBbnlR?tab=BB08J2 如果您正在攻读计算机,AI,ML或搜索领域专业的博士或硕士研究生,而且对应用科学家的实习工作感兴趣。如果您也喜爱深入研究棘手的技术问题并提出解决方案,用成功的产品显著地改善人们的生活。 那么,我们诚挚邀请您加入亚马逊的International Technology搜索团队改善Amazon的产品搜索服务。我们的目标是帮助亚马逊的客户找到他们所需的产品,并发现他们感兴趣的新产品。 这会是一份收获满满的工作。您每天的工作都与全球数百万亚马逊客户的体验紧密相关。您将提出和探索创新,基于TB级别的产品和流量数据设计机器学习模型。您将集成这些模型到搜索引擎中为客户提供服务,通过数据,建模和客户反馈来完成闭环。您对模型的选择需要能够平衡业务指标和响应时间的需求。
CN, 44, Shenzhen
职位:Applied scientist 应用科学家实习生 毕业时间:2026年10月 - 2027年7月之间毕业的应届毕业生 · 入职日期:2026年6月及之前 · 实习时间:保证一周实习4-5天全职实习,至少持续3个月 · 工作地点:深圳福田区 投递须知: 1 填写简历申请时,请把必填和非必填项都填写完整。提交简历之后就无法修改了哦! 2 学校的英文全称请准确填写。中英文对应表请查这里(无法浏览请登录后浏览)https://docs.qq.com/sheet/DVmdaa1BCV0RBbnlR?tab=BB08J2 如果您正在攻读计算机,AI,ML领域专业的博士或硕士研究生,而且对应用科学家的实习工作感兴趣。如果您也喜爱深入研究棘手的技术问题并提出解决方案,用成功的产品显著地改善人们的生活。 那么,我们诚挚邀请您加入亚马逊。这会是一份收获满满的工作。您每天的工作都与全球数百万亚马逊客户的体验紧密相关。您将提出和探索创新,基于TB级别的产品和流量数据设计机器学习模型。您将集成这些为客户提供服务,通过数据,建模和客户反馈来完成闭环。您对模型的选择需要能够平衡业务指标和响应时间的需求。
LU, Luxembourg
Join our team as an Applied Scientist II where you'll develop innovative machine learning solutions that directly impact millions of customers. You'll work on ambiguous problems where neither the problem nor solution is well-defined, inventing novel scientific approaches to address customer needs at the project level. This role combines deep scientific expertise with hands-on implementation to deliver production-ready solutions that drive measurable business outcomes. Key job responsibilities Invent: - Design and develop novel machine learning models and algorithms to solve ambiguous customer problems where textbook solutions don't exist - Extend state-of-the-art scientific techniques and invent new approaches driven by customer needs at the project level - Produce internal research reports with the rigor of top-tier publications, documenting scientific findings and methodologies - Stay current with academic literature and research trends, applying latest techniques when appropriate Implement: - Write production-quality code that meets or exceeds SDE I standards, ensuring solutions are testable, maintainable, and scalable - Deploy components directly into production systems supporting large-scale applications and services - Optimize algorithm and model performance through rigorous testing and iterative improvements - Document design decisions and implementation details to enable reproducibility and knowledge transfer - Contribute to operational excellence by analyzing performance gaps and proposing solutions Influence: - Collaborate with cross-functional teams to translate business goals into scientific problems and metrics - Mentor junior scientists and help new teammates understand customer needs and technical solutions - Present findings and recommendations to both technical and non-technical stakeholders - Contribute to team roadmaps, priorities, and strategic planning discussions - Participate in hiring and interviewing to build world-class science teams
US, CA, Sunnyvale
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Applied Scientist with a strong deep learning background, to build Generative Artificial Intelligence (GenAI) technology with Large Language Models (LLMs) and multimodal systems. Key job responsibilities As an Applied Scientist with the AGI team, you will work with talented peers to support the development of GenAI algorithms and modeling techniques, to advance the state of the art with LLMs. Your work will directly impact our customers in the form of products and services that make use of speech and language technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in GenAI. About the team The AGI team has a mission to push the envelope with GenAI in LLMs and multimodal systems, in order to provide the best-possible experience for our customers.
IN, HR, Gurugram
Lead ML teams building large-scale forecasting and optimization systems that power Amazon’s global transportation network and directly impact customer experience and cost. As an Applied Science Manager, you will set scientific direction, mentor applied scientists, and partner with engineering and product leaders to deliver production-grade ML solutions at massive scale. Key job responsibilities 1. Lead and grow a high-performing team of Applied Scientists, providing technical guidance, mentorship, and career development. 2. Define and own the scientific vision and roadmap for ML solutions powering large-scale transportation planning and execution. 3. Guide model and system design across a range of techniques, including tree-based models, deep learning (LSTMs, transformers), LLMs, and reinforcement learning. 4. Ensure models are production-ready, scalable, and robust through close partnership with stakeholders. Partner with Product, Operations, and Engineering leaders to enable proactive decision-making and corrective actions. 5. Own end-to-end business metrics, directly influencing customer experience, cost optimization, and network reliability. 6. Help contribute to the broader ML community through publications, conference submissions, and internal knowledge sharing. A day in the life Your day includes reviewing model performance and business metrics, guiding technical design and experimentation, mentoring scientists, and driving roadmap execution. You’ll balance near-term delivery with long-term innovation while ensuring solutions are robust, interpretable, and scalable. Ultimately, your work helps improve delivery reliability, reduce costs, and enhance the customer experience at massive scale.