Lessons learned from 10 years of DynamoDB

Prioritizing predictability over efficiency, adapting data partitioning to traffic, and continuous verification are a few of the principles that help ensure stability, availability, and efficiency.

Amazon DynamoDB is one of the most popular NoSQL database offerings on the Internet, designed for simplicity, predictability, scalability, and reliability. To celebrate DynamoDB’s 10th anniversary, the DynamoDB team wrote a paper describing lessons we’d learned in the course of expanding a fully managed cloud-based database system to hundreds of thousands of customers. The paper was presented at this year’s USENIX ATC conference.

The paper captures the following lessons that we have learned over the years:

  • Designing systems for predictability over absolute efficiency improves system stability. While components such as caches can improve performance, they should not introduce bimodality, in which the system has two radically different ways of responding to similar requests (e.g., one for cache misses and one for cache hits). Consistent behaviors ensure that the system is always provisioned to handle the unexpected. 
  • Adapting to customers’ traffic patterns to redistribute data improves customer experience. 
  • Continuously verifying idle data is a reliable way to protect against both hardware failures and software bugs in order to meet high durability goals. 
  • Maintaining high availability as a system evolves requires careful operational discipline and tooling. Mechanisms such as formal proofs of complex algorithms, game days (chaos and load tests), upgrade/downgrade tests, and deployment safety provide the freedom to adjust and experiment with the code without the fear of compromising correctness. 
Related content
Amazon DynamoDB was introduced 10 years ago today; one of its key contributors reflects on its origins, and discusses the 'never-ending journey' to make DynamoDB more secure, more available and more performant.

Before we dig deeper into these topics, a little terminology. A DynamoDB table is a collection of items (e.g., products), and each item is a collection of attributes (e.g., name, price, category, etc.). Each item is uniquely identified by its primary key. In DynamoDB, tables are typically partitioned, or divided into smaller sub-tables, which are assigned to nodes. A node is a set of dedicated computational resources — a virtual machine — running on a single server in a datacenter.

DynamoDB stores three copies of each partition, in different availability zones. This makes the partition highly available and durable because the availability zones’ storage resources share nothing and are substantially independent. For instance, we wouldn’t assign a partition and one of its copies to nodes that share a power supply, because a power outage would take both of them offline. The three copies of the same partition are known as a replication group, and there is a leader for the group that is responsible for replicating all the customer mutations and serving strongly consistent reads.

DynamoDB architecture.png
The DynamoDB architecture, including a request router, the partition metadata system, and storage nodes in different availability zones (AZs).

Those definitions in hand, let’s turn to our lessons learned.

Predictability over absolute efficiency

DynamoDB employs a lot of metadata caches in order to reduce latency. One of those caches stores the routing metadata for data requests. This cache is deployed on a fleet of thousands of request routers, DynamoDB’s front-end service.

In the original implementation, when the request router received the first request for a table, it downloaded the routing information for the entire table and cached it locally. Since the configuration information about partition replicas rarely changed, the cache hit rate was approximately 99.75%.

Related content
How Alexa scales machine learning models to millions of customers.

This was an amazing hit rate. However, on the flip side, the fallback mechanism for this cache was to hit the metadata table directly. When the cache becomes ineffective, the metadata table needs to instantaneously scale from handling 0.25% of requests to 100%. The sudden increase in traffic can cause the metadata table to fail, causing cascading failure in other parts of the system. To mitigate against such failures, we redesigned our caches to behave predictably.

First, we built an in-memory datastore called MemDS, which significantly reduced request routers’ and other metadata clients’ reliance on local caches. MemDS stores all the routing metadata in a highly compressed manner and replicates it across a fleet of servers. MemDS scales horizontally to handle all incoming requests to DynamoDB.

Second, we deployed a new local cache that avoids the bimodality of the original cache. All requests, even if satisfied by the local cache, are asynchronously sent to the MemDS. This ensures that the MemDS fleet is always serving a constant volume of traffic, regardless of cache hit or miss. The regular exercise of the fallback code helps prevent surprises during fallback.

DDB-MemDS.png
DynamoDB architecture with MemDS.

Unlike conventional local caches, MemDS sees traffic that is proportional to the customer traffic seen by the service; thus, during cache failures, it does not see a sudden amplification of traffic. Doing constant work removed the need for complex logic to handle edge cases around cache misses and reduced the reliance on local caches, improving system stability.

Reshaping partitioning based on traffic

Partitions offer a way to dynamically scale both the capacity and performance of tables. In the original DynamoDB release, customers explicitly specified the throughput that a table required in terms of read capacity units (RCUs) and write capacity units (WCUs). The original system assigned partitions to nodes based on both available space and computational capacity.

Related content
Optimizing placement of configuration data ensures that it’s available and consistent during “network partitions”.

As the demands on a table changed (because it grew in size or because the load increased), partitions could be further split to allow the table to scale elastically. Partition abstraction proved really valuable and continues to be central to the design of DynamoDB.

However, the early version of DynamoDB assigned both space and capacity to individual partitions on the basis of size, evenly distributing computational resources across table entries. This led to challenges of “hot partitions” and throughput dilution.

Hot partitions happened because customer workloads were not uniformly distributed and kept hitting a subset of items. Throughput dilution happened when partitions that had been split to handle increased load ended up with so few keys that they could quickly max out their meager allocated capacity.

Our initial response to these challenges was to add bursting and adaptive capacity (along with other features such as split for consumption) to DynamoDB. This line of work also led to the launch of on-demand tables.

Bursting is a way to absorb temporal spikes in workloads at a partition level. It’s based on the observation that not all partitions hosted by a storage node use their allocated throughput simultaneously.

Related content
Amazon researchers describe new method for distributing database tables across servers.

The idea is to let applications tap into unused capacity at a partition level on a best-effort basis to absorb short-lived spikes. DynamoDB still maintains workload isolation by ensuring that a partition can burst only if there is unused throughput at the node level.

DynamoDB also launched adaptive capacity to handle long-lived spikes that cannot be absorbed by the burst capacity. Adaptive capacity monitors traffic patterns and repartitions tables so that heavily accessed items reside on different nodes.

Both bursting and adaptive capacity had limitations, however. Bursting was helpful only for short-lived spikes in traffic, and it was dependent on nodes’ having enough throughput to support it. Adaptive capacity was reactive and kicked in only after transmission rates had been throttled down to avoid overloads.

To address these limitations, the DynamoDB team replaced adaptive capacity with global admission control (GAC). GAC builds on the idea of token buckets, in which bandwidth is allocated to network nodes as tokens, and the nodes “cash in” tokens in order to transmit data. Each request router maintains a local token bucket and communicates with GAC to replenish tokens at regular intervals (on the order of every few seconds). For an extra layer of defense, DynamoDB also uses token buckets at the partition level.

Continuous verification 

To provide durability and crash recovery, DynamoDB uses write-ahead logs, which record data writes before they occur. In the event of a crash, DynamoDB can use the write-ahead logs to reconstruct lost data writes, bringing partitions up to date.

Write-ahead logs are stored in all three replicas of a partition. For higher durability, the write-ahead logs are periodically archived to S3, an object store that is designed for more than 99.99% (in fact, 11 nines) durability. Each replica contains the most recent write-ahead logs, which are usually waiting to be archived. The unarchived logs are typically a few hundred megabytes in size.

Storage replica vs. log replica.png
Healing a storage replica by copying the B-tree can take several minutes, while adding a log replica, which takes only a few seconds, ensures that there is no impact on durability.

DynamoDB continuously verifies data at rest. Our goal is to detect any silent data errors or “bit rot” — bit errors caused by degradation of the storage medium. An example of continuous verification is the scrub process.

The scrub process verifies two things: that all three copies in a replication group have the same data and that the live replicas match a reference replica built offline using the archived write-ahead-log entries.

The verification is done by computing the checksum of the live replica and matching that with a snapshot of the reference replica. A similar technique is used to verify replicas of global tables. Over the years, we have learned that continuous verification of data at rest is the most reliable method of protecting against hardware failures, silent data corruption, and even software bugs.

Availability

DynamoDB regularly tests its resilience to node, rack, and availability zone (AZ) failures. For example, to test the availability and durability of the overall service, DynamoDB performs power-off tests. Using realistic simulated traffic, a job scheduler powers off random nodes. At the end of all the power-off tests, the test tools verify that the data stored in the database is logically valid and not corrupted.

Related content
Amazon Athena reduces query execution time by 14% by eliminating redundant operations.

The first point about availability is that it needs to be measurable. DynamoDB is designed for 99.999% availability for global tables and 99.99% availability for regional tables. To ensure that these goals are being met, DynamoDB continuously monitors availability at the service and table levels. The tracked availability data is used to estimate customer-perceived availability trends and trigger alarms if the number of errors that customers see crosses a certain threshold.

These alarms are called customer-facing alarms (CFAs). The goal of these alarms is to report any availability-related problems and proactively mitigate them either automatically or through operator intervention. The key point to note here is that availability is measured not only on the server side but on the client side.

We also use two sets of clients to measure the user-perceived availability. The first set of clients is internal Amazon services using DynamoDB as the data store. These services share the availability metrics for DynamoDB API calls as observed by their software.

The second set of clients is our DynamoDB canary applications. These applications are run from every AZ in the region, and they talk to DynamoDB through every public endpoint. Real application traffic allows us to reason about DynamoDB availability and latencies as seen by our customers. The canary applications offer a good representation of what our customers might be experiencing both long and short term.

The second point is that read and write availability need to be handled differently. A partition’s write availability depends on the health of its leader and of its write quorum, meaning two out of the three replicas from different AZs. A partition remains available as long as there are enough healthy replicas for a write quorum and a leader.

Related content
“Anytime query” approach adapts to the available resources.

In a large service, hardware failures such as memory and disk failures are common. When a node fails, all replication groups hosted on the node are down to two copies. The process of healing a storage replica can take several minutes because the repair process involves copying the B-tree — a data structure that maps partitions to storage locations — and write-ahead logs.

Upon detecting an unhealthy storage replica, the leader of a replication group adds a log replica to ensure there is no impact on durability. Adding a log replica takes only a few seconds, because the system has to copy only the most recent write-ahead logs from a healthy replica; reconstructing the more memory-intensive B-tree can wait. Quick healing of affected replication groups using log replicas thus ensures the high durability of the most recent writes. Adding a log replica is the fastest way to ensure that the write quorum of the group is always met. This minimizes disruption to write availability due to an unhealthy write quorum. The leader replica serves consistent reads.

Introducing log replicas was a big change to the system, but the Paxos consensus protocol, which is formally provable, gave us the confidence to safely tweak and experiment with the system to achieve higher availability. We have been able to run millions of Paxos groups in a region with log replicas. Eventually, consistent reads can be served by any of the replicas. In case a leader fails, other replicas detect its failure and elect a new leader to minimize disruptions to the availability of consistent reads.

Research areas

Related content

IN, KA, Bengaluru
RBS (Retail Business Services) Tech team works towards enhancing the customer experience (CX) and their trust in product data by providing technologies to find and fix Amazon CX defects at scale. Our platforms help in improving the CX in all phases of customer journey, including selection, discoverability & fulfilment, buying experience and post-buying experience (product quality and customer returns). The team also develops GenAI platforms for automation of Amazon Stores Operations. As a Sciences team in RBS Tech, we focus on foundational ML research and develop scalable state-of-the-art ML solutions to solve the problems covering customer experience (CX) and Selling partner experience (SPX). We work to solve problems related to multi-modal understanding (text and images), task automation through multi-modal LLM Agents, supervised and unsupervised techniques, multi-task learning, multi-label classification, aspect and topic extraction for Customer Anecdote Mining, image and text similarity and retrieval using NLP and Computer Vision for product groupings and identifying duplicate listings in product search results. Key job responsibilities As an Applied Scientist, you will be responsible to design and deploy scalable GenAI, NLP and Computer Vision solutions that will impact the content visible to millions of customer and solve key customer experience issues. You will develop novel LLM, deep learning and statistical techniques for task automation, text processing, image processing, pattern recognition, and anomaly detection problems. You will define the research and experiments strategy with an iterative execution approach to develop AI/ML models and progressively improve the results over time. You will partner with business and engineering teams to identify and solve large and significantly complex problems that require scientific innovation. You will independently file for patents and/or publish research work where opportunities arise. The RBS org deals with problems that are directly related to the selling partners and end customers and the ML team drives resolution to organization level problems. Therefore, the Applied Scientist role will impact the large product strategy, identifies new business opportunities and provides strategic direction which is very exciting.
IN, KA, Bengaluru
Selection Monitoring team is responsible for making the biggest catalog on the planet even bigger. In order to drive expansion of the Amazon catalog, we develop advanced ML/AI technologies to process billions of products and algorithmically find products not already sold on Amazon. We work with structured, semi-structured and Visually Rich Documents using deep learning, NLP and image processing. The role demands a high-performing and flexible candidate who can take responsibility for success of the system and drive solutions from research, prototype, design, coding and deployment. We are looking for Applied Scientists to tackle challenging problems in the areas of Information Extraction, Efficient crawling at internet scale, developing ML models for website comprehension and agents to take multi-step decisions. You should have depth and breadth of knowledge in text mining, information extraction from Visually Rich Documents, semi structured data (HTML) and advanced machine learning. You should also have programming and design skills to manipulate Semi-Structured and unstructured data and systems that work at internet scale. You will encounter many challenges, including: - Scale (build models to handle billions of pages), - Accuracy (requirements for precision and recall) - Speed (generate predictions for millions of new or changed pages with low latency) - Diversity (models need to work across different languages, market places and data sources) You will help us to - Build a scalable system which can algorithmically extract information from world wide web. - Intelligently cluster web pages, segment and classify regions, extract relevant information and structure the data available on semi-structured web. - Build systems that will use existing Knowledge Base to perform open information extraction at scale from visually rich documents. Key job responsibilities - Use AI, NLP and advances in LLMs/SLMs and agentic systems to create scalable solutions for business problems. - Efficiently Crawl web, Automate extraction of relevant information from large amounts of Visually Rich Documents and optimize key processes. - Design, develop, evaluate and deploy, innovative and highly scalable ML models, esp. leveraging latest advances in RL-based fine tuning methods like DPO, GRPO etc. - Work closely with software engineering teams to drive real-time model implementations. - Establish scalable, efficient, automated processes for large scale model development, model validation and model maintenance. - Lead projects and mentor other scientists, engineers in the use of ML techniques. - Publish innovation in research forums.
US, CA, Santa Clara
We are seeking an Applied Scientist II to join Amazon Customer Service's Science team, where you will build AI-based automated customer service solutions using state-of-the-art techniques in retrieval-augmented generation (RAG), agentic AI, and post-training of large language models. You will work at the intersection of research and production, developing intelligent systems that directly impact millions of customers while collaborating with scientists, engineers, and product managers in a fast-paced, innovative environment. Key job responsibilities - Design, develop, and deploy information retrieval systems and RAG pipelines using embedding models, reranking algorithms, and generative models to improve customer service automation - Conduct post-training of large language models using techniques such as Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO) to optimize model performance for customer service tasks - Build and curate high-quality datasets for model training and evaluation, ensuring data quality and relevance for customer service applications - Design and implement comprehensive evaluation frameworks, including data curation, metrics development, and methods such as LLM-as-a-judge to assess model performance - Develop AI agents for automated customer service, understanding their advantages and common pitfalls, and implementing solutions that balance automation with customer satisfaction - Independently perform research and development with minimal guidance, staying current with the latest advances in machine learning and AI - Collaborate with cross-functional teams including engineering, product management, and operations to translate research into production systems - Publish findings and contribute to the broader scientific community through papers, patents, and open-source contributions - Monitor and improve deployed models based on real-world performance metrics and customer feedback A day in the life As an Applied Scientist II, you will start your day reviewing metrics from deployed models and identifying opportunities for improvement. You might spend your morning experimenting with new post-training techniques to improve model accuracy, then collaborate with engineers to integrate your latest model into production systems. You will participate in design reviews, share your findings with the team, and mentor junior scientists. You will balance research exploration with practical implementation, always keeping the customer experience at the forefront of your work. You will have the autonomy to drive your own research agenda while contributing to team goals and deliverables. About the team The Amazon Customer Service Science team is dedicated to revolutionizing customer support through advanced AI and machine learning. We are a diverse group of scientists and engineers working on some of the most challenging problems in natural language understanding and AI automation. Our team values innovation, collaboration, and a customer-obsessed mindset. We encourage experimentation, celebrate learning from failures, and are committed to maintaining Amazon's high bar for scientific rigor and operational excellence. You will have access to world-class computing resources, massive datasets, and the opportunity to work alongside some of the brightest minds in AI and machine learning.
US, MA, N.reading
Amazon is seeking exceptional talent to help develop the next generation of advanced robotics systems that will transform automation at Amazon's scale. We're building revolutionary robotic systems that combine cutting-edge AI, sophisticated control systems, and advanced mechanical design to create adaptable automation solutions capable of working safely alongside humans in dynamic environments. This is a unique opportunity to shape the future of robotics and automation at an unprecedented scale, working with world-class teams pushing the boundaries of what's possible in robotic dexterous manipulation, locomotion, and human-robot interaction. This role presents an opportunity to shape the future of robotics through innovative applications of deep learning and large language models. At Amazon we leverage advanced robotics, machine learning, and artificial intelligence to solve complex operational challenges at an unprecedented scale. Our fleet of robots operates across hundreds of facilities worldwide, working in sophisticated coordination to fulfill our mission of customer excellence. The ideal candidate will contribute to research that bridges the gap between theoretical advancement and practical implementation in robotics. You will be part of a team that's revolutionizing how robots learn, adapt, and interact with their environment. Join us in building the next generation of intelligent robotics systems that will transform the future of automation and human-robot collaboration. Key job responsibilities - Design and implement whole body control methods for balance, locomotion, and dexterous manipulation - Utilize state-of-the-art in methods in learned and model-based control - Create robust and safe behaviors for different terrains and tasks - Implement real-time controllers with stability guarantees - Collaborate effectively with multi-disciplinary teams to co-design hardware and algorithms for loco-manipulation - Mentor junior engineer and scientists
US, CA, Sunnyvale
Amazon's AGI Information is seeking an exceptional Applied Scientist to drive science advancements in the Amazon Knowledge Graph team (AKG). AKG is re-inventing knowledge graphs for the LLM era, optimizing for LLM grounding. At the same time, AKG is innovating to utilize LLMs in the knowledge graph construction pipelines to overcome obstacles that traditional technologies could not overcome. As a member of the AKG IR team, you will have the opportunity to work on interesting problems with immediate customer impact. The team is addressing challenges in web-scale knowledge mining, fact verification, multilingual information retrieval, and agent memory operating over Graphs. You will also have the opportunity to work with scientists working on the other challenges, and with the engineering teams that deliver the science advancements to our customers. A successful candidate has a strong machine learning and agent background, is a master of state-of-the-art techniques, has a strong publication record, has a desire to push the envelope in one or more of the above areas, and has a track record of delivering to customers. The ideal candidate enjoys operating in dynamic environments, is self-motivated to take on new challenges, and enjoys working with customers, stakeholders, and engineering teams to deliver big customer impact, shipping solutions via rapid experimentation and then iterating on user feedback and interactions. Key job responsibilities As an Applied Scientist, you will leverage your technical expertise and experience to demonstrate leadership in tackling large complex problems. You will collaborate with applied scientists and engineers to develop novel algorithms and modeling techniques to build the knowledge graph that delivers fresh factual knowledge to our customers, and that automates the knowledge graph construction pipelines to scale to many billions of facts. Your first responsibility will be to solve entity resolution to enable conflating facts from multiple sources into a single graph entity for each real world entity. You will develop generic solutions that work fo all classes of data in AKG (e.g., people, places, movies, etc.), that cope with sparse, noisy data, that scale to hundreds of millions of entities, and that can handle streaming data. You will define a roadmap to make progress incrementally and you will insist on scientific rigor, leading by example.
US, CA, Sunnyvale
Amazon is seeking exceptional talent to help develop the next generation of advanced robotics systems that will transform automation at Amazon's scale. We're building revolutionary robotic systems that combine innovative AI, sophisticated control systems, and advanced mechanical design to create adaptable automation solutions capable of working safely alongside humans in dynamic environments. This is a unique opportunity to shape the future of robotics and automation at unprecedented scale, working with world-class teams pushing the boundaries of what's possible in robotic manipulation, locomotion, and human-robot interaction. This role presents an opportunity to shape the future of robotics through innovative applications of deep learning and large language models. We leverage advanced robotics, machine learning, and artificial intelligence to solve complex operational challenges at unprecedented scale. Our fleet of robots operates across hundreds of facilities worldwide, working in sophisticated coordination to fulfill our mission of customer excellence. We are pioneering the development of robotics foundation models that: - Enable unprecedented generalization across diverse tasks - Integrate multi-modal learning capabilities (visual, tactile, linguistic) - Accelerate skill acquisition through demonstration learning - Enhance robotic perception and environmental understanding - Streamline development processes through reusable capabilities The ideal candidate will contribute to research that bridges the gap between theoretical advancement and practical implementation in robotics. You will be part of a team that's revolutionizing how robots learn, adapt, and interact with their environment. Join us in building the next generation of intelligent robotics systems that will transform the future of automation and human-robot collaboration. As a Senior Applied Scientist, you will develop and improve machine learning systems that help robots perceive, reason, and act in real-world environments. You will leverage state-of-the-art models (open source and internal research), evaluate them on representative tasks, and adapt/optimize them to meet robustness, safety, and performance needs. You will invent new algorithms where gaps exist. You’ll collaborate closely with research, controls, hardware, and product-facing teams, and your outputs will be used by downstream teams to further customize and deploy on specific robot embodiments. Key job responsibilities As a Senior Applied Scientist in the Foundations Model team, you will: - Leverage state-of-the-art models for targeted tasks, environments, and robot embodiments through fine-tuning and optimization. - Execute rapid, rigorous experimentation with reproducible results and solid engineering practices, closing the gap between sim and real environments. - Build and run capability evaluations/benchmarks to clearly profile performance, generalization, and failure modes. - Contribute to the data and training workflow: collection/curation, dataset quality/provenance, and repeatable training recipes. - Write clean, maintainable, well commented and documented code, contribute to training infrastructure, create tools for model evaluation and testing, and implement necessary APIs - Stay current with latest developments in foundation models and robotics, assist in literature reviews and research documentation, prepare technical reports and presentations, and contribute to research discussions and brainstorming sessions. - Work closely with senior scientists, engineers, and leaders across multiple teams, participate in knowledge sharing, support integration efforts with robotics hardware teams, and help document best practices and methodologies.
JP, 13, Tokyo
Amazon.com strives to be Earth's most customer-centric company where people can find and discover anything they want to buy. We hire the world's brightest minds and offer them a fast-paced, technologically sophisticated, and collaborative work environment. We are seeking a talented, customer-focused Economist to join our JCI Measurement and Optimization Science Team (JCI MOST). In this role, you will design experiments and build econometric models to measure intervention impacts and deliver data-driven insights that inform leadership decisions. Amazon Economists leverage our world-class data systems to build sophisticated econometric models, drawing from diverse methodological approaches including econometric theory, empirical IO, empirical health, labor, and public economics—all highly valued skillsets at Amazon. You will work in a fast-moving environment solving critical business problems as part of cross-functional teams embedded within business units or our central science and economics organization. This role requires exceptional Causal Inference expertise, strong cross-functional collaboration skills, business acumen, and an entrepreneurial spirit to drive measurable improvements in our pricing quality and business outcomes.
CN, 31, Shanghai
As a Sr. Applied Scientist, you will be responsible for bringing new product designs through to manufacturing. You will work closely with multi-disciplinary groups including Product Design, Industrial Design, Hardware Engineering, and Operations, to drive key aspects of engineering of consumer electronics products. In this role, you will use expertise in physical sciences, theoretical, numerical or empirical techniques to create scalable models representing response of physical systems or devices, including: * Applying domain scientific expertise towards developing innovative analysis and tests to study viability of new materials, designs or processes * Working closely with engineering teams to drive validation, optimization and implementation of hardware design or software algorithmic solutions to improve product and customer risks * Establishing scalable, efficient, automated processes to handle large scale design and data analysis * Conducting research into use conditions, materials and analysis techniques * Tracking general business activity including device health in field and providing clear, compelling reports to management on a regular basis * Developing, implementing guidelines to continually optimize design processes * Using simulation tools like LS-DYNA, and Abaqus for analysis and optimization of product design * Using of programming languages like Python and Matlab for analytical/statistical analyses and automation * Demonstrating strong understanding across multiple physical science domains, e.g. structural, thermal, fluid dynamics, and materials * Developing, analyzing and testing structural solutions from concept design, feature development, product architecture, through system validation * Supporting product development and optimization through application of analysis and testing of complex electronic assemblies using advanced simulation and experimentation tools and techniques
US, WA, Redmond
Amazon Leo is an initiative to launch a constellation of Low Earth Orbit satellites that will provide low-latency, high-speed broadband connectivity to unserved and underserved communities around the world. As a Communications Engineer in Modeling and Simulation, this role is primarily responsible for the developing and analyzing high level system resource allocation techniques for links to ensure optimal system and network performance from the capacity, coverage, power consumption, and availability point of view. Be part of the team defining the overall communication system and architecture of Amazon Leo’s broadband wireless network. This is a unique opportunity to innovate and define novel wireless technology with few legacy constraints. The team develops and designs the communication system of Leo and analyzes its overall system level performance, such as overall throughput, latency, system availability, packet loss, etc., as well as compatibility for both connectivity and interference mitigation with other space and terrestrial systems. This role in particular will be responsible for 1) evaluating complex multi-disciplinary trades involving RF bandwidth and network resource allocation to customers, 2) understanding and designing around hardware/software capabilities and constraints to support a dynamic network topology, 3) developing heuristic or solver-based algorithms to continuously improve and efficiently use available resources, 4) demonstrating their viability through detailed modeling and simulation, 5) working with operational teams to ensure they are implemented. This role will be part of a team developing the necessary simulation tools, with particular emphasis on coverage, capacity, latency and availability, considering the yearly growth of the satellite constellation and terrestrial network. Export Control Requirement: Due to applicable export control laws and regulations, candidates must be a U.S. citizen or national, U.S. permanent resident (i.e., current Green Card holder), or lawfully admitted into the U.S. as a refugee or granted asylum. Key job responsibilities • Work within a project team and take the responsibility for the Leo's overall communication system design and architecture • Extend existing code/tools and create simulation models representative of the target system, primarily in MATLAB • Design interconnection strategies between fronthaul and backhaul nodes. Analyze link availability, investigate link outages, and optimize algorithms to study and maximize network performance • Use RF and optical link budgets with orbital constellation dynamics to model time-varying system capacity • Conduct trade-off analysis to benefit customer experience and optimization of resources (costs, power, spectrum), including optimization of satellite constellation design and link selection • Work closely with implementation teams to simulate expected system level performance and provide quick feedback on potential improvements • Analyze and minimize potential self-interference or interference with other communication systems • Provide visualizations, document results, and communicate them across multi-disciplinary project teams to make key architectural decisions
US, WA, Seattle
Amazon is seeking exceptional talent to help develop the next generation of advanced robotics systems that will transform automation at Amazon's scale. We're building revolutionary robotic systems that combine cutting-edge AI, sophisticated control systems, and advanced electromechanical design to create adaptable automation solutions capable of working safely alongside humans in dynamic environments. This is a unique opportunity to shape the future of robotics and automation at an unprecedented scale, working with world-class teams pushing the boundaries of what's possible in robotic manipulation, locomotion, and human-robot interaction. Amazon is seeking a talented and motivated Principal Applied Scientist to develop tactile sensors and guide the sensing strategy for our gripper design. The ideal candidate will have extensive experience in sensor development, analysis, testing and integration. This candidate must have the ability to work well both independently and in a multidisciplinary team setting. Key job responsibilities - Author functional requirements, design verification plans and test procedures - Develop design concepts which meet the requirements - Work with engineering team members to implement the concepts in a product design - Support product releases to manufacturing and customer deployments - Work efficiently to support aggressive schedules