On what are known as game days to teams inside Amazon, millions of virtual “customers” log on to the Amazon Store to search for items, browse product pages, load shopping carts, and check out as if they were real customers hunting for bargains during a sale such as Prime Day.
“It’s like a fire drill, a planned practice,” said Molly McElheny, a principal technical program manager in Central Reliability Engineering at Amazon. McElheny is responsible for helping to oversee those game days, which her organization runs at strategically chosen times in advance of big sales. Their goal? Make sure the Amazon Store and the many teams who help it run smoothly are ready ahead of time for potentially massive spikes in traffic.
That planned practice draws on forecasts of traffic and loads on Amazon services generated by CloudTune, a system that serves as a communications vehicle between the teams who plan events such as Prime Day and service teams that own infrastructure components and help run the Amazon Store.
CloudTune Forecasting emanated from Amazon’s central economics team back in 2015 as an improved methodology for capacity planning to handle major events such as Prime Day and Black Friday, explained Oleksiy Mnyshenko, a senior manager and economist at Amazon.
“These events have large peak-to-mean spreads,” he noted. “This means we need to proactively model the expected peak load and continuously assess our AWS capacity needs to support it.”
Demand forecasting
The CloudTune Forecasting system has expanded over the years from generating peak computation-load forecasts one year in advance in the United States to a series of forecasts that range from per-week forecasts up to two years out to per-minute forecasts several months into the future. In addition, those forecasts — which are continually refreshed with new data — are now also generated for a wide variety of Amazon teams and regions around the world.
While the need for specific regional forecasts may be obvious — a Mother’s Day sale forecast in the United States will not be relevant for a Diwali sale in India — many unique service teams that support the Amazon Store also rely on these forecasts.
When you go to the Amazon Store, ... in the background, there are thousands of software systems that together constitute what the experience is, and all of these systems and teams owning them need to be ready for these peak events.
One team may be responsible for the home page in a specific region, whereas another team is responsible for the shopping cart experience there, and yet another handles the checkout process. Each team experiences traffic differently and, necessarily, consumes AWS computing power differently. Over time, teams at Amazon have collaborated to improve CloudTune forecasts to be useful for each of those teams and their specific concerns.
“When you go to the Amazon Store, it feels very seamless as you go from searching for something to navigating to details about the product to then checking out, but in the background, there are thousands of software systems that together constitute what the experience is, and all of these systems and teams owning them need to be ready for these peak events,” Mnyshenko said.
In the early years, CloudTune forecasts were geared primarily to help service teams know how much computational capacity they needed for peak events. Since then, improvements have focused on differentiating across teams and regions. As the Amazon Store continued to grow, it became important to extend demand outlook to a two-years-out aggregate forecast per region to help inform decisions for AWS related to computing power, networking, and data center planning.
“A data center is not built in a day,” noted Chunpeng Wang, a senior applied scientist at Amazon who works on the CloudTune forecast team. “Our forecasts are an important input into long-term capacity planning for AWS.”
What’s more, the Amazon Store is not alone in contending with peak events, noted Ben Mildenhall, a senior manager in cloud computing and auto scaling.
“Many AWS external customers have Black Friday and Cyber Monday events as well,” Mildenhall said. “So it’s important we optimize to give all of our customers a great experience.”
CloudTune forecasts provide inputs to AWS to help size infrastructure in a way that maximizes utilization efficiency, noted Mnyshenko. “The way CloudTune specifically helps here is continuously getting better at anticipating the mix of capacity we’re using by generation, by type, by location, so that we can have those conversations and provide this feedback to AWS,” he said.
Granular, flexible, and explainable
Like many demand-forecasting applications, CloudTune is a time-series forecasting system. What’s unique about it is the ability to predict demand at one-minute granularity, noted Mnyshenko. This level of granularity provides insight into patterns such as short-duration spikes in website traffic. Teams use the forecasts as inputs to determine their computing capacity not just for peak events like back to school but also peak times during any given day, week, or month.
“Our comparative advantage is intra-day load predictions at one-minute granularity, allowing us to track actuals during peak events, highlighting these sharp edges where checkout spikes way beyond the natural peak for the period,” Mnyshenko said.
In addition, CloudTune forecasts need to be flexible to accommodate changes in the day and duration of events, such as the evolution of Prime Day from a 24-hour event to a 48-hour event on different days each year.
At other times, CloudTune needs to make forecasts for special events such as the launch of popular gaming consoles, which may sell out in a matter of minutes.
“That can create a huge spike, and we have to predict the traffic spike and the order spike,” explained Ebrahim Nasrabadi, a senior manager of applied science who leads the CloudTune Forecasting science team.
The team responsible for CloudTune Forecasting has developed modular and configurable models to address these and other challenges, he noted.
For example, built-in functionality allows the removal of outliers — due to things such as a spike in robot traffic that can decrease or increase actual website traffic and order rate unexpectedly — from predictable seasonal behavior and known calendar events. Since these interruptions do not regularly occur, the tool allows forecast teams to exclude those outliers from data used in the forecast.
“Our models are simple and quite flexible to include additional variables and seasonality,” noted Nasrabadi. The models also take into account significant changes in a trend within a dataset, also known as a slope break.
The CloudTune team also emphasizes forecast models that are explainable.
“We have to be very crisp about what we are doing, very transparent about our expectations,” said Wang.
Hundreds of Amazon Store software teams use these forecasts to help determine their AWS capacity needs for peak events. The better these teams understand the forecasts, the more trust they have in them, noted Mnyshenko.
“We need to be able to explain what goes into the ingredients and, more importantly, what we are doing to reduce the spread in errors,” he said.
Continuous automation
Currently, service teams not yet using automation enhancements take the CloudTune forecasts and translate them into capacity orders for servers through the Amazon Elastic Compute Cloud (Amazon EC2) using many different manual tools and processes, said Doug Smith, a senior technical program manager responsible for delivering improvements and features to the CloudTune toolset.
A key future direction for CloudTune is to continuously enhance these tools and automate as many manual processes as possible, Smith noted.
The world we’re envisioning between our team and CloudTune is one where services teams don’t have to worry about scaling at all.
“We’re moving into automation so that we can take our CloudTune forecasts as inputs into these new products that we’re building to provide a hands-off experience,” he said.
And while the game days McElheny’s team runs in advance of these major events will continue apace, she has a vision for the future there as well. Today, she said, the forecasts enable simulations of high-level customer journeys. She’d like to get to a forecast that allows her team to simulate an event down to the types of products customers are ordering when and where.
“This matters because different services get called depending on a lot of different factors. The closer we can simulate the real traffic the better, because we’re actually hitting services with the traffic they expect to see during the event,” McElheny said.
To get there, McElheny, Smith, and their colleagues work together to make sure the forecasts provide the best data for the most realistic simulations.
“The world we’re envisioning between our team and CloudTune is one where services teams don’t have to worry about scaling at all,” McElheny said. “CloudTune does it for them, and then we run a game day, and as we find issues during game day, CloudTune goes and places orders to scale things up for those customers.”