Jonathan and Kathy, an Orlando-based couple, were out visiting a neighbor a few days before Christmas 2020 when Jonathan got an unusual alert from their Amazon Echo. Using the Alexa App, they dropped in on their Echo device, which allowed them to hear what was happening in their home in real time.
"You could hear things crackling and popping, and the smoke alarm was going off like crazy," Jonathan told About Amazon. He then rushed home. "Upon rolling into the neighborhood, it was very smoky," Jonathan said. "I pulled up into the driveway, opened the garage, and smoke just started billowing out. I went into our house, and more black smoke poured out. It was so thick you couldn't see six inches in front of your face. The only thing I could think of was Cooper."
Jonathan managed to get Cooper, the couple's French bulldog, from his pen as smoke billowed from the house. The fire department was also able to extinguish the fire and minimize damage. However, neither outcome may have occurred if it weren’t for a Smart Alert mobile notification from Alexa.
The feature that alerted Jonathan is called Alexa Guard, a smart-home capability that relies on acoustic event detection (AED). AED is an emerging field that focuses on training models to detect and process sounds.
“The technology behind Alexa Guard was developed in an effort to augment the utility of Echo devices,” said Angel Calvo, director of software for Alexa Smart Home team.
How Guard works
When set to away mode, Guard is trained to identify sounds related to home security and safety events, like a smoke alarm sounding, and to distinguish those sounds from something more prosaic, like a microwave beeping.
I am so glad this couple and their pet are OK - We built #AlexaGuard with this customer use case in mind, so learning that we helped with Guard to save this puppy from a fire, emphasis why I love my job... kudos to the Alexa Guard team! https://t.co/rX48tbCNko
— Angel Calvo (@ANGELCALVOS) January 6, 2021
The detection service relies on two models applied in a two-step system, one on the device, another in the cloud.
The first step utilizes a recurrent neural network — a type of deep learning model that uses sequential data or time series data to learn — on the Echo device itself. The on-device detection works by converting the audio input into features that feed into a recurrent neural network (RNN).
The device uses long short-term memory (LSTM) — a type of recurrent neural network that has shown a significant improvement in speech recognition and has high accuracy, “particularly when it’s applied to sequential data,” said Ming Sun, applied science manager for AED. This is particularly important for determining when a specific sound occurred.
The Echo must also occasionally be able to distinguish between multiple sounds at once. Layered over the RNN is a multi-task learning framework that is trained to detect multiple events. These multiple output layers work like branches off the base neural network, each trained to recognize a different event in the captured audio.
This helps Echo devices detect multiple concurrent incidents (those which customers have selected for detection) such as footsteps and glass breaking, for example.
Layering multiple output layers over a single neural network also makes the detection system in Echo devices very scalable; the device can be trained to recognize new sounds with minimal additions.
“Without this design, we would need to update the whole model every time we update one existing sound event or add a new sound event,” Sun said. “Now, we only have to update the output layer for a target existing event, or add a new output layer for a new event.”
When one of the sounds a customer has selected for detection triggers Guard on the Echo device, that audio is then sent to the cloud for the second verification step to confirm the on-device detection. The cloud runs a much more powerful recognition system to filter out false triggers that might be linked to ambient noise around the home, Sun said.
If the validation process confirms the sound is the one that the device is actively monitoring for, the customer gets a notification in their Alexa app along with an audio clip of the detection.
Getting creative to teach Guard sounds
Because home security events are relatively rare — and the data sets for these audio events are quite meager — semi-supervised learning and self-supervised learning have been critical as Sun’s team expands and refines Guard’s capabilities.
“Semi-supervised learning relies on small sets of annotated training data to leverage larger sets of unannotated data,” Sun said. “While self-supervised learning utilizes larger sets of unannotated data with training targets derived from data itself in an unsupervised way — no human annotations.
“Another technique is to detect for a longer time and aggregate events to be more accurate,” Sun said. To improve the accuracy of sounds with repeating patterns, the detectors look for shorter repeating patterns, such as an appliance beeping. This allows Guard to distinguish between that type of repetitive beeping and an alarm, which can run for 30 seconds or longer. Guard can also detect the difference between a smoke alarm and a carbon monoxide alarm, and notify customers of a specific risk.
Since the very beginning, it’s been critical to build accurate models that consume less resources. We apply lots of optimization so that this system can be as small and efficient as possible.
Guard Plus, a subscription service launched in January, detects sounds that could be an intruder — like footsteps, a door closing, or glass breaking — and can send a Smart Alert mobile notification or plays a siren on the Echo device. Alexa can also notify customers about the sound of smoke alarms or carbon monoxide alarms. Because the ambient sounds in places like dense urban environments or apartment complexes can make this tricky, the team added a feature allowing customers to adjust the sensitivity to accommodate the noise in their home environments.
The limited annotated data the Guard team had access to has also required them to get creative. Glass breaking, for example, is a rare sound, it’s over in two to three seconds, and it varies based on the type of glass. To bolster their data set, the Guard team rented a warehouse and contracted a construction crew to break hundreds of windows: single pane, double pane, different compositions. This allowed the team to build an authentic data set to build the initial model — also called a seed model — before deploying Guard to beta testers.
All of the strategies Sun’s team employed to optimize the recognition system on Echo devices have minimized the error rate.
This is where the powerful AED models in the cloud — Guard’s second validation step — are so essential. The chances of false alarm are much smaller when audio is processed through both local and cloud systems, Sun said. And, he emphasized, audio is sent to the cloud only after running it through a device-side model to protect privacy.
“Since the very beginning, it’s been critical to build accurate models that consume less resources,” Sun said. “We apply lots of optimization so that this system can be as small and efficient as possible.”
Edge devices like Echo only send data to the cloud when it’s essential. In the case of Guard, that means the majority of the audio data is processed and discarded by the neural network on the device. Only potential triggers make it to the cloud. For those events, customers are able to view, listen, and delete the audio that Guard detects directly from their Guard History in the Alexa app, or from the Alexa Privacy Settings page.