In June, one of us (Prem) wrote a blog post for Amazon Science arguing that Alexa, and more generally, the field of AI, is entering a new “age of self” in which it will become more self-aware, self-learning, and self-service.
Amazon’s progress toward self-service AI was on display today at a virtual event in which we unveiled our new lineup of devices and services.
Among these were three self-service features for Alexa-enabled devices: preference teaching, Custom Sound Event Detection, and, for camera-based Ring devices, Custom Event Alerts.
- Preference teaching allows customers to explicitly teach Alexa which skills should handle particular types of requests, which sports teams they follow, and which cuisines they prefer;
- Custom Sound Event Detection allows customers to teach Alexa to recognize particular household sounds — a doorbell sound, for instance — and to initiate particular Alexa Routines when it hears them;
- Ring Custom Event Alerts let the customer designate a particular region of the image captured by a Ring Video Doorbell camera or Spotlight camera as a region of interest and teach the camera to discriminate different states for that region — a shed door as either open or shut, for instance.
All of these are examples of ways in which Amazon is working to democratize AI by enabling customers to configure machine learning systems as they see fit, without the need for expertise in programming or machine learning.
Preference teaching
Preference teaching allows customers to teach Alexa their preferences using natural language — for instance, “Alexa, I’m a big fan of the Patriots”, or “Alexa, I love Thai food”.
It’s an extension of interactive teaching by customers, which we launched last year. With preference teaching, a salient difference is that customers initiate the teaching, whereas previously, Alexa would initiate it in response to a command it could not understand.
At the core of both applications are two models: a natural-language-understanding (NLU) model that identifies the user's intent, along with entity names and entity types, and a dialogue management model that manages the interaction with the customer and decides what actions to take.
More coverage of devices and services announcements
An important technical advance this year is that the dialogue management model is, like the NLU model, a deep-neural-network model. We trained it using Alexa Conversations, which allows the designer to simply provide examples of the types of dialogues the model should be able to handle. Alexa Conversations then analyzes the examples and automatically generates variations of them, increasing the amount of data available to train the neural dialogue management model 100-fold.
At launch, preference teaching will support three classes of preferences: preferred skills for handling weather requests, preferred sports teams, and food preferences. Once our model has identified a customer preference, it searches the relevant knowledge base for a match. If necessary, it will follow up with a request for more information. For instance, if the customer expresses a preference for “the Giants”, and Alexa finds more than one matching name in the sports knowledge base, it might ask, “Did you mean the New York Giants or the San Francisco Giants?”
In ongoing research, Alexa AI scientists are working to add commonsense to the preference extraction model, so that, for instance, if a customer says, “I don’t eat meat,” Alexa will employ commonsense reasoning to interpret that as a preference for vegetarian restaurants and recipes.
Custom Sound Event Detection and Ring Custom Event Alerts
Custom Sound Event Detection and Ring Custom Event Alerts use a similar approach to few-shot learning, or learning a new classification task from just a handful of examples.
With Custom Sound Event Detection, the customer provides six to ten examples of a new sound — say, the doorbell ringing — when prompted by Alexa. Alexa uses these samples to build a detector for the new sound. Subsequently, when Alexa detects the sound, it will execute a routine set by the customer — say, flashing the lights in the farthest room of the house.
Similarly, with Ring Custom Event Alerts, the customer uses a cursor or, on a touch screen, a finger to outline a region of interest — say, the door of a shed — within the field of view of a particular camera.
Then, by sorting through historical image captures from that camera, the customer identifies five examples of a particular state of that region — say, the shed door open — and five examples of an alternative state — say, the shed door closed. Ring Custom Event Alerts can be configured to send the customer an alert if the state of the region of interest changes.
In both cases, we train neural models on classification tasks — audio classification in one case, video in the other. The models are encoder-decoder models, meaning they have encoder modules that embed inputs, or convert them into vector representations. On the basis of those embeddings, the decoders make predictions.
For event detection — whether audio or visual — we use the encoders only. When examples of the same type of event pass through the encoder, the resulting embeddings define a region in the embedding space. Recognizing later instances of the same event is just a matter of gauging their embeddings’ distance from those of the examples.
To train the encoder for Custom Sound Event Detection, the Alexa team took advantage of self-supervised learning. In the first stage of training, we trained the network simply to reproduce the input signal: that is, from the embedding, the decoder had to reconstruct the encoder’s input. This enabled us to develop a strong encoder using only unlabeled data.
Then we fine-tuned the model on labeled data — sound recordings labeled by type. This enabled the encoder to learn finer distinctions between different types of sounds. Ring Custom Event Alerts uses this approach too, in which we leverage publicly available data.
Preference teaching and custom event detection are just a few of the ways in which we are working to democratize AI. We continue to advance the science of self-service to make AI more customizable and useful for everyone.