Alexa currently has more than 90,000 skills, or abilities contributed by third-party developers — the NPR skill, the Find My Phone skill, the Jeopardy! skill, and so on.
For each skill, the developer has to specify both slots — the types of data the skill will act on — and slot values — the particular values that the slots can assume. A restaurant-finding skill, for instance, would probably have a slot called something like CUISINE_TYPE, which could take on values such as “Indian”, “Chinese”, “Mexican”, and so on.
For some skills, exhaustively specifying slot values is a laborious process. We’re trying to make it easier with a tool we’re calling catalogue value suggestions, which is currently available to English-language skill developers and will soon expand to other languages.
With catalogue value suggestions, the developer supplies a list of slot values, and based on that list, a neural network suggests a range of additional slot values. So if, for example, the developer provided the CUISINE_TYPEs “Indian”, “Chinese”, and “Mexican”, the network might suggest “Ethiopian” and “Peruvian”. The developer can then choose whether to accept or reject each suggestion.
“This will definitely improve the dev process of creating a skill,” says José Chavez Marino, an Xbox developer with Microsoft. “The suggestions were very good, but even if they were not accurate, you just don't use them. I only see positive things on implementing this in the Alexa dev console.”
The system depends centrally on the idea of embeddings, or representing text strings as points in a multidimensional space, such that strings with similar semantic content are close together. We use proximity in the embedding space as the basis for three distinct tasks: making the slot value suggestions themselves; weeding offensive terms out of the value suggestion catalogue; and identifying slots whose values are so ambiguous that suggestions would be unproductive.
The first step in building our catalogue of slot value suggestions: assemble a list of phrases, as slot values frequently consist of more than one word — restaurant names and place names, for instance. When training our embedding network, we treated both phrases and non-phrasal words as tokens, or semantic units.
We then fed the network training data in overlapping five-token chunks. For any given input token, the network would learn to predict the two tokens that preceded it and the two that followed it. The outputs of the network thus represented the frequencies with which tokens co-occurred, which we used to group tokens together in the embedding space.
Next, we removed offensive content from the catalogue. We combined and pruned several publicly available blacklists of offensive terms, embedded their contents, and identified words near them in the embedding space. For each of those nearby neighbors, we looked at its 10 nearest neighbors. If at least five of these were already on the blacklist, we blacklisted the new term as well.
When a developer provides us with a list of values for a particular slot, our system finds their average embedding and selects its nearest neighbors as slot value suggestions. If the developer-provided values lie too far from their average (see figure, above), the system concludes that the slot is too ambiguous to yield useful suggestions.
To test our system, we extracted 500 random slots from the 1,000 most popular Alexa skills and used half the values for each slot to generate suggestions. On average, the system provided 6.51 suggestions per slot, and human reviewers judged that 88.5% of them were situationally appropriate.
Acknowledgments: Markus Dreyer, Likhitha Patha, Ben Overholts, Sam Sussman, Yash Naik, Adam Hasham