At this year’s Interspeech, the largest annual conference on speech-related technologies, Shehzad Mevawalla, the director of automatic speech recognition for Alexa, will deliver a keynote address on “successes, challenges, and opportunities for speech technology in conversational agents.”
Ironically, Mevawalla is, in his own description, “not a speech person”.
“I'm a software engineer by trade and by training,” Mevawalla say. “When I started with Amazon, I ran Amazon's business intelligence. Following that, I was running seller performance for third-party sellers, making sure that they were providing a great experience for our customers. Then I spent the next four years in inventory management, where my systems were basically purchasing all of Amazon's inventory worldwide and figuring out how to get it into our warehouses.”
That experience with large-scale systems made Mevawalla an attractive candidate to lead the Alexa speech recognition engineering team, a responsibility he assumed three years ago.
“Alexa has tens of millions of devices out there, and with that kind of scale it’s definitely a challenge,” says Mevawalla. “For example, Christmas morning, millions of people unwrap Alexa devices. They all fire up at the same time. You get a tremendous spike of activity, and you cannot disappoint customers on Christmas morning. So your systems have to be able to scale.
“And it has to be always on, always reliable. You cannot miss an alarm. You cannot miss a timer. And we continue to add more functionality. So, for example, we added the ability for customers to whisper, and Alexa will whisper back. That cannot add latency. A couple of years ago, we added speaker ID, so we can personalize customers’ Alexa experiences. Again, that has to run with no added latency. Another example is automatic language switching, so you can set your US device to dual languages — for example, an English-Spanish mix. We have to detect whether the customer is speaking English or Spanish in real time and respond in kind. You can't say, ‘Okay, let me listen to this person first for a while, and let's see if they are speaking English or Spanish.’ You have to make the decision immediately.”
Going end-to-end
Mevawalla was no stranger to machine learning when he joined the speech recognition effort. Among other things, his work with Amazon’s third-party sellers involved fraud detection, a problem he had worked on for nine years before joining Amazon and one that depends heavily on machine learning models. Two years ago, his responsibilities expanded to include Alexa speech recognition’s scientific research.
Looking forward to this years' keynote speakers at #INTERSPEECH2020!
— INTERSPEECH 2020 (@interspeech20) September 14, 2020
🔹Janet Pierrehumbert (Uni. of Oxford)
🔹Barbara @ShinnCunningham (Carnegie Mellon Uni.)
🔹Lin-shan Lee (National Taiwan Uni.)
🔹Shehzad Mevawalla (Amazon Alexa)
Only 6 weeks to go:https://t.co/u0ZU2GsrBi
Even in that short time he’s led the science team, Mevawalla says, he’s seen dramatic changes. “I think we’ve made huge strides in the last couple of years,” he says. “For example, we can now run full-capability speech recognition on-device. The models that used to be many gigabytes in size, required huge amounts of memory, and ran on massive servers in the cloud — we're now able to take those models and shrink them into tiny footprints and fit them into devices that are no larger than a tin can.”
In large part, Mevawalla explains, that’s because of a move to end-to-end models, neural networks that take acoustic speech signals as input and directly output transcribed speech. In the past, by contrast, Alexa’s speech recognizers had specialized components that processed inputs in sequence, such as an acoustic model and a language model.
“The acoustic model is trying to recognize the phonemes that are coming in — the basic units of speech,” Mevawalla explains. “Then the language model would do a search using n-grams” — sequences of between three and five words. “You could have multiple combinations of sentences that are possibilities given a string of phonemes,” he says. “Which path do you go down? These large search trees are built by processing large amounts of text data and noting the probability that words occur in a particular order.
“With an end-to-end model, you no longer have the overhead of the various separate components, or those giant search trees. Instead, you now have a full neural representation, which by itself reduces the size of the model to one-hundredth the size. Various quantization techniques minimize the memory and compute footprint even further without losing any accuracy. These models can then be deployed on our devices and executed using our own Amazon neural processor, AZ1 — a neural accelerator that is optimized to run DNNs [deep neural networks].”
Speaker ID
Alexa’s speaker ID function — which recognizes which customer is speaking, so that Alexa can personalize responses — has also moved to an end-to-end model, Mevawalla says.
“We've made a lot of huge improvements,” he says. “We have a two-model approach to speaker ID, where we combine text-dependent and text-independent models. The text-dependent model knows what you're saying ahead of time, so it can match it. With Alexa, most utterances start with the wake word, whether it be ‘Alexa’, ‘Computer’, or ‘Amazon’. So it matches the way you say ‘Alexa’ with the way you said ‘Alexa’ before. Then there's the text-independent model, which matches your voice independent of what you're saying. Moving both these systems to full neural and combining them has given us an order of magnitude improvement in speaker identification systems.”
Mevawalla also points to the speech recognition team’s expanded use of semi-supervised learning (SSL), in which a machine learning model trained on a small body of annotated data itself labels a much larger body of data, which is in turn used for additional training — either of the original model or of a lighter, faster model.
“The volumes of data that we can process using SSL is something we've enhanced over the last year,” Mevawalla says. “Language pooling, when combined with SSL, is another technique that we have leveraged very effectively. And that’s completely non-reviewed, unannotated data that a machine transcribed.”
At this year’s Interspeech, “I'm really excited to share all the innovations that have happened within Alexa,” Mevawalla adds. “I think we've moved the needle in a lot of really new ways.”