Conversational AI

ASRU: Integrating speech recognition and language understanding

Amazon’s Jimmy Kunzmann on how “signal-to-interpretation” models improve availability, performance.

December 17, 2021

Jimmy Kunzmann, a senior manager for applied science with Alexa AI, is one of the sponsorship chairs at this year’s IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). His research team also presented two papers at the conference, both on the topic of “signal-to-interpretation”, or the integration of automatic speech recognition and natural-language understanding into a single machine learning model.

Kunzmann portrait.png — Jimmy Kunzmann, a senior manager for applied science with Alexa AI and a sponsorship chair at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

“Signal-to-interpretation derives the domain, intent, and slot values directly from the audio signal, and it’s becoming more and more a hot topic in research land,” Kunzmann says. “Research is driven largely by what algorithm gives the best performance in terms of accuracy, and signal-to-interpretation can drive accuracy up and latency and memory footprint down.”

The Alexa AI team is constantly working to improve Alexa’s accuracy, but its interest in signal-to-interpretation stemmed from the need to ensure Alexa’s availability on resource-constrained devices with intermittent Internet connections.

“If Internet connectivity drops all of a sudden, and nothing is working anymore, in a home or car environment, that's frustrating — when your lights are not switched on anymore, or you can’t call your favorite contacts in your car,” Kunzmann says.

Kunzmann says that his team’s early work concentrated on finding techniques to dramatically reduce the memory footprint of models that run on-device — techniques such as perfect hashing. But that work still approached automatic speech recognition (ASR) and natural-language understanding (NLU) as separate, sequential tasks.

More recently, he says, the team has moved to end-to-end neural-network-based models that tightly couple ASR and NLU, enabling more compact on-device models.

“By replacing traditional techniques with neural techniques, we could get a smaller footprint — and faster and more accurate models, actually,” Kunzmann says. “And the closer we couple all system components, the further we increasing reliability.”

Running end-to-end models on device can also improve responsiveness, Kunzmann says.

“Fire TV customers said that when we process requests like switching channels or proceeding to the next page on-device, we are much faster, and usability goes up,” he says

At ASRU, Kunzmann’s team is reporting on two new projects to make on-device, neural, signal-to-interpretation models even more useful.

Dynamic content

One paper, “Context-aware Transformer transducer for speech recognition”, considers the problem of how to incorporate personalized content — for instance, names from an address book, or the custom names of smart appliances — into neural models at run time.

“In the old days, they had so-called class-based language models, and at inference time, you could load these lists dynamically and get the user’s personalized content decoded,” Kunzmann says. “With neural approaches, you have a huge parameter set, but it is all pretrained. So you have to invent means of ingesting user data at run time.

“The neural network has numerous layers, represented typically as vectors of probabilities. If you are going from one layer to the other, you feed updated probabilities forward. You can ingest information by changing these probabilities based on dynamic content, which allows you to change output probabilities to recognize user context — like your personal address book or your location of interest.

Biasing layer.png — The architecture of the context-aware model proposed in "Context-aware Transformer transducer for speech recognition": *(a)* Transformer transducer model; *(b)* context biasing layer; *(c)* context-aware Transformer transducer (CATT) with audio embeddings; *(d)* CATT with audio and label embeddings.

Multilingual processing

The other ASRU paper from Kunzmann’s team, “In pursuit of babel: Multilingual end-to-end spoken language understanding”, tackles the problem of moving multilingual models, which can respond in kind to requests in any of several languages, on-device.

In the cloud-based version of Alexa’s multilingual service, the same customer utterance is sent to multiple ASR models at once. Once a separate language identification model has determined what language is being spoken, the output of the appropriate ASR model is used for further processing. This prevents delays, because it enables the ASR models to begin working before the language has been identified.

Multilingual architecture.png — The architecture of the multilingual model proposed in "In pursuit of babel: Multilingual end-to-end spoken language understanding".

“On-device, we cannot afford that, because we don't have compute fleets running in parallel,” Kunzmann says. “Remember, signal-to-interpretation is one system that tightly couples ASR and NLU. In a nutshell, we show that we can train the signal-to-interpretation models on data from three different locales — in this case, English, Spanish, and French — and that improves accuracy and shrinks the model footprint. We could improve these systems’ performance by an order of magnitude and run these models on-device.”

“I think this is a core aspect what we want to do in science at Amazon — driving the research community to new areas. Performance improvements, like dynamic content processing, are helping research generally, but they’re also helping solve our customer problems.”

About the Author

Larry Hardesty

Larry Hardesty is the editor of the Amazon Science blog. Previously, he was a senior editor at MIT Technology Review and the computer science writer at the MIT News Office.

ASRU: Integrating speech recognition and language understanding

Amazon’s Jimmy Kunzmann on how “signal-to-interpretation” models improve availability, performance.

Dynamic content

Multilingual processing

Related content

Work with us