-
AMTA 20202020We present Sockeye 2, a modernized and streamlined version of the Sockeye neural machine translation (NMT) toolkit. New features include a simplified code base through the use of MXNet’s Gluon API, a focus on state of the art model architectures, distributed mixed precision training, and efficient CPU decoding with 8-bit quantization. These improvements result in faster training and inference, higher automatic
-
Interspeech 20202020Recent advances in Text-to-Speech (TTS) have improved quality and naturalness to near-human capabilities. But something which is still lacking in order to achieve human-like communication is the dynamic variations and adaptability of human speech in more complex scenarios. This work attempts to solve the problem of achieving a more dynamic and natural intonation in TTS systems, particularly for stylistic
-
Interspeech 20202020A modern Spoken Language Understanding (SLU) system usually contains two sub-systems, Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU),where ASR transforms voice signal to text form and NLU provides intent classification and slot filling from the text. In practice,such decoupled ASR/NLU design facilitates fast model iteration for both components. However, this makes downstream NLU
-
Interspeech 20202020Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesizing speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesizing speech in a different target speaker’s voice. Cur-rent approaches for fine-grained PT suffer
-
Interspeech 20202020The speech recognition training data corresponding to digital voice assistants is dominated by wake-words. Training endto-end (E2E) speech recognition models without careful attention to such data results in sub-optimal performance as models prioritize learning wake-words. To address this problem, we propose a novel discriminative initialization strategy by introducing a regularization term to penalize
Related content
-
May 24, 2018Amazon scientists are continuously expanding Alexa’s natural-language-understanding (NLU) capabilities to make Alexa smarter, more useful, and more engaging.
-
May 11, 2018Smart speakers, such as the Amazon Echo family of products, are growing in popularity among consumer and business audiences. In order to improve the automatic speech recognition (ASR) and full-duplex voice communication (FDVC) performance of these smart speakers, acoustical echo cancellation (AEC) and noise reduction systems are required. These systems reduce the noises and echoes that can impact operation, such as an Echo device accurately hearing the wake word “Alexa.”
-
May 04, 2018In recent years, the amount of textual information produced daily has increased exponentially. This information explosion has been accelerated by the ease with which data can be shared across the web. Most of the textual information is generated as free-form text, and only a small fraction is available in structured format (Wikidata, Freebase etc.) that can be processed and analyzed directly by machines.
-
April 25, 2018This morning, I am delivering a keynote talk at the World Wide Web Conference in Lyon, France, with the title, Conversational AI for Interacting with the Digital and Physical World.
-
April 12, 2018The Amazon Echo is a hands-free smart home speaker you control with your voice. The first important step in enabling a delightful customer experience with an Echo or other Alexa-enabled device is wake word detection, so accurate detection of “Alexa” or substitute wake words is critical. It is challenging to build a wake word system with low error rates when there are limited computation resources on the device and it's in the presence of background noise such as speech or music.
-
April 10, 2018Just as Alexa can wake up without the need to press a button, she also automatically detects when a user finishes her query and expects a response. This task is often called “end-of-utterance detection,” “end-of-query detection,” “end-of-turn detection,” or simply “end-pointing.”