-
ICASSP 20212021Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three
-
ICASSP 20212021Automatic dubbing aims at seamlessly replacing the speech in a video document with synthetic speech in a different language. The task implies many challenges, one of which is generating translations that not only convey the original content, but also match the duration of the corresponding utterances. In this paper, we focus on the problem of controlling the verbosity of machine translation output, so that
-
ICASSP 20212021Automatic dubbing is an extension of speech-to-speech translation such that the resulting target speech is carefully aligned in terms of duration, lip movements, timbre, emotion, prosody, etc. of the speaker in order to achieve audiovisual coherence. Dubbing quality strongly depends on isochrony, i.e., arranging the translation of the original speech to optimally match its sequence of phrases and pauses
-
ICASSP 20212021Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial training (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accentinvariant representations. We also prove that performing the gradient reversal in DAT is equivalent to minimizing
-
ICASSP 20212021We present Bifocal RNN-T, a new variant of the Recurrent Neural Network Transducer (RNN-T) architecture designed for improved inference time latency on speech recognition tasks. The architecture enables a dynamic pivot for its runtime compute pathway, namely taking advantage of keyword spotting to select which component of the network to execute for a given audio frame. To accomplish this, we leverage a
Related content
-
February 28, 2022Novel pretraining method enables increases of 5% to 14% on five different evaluation metrics.
-
February 22, 2022Integrating symbolic reasoning and learning efficiently from interactions with the world are two major remaining challenges, says vice president and distinguished scientist Nikko Ström.
-
February 15, 2022Amazon Science is now the destination for information on the SocialBot, TaskBot, and SimBot challenges, including FAQs, team updates, publications, and other program information.
-
February 11, 2022Paper deals with detecting and answering out-of-domain requests for task-oriented dialogue systems.
-
February 02, 2022Sessions on multidevice scenarios, inclusive and fair speech technologies, trustworthy speech processing, and speech intelligibility prediction seek paper submissions.
-
January 31, 2022How Alexa scales machine learning models to millions of customers.