-
ICASSP 20202020We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational Auto Encoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics
-
ICASSP 20202020In large-scale domain classification, an utterance can be handled by multiple domains with overlapped capabilities. However, only a limited number of ground-truth domains are provided for each training utterance in practice, while knowing as many correct target labels as possible is helpful for improving the model performance. In this paper, given one ground-truth domain for each training utterance, we
-
ICASSP 20202020Wake word (WW) spotting is challenging in far-field because of not only the interference in signal transmission but also the complexity in acoustic environment. Traditional WW model training requires a large amount of in-domain WW-specific data with substantial human annotations. This prevents the model building in the situation of lacking such data. In this paper we present data-efficient solutions to
-
ICASSP 20202020End-to-end approaches for automatic speech recognition (ASR) benefit from directly modeling the probability of the word sequence given the input audio stream in a single neural network. However, compared to conventional ASR systems, these models typically require more data to achieve comparable results. Well-known model adaptation techniques, to account for domain and style adaptation, are not easily applicable
-
ICASSP 20202020Grapheme-to-phoneme (G2P) models convert a written word into its corresponding pronunciation and are essential components in automatic-speech-recognition and text-to-speech systems. Recently, the use of neural encoder-decoder architectures has substantially improved G2P accuracy for monolingual and multilingual cases. However, most multilingual G2P studies focus on sets of languages that share similar graphemes
Related content
-
April 04, 2019Customer interactions with Alexa are constantly growing more complex, and on the Alexa science team, we strive to stay ahead of the curve by continuously improving Alexa’s speech recognition system. Increasingly, keeping pace with Alexa’s expanding capabilities will require automating the learning process, through techniques such as semi-supervised learning, which leverages a small amount of annotated data to extract information from a much larger store of unannotated data.
-
April 01, 2019The idea of using arrays of microphones to improve automatic speech recognition (ASR) is decades old. The acoustic signal generated by a sound source reaches multiple microphones with different time delays. This information can be used to create virtual directivity, emphasizing a sound arriving from a direction of interest and diminishing signals coming from other directions. In voice recognition, one of the more popular methods for doing this is known as “beamforming”.
-
Animation by Nick LittleMarch 28, 2019Audio watermarking is the process of adding a distinctive sound pattern — undetectable to the human ear — to an audio signal to make it identifiable to a computer. It’s one of the ways that video sites recognize copyrighted recordings that have been posted illegally. To identify a watermark, a computer usually converts a digital file into an audio signal, which it processes internally.
-
March 21, 2019Sentiment analysis is the attempt, computationally, to determine from someone’s words how he or she feels about something. It has a host of applications, in market research, media analysis, customer service, and product recommendation, among other things. Sentiment classifiers are typically machine learning systems, and any given application of sentiment analysis may suffer from a lack of annotated data for training purposes.
-
March 20, 2019Although deep neural networks have enabled accurate large-vocabulary speech recognition, training them requires thousands of hours of transcribed data, which is time-consuming and expensive to collect. So Amazon scientists have been investigating techniques that will let Alexa learn with minimal human involvement, techniques that fall in the categories of unsupervised and semi-supervised learning.
-
March 11, 2019In experiments involving sound recognition, technique reduces error rate by 15% to 30%.