-
Interspeech 20202020End-to-end spoken language understanding (SLU) models are a class of model architectures that predict semantics directly from speech. Because of their input and output types, we refer to them as speech-to-interpretation (STI) models. Previous works have successfully applied STI models to targeted use cases, such as recognizing home automation commands, however no study has yet addressed how these models
-
Interspeech 20202020Voice assistants such as Siri, Alexa, etc. usually adopt a pipeline to process users’ utterances, which generally include transcribing the audio into text, understanding the text, and finally responding back to users. One potential issue is that some utterances could be devoid of any interesting speech, and are thus not worth being processed through the entire pipeline. Examples of uninteresting utterances
-
Interspeech 20202020Traditional hybrid speech recognition systems use a fixed vocabulary for recognition, which is a challenge for agglutinative and compounding languages due to the presence of large number of rare words. This causes high out-of-vocabulary rate and leads to poor probability estimates for rare words. It is also important to keep the vocabulary size in check for a low-latency WFST-based speech recognition system
-
Interspeech 20202020In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data. Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical representations. As
-
Interspeech 20202020Subwords are the most widely used output units in end-to-end speech recognition. They combine the best of two worlds by modeling the majority of frequent words directly and at the same time allow open vocabulary speech recognition by backing off to shorter units or characters to construct words unseen during training. However, mapping text to subwords is ambiguous and often multiple segmentation variants
Related content
-
July 09, 2021The conference’s mission is to bring together stakeholders working toward improving the truthfulness and trustworthiness of online communications.
-
July 08, 2021Amazon Visiting Academic Barbara Poblete helps to build safer, more-diverse online communities — and to aid disaster response.
-
July 02, 2021Giving a neural generation model “control knobs” enables modulation of the content of generated language.
-
July 01, 2021Methods share a two-stage training process in which a model learns a representation from audio data, then learns to predict that representation from text.
-
June 24, 2021The organization focuses on furthering the state of the art on discourse- and dialogue-related technologies.
-
June 17, 2021Combining classic signal processing with deep learning makes method efficient enough to run on a phone.