Speech

Accelerator-aware training for transducer-based speech recognition

Suhaila Shakiah, Rupak Vignesh Swaminathan, Hieu Duy Nguyen, Raviteja Chinta, Tariq Afzal, Nathan Susanj, Thanasis Mouchtaris, Grant Strimel, Ariya Rastrow

SLT 2022

2022

Machine learning model weights and activations are represented in full-precision during training. This leads to performance degradation in runtime when deployed on neural network accelerator (NNA) chips, which leverage highly parallelized fixed-point arithmetic to improve runtime memory and latency. In this work, we replicate the NNA operators during the training phase, accounting for the degradation due

Conversational AI

Contextual-utterance training for automatic speech recognition

Alejandro Gomez Alanis, Lukas Drude, Andreas Schwarz, Rupak Vignesh Swaminathan, Simon Wiesler

iberSPEECH 2022

2022

Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation

Conversational AI

Self-supervised speech representation learning: A review

Abdel-Rahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing

2022

Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide

Conversational AI

Exploration of language-specific self-attention parameters for multilingual end-to-end speech recognition

Brady Houston, Katrin Kirchhoff

SLT 2022

2022

In the last several years, end-to-end (E2E) ASR models have mostly surpassed the performance of hybrid ASR models. E2E is particularly well suited to multilingual approaches because it doesn’t require language-specific phone alignments for training. Recent work has improved multilingual E2E modeling over naive data pooling on up to several dozen languages by using both language-specific and language-universal

Machine learning

Residual adapters for targeted updates in RNN-transducer based speech recognition system

Sungjun Han, Deepak Baby, Valentin Mendelev

SLT 2022

2022

This paper investigates an approach for adapting RNNTransducer (RNN-T) based automatic speech recognition (ASR) model to improve the recognition of unseen words during training. Prior works have shown that it is possible to incrementally fine-tune the ASR model to recognize multiple sets of new words. However, this creates a dependency between the updates which is not ideal for the hot-fixing use-case where

Conversational AI

Guided contrastive self-supervised pre-training for automatic speech recognition

Aparna Khare, Minhua Wu, Saurabhchand Bhati, Jasha Droppo, Roland Maas

SLT 2022

2022

Contrastive Predictive Coding (CPC) is a representation learning method that maximizes the mutual information between intermediate latent representations and the output of a given model. It can be used to effectively initialize the encoder of an Automatic Speech Recognition (ASR) model. We present a novel modification of CPC called Guided Contrastive Predictive Coding (GCPC). Our proposed method maximizes

Machine learning

On granularity of prosodic representations in expressive text-to-speech

Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafal Sienkiewicz, Daniel Korzekwa, Viacheslav Klimkov

SLT 2022

2022

In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input

Conversational AI

Personalization of CTC speech recognition models

Saket Dingliwal, Monica Sunkara, Srikanth Ronanki, Jeff Farris, Katrin Kirchhoff, Sravan Bodapati

SLT 2022

2022

End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time

Conversational AI

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Kai Zhen, Martin Radfar, Hieu Nguyen, Grant Strimel, Nathan Susanj, Thanasis Mouchtaris

SLT 2022

2022

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, “soft-to-hard” compression mechanism with self-adjustable

Conversational AI

A quick guide to Amazon’s 50-plus ICASSP papers 2022

Staff writer

May 10, 2022

Topics range from the predictable, such as speech recognition and signal processing, to time series forecasting and personalization.

Conversational AI

Speech

Work with us