Speech

L2-GEN: A neural phoneme paraphrasing approach to L2 speech synthesis for mispronunciation diagnosis

Daniel Zhang, Ashwinkumar Ganesan, Sarah Campbell, Daniel Korzekwa

Interspeech 2022

2022

In this paper, we study the problem of generating mispronounced speech mimicking non-native (L2) speakers learning English as a Second Language (ESL) for the mispronunciation detection and diagnosis (MDD) task. The paper is motivated by the widely observed yet not well addressed data sparsity issue in MDD research where both L2 speech audio and its fine-grained phonetic annotations are difficult to obtain

Conversational AI

Adversarial reweighting for speaker verification fairness

Minho Jin, Chelsea J.-T. Ju, Zeya Chen, Yi Chieh Liu, Jasha Droppo, Andreas Stolcke

Interspeech 2022

2022

We address performance fairness for speaker verification using the adversarial reweighting (ARW) method. ARW is reformulated for speaker verification with metric learning, and shown to improve results across different subgroups of gender and nationality, without requiring annotation of subgroups in the training data. An adversarial network learns a weight for each training sample in the batch so that the

Machine learning

Incremental learning for RNN-Transducer based speech recognition models

Deepak Baby, Pasquale D'Alterio, Valentin Mendelev

Interspeech 2022

2022

This paper investigates an incremental learning framework for a real-world voice assistant employing RNN-Transducer based automatic speech recognition (ASR) model. Such a model needs to be regularly updated to keep up with changing distribution of customer requests. We demonstrate that a simple fine-tuning approach with a combination of old and new training data can be used to incrementally update the model

Conversational AI

Creating new voices using normalizing flows

Piotr Bilinski, Tom Merritt, Abdelhamid Ezzerg, Kamil Pokora, Sebastian Cygert, Kayoko Yanagisawa, Roberto Barra-Chicote, Daniel Korzekwa

Interspeech 2022

2022

Creating realistic and natural-sounding synthetic speech remains a big challenge for voice identities unseen during training. As there is growing interest in synthesizing voices of new speakers, here we investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities. Firstly, we

Conversational AI

Computer-assisted pronunciation training — speech synthesis is almost all you need

Daniel Korzekwa, Jaime Lorenzo Trueba, Thomas Drugman, Bozena Kostek

Computer Assisted Language Learning Journal

2022

The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation

Conversational AI

Prosodic alignment for off-screen automatic dubbing

Yogesh Virkar, Marcello Federico, Robert Enyedi, Roberto Barra-Chicote

Interspeech 2022

2022

The goal of automatic dubbing is to perform speech-to-speech translation while achieving audiovisual coherence. This entails isochrony, i.e., translating the original speech by also matching its prosodic structure into phrases and pauses, especially when the speaker’s mouth is visible. In previous work, we introduced a prosodic alignment model to address isochrone or on-screen dubbing. In this work, we

Conversational AI

Automatic evaluation of speaker similarity

Kamil Deja, Ariadna Sanchez, Julian Roth, Marius Cotescu

Interspeech 2022

2022

We introduce a new automatic evaluation method for speaker similarity assessment, that is consistent with human perceptual scores. Modern neural text-to-speech models require a vast amount of clean training data, which is why many solutions switch from single speaker models to solutions trained on examples from many different speakers. Multi-speaker models bring new possibilities, such as a faster creation

Conversational AI

Graph-based multi-view fusion and local adaptation: Mitigating within household confusability for speaker identification

Long Chen, Yixiong Meng, Venkatesh Ravichandran, Andreas Stolcke

Interspeech 2022

2022

Speaker identification (SID) in the household scenario (e.g., for smart speakers) is an important but challenging problem due to limited number of labeled (enrollment) utterances, confusable voices, and demographic imbalances. Conventional speaker recognition systems generalize from a large random sample of speakers, causing the recognition to underperform for households drawn from specific cohorts or otherwise

Conversational AI

Expressive, variable, and controllable duration modelling in TTS

Ammar Abbas, Tom Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman

Interspeech 2022

2022

Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve duration

Conversational AI

A multimodal strategy for singing language identification

Wo Jae Lee, Emanuele Coviello

Interspeech 2022

2022

Identification of the language of performance of songs is important for applications such as personalized recommendations, discovery, and search. In this paper, we present an automated multimodal approach to identify the singing language of songs that scales to millions of songs. The proposed model uses a variety of song-level features, including a consumption embedding derived from sessions listening data

Machine learning

Speech

Work with us