eCat: An end-to-end model for multi-speaker TTS & many-to-many fine-grained prosody transfer

Ammar Abbas; Sri Karlapati; Bastian Schnell; Penny Karanasou; Marcel Granero Moya; Amith Nagaraj; Ayman Boustati; Nicole Peinelt; Alexis Moinet; Thomas Drugman

Publication

eCat: An end-to-end model for multi-speaker TTS & many-to-many fine-grained prosody transfer

By Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman

2023

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

We present eCat, a novel end-to-end multi-speaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion from speech. In Stage II, we learn to predict the prosody representations using the contextual information available in text. We compare eCat to CopyCat2, a model capable of both fine-grained prosody transfer (FPT) and multi-speaker TTS. We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46.7% across 2 languages, 3 locales, and 7 speakers, along with better target-speaker similarity in FPT. We also compare eCat to VITS, and show a statistically significant preference.

eCat: An end-to-end model for multi-speaker TTS & many-to-many fine-grained prosody transfer

Latest news

Work with us