Audio samples for the paper "BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data".
Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion- parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS.
Below are selected samples produced by the model. There's no annotation on the input text and no post-processing on the audio.
English
At the conference, the professor, Mark Curtis, who researched the phenomena that the student who presented earlier had focused on made a surprising revelation that shocked the audience.
His latest invention (a device meant to assist in everyday chores (something he never seemed to run out of)), was nothing short of brilliant.
Overwhelmed with confusion and despair, David Darlan cried out, "What do you want from me? Why can't you just tell me what's wrong?"
After getting to his car he said, "Oh great, another Monday, I just can't wait to sit in traffic for an hour and spend the next 8 hours staring at a computer screen."
With an ample supply of joie de vivre, Mary danced through the streets of Nice, stopping only to enjoy a nice café with a warm croissant. How French!
His face lit up with pure delight as he exclaimed, "We did it! We won the championship! I knew we could do it together!"
"I went through all of this trouble, buying flowers, chocolate, and even organizing a flash mob, and she's still rejecting me?"
A profound sense of realization washed over Matty as he whispered, "You've been there for me all along, haven't you? I never truly appreciated you until now."
Beth collapsed into his arms, sobbing uncontrollably, "I failed them, I failed them all. They’re all dead! Nothing we can do will ever bring them back. How can I ever live with myself again? How?"
"Uh, are you sure about this?" Tim asked nervously, looking at the steep slope before them. "Whoa, it's higher than I thought," he continued, his voice filled with trepidation. "Aha, but look at the view," Emily responded with excitement, "it's worth the climb!"
Spanish