In the last five years, speech synthesis technology has moved to all-neural models, which allow the separate elements of speech — prosody, accent, language, and speaker identity (voice) — to be controlled separately. It’s the technology that enabled the Amazon Text-to-Speech group to teach the feminine-sounding, English-language Alexa voice to speak in perfectly accented U.S. Spanish and the masculine-sounding U.S. voice to speak with a British accent.
In both of those cases, however, we had two advantages: (1) abundant annotated speech samples with the target accent, which the existing voice model could learn from, and (2) a set of rules for mapping graphemes — sequences of characters — to the phonemes — the minimal units of phonetic information, and the input to our text-to-speech models — of the target accent.
In the case of the Irish-accented, female-sounding English Alexa voice, which launched late last year, we had neither of those advantages — no grapheme-to-phoneme rules and a dataset that was an order of magnitude smaller than those for British English and U.S. Spanish. When we tried using the same approach to accent transfer that had worked in the previous cases, the results were poor.
So instead of taking an existing voice and teaching it a new accent, we took recordings of accented speech and changed their speaker ID. This provided us with additional training data for our Irish-accent text-to-speech model in the target voice, which greatly improved the accent quality.
More precisely, to train a multispeaker, multiaccent text-to-speech (TTS) model, we first synthesized training data using a separate voice conversion (VC) model.
The inputs to the voice conversion model include a speaker embedding, which is a vector representation of the acoustic characteristics of a given speaker’s voice; a mel-spectrogram, which is a snapshot of the frequency spectrum of the speech signal at short intervals; and the phoneme sequence associated with the spectrogram.
During training, the TTS model, too, receives a speaker embedding, mel-spectrograms, and phoneme sequences, but at inference time, it does not receive the spectrograms. It’s a multiaccent, multispeaker model, so at training time, it also receives an accent ID, a simple ordinal indicator of the input speech’s accent. At inference time, the accent ID signal will still control the accent of the output speech.
Using a multiaccent model is not essential to our approach, but at Alexa AI, we’ve found, empirically, that multiaccent models tend to yield more-natural-sounding synthetic speech than single-accent models.
The TTS model’s inputs also include information, extracted from the input speech signal, about the duration of the individual input phonemes, which gives the model better control of the accent rhythm. Again, at inference time, there is no input speech signal; instead, the durations of the phonemes are predicted by a separate duration model, which is trained in parallel with the TTS model.
Although we have no grapheme-to-phoneme (G2P) rules for Irish-accented English speech, we have to generate the input phonemes for our TTS model somehow, and we experimented with the G2P rules for both British English and American English. Neither of these is entirely accurate: for instance, the vowel sound of the word “can’t” — and thus the associated phoneme — is different in Irish English than in either of the other two accent groups. But we were able to get credible results with both British English and American English G2P rules.
American English worked slightly better, however, and this is probably because of rhoticity: American English speakers, like Irish English speakers, pronounce their r’s; British English speakers usually drop them.
To evaluate our method, we asked reviewers to compare Irish English speech synthesized by our method to recordings of four different Irish English speakers, one of whom was our source speaker — the one who provided the speech that was the basis of our augmented data. In terms of accent, reviewers rated recordings of the source speaker as about 72.56% similar to other recordings of the same speaker; they rated our synthesized speech (in a different voice) 61.4% similar to recordings of the source speaker.
When reviewers were asked to compare the accent of the source speaker to those of the other three Irish English speakers, however, the similarity score fell to 53%; when asked to do the same with our synthesized speech, the similarity score was 51%. In other words, reviewers thought that our synthesized speech approximated the “average” Irish accent about as well as the source speaker did. That the agreement is so low — for both real and synthetic speech — is a testimony to the diversity of accents in Irish English (sometimes called the language of a million accents).
To baseline the results, we also asked the reviewers to compare speech generated through our approach to speech generated through the leading prior approach. Overall, they found that our approach offered a 50% improvement in accent similarity over the prior approach.
Acknowledgements: We would like to acknowledge Andre Canelas for identifying the opportunity and driving the project and Dennis Stansbury, Seán Mac Aodha, Laura Teefy, and Rikki Price for their support in making the experience authentic.