More-natural prosody for synthesized speech

Prosody transfer technique addresses the problem of “source speaker leakage”, while prosody selection model better matches prosody to semantic content.

At this year’s Interspeech, the Amazon text-to-speech team presented two new papers about controlling prosody — the rhythm, emphasis, melody, duration, and loudness of speech — in speech synthesis.

One paper, “CopyCat: many-to-many fine-grained prosody transfer for neural text-to-speech”, is about transferring prosody from recorded speech to speech synthesized in a different voice. In particular, it addresses the problem of “source speaker leakage”, in which the speech synthesis model sometimes produces speech in the source speaker’s voice, rather than the target speaker’s voice.

According to listener studies using the industry-standard MUSHRA (multiple stimuli with hidden reference and anchor) methodology, the speech produced by our model improved over the state-of-the-art system's by 47% in terms of naturalness and 14% in retention of speaker identity.

Source reference
Target identity
Speech with target identity + source prosody
Source reference
Target identity
Speech with target identity + source prosody

The other paper, “Dynamic prosody generation for speech synthesis using linguistics-driven acoustic embedding selection”, is about achieving more dynamic and natural intonation in synthesized speech from TTS systems. It describes a model that uses syntactic and semantic properties of the utterance to determine the prosodic features.

Again according to tests using the MUSHRA methodology, our model reduced the discrepancy between the naturalness of synthesized speech and that of recorded speech by about 6% for complex utterances and 20% on the task of long-form reading.

"Does he wear a black suit or a blue one?"

Centroid
Syntactic
BERT
BERT + Syntactic

"Who ate the rest of my pizza?"

Centroid
Syntactic
BERT
BERT + Syntactic

"Get scores, schedules, and listen to live audio streams."

Centroid
Syntactic
BERT
BERT + Syntactic

CopyCat

When prosody transfer (PT) involves very fine-grained characteristics — the inflections of individual words, as opposed to general speaking styles — it’s more likely to suffer from source speaker leakage. This issue is exacerbated when the PT model is trained on non-parallel data — i.e., without having the same utterances spoken by the source and target speaker.

The core of CopyCat is a novel reference encoder, whose inputs are a mel-spectrogram of the source speech (a snapshot of the frequency spectrum); an embedding, or vector representation, of the source speech phonemes (the smallest units of speech); and a vector indicating the speaker’s identity. 

The reference encoder outputs speaker-independent representations of the prosody of the input speech. These prosodic representations are robust to source speaker leakage despite being trained on non-parallel data. In the absence of parallel data, we train the model to transfer prosody from speakers onto themselves. 

CopyCat architecture flowchart
The CopyCat architecture.

During inference, the phonemes of the speech to be synthesized pass first through a phoneme encoder and then to the reference encoder. The output of the reference encoder, together with the encoded phonemes and the speaker identity vector, then passes to the decoder, which generates speech with the target speaker’s voice and the source speaker's prosody.

In order to evaluate the efficacy of our method, we compared CopyCat to a state-of-the-art model over five target voices, onto which the source prosody from 12 different unseen speakers had been transferred. CopyCat showed a statistically significant 47% increase in prosody transfer quality over the baseline. In another evaluation involving native speakers of American English, CopyCat showed a statistically significant 14% improvement over baseline in its ability to retain the target speaker’s identity. CopyCat achieves both the results with a significantly simpler decoder than the baseline requires, with no drop in naturalness. 

Prosody Selection 

Text-to-speech (TTS) has improved dramatically in recent years, but it still lacks the dynamic variation and adaptability of human speech.

One popular way to encode prosody in TTS systems is to use a variational autoencoder (VAE), which learns a distribution of prosodic characteristics from sample speech. Selecting a prosodic style for a synthetic utterance is a matter of picking a point — an acoustic embedding — in that distribution. 

In practice, most VAE-based TTS systems simply choose a point in the center of the distribution — a centroid — for all utterances. But rendering all the samples with the exact same prosody gets monotonous. 

In our Interspeech paper, we present a novel way of exploiting linguistic information to select acoustic embeddings in VAE systems to achieve a more dynamic and natural intonation in TTS systems, particularly for stylistic speech such as the newscaster speaking style.

Syntax, semantics, or both?

We experiment with three different systems for generating vector representations of the inputs to a TTS system, which allows us to explore the impact of both syntax and semantics on the overall quality of speech synthesis.

The first system uses syntactic information only; the second relies solely on BERT embeddings, which capture semantic information about strings of text, on the basis of word co-occurrence in large text corpora; and the third uses a combination of BERT and syntactic information. Based on these representations, our model selects acoustic embeddings to characterize the prosody of synthesized utterances.

To explore whether syntactic information can aid prosody selection, we use the notion of syntactic distance, a measure based on constituency trees, which map syntactic relationships between the words of a sentence. Large syntactic distances correlate with acoustically relevant events such as phrasing breaks or prosodic resets.

A constituency tree featuring syntactic-distance measures.
A constituency tree featuring syntactic-distance measures (orange circles).
credit: Glynis Condon

At left is the constituency tree of the sentence “The brown fox is quick, and it is jumping over the lazy dog”. Parts of speech are labeled according to the Penn part-of-speech tags: “DT”, for instance, indicates a determiner; “VBZ” indicates a third-person singular present verb, while “VBG” indicates a gerund or present participle; and so on.

The structure of the tree indicates syntactic relationships: for instance, “the”, “brown”, and “fox” together compose a noun phrase (NP), while “is” and “quick” compose a verb phrase (VP). 

Syntactic distance is a rank ordering that indicates the difference in the heights, within the tree, of the common ancestors of consecutive words; any values that preserve that ordering are valid.

One valid distance vector for this sentence is d = [0 2 1 3 1 8 7 6 5 4 3 2 1]. The completion of the subject noun phrase (after “fox”) triggers a prosodic reset, reflected in the distance of 3 between “fox” and “is”. There should also be a more emphasized reset at the end of the first clause, represented by the distance of 8 between “quick” and “and”.

We compared VAE models with linguistically informed acoustic-embedding selection against a VAE model that uses centroid selection on two tasks, sentence synthesis and long-form reading.

The sentence synthesis data set had four categories: complex utterances, sentences with compound nouns, and two types of questions, with their characteristic prosody (the rising inflection at the end, for instance): questions beginning with “wh” words (who, what, why, etc.) and “or” questions, which present a choice.

The model that uses syntactic information alone improves on the baseline model across the board, while the addition of semantic information improves performance still further in some contexts. 

On the “wh” questions, the combination of syntactic and semantic data delivered an 8% improvement over the baseline, and on the “or” questions, the improvement was 21%. This demonstrates that questions have closely related syntactic structures, information that can be used to achieve better prosody.

On long-form reading, the syntactic model alone delivered the best results, reducing the gap between the baseline and recorded speech by approximately 20%.

Research areas

Related content

US, NY, New York
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Senior Applied Scientist to work on pre-training methodologies for Generative Artificial Intelligence (GenAI) models. You will interact closely with our customers and with the academic and research communities. Key job responsibilities Join us to work as an integral part of a team that has experience with GenAI models in this space. We work on these areas: - Scaling laws - Hardware-informed efficient model architecture, low-precision training - Optimization methods, learning objectives, curriculum design - Deep learning theories on efficient hyperparameter search and self-supervised learning - Learning objectives and reinforcement learning methods - Distributed training methods and solutions - AI-assisted research About the team The AGI team has a mission to push the envelope in GenAI with Large Language Models (LLMs) and multimodal systems, in order to provide the best-possible experience for our customers.
US, WA, Seattle
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! Key job responsibilities - Develop ML models for various recommendation & search systems using deep learning, online learning, and optimization methods - Work closely with other scientists, engineers and product managers to expand the depth of our product insights with data, create a variety of experiments to determine the high impact projects to include in planning roadmaps - Stay up-to-date with advancements and the latest modeling techniques in the field - Publish your research findings in top conferences and journals A day in the life We're using advanced approaches such as foundation models to connect information about our videos and customers from a variety of information sources, acquiring and processing data sets on a scale that only a few companies in the world can match. This will enable us to recommend titles effectively, even when we don't have a large behavioral signal (to tackle the cold-start title problem). It will also allow us to find our customer's niche interests, helping them discover groups of titles that they didn't even know existed. We are looking for creative & customer obsessed machine learning scientists who can apply the latest research, state of the art algorithms and ML to build highly scalable page personalization solutions. You'll be a research leader in the space and a hands-on ML practitioner, guiding and collaborating with talented teams of engineers and scientists and senior leaders in the Prime Video organization. You will also have the opportunity to publish your research at internal and external conferences.
US, NY, New York
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! We are looking for a self-motivated, passionate and resourceful Applied Scientist to bring diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. You will spend your time as a hands-on machine learning practitioner and a research leader. You will play a key role on the team, building and guiding machine learning models from the ground up. At the end of the day, you will have the reward of seeing your contributions benefit millions of Amazon.com customers worldwide. Key job responsibilities - Develop AI solutions for various Prime Video Search systems using Deep learning, GenAI, Reinforcement Learning, and optimization methods; - Work closely with engineers and product managers to design, implement and launch AI solutions end-to-end; - Design and conduct offline and online (A/B) experiments to evaluate proposed solutions based on in-depth data analyses; - Effectively communicate technical and non-technical ideas with teammates and stakeholders; - Stay up-to-date with advancements and the latest modeling techniques in the field; - Publish your research findings in top conferences and journals. About the team Prime Video Search Science team owns science solution to power search experience on various devices, from sourcing, relevance, ranking, to name a few. We work closely with the engineering teams to launch our solutions in production.
US, CA, San Francisco
If you are interested in this position, please apply on Twitch's Career site https://www.twitch.tv/jobs/en/ About Us: Twitch is the world’s biggest live streaming service, with global communities built around gaming, entertainment, music, sports, cooking, and more. It is where thousands of communities come together for whatever, every day. We’re about community, inside and out. You’ll find coworkers who are eager to team up, collaborate, and smash (or elegantly solve) problems together. We’re on a quest to empower live communities, so if this sounds good to you, see what we’re up to on LinkedIn and X, and discover the projects we’re solving on our Blog. Be sure to explore our Interviewing Guide to learn how to ace our interview process. You can work in San Francisco, CA or Seattle, WA. Perks - Medical, Dental, Vision & Disability Insurance - 401(k) - Maternity & Parental Leave - Flexible PTO - Amazon Employee Discount
US, WA, Bellevue
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Applied Scientist with a strong deep learning background, to help build industry-leading technology with Large Language Models (LLMs) and multimodal systems. Key job responsibilities As an Applied Scientist with the AGI team, you will work with world-class scientists and engineers to develop novel data, modeling and engineering solutions to support the responsible AI initiatives at AGI. Your work will directly impact our customers in the form of products and services that make use of audio technology. About the team While the rapid advancements in Generative AI have captivated global attention, we see these as just the starting point. Our team is dedicated to pushing the boundaries of what’s possible, leveraging Amazon’s unparalleled ML infrastructure, computing resources, and commitment to responsible AI principles. And Amazon’s leadership principle of customer obsession guides our approach, prioritizing our customers’ needs and preferences each step of the way.
US, WA, Bellevue
Are you interested in a unique opportunity to advance the accuracy and efficiency of Artificial General Intelligence (AGI) systems? If so, you're at the right place! As a Quantitative Researcher on our team, you will be working at the intersection of mathematics, computer science, and finance, you will collaborate with a diverse team of engineers in a fast-paced, intellectually challenging environment where innovative thinking is encouraged and rewarded. We operate at Amazon's large scale with the energy of a nimble start-up. If you have a learner's mindset, enjoy solving challenging problems, and value an inclusive team culture, you will thrive in this role, and we hope to hear from you. Key job responsibilities * Conduct statistical analyses on web-scale datasets to develop state-of-the-art multimodal large language models * Conceptualize and develop mathematical models, data sampling and preparation strategies to continuously improve existing algorithms * Identify and utilize data sources to drive innovation and improvements to our LLMs About the team We are passionate engineers and scientists dedicated to pushing the boundaries of innovation. We evaluate and represent the customer perspective through accurate benchmarking.
US, CA, Sunnyvale
The Artificial General Intelligence (AGI) team is looking for a highly skilled and experienced Senior Applied Scientist, to lead the development and implementation of algorithms and models for supervised fine-tuning and reinforcement learning through human feedback; with a focus across text, image, and video modalities. As a Senior Applied Scientist, you will play a critical role in driving the development of Generative AI (Gen AI) technologies that can handle Amazon-scale use cases and have a significant impact on our customers' experiences. Key job responsibilities - Collaborate with cross-functional teams of engineers, product managers, and scientists to identify and solve complex problems in GenAI - Design and execute experiments to evaluate the performance of different algorithms and models, and iterate quickly to improve results - Think big about the arc of development of GenAI over a multi-year horizon, and identify new opportunities to apply these technologies to solve real-world problems - Communicate results and insights to both technical and non-technical audiences, including through presentations and written reports - Mentor and guide junior scientists and engineers, and contribute to the overall growth and development of the team
MX, DIF, Mexico City
Do you like working on projects that are highly visible and are tied closely to Amazon’s growth? Are you seeking an environment where you can drive innovation leveraging the scalability and innovation with Amazon's AWS cloud services? The Amazon International Technology Team is hiring Applied Scientists to work in our Machine Learning team in Mexico City. The Intech team builds International extensions and new features of the Amazon.com web site for individual countries and creates systems to support Amazon operations. We have already worked in Germany, France, UK, India, China, Italy, Brazil and more. Key job responsibilities About you You want to make changes that help millions of customers. You don’t want to make something 10% better as a part of an enormous team. Rather, you want to innovate with a small community of passionate peers. You have experience in analytics, machine learning, LLMs and Agentic AI, and a desire to learn more about these subjects. You want a trusted role in strategy and product design. You put the customer first in your thinking. You have great problem solving skills. You research the latest data technologies and use them to help you innovate and keep costs low. You have great judgment and communication skills, and a history of delivering results. Your Responsibilities - Define and own complex machine learning solutions in the consumer space, including targeting, measurement, creative optimization, and multivariate testing. - Design, implement, and evolve Agentic AI systems that can autonomously perceive their environment, reason about context, and take actions across business workflows—while ensuring human-in-the-loop oversight for high-stakes decisions. - Influence the broader team's approach to integrating machine learning into business workflows. - Advise leadership, both tech and non-tech. - Support technical trade-offs between short-term needs and long-term goals.
BR, SP, Sao Paulo
Do you like working on projects that are highly visible and are tied closely to Amazon’s growth? Are you seeking an environment where you can drive innovation leveraging the scalability and innovation with Amazon's AWS cloud services? The Amazon International Technology Team is hiring Applied Scientists to work in our Machine Learning team in Mexico City. The Intech team builds International extensions and new features of the Amazon.com web site for individual countries and creates systems to support Amazon operations. We have already worked in Germany, France, UK, India, China, Italy, Brazil and more. Key job responsibilities About you You want to make changes that help millions of customers. You don’t want to make something 10% better as a part of an enormous team. Rather, you want to innovate with a small community of passionate peers. You have experience in analytics, machine learning, LLMs and Agentic AI, and a desire to learn more about these subjects. You want a trusted role in strategy and product design. You put the customer first in your thinking. You have great problem solving skills. You research the latest data technologies and use them to help you innovate and keep costs low. You have great judgment and communication skills, and a history of delivering results. Your Responsibilities - Define and own complex machine learning solutions in the consumer space, including targeting, measurement, creative optimization, and multivariate testing. - Design, implement, and evolve Agentic AI systems that can autonomously perceive their environment, reason about context, and take actions across business workflows—while ensuring human-in-the-loop oversight for high-stakes decisions. - Influence the broader team's approach to integrating machine learning into business workflows. - Advise leadership, both tech and non-tech. - Support technical trade-offs between short-term needs and long-term goals.
BR, SP, Sao Paulo
Do you like working on projects that are highly visible and are tied closely to Amazon’s growth? Are you seeking an environment where you can drive innovation leveraging the scalability and innovation with Amazon's AWS cloud services? The Amazon International Technology Team is hiring Applied Scientists to work in our Software Development Center in Sao Paulo. The Intech team builds International extensions and new features of the Amazon.com web site for individual countries and creates systems to support Amazon operations. We have already worked in Germany, France, UK, India, China, Italy, Brazil and more. Key job responsibilities About you You want to make changes that help millions of customers. You don’t want to make something 10% better as a part of an enormous team. Rather, you want to innovate with a small community of passionate peers. You have experience in analytics, machine learning and big data, and a desire to learn more about these subjects. You want a trusted role in strategy and product design. You put the customer first in your thinking. You have great problem solving skills. You research the latest data technologies and use them to help you innovate and keep costs low. You have great judgment and communication skills, and a history of delivering results. Your Responsibilities - Define and own complex machine learning solutions in the consumer space, including targeting, measurement, creative optimization, and multivariate testing. - Influence the broader team's approach to integrating machine learning into business workflows. - Advise senior leadership, both tech and non-tech. - Make technical trade-offs between short-term needs and long-term goals.