More-natural prosody for synthesized speech

Prosody transfer technique addresses the problem of “source speaker leakage”, while prosody selection model better matches prosody to semantic content.

At this year’s Interspeech, the Amazon text-to-speech team presented two new papers about controlling prosody — the rhythm, emphasis, melody, duration, and loudness of speech — in speech synthesis.

One paper, “CopyCat: many-to-many fine-grained prosody transfer for neural text-to-speech”, is about transferring prosody from recorded speech to speech synthesized in a different voice. In particular, it addresses the problem of “source speaker leakage”, in which the speech synthesis model sometimes produces speech in the source speaker’s voice, rather than the target speaker’s voice.

According to listener studies using the industry-standard MUSHRA (multiple stimuli with hidden reference and anchor) methodology, the speech produced by our model improved over the state-of-the-art system's by 47% in terms of naturalness and 14% in retention of speaker identity.

Source reference
Target identity
Speech with target identity + source prosody
Source reference
Target identity
Speech with target identity + source prosody

The other paper, “Dynamic prosody generation for speech synthesis using linguistics-driven acoustic embedding selection”, is about achieving more dynamic and natural intonation in synthesized speech from TTS systems. It describes a model that uses syntactic and semantic properties of the utterance to determine the prosodic features.

Again according to tests using the MUSHRA methodology, our model reduced the discrepancy between the naturalness of synthesized speech and that of recorded speech by about 6% for complex utterances and 20% on the task of long-form reading.

"Does he wear a black suit or a blue one?"

Centroid
Syntactic
BERT
BERT + Syntactic

"Who ate the rest of my pizza?"

Centroid
Syntactic
BERT
BERT + Syntactic

"Get scores, schedules, and listen to live audio streams."

Centroid
Syntactic
BERT
BERT + Syntactic

CopyCat

When prosody transfer (PT) involves very fine-grained characteristics — the inflections of individual words, as opposed to general speaking styles — it’s more likely to suffer from source speaker leakage. This issue is exacerbated when the PT model is trained on non-parallel data — i.e., without having the same utterances spoken by the source and target speaker.

The core of CopyCat is a novel reference encoder, whose inputs are a mel-spectrogram of the source speech (a snapshot of the frequency spectrum); an embedding, or vector representation, of the source speech phonemes (the smallest units of speech); and a vector indicating the speaker’s identity. 

The reference encoder outputs speaker-independent representations of the prosody of the input speech. These prosodic representations are robust to source speaker leakage despite being trained on non-parallel data. In the absence of parallel data, we train the model to transfer prosody from speakers onto themselves. 

CopyCat architecture flowchart
The CopyCat architecture.

During inference, the phonemes of the speech to be synthesized pass first through a phoneme encoder and then to the reference encoder. The output of the reference encoder, together with the encoded phonemes and the speaker identity vector, then passes to the decoder, which generates speech with the target speaker’s voice and the source speaker's prosody.

In order to evaluate the efficacy of our method, we compared CopyCat to a state-of-the-art model over five target voices, onto which the source prosody from 12 different unseen speakers had been transferred. CopyCat showed a statistically significant 47% increase in prosody transfer quality over the baseline. In another evaluation involving native speakers of American English, CopyCat showed a statistically significant 14% improvement over baseline in its ability to retain the target speaker’s identity. CopyCat achieves both the results with a significantly simpler decoder than the baseline requires, with no drop in naturalness. 

Prosody Selection 

Text-to-speech (TTS) has improved dramatically in recent years, but it still lacks the dynamic variation and adaptability of human speech.

One popular way to encode prosody in TTS systems is to use a variational autoencoder (VAE), which learns a distribution of prosodic characteristics from sample speech. Selecting a prosodic style for a synthetic utterance is a matter of picking a point — an acoustic embedding — in that distribution. 

In practice, most VAE-based TTS systems simply choose a point in the center of the distribution — a centroid — for all utterances. But rendering all the samples with the exact same prosody gets monotonous. 

In our Interspeech paper, we present a novel way of exploiting linguistic information to select acoustic embeddings in VAE systems to achieve a more dynamic and natural intonation in TTS systems, particularly for stylistic speech such as the newscaster speaking style.

Syntax, semantics, or both?

We experiment with three different systems for generating vector representations of the inputs to a TTS system, which allows us to explore the impact of both syntax and semantics on the overall quality of speech synthesis.

The first system uses syntactic information only; the second relies solely on BERT embeddings, which capture semantic information about strings of text, on the basis of word co-occurrence in large text corpora; and the third uses a combination of BERT and syntactic information. Based on these representations, our model selects acoustic embeddings to characterize the prosody of synthesized utterances.

To explore whether syntactic information can aid prosody selection, we use the notion of syntactic distance, a measure based on constituency trees, which map syntactic relationships between the words of a sentence. Large syntactic distances correlate with acoustically relevant events such as phrasing breaks or prosodic resets.

A constituency tree featuring syntactic-distance measures.
A constituency tree featuring syntactic-distance measures (orange circles).
credit: Glynis Condon

At left is the constituency tree of the sentence “The brown fox is quick, and it is jumping over the lazy dog”. Parts of speech are labeled according to the Penn part-of-speech tags: “DT”, for instance, indicates a determiner; “VBZ” indicates a third-person singular present verb, while “VBG” indicates a gerund or present participle; and so on.

The structure of the tree indicates syntactic relationships: for instance, “the”, “brown”, and “fox” together compose a noun phrase (NP), while “is” and “quick” compose a verb phrase (VP). 

Syntactic distance is a rank ordering that indicates the difference in the heights, within the tree, of the common ancestors of consecutive words; any values that preserve that ordering are valid.

One valid distance vector for this sentence is d = [0 2 1 3 1 8 7 6 5 4 3 2 1]. The completion of the subject noun phrase (after “fox”) triggers a prosodic reset, reflected in the distance of 3 between “fox” and “is”. There should also be a more emphasized reset at the end of the first clause, represented by the distance of 8 between “quick” and “and”.

We compared VAE models with linguistically informed acoustic-embedding selection against a VAE model that uses centroid selection on two tasks, sentence synthesis and long-form reading.

The sentence synthesis data set had four categories: complex utterances, sentences with compound nouns, and two types of questions, with their characteristic prosody (the rising inflection at the end, for instance): questions beginning with “wh” words (who, what, why, etc.) and “or” questions, which present a choice.

The model that uses syntactic information alone improves on the baseline model across the board, while the addition of semantic information improves performance still further in some contexts. 

On the “wh” questions, the combination of syntactic and semantic data delivered an 8% improvement over the baseline, and on the “or” questions, the improvement was 21%. This demonstrates that questions have closely related syntactic structures, information that can be used to achieve better prosody.

On long-form reading, the syntactic model alone delivered the best results, reducing the gap between the baseline and recorded speech by approximately 20%.

Research areas

Related content

ES, M, Madrid
Amazon's International Technology org in EU (EU INTech) is creating new ways for Amazon customers discovering Amazon catalog through new and innovative Customer experiences. Our vision is to provide the most relevant content and CX for their shopping mission. We are responsible for building the software and machine learning models to surface high quality and relevant content to the Amazon customers worldwide across the site. The team, mainly located in Madrid Technical Hub, London and Luxembourg, comprises Software Developer and ML Engineers, Applied Scientists, Product Managers, Technical Product Managers and UX Designers who are experts on several areas of ranking, computer vision, recommendations systems, Search as well as CX. Are you interested on how the experiences that fuel Catalog and Search are built to scale to customers WW? Are interesting on how we use state of the art AI to generate and provide the most relevant content? Key job responsibilities We are looking for Applied Scientists who are passionate to solve highly ambiguous and challenging problems at global scale. You will be responsible for major science challenges for our team, including working with text to image and image to text state of the art models to scale to enable new Customer Experiences WW. You will design, develop, deliver and support a variety of models in collaboration with a variety of roles and partner teams around the world. You will influence scientific direction and best practices and maintain quality on team deliverables. We are open to hiring candidates to work out of one of the following locations: Madrid, M, ESP
US, WA, Bellevue
Imagine being part of an agile team where your ideas have the potential to reach millions of customers. Picture working on cutting-edge, customer-facing solutions, where every team member is a critical voice in the decision making process. Envision being able to leverage the resources of a Fortune 500 company within the atmosphere of a start-up. Welcome to Amazon’s NCRC team. We solve complex problems in an ambiguous space, focusing on reducing return costs and improving the customer experience. We build solutions that are distributed on a large scale, positively impacting experiences for our customers and sellers. Come innovate with the NCRC team! The Net Cost of Refunds and Concessions (NCRC) team is looking for a Senior Manager Data Science to lead a team of economists, business intelligence engineers and business analysts who investigate business problems, develop insights and build models & algorithms that predict and quantify new opportunity. The team instigates and productionalizes nascent solutions around four pillars: outbound defects, inbound defects, yield optimization and returns reduction. These four pillars interact, resulting in impacts to our overall return rate, associated costs, and customer satisfaction. You may have seen some downstream impacts of our work including Amazon.com customer satisfaction badges on the website and app, new returns drop off optionality, and faster refunds for low cost items. In this role, you will set the science vision and direction for the team, collaborating with internal stakeholders across our returns and re-commerce teams to scale and advance science solutions. This role is based in Bellevue, WA Key job responsibilities * Single threaded leader responsible for setting and driving science strategy for the organization. * Lead and provide coaching to a team of Scientists, Economists, Business Intelligence Engineers and Business Analysts. * Partner with Engineering, Product and Machine Learning leaders to deliver insights and recommendations across NCRC initiatives. * Lead research and development of models and science products powering return cost reduction. * Educate and evangelize across internal teams on analytics, insights and measurement by writing whitepapers, knowledge documentation and delivering learning sessions. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA
US, WA, Bellevue
We are designing the future. If you are in quest of an iterative fast-paced environment, where you can drive innovation through scientific inquiry, and provide tangible benefit to hundreds of thousands of our associates worldwide, this is your opportunity. Come work on the Amazon Worldwide Fulfillment Design & Engineering Team! We are looking for an experienced and Research Scientist with background in Ergonomics and Industrial Human Factors, someone that is excited to work on complex real-world challenges for which a comprehensive scientific approach is necessary to drive solutions. Your investigations will define human factor / ergonomic thresholds resulting in design and implementation of safe and efficient workspaces and processes for our associates. Your role will entail assessment and design of manual material handling tasks throughout the entire Amazon network. You will identify fundamental questions pertaining to the human capabilities and tolerances in a myriad of work environments, and will initiate and lead studies that will drive decision making on an extreme scale. .You will provide definitive human factors/ ergonomics input and participate in design with every single design group in our network, including Amazon Robotics, Engineering R&D, and Operations Engineering. You will work closely with our Worldwide Health and Safety organization to gain feedback on designs and work tenaciously to continuously improve our associate’s experience. Key job responsibilities - Collaborating and designing work processes and workspaces that adhere to human factors / ergonomics standards worldwide. - Producing comprehensive and assessments of workstations and processes covering biomechanical, physiological, and psychophysical demands. - Effectively communicate your design rationale to multiple engineering and operations entities. - Identifying gaps in current human factors standards and guidelines, and lead comprehensive studies to redefine “industry best practices” based on solid scientific foundations. - Continuously strive to gain in-depth knowledge of your profession, as well as branch out to learn about intersecting fields, such as robotics and mechatronics. - Travelling to our various sites to perform thorough assessments and gain in-depth operational feedback, approximately 25%-50% of the time. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA
US, NY, New York
Amazon Advertising exists at the intersection of marketing and e-commerce and offers advertisers a rich array of innovative advertising solutions across Amazon-owned and third party properties. We believe that advertising, when done well, can greatly enhance the value of the customer experience and generate a positive return on investment for our advertising partners. We are currently looking for a highly skilled and motivated Data Scientist to help scale our growing advertising business. The Data Scientist is a key member of the Global Marketing Insights team at Amazon Ads, working with marketing, product, retail and other Amazon business partners to analyze and improve advertisers’ performance on Amazon, in support of their marketing objectives. You will work with Amazon's unique data and translate it into high-quality and actionable insights and recommendations to improve the effectiveness of advertiser campaigns and unlock business opportunities. Day to day activities include analyzing advertiser behaviors to develop data-driven insights on what tactics and strategies lead to success. You will also build automated solutions to generate science driven insights at scale, that are distributed to our advertisers across channels. Basic qualifications - Bachelor's or Master's degree in Engineering, Statistics, Economics, or a related technical field - Proven experience in data analytics or data science roles - Proficiency with SQL and Python - Strong understanding of basic statistical techniques and methodologies such as distributions, hypothesis testing, regressions, experimentation, A/B Testing etc. - Excellent organizational, interpersonal, and communication skills (both written and verbal) - Ability to work cross-functionally and with technical and non-technical stakeholders Preferred qualifications - Understanding of advanced statistical techniques and methodologies such as causal inference, propensity score matching, machine learning etc. - Experience with developing and deploying production machine learning models, especially on cloud platforms - Experience building and managing data pipelines - Experience with digital advertising products, performance analytics , marketing and advertising campaigns - MBA, Master’s, or Doctoral degree in Economics, Engineering, Marketing, Statistics, Advertising, or related fields - Publication track record/writing experience (ex. published a paper in a technical journal or trade publication) About the team The Marketing Insights team is responsible for delivering science backed insights to millions of advertisers via our marketing messages. Our team is distributed across the globe and is building cutting edge data science to identify and communicate the impact of various advertising strategies for our products. We are open to hiring candidates to work out of one of the following locations: New York, NY, USA
US, WA, Seattle
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and Scala would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. We are open to hiring candidates to work out of one of the following locations: Chicago, IL, USA | Seattle, WA, USA | Washington, DC, USA
US, WA, Seattle
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and Scala would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. We are open to hiring candidates to work out of one of the following locations: Chicago, IL, USA | Seattle, WA, USA | Washington, DC, USA
US, WA, Seattle
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and Scala would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. We are open to hiring candidates to work out of one of the following locations: Chicago, IL, USA | Seattle, WA, USA | Washington, DC, USA
US, CA, Santa Clara
Machine learning (ML) has been strategic to Amazon from the early years. We are pioneers in areas such as recommendation engines, product search, eCommerce fraud detection, and large-scale optimization of fulfillment center operations. The Generative AI team helps AWS customers accelerate the use of Generative AI to solve business and operational challenges and promote innovation in their organization. As an applied scientist, you are proficient in designing and developing advanced ML models to solve diverse challenges and opportunities. You will be working with terabytes of text, images, and other types of data to solve real-world problems. You'll design and run experiments, research new algorithms, and find new ways of optimizing risk, profitability, and customer experience. We’re looking for talented scientists capable of applying ML algorithms and cutting-edge deep learning (DL) and reinforcement learning approaches to areas such as drug discovery, customer segmentation, fraud prevention, capacity planning, predictive maintenance, pricing optimization, call center analytics, player pose estimation, event detection, and virtual assistant among others. AWS Sales, Marketing, and Global Services (SMGS) is responsible for driving revenue, adoption, and growth from the largest and fastest growing small- and mid-market accounts to enterprise-level customers including public sector. The AWS Global Support team interacts with leading companies and believes that world-class support is critical to customer success. AWS Support also partners with a global list of customers that are building mission-critical applications on top of AWS services. Key job responsibilities The primary responsibilities of this role are to: Design, develop, and evaluate innovative ML models to solve diverse challenges and opportunities across industries Interact with customer directly to understand their business problems, and help them with defining and implementing scalable Generative AI solutions to solve them Work closely with account teams, research scientist teams, and product engineering teams to drive model implementations and new solutions About the team Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. We are open to hiring candidates to work out of one of the following locations: San Francisco, CA, USA | Santa Clara, CA, USA
US, WA, Bellevue
Amazon.com Services, Inc. is looking for a motivated individual with strong analytical skills and practical experience to join our Modeling and Optimization team. We are hiring specialists into our scientific team with expertise in network and combinatorial optimization, simulation-based design, and/or control theory. Amazon is growing rapidly and because we are driven by faster delivery to customers, a more efficient supply chain network, and lower cost of operations, our main focus is in the development of analytical strategic models and automation tools fed by massive amounts of data. You will be responsible for building these models/tools that improve the economics of Amazon’s worldwide fulfillment networks in North America, Europe, and Japan, China, and India as Amazon increases the speed and decreases the cost to deliver products to customers. You will identify and evaluate opportunities to reduce variable costs by improving fulfillment center processes, transportation operations and scheduling, and the execution to operational plans. You will also improve the efficiency of capital investment by helping the fulfillment centers to improve storage utilization and the effective use of automation. Finally, you will help create the metrics to quantify improvements to the fulfillment costs (e.g., transportation and labor costs) resulting from the application of these optimization models and tools. The ideal candidate will have good communication skills with both technical and business people with ability to speak at a level appropriate for the audience. Key job responsibilities * Understand ambiguous business problems, model it in the simplest and most effective manner with limited guidance. * Use new or existing tools to support internal partner-teams and provide the best, science-based guidance. * Contribute to existing tools with highly disciplined coding practices. * Contribute to the growth of knowledge of our team and the scientific community with internal and external publications or presentations. About the team * At the Modeling and Optimization (MOP) team, we use optimization, algorithm design, statistics, and machine learning to improve decision-making capabilities across WW Operations and Amazon Logistics. * We focus on transportation topology, labor and resource planning, routing science, visualization research, data science and development, and process optimization. * We create models to simulate, optimize, and control the fulfillment network with the objective of reducing cost while improving speed and reliability. * We support multiple business line, therefore maintain a comprehensive and objective view, coordinating solutions across organizational lines where possible. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA
US, CA, Santa Clara
Amazon AI is looking for world class scientists and engineers to join its AWS AI. This group is entrusted with developing core natural language processing, generative AI, deep learning and machine learning algorithms for AWS. You will invent, implement, and deploy state of the art machine learning algorithms and systems. You will build prototypes and explore conceptually new solutions. You will interact closely with our customers and with the academic community. You will be at the heart of a growing and exciting focus area for AWS and work with other acclaimed engineers and world famous scientists. A day in the life Inclusive Team Culture Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Work/Life Balance Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. About the team The Amazon Web Services (AWS) Next Gen DevX (NGDE) team uses generative AI and foundation models to reimagine the experience of all builders on AWS. From the IDE to web-based tools and services, AI will help engineers work on large and small applications. We explore new technologies and find creative solutions. Curiosity and an explorative mindset can find a place here to impact the life of engineers around the world. If you are excited about this space and want to enlighten your peers with new capabilities, this is the team for you. We are open to hiring candidates to work out of one of the following locations: Santa Clara, CA, USA