Amazon Text-to-Speech group's research at ICASSP 2022

Papers focus on speech conversion and data augmentation — and sometimes both at once.

The automatic conversion of text to speech is crucial to Alexa: it’s how Alexa communicates with customers. The models developed by the Amazon Text-to-Speech group are also available to Amazon Web Services (AWS) customers through Polly, the AWS text-to-speech service.

The Text-to-Speech (TTS) group has four papers at this year’s International Conference on Acoustics, Speech, and Signal Processing (ICASSP), all of which deal with either voice conversion (preserving prosodic features while converting one synthetic voice to another), data augmentation, or both.

More ICASSP coverage on Amazon Science

In “Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module”, the Amazon TTS group addresses the problem of few-shot speaker adaptation, or learning a new synthetic voice from just a handful of training examples. The paper reformulates the problem as learning a voice conversion model that’s applied to the output of a high-quality TTS model, a conceptual shift from the existing few-shot-TTS paradigm.

In “Cross-speaker style transfer for text-to-speech using data augmentation”, the team shows how to build a TTS model capable of expressive speech, even when the only available training data for the target voice consists of neutral speech. The idea is to first train a voice conversion model, which converts samples of expressive speech in other voices into the target voice, and then use the converted speech as additional training data for the TTS model.

In “Distribution augmentation for low-resource expressive text-to-speech”, the TTS group expands the range of texts used to train a TTS model by recombining excerpts from existing examples to produce new examples. The trick is to maintain the syntactic coherence of the synthetic examples, so that the TTS model won’t waste resources learning improbable sequences of phonemes. (This is the one data augmentation paper that doesn’t rely on voice conversion.)

Parse substitution.png
In this example of data augmentation through recombination of existing training examples, the verb phrase “shook her head”, as identified by a syntactic parse, is substituted for the verb phrase “lied” in the sentence “he never lied”. The original acoustic signals (bottom row) are cut and spliced at the corresponding points. From "Distribution augmentation for low-resource expressive text-to-speech".

Finally, in “Text-free non-parallel many-to-many voice conversion using normalising flows”, the team adapts the concept of normalizing flows, which have been used widely for TTS, to the problem of voice conversion. Like most deep-learning models, normalizing flows learn functions that produce vector representations of input data. The difference is that the functions are invertible, so the inputs can be recovered from the representations. The team hypothesized that preserving more information from the input data would yield better voice conversion, and early experiments bear that hypothesis out.

Voice filter

The idea behind “Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module” is that for few-shot learning, it’s easier to take the output of an existing, high-quality TTS model — a voice spectrogram — and adapt that to a new target voice than it is to adapt the model itself.

The key to the approach is that the voice filter, which converts the TTS model’s output to a new voice, is trained on synthetic data created by the TTS model itself.

Voice filter.png
The training procedure for the voice filter.

The TTS model is duration controllable, meaning that the input text is encoded to indicate the duration that each phoneme should have in the output speech. This enables the researchers to create two parallel corpora of training data. One corpus consists of real training examples, from 120 different speakers. The other corpus is synthetic speech generated by the TTS model, but with durations that match those of the multispeaker examples.

The voice filter is trained on the parallel corpora, and then, for few-shot learning, the researchers simply fine-tune it on a new speaker. In experiments, the researchers found that this approach produced speech whose quality was comparable to that produced by conventional models trained on 30 times as much data.

Cross-speaker style transfer

The voice conversion model that the researchers use in “Cross-speaker style transfer for text-to-speech using data augmentation” is based on the CopyCat model previously reported on the Amazon Science blog. The converted expressive data is added to the neutral data to produce the dataset used to train the TTS model

The TTS model takes two inputs: a text sequence and a style vector. During training, the text sequence passes to the TTS model, and the spectrogram of the target speech sample passes to a reference encoder, which produces the style embedding. At inference time, of course, there is no input spectrogram. But the researchers show that they can control the style of the TTS model’s output by feeding it a precomputed style embedding.

Cross-speaker style transfer.png
The voice conversion model (left) and text-to-speech model (right) used for cross-speaker style transfer. The reference encoders are used only during training. From "Cross-speaker style transfer for text-to-speech using data augmentation".

The researchers assessed the model based on human evaluation using the MUSHRA perception scale. Human evaluators reported that, relative to a benchmark model, the new model reduced the gap in perceived style similarity between synthesized and real speech by an average of 58% across 14 different speakers.

Distribution augmentation

Distribution augmentation for low-resource expressive text-to-speech” considers the case in which training data for a new voice is lacking. The goal is to permute the texts of the existing examples, producing new examples, and recombine excerpts from the corresponding speech samples to produce new samples. This does not increase the acoustic diversity of the training targets, but it does increase the linguistic diversity of the training inputs.

To ensure that the synthetic training examples do not become too syntactically incoherent, the researchers construct parse trees for the input texts and then swap syntactically equivalent branches across trees (see figure, above). Swapping the corresponding sections of the acoustic signal requires good alignment between text and signal, which is accomplished by existing forced-alignment models.

During training, to ensure that the resulting TTS model doesn’t become overbiased toward the synthetic examples, the researchers also include a special input token to indicate points at which two existing samples have been fused together. The expectation is that the model will learn to privilege phonemic sequences internal to the real samples over phonemic sequences that cross boundaries between fused samples. At inference time, the value of the token is simply set to 0 across all inputs.

Fused-sample token.png
The “augmentation tag” marks the boundary between acoustic signals taken from two different training examples, to prevent overbiasing the TTS model toward synthetic data. From "Distribution augmentation for low-resource expressive text-to-speech".

The quality of the model’s speech output was assessed by 60 human evaluators, who compared it to speech output by a baseline model, on five different datasets. Across the board, the output of the new model received better scores than the output of the benchmark model.

Normalizing flows

A normalizing flow learns to map input data to a representational space in a way that maximizes the approximation of some prior distribution. The word “flow” indicates that the mapping can be the result of passing the data through a series of invertible transformations, and the enforcement of the distribution imposes the normalization.

In “Text-free non-parallel many-to-many voice conversion using normalising flows”, Amazon TTS researchers consider a flow whose inputs are a source spectrogram, a phoneme embedding, a speaker identity embedding, the fundamental frequency of the acoustic signal, and a flag denoting whether a frame of input audio is voiced or unvoiced. The flow maps the inputs to a distribution of phoneme frequencies in a particular application domain.

Typically, a normalizing flow will learn both the distribution and the mapping from the training data. But here, the researchers pretrain the flow on a standard TTS task, for which training data is plentiful, to learn the distribution in advance.

Because the flow is reversible, a vector in the representational space can be mapped back to a set of source inputs, provided that the other model inputs (phoneme embedding, speaker ID, and so on) are available. To use normalizing flows to perform speech conversion, the researchers simply substitute one speaker for another during this reverse mapping.

Normalizing flow.png
An overview of TTS researchers' use of normalizing flows to do voice conversion. From "Text-free non-parallel many-to-many voice conversion using normalising flows".

The researchers examine two different experimental setting, one in which the voice conversion model takes both text sequences and spectrograms as inputs and one in which it takes spectrograms only. In the second case, the pretrained normalizing-flow model significantly outperformed the benchmarks. A normalizing-flow model that learned the phoneme distribution directly from the training data didn’t fare as well, indicating the importance of the pretraining step.

Research areas

Related content

US, CA, Palo Alto
We’re working to improve shopping on Amazon using the conversational capabilities of large language models, and are searching for pioneers who are passionate about technology, innovation, and customer experience, and are ready to make a lasting impact on the industry. You’ll be working with talented scientists, engineers, and technical program managers (TPM) to innovate on behalf of our customers. If you’re fired up about being part of a dynamic, driven team, then this is your moment to join us on this exciting journey! We are open to hiring candidates to work out of one of the following locations: Palo Alto, CA, USA
US, WA, Redmond
Project Kuiper is an initiative to launch a constellation of Low Earth Orbit satellites that will provide low-latency, high-speed broadband connectivity to unserved and underserved communities around the world. We are searching for talented candidates with experience in spaceflight trajectory modeling and simulation, orbit mechanics, and launch vehicle mission planning. Key job responsibilities This position requires experience in simulation and analysis of astrodynamics models and spaceflight trajectories. This position requires experience in software development for astrodynamics applications and expertise in supporting mission workflow for satellite operations. Strong analysis skills are required to develop engineering studies of complex large-scale dynamical systems. This position requires demonstrated expertise in computational analysis automation and tool development. Working with the Kuiper engineering team, you will: - Develop modeling techniques for analysis and simulation of deployment dynamics of multiple satellites - Support Project Kuiper’s Launch Vehicle Mission Management team with technical expertise in Launch Vehicle trajectory requirements specification - Develop tools to support Mission Management planning for over 80 launches! - Work collaboratively with launch vehicle system technical teams - Provide support of algorithm development and testing for the Kuiper Flight Dynamics System. - Provide software development support of production code. Export Control Requirement: Due to applicable export control laws and regulations, candidates must be a U.S. citizen or national, U.S. permanent resident (i.e., current Green Card holder), or lawfully admitted into the U.S. as a refugee or granted asylum. We are open to hiring candidates to work out of one of the following locations: Redmond, WA, USA
CN, 11, Beijing
Are you interested in applying your strong quantitative analysis and big data skills to world-changing problems? Are you interested in driving the development of methods, models and systems for strategy planning, transportation and fulfillment network? Are you interested to cooperate with Amazonians around the world? If so, then this is the job for you. Our team, ATE(Analytics Technology and Engineering) is looking for an Applied Scientist to join our growing Science Team in Bangalore (India)/ Beijing(China). We are responsible for creating core analytics tech capabilities, quantative models, platforms development, and data engineering. We develop scalable analytics applications and research models to optimize operations processes. We standardize and optimize data sources and visualization efforts across geographies, build up, and maintain the online business intelligence services and data mart. You will work with other scientists, professional data engineers, business intelligence engineers, and product managers using rigorous quantitative approaches to ensure high quality data tech products for our customers around the world, including India, Australia, Brazil, Mexico, Singapore and Middle East. Amazon is growing rapidly and because we are driven by faster delivery to customers, a more efficient supply chain network, and lower cost of operations, our main focus is in the development of strategic models and automation tools fed by our massive amounts of available data. You will be responsible for building these models/tools that improve the economics of Amazon’s worldwide fulfillment networks in different countries as Amazon increases the speed and decreases the cost to deliver products to customers. You will work on large-scale vehicle routing and scheduling problems under complex operational and physical constraints. You will also identify and evaluate opportunities to reduce variable costs by improving fulfillment center processes, transportation operations and scheduling, and the execution of operational plans. Finally, you will help create the metrics to quantify improvements to the fulfillment costs (e.g., transportation and labor costs) resulting from the application of these optimization models and tools. Key job responsibilities - Design and develop complex mathematical, simulation and optimization models and apply them to define strategic and tactical needs and drive the appropriate business and technical solutions in the areas of vehicle routing, inventory management, network flow, supply chain optimization, demand planning. - Apply theories of mathematical optimization, including linear programming, combinatorial optimization, integer programming, dynamic programming, network flows and algorithms to design optimal or near optimal solution methodologies to be used by in-house decision support tools and software. - Translating business questions and concerns into specific analytical questions that can be answered with available data using Statistical and Machine Learning methods. - Prototype models by using modeling and programming languages with efficient data querying and modeling infrastructure. - Communicate proposals and results in a clear manner backed by data and coupled with actionable conclusions to drive business decisions. - Collaborate with colleagues from multidisciplinary science, engineering and business backgrounds. - Manage your own process. Prioritize and execute on high impact projects, triage external requests, and ensure to deliver projects in time. We are open to hiring candidates to work out of one of the following locations: Beijing, 11, CHN
GB, Cambridge
Our team builds generative AI solutions that will produce some of the future’s most influential voices in media and art. We develop cutting-edge technologies with Amazon Studios, the provider of original content for Prime Video, with Amazon Game Studios and Alexa, the ground-breaking service that powers the audio for Echo. Do you want to be part of the team developing the future technology that impacts the customer experience of ground-breaking products? Then come join us and make history. We are looking for a passionate, talented, and inventive Applied Scientist with a background in Machine Learning to help build industry-leading Speech, Language, Audio and Video technology. As an Applied Scientist at Amazon you will work with talented peers to develop novel algorithms and generative AI models to drive the state of the art in audio (and vocal arts) generation. Position Responsibilities: * Participate in the design, development, evaluation, deployment and updating of data-driven models for digital vocal arts applications. * Participate in research activities including the application and evaluation and digital vocal and video arts techniques for novel applications. * Research and implement novel ML and statistical approaches to add value to the business. * Mentor junior engineers and scientists. We are open to hiring candidates to work out of one of the following locations: Cambridge, GBR | London, GBR
GB, Cambridge
Our team undertakes research together with multiple organizations to advance the state-of-the-art in speech technologies. We not only work on giving Alexa, the ground-breaking service that powers Echo, her voice, but we also develop cutting-edge technologies with Amazon Studios, the provider of original content for Prime Video. Do you want to be part of the team developing the latest technology that impacts the customer experience of ground-breaking products? Then come join us and make history. We are looking for a passionate, talented, and inventive Senior Applied Scientist with a background in Machine Learning to help build industry-leading Speech, Language and Video technology. As a Senior Applied Scientist at Amazon you will work with talented peers to develop novel algorithms and modelling techniques to drive the state of the art in speech and vocal arts synthesis. Position Responsibilities: * Participate in the design, development, evaluation, deployment and updating of data-driven models for digital vocal arts applications. * Participate in research activities including the application and evaluation and digital vocal and video arts techniques for novel applications. * Research and implement novel ML and statistical approaches to add value to the business. * Mentor junior engineers and scientists. We are open to hiring candidates to work out of one of the following locations: Cambridge, GBR | London, GBR
US, VA, Arlington
The GenAI Innovation Center helps AWS customers accelerate their use of Generative AI to solve business challenges and promote innovation across their organizations. The Public Sector team focuses on public sector customers and their unique challenges. As a data scientist, you have deep and broad experience as an ML practitioner. You interface directly with customers to understand and identify their challenges that can be addressed by Generative AI. You build secure solutions that can scale to the size of the problem at hand and guide customers through your rigorous evaluation process. You'll design and run experiments, research new algorithms, and find new ways of optimizing risk, profitability, and customer experience. You're part of both a small team dedicated to public sector customers and a global organization enabling customers to accelerate their progress on GenAI. This position requires that the candidate selected be a US Citizen. AWS Sales, Marketing, and Global Services (SMGS) is responsible for driving revenue, adoption, and growth from the largest and fastest growing small- and mid-market accounts to enterprise-level customers including public sector. The AWS Global Support team interacts with leading companies and believes that world-class support is critical to customer success. AWS Support also partners with a global list of customers that are building mission-critical applications on top of AWS services. Key job responsibilities The primary responsibilities of this role are to: Design, develop, and evaluate innovative ML models to solve diverse challenges and opportunities across industries. Interact with customer directly to understand their business problems, and help them with defining and implementing scalable Generative AI solutions to solve them. Work closely with account teams, research scientist teams, and product engineering teams to drive model implementations and new solution. A day in the life 1. Team with a GenAI strategist to understand a customer problem and provide guidance on how and whether GenAI can help address the issue. 2. Share your latest experiment results or challenges with other scientists on the team. 3. Collaborate on a blog post to share the results and methods used in your most recent customer success. 4. Attend or a deliver a tech talk to highlight a project you or a team mate just completed. 5. Provide feedback to your team during a code review. 6. Meet with customer stakeholders to demonstrate the latest progress on their problem. About the team Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Denver, CO, USA
US, CA, Santa Clara
Amazon is looking for a motivated individual with strong analytical and algorithmic skills and practical experience to join the Modeling and Optimization (MOP) Routing Science team. Your main focus will be on developing and improving our last-mile experience, with emphasis on algorithmic and analytical work. We are looking for candidates with proven ability to design, implement, and evaluate state-of-the-art solutions to large-scale optimization problems, working closely with software development engineers. The position requires strong background in combinatorial optimization, algorithms, algorithm engineering, and data structures, particularly as it applies to vehicle routing and related problems. Familiarity with data science and Machine Learning techniques is a plus. You will also play an integral role in the network planning, modeling, and analysis that will improve the efficiency and cost effectiveness of global fulfillment operations. You will identify and evaluate opportunities to reduce variable costs by improving the transportation network topology, inventory placement, transportation operations and scheduling, fulfillment center processes, and the execution to operational plans. You will also improve the efficiency of capital investment by helping plan the location and deployment of fixed assets. Finally, you will help create the metrics to quantify improvements to the fulfillment costs (e.g., transportation and labor costs) resulting from the application of these optimization models and tools. Key job responsibilities We are looking for candidates with proven ability to design, implement, and evaluate state-of-the-art solutions to large-scale optimization problems, working closely with software development engineers. The position requires strong background in combinatorial optimization, algorithms, algorithm engineering, and data structures, particularly as it applies to vehicle routing problems. We are open to hiring candidates to work out of one of the following locations: Santa Clara, CA, USA
US, WA, Seattle
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, WA, Seattle
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, MA, North Reading
We are looking for experienced scientists and engineers to explore new ideas, invent new approaches, and develop new solutions in the areas of Controls, Dynamic modeling and System identification. Are you inspired by invention? Is problem solving through teamwork in your DNA? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Robotics. We are a smart team of doers that work passionately to apply cutting edge advances in robotics and software to solve real-world challenges that will transform our customers’ experiences in ways we can’t even imagine yet. We invent new improvements every day. We are Amazon Robotics and we will give you the tools and support you need to invent with us in ways that are rewarding, fulfilling and fun. Key job responsibilities Applied Scientists take on big unanswered questions and guide development team to state-of-the-art solutions. We want to hear from you if you have deep industry experience in the Mechatronics domain and : * the ability to think big and conceive of new ideas and novel solutions; * the insight to correctly identify those worth exploring; * the hands-on skills to quickly develop proofs-of-concept; * the rigor to conduct careful experimental evaluations; * the discipline to fast-fail when data refutes theory; * and the fortitude to continue exploring until your solution is found We are open to hiring candidates to work out of one of the following locations: North Reading, MA, USA | Westborough, MA, USA