Two new approaches to synthesizing speech with appropriate prosody

Methods share a two-stage training process in which a model learns a representation from audio data, then learns to predict that representation from text.

At ICASSP 2021, the Amazon text-to-speech team presented two new papers about synthesizing speech from text with contextually appropriate prosody — the rhythm, emphasis, melody, duration, and loudness of speech.

Text-to-speech (TTS) is a one-to-many problem where a single piece of text may have more than one appropriate prosodic rendition. Determining the prosody of a piece of text is a non-trivial problem, but it can increase the naturalness of synthesized speech considerably.

The approaches we describe in the two papers share a general philosophy, but the ways in which they tackle the problem are fundamentally different. 

Kathaka-TTSModel.png
The Kathaka architecture, with separate encoders for spectrogram and phoneme sequence inputs.

One of our papers, “Prosodic representation learning and contextual sampling for neural text-to-speech”, introduces Kathaka, a model trained using a novel two-stage approach. In the first stage, the model learns a distribution of the prosody of all the speech samples in the training data by exploiting a variational learning approach. In stage two, the model learns to sample from this distribution based on semantic and syntactic characteristics of the texts associated with the speech samples.

According to listener studies using the industry-standard MUSHRA (multiple stimuli with hidden reference and anchor) methodology, the speech produced by Kathaka improved over the baseline TTS model by 13.2% in terms of naturalness.

The other paper, “CAMP: A two-stage approach to modelling prosody in context”, introduces CAMP, the context-aware model of prosody. Like Kathaka, CAMP is trained using a two-stage approach. In the first stage, CAMP learns a representation of prosody for every word of each speech sample in the training data in a non-variational way. In stage two, the model learns to predict these learned representations based on the semantic and syntactic characteristics of the associated texts. 

According listener studies with MUSHRA evaluations, the speech produced by CAMP improved over the baseline TTS model by 26% in terms of naturalness. 

Kathaka

Since TTS is a one-to-many problem, where the same text can be said in different ways, TTS models often synthesize speech with neutral prosody. This decreases the naturalness of synthesized speech, as there is no relation between the prosody and what is being said. 

Kathaka’s two-stage learning approach tackles this problem by exploiting the semantics and syntax of the text. The Kathaka architecture has two encoders: one, the reference encoder, takes a mel-spectrogram (a snapshot of the frequency spectrum) of the speech signal as input; the other takes the associated text, represented as a sequence of phonemes, the smallest units of speech. 

Based on the mel-spectrogram, the reference encoder outputs the parameters of a prosody distribution (the mean and variance, µ and σ in the diagram above), and a sample is selected from that distribution. This sample, along with the phoneme encoding, is used to synthesize a new mel-spectrogram. The model is an autoencoder, meaning that it’s trained to output the same mel-spectrogram given to the reference encoder as input.

At inference time, of course, mel-spectrograms aren’t available as input, as they are to be synthesized. Thus, in step two, we train “Samplers”, which predict the parameters of the prosody distribution directly from the text.

Kathaka sampler.png
The architecture of the Kathaka sampler, which factors in both semantic information from BERT embeddings and syntactic information from parse trees.

To encode the text, we use a BERT model, which is pretrained to provide contextual word embeddings — representations of words as vectors in a multidimensional space — that capture semantic and some syntactic information about the text. We also apply graph neural networks to syntax parse trees of the text, to produce representations of just the texts’ syntactic information. 

From these representations, the Sampler learns to predict the parameters of the prosody distribution. At inference time, a sample from this distribution is used in place of the sampled point from the reference encoder to synthesize the mel-spectrogram.

In order to evaluate the efficacy of Kathaka, we compared it to our neural-text-to-speech (NTTS) baseline and showed a statistically significant 13.2% increase in naturalness. 

CAMP

CAMP uses a similar two-step approach to training, but instead of learning a distribution of prosodies, it learns specific mappings between individual words and prosodic representations, conditioned on semantic and syntactic features of the text.

In stage one, CAMP learns word-level representations of prosody using a word-level reference encoder. This encoder takes a mel-spectrogram as input and produces a word-level representation of the speech sample’s prosody. This word-level representation is then aligned with the phonemes that constitute the word, which, again, are encoded by a separate encoder. Both sets of features are then used to synthesize a mel-spectrogram as output, and the training target is the same mel-spectrogram that the reference encoder took as input. Through this process, CAMP learns word-level prosodic representations. 

CAMP architecture.png
During training (left), CAMP, like Kathaka, learns a prosody representation through the reference encoder and learns to predict that representation from the syntactic and semantic content of the input text. At inference, it replaces the prosody representations from the reference encoder (center) with representations predicted from the syntactic and semantic content of the input text (right).

In stage two, CAMP uses semantic and syntactic information from the input texts to predict the word-level prosody representations learned in stage one. To encode the text, we again use BERT embeddings, and we also use word-level syntax tags such as (1) part of speech (POS); (2) word class (“open” words such as nouns or verbs, which can proliferate indefinitely, versus “closed” words such as pronouns and articles, which are fixed and limited); (3) noun structure; and (4) punctuation structure. This information is then used to predict the word-level prosody representations learned in stage one. 

As with Kathaka, during inference, we replace the prosody representations from the reference encoder with the representations predicted from the syntactic and semantic content of the input text. 

Compared to our NTTS baseline, CAMP showed a statistically significant 26% increase in naturalness. 

Related content

GB, London
Are you excited about applying economic models and methods using large data sets to solve real world business problems? Then join the Economic Decision Science (EDS) team. EDS is an economic science team based in the EU Stores business. The teams goal is to optimize and automate business decision making in the EU business and beyond. An internship at Amazon is an opportunity to work with leading economic researchers on influencing needle-moving business decisions using incomparable datasets and tools. It is an opportunity for PhD students and recent PhD graduates in Economics or related fields. We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Knowledge of econometrics, as well as basic familiarity with Stata, R, or Python is necessary. Experience with SQL would be a plus. As an Economics Intern, you will be working in a fast-paced, cross-disciplinary team of researchers who are pioneers in the field. You will take on complex problems, and work on solutions that either leverage existing academic and industrial research, or utilize your own out-of-the-box pragmatic thinking. In addition to coming up with novel solutions and prototypes, you may even need to deliver these to production in customer facing products. Roughly 85% of previous intern cohorts have converted to full time economics employment at Amazon.
US, CA, Cupertino
We're looking for an Applied Scientist to help us secure Amazon's most critical data. In this role, you'll work closely with internal security teams to design and build AR-powered systems that protect our customers' data. You will build on top of existing formal verification tools developed by AWS and develop new methods to apply those tools at scale. You will need to be innovative, entrepreneurial, and adaptable. We move fast, experiment, iterate and then scale quickly, thoughtfully balancing speed and quality. Inclusive Team Culture Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Work/Life Balance Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. Key job responsibilities Deeply understand AR techniques for analyzing programs and other systems, and keep up with emerging ideas from the research community. Engage with our customers to develop understanding of their needs. Propose and develop solutions that leverage symbolic reasoning services and concepts from programming languages, theorem proving, formal verification and constraint solving. Implement these solutions as services and work with others to deploy them at scale across Payments and Healthcare. Author papers and present your work internally and externally. Train new teammates, mentor others, participate in recruiting and interviewing, and participate in our tactical and strategic planning. About the team Our small team of applied scientists works within a larger security group, supporting thousands of engineers who are developing Amazon's payments and healthcare services. Security is a rich area for automated reasoning. Most other approaches are quite ad-hoc and take a lot of human effort. AR can help us to reason deliberately and systematically, and the dream of provable security is incredibly compelling. We are working to make this happen at scale. We partner closely with our larger security group and with other automated reasoning teams in AWS that develop core reasoning services.
US, NY, New York
Search Thematic Ad Experience (STAX) team within Sponsored Products is looking for a leader to lead a team of talented applied scientists working on cutting-edge science to innovate on ad experiences for Amazon shoppers!. You will manage a team of scientists, engineers, and PMs to innovate new widgets on Amazon Search page to improve shopper experience using state-of-the-art NLP and computer vision models. You will be leading some industry first experiences that has the potential to revolutionize how shopping looks and feels like on Amazon, and e-commerce marketplaces in general. You will have the opportunity to design the vision on how ad experiences look on Amazon search page, and use the combination of advanced techniques and continuous experimentation to realize this vision. Your work will be core to Amazon’s advertising business. You will be a significant contributor in building the future of sponsored advertising, directly impacting the shopper experience for our hundreds of millions of shoppers worldwide, while delivering significant value for hundreds of thousands of advertisers across the purchase journey with ads on Amazon. Key job responsibilities * Be the technical leader in Machine Learning; lead efforts within the team, and collaborate and influence across the organization. * Be a critic, visionary, and execution leader. Invent and test new product ideas that are powered by science that addresses key product gaps or shopper needs. * Set, plan, and execute on a roadmap that strikes the optimal balance between short term delivery and long term exploration. You will influence what we invest in today and tomorrow. * Evangelize the team’s science innovation within the organization, company, and in key conferences (internal and external). * Be ruthless with prioritization. You will be managing a team which is highly sought after. But not all can be done. Have a deep understanding of the tradeoffs involved and be fierce in prioritizing. * Bring clarity, direction, and guidance to help teams navigate through unsolved problems with the goal to elevate the shopper experience. We work on ambiguous problems and the right approach is often unknown. You will bring your rich experience to help guide the team through these ambiguities, while working with product and engineering in crisply defining the science scope and opportunities. * Have strong product and business acumen to drive both shopper improvements and business outcomes. A day in the life * Lead a multidisciplinary team that embodies “customer obsessed science”: inventing brand new approaches to solve Amazon’s unique problems, and using those inventions in software that affects hundreds of millions of customers * Dive deep into our metrics, ongoing experiments to understand how and why they are benefitting our shoppers (or not) * Design, prototype and validate new widgets, techniques, and ideas. Take end-to-end ownership of moving from prototype to final implementation. * Be an advocate and expert for STAX science to leaders and stakeholders inside and outside advertising. About the team We are the Search thematic ads experience team within Sponsored products - a fast growing team of customer-obsessed engineers, technologists, product leaders, and scientists. We are focused on continuous exploration of contexts and creatives to drive value for both our customers and advertisers, through continuous innovation. We focus on new ads experiences globally to help shoppers make the most informed purchase decision while helping shortcut the time to discovery that shoppers are highly likely to engage with. We also harvest rich contextual and behavioral signals that are used to optimize our backend models to continually improve the shopper experience. We obsess about our customers and are continuously seeking opportunities to delight them.
US, CA, Palo Alto
Amazon is the 4th most popular site in the US. Our product search engine, one of the most heavily used services in the world, indexes billions of products and serves hundreds of millions of customers world-wide. We are working on a new initiative to transform our search engine into a shopping engine that assists customers with their shopping missions. We look at all aspects of search CX, query understanding, Ranking, Indexing and ask how we can make big step improvements by applying advanced Machine Learning (ML) and Deep Learning (DL) techniques. We’re seeking a thought leader to direct science initiatives for the Search Relevance and Ranking at Amazon. This person will also be a deep learning practitioner/thinker and guide the research in these three areas. They’ll also have the ability to drive cutting edge, product oriented research and should have a notable publication record. This intellectual thought leader will help enhance the science in addition to developing the thinking of our team. This leader will direct and shape the science philosophy, planning and strategy for the team, as we explore multi-modal, multi lingual search through the use of deep learning . We’re seeking an individual that can enhance the science thinking of our team: The org is made of 60+ applied scientists, (2 Principal scientists and 5 Senior ASMs). This person will lead and shape the science philosophy, planning and strategy for the team, as we push into Deep Learning to solve problems like cold start, discovery and personalization in the Search domain. Joining this team, you’ll experience the benefits of working in a dynamic, entrepreneurial environment, while leveraging the resources of Amazon [Earth's most customer-centric internet company]. We provide a highly customer-centric, team-oriented environment in our offices located in Palo Alto, California.
JP, 13, Tokyo
Our mission is to help every vendor drive the most significant impact selling on Amazon. Our team invent, test and launch some of the most innovative services, technology, processes for our global vendors. Our new AVS Professional Services (ProServe) team will go deep with our largest and most sophisticated vendor customers, combining elite client-service skills with cutting edge applied science techniques, backed up by Amazon’s 20+ years of experience in Japan. We start from the customer’s problem and work backwards to apply distinctive results that “only Amazon” can deliver. Amazon is looking for a talented and passionate Applied Science Manager to manage our growing team of Applied Scientists and Business Intelligence Engineers to build world class statistical and machine learning models to be delivered directly to our largest vendors, and working closely with the vendors' senior leaders. The Applied Science Manager will set the strategy for the services to invent, collaborating with the AVS business consultants team to determine customer needs and translating them to a science and development roadmap, and finally coordinating its execution through the team. In this position, you will be part of a larger team touching all areas of science-based development in the vendor domain, not limited to Japan-only products, but collaborating with worldwide science and business leaders. Our current projects touch on the areas of causal inference, reinforcement learning, representation learning, anomaly detection, NLP and forecasting. As the AVS ProServe Applied Science Manager, you will be empowered to expand your scope of influence, and use ProServe as an incubator for products that can be expanded to all Amazon vendors, worldwide. We place strong emphasis on talent growth. As the Applied Science Manager, you will be expected in actively growing future Amazon science leaders, and providing mentoring inside and outside of your team. Key job responsibilities The Applied Science Manager is accountable for: (1) Creating a vision, a strategy, and a roadmap tackling the most challenging business questions from our leading vendors, assess quantitatively their feasibility and entitlement, and scale their scope beyond the ProServe team. (2) Coordinate execution of the roadmap, through direct reports, consisting of scientists and business intelligence engineers. (3) Grow and manage a technical team, actively mentoring, developing, and promoting team members. (4) Work closely with other science managers, program/product managers, and business leadership worldwide to scope new areas of growth, creating new partnerships, and proposing new business initiatives. (5) Act as a technical supervisor, able to assess scientific direction, technical design documents, and steer development efforts to maximize project delivery.
US, NY, New York
Amazon Advertising is one of Amazon's fastest growing and most profitable businesses. As a core product offering within our advertising portfolio, Sponsored Products (SP) helps merchants, retail vendors, and brand owners succeed via native advertising, which grows incremental sales of their products sold through Amazon. The SP team's primary goals are to help shoppers discover new products they love, be the most efficient way for advertisers to meet their business objectives, and build a sustainable business that continuously innovates on behalf of customers. Our products and solutions are strategically important to enable our Retail and Marketplace businesses to drive long-term growth. We deliver billions of ad impressions and millions of clicks and break fresh ground in product and technical innovations every day! The Supply team (within Sponsored Products) is looking for an Applied Scientist to join a fast-growing team with the mandate of creating new ad experiences that elevate the shopping experience for hundreds of millions customers worldwide. The Applied Scientist will take end-to-end ownership of driving new product/feature innovation by applying advanced statistical and machine learning models. The role will handle petabytes of unstructured data (images, text, videos) to extract insights into what metadata can be useful for us to highlight to simplify purchase decisions, and propose new experiences that increase shopper engagement. Why you love this opportunity Amazon is investing heavily in building a world-class advertising business. This team is responsible for defining and delivering a collection of advertising products that drive discovery and sales. Our solutions generate billions in revenue and drive long-term growth for Amazon’s Retail and Marketplace businesses. We deliver billions of ad impressions, millions of clicks daily, and break fresh ground to create world-class products. We are highly motivated, collaborative, and fun-loving team with an entrepreneurial spirit - with a broad mandate to experiment and innovate. Key job responsibilities As an Applied Scientist on this team you will: Build machine learning models and perform data analysis to deliver scalable solutions to business problems. Perform hands-on analysis and modeling with very large data sets to develop insights that increase traffic monetization and merchandise sales without compromising shopper experience. Work closely with software engineers on detailed requirements, technical designs and implementation of end-to-end solutions in production. Design and run A/B experiments that affect hundreds of millions of customers, evaluate the impact of your optimizations and communicate your results to various business stakeholders. Work with scientists and economists to model the interaction between organic sales and sponsored content and to further evolve Amazon's marketplace. Establish scalable, efficient, automated processes for large-scale data analysis, machine-learning model development, model validation and serving. Research new predictive learning approaches for the sponsored products business. Write production code to bring models into production. A day in the life You will invent new experiences and influence customer-facing shopping experiences to help suppliers grow their retail business and the auction dynamics that leverage native advertising; this is your opportunity to work within the fastest-growing businesses across all of Amazon! Define a long-term science vision for our advertising business, driven fundamentally from our customers' needs, translating that direction into specific plans for research and applied scientists, as well as engineering and product teams. This role combines science leadership, organizational ability, technical strength, product focus, and business understanding.
US, CA, Palo Alto
The Amazon Search team creates powerful, customer-focused search solutions and technologies. Whenever a customer visits an Amazon site worldwide and types in a query or browses through product categories, Amazon Search services go to work. We design, develop, and deploy high performance, fault-tolerant distributed search systems used by millions of Amazon customers every day. We are seeking a Principal Scientist with deep expertise in Search. Your responsibility will be to advance the state-of-the-art for search science that leads to world-class products that impact Amazon's customers. Key job responsibilities You will be responsible for defining key research directions, adopting or inventing new machine learning techniques, conducting rigorous experiments, publishing results, and ensuring that research is translated into practice. You will develop long-term strategies, persuade teams to adopt those strategies, propose goals and deliver on them. You will also participate in organizational planning, hiring, mentorship and leadership development. You will be technically fearless and with a passion for building scalable science and engineering solutions. You will serve as a key scientific resource in full-cycle development (conception, design, implementation, testing to documentation, delivery, and maintenance). About the team This is a position on Core Ranking and Experimentation team in Palo Alto, CA. The team works on a variety of topics in search ranking and relevance, such as multi-objective optimization, personalization, and fast online experimentation. We work closely with teams in various parts of the stack to ensure that our science is translated to customer facing products.
US, WA, Bellevue
Amazon is looking for a passionate, talented, and inventive Applied Scientists with a strong machine learning background to help build industry-leading Speech and Language technology. Our mission is to provide a delightful experience to Amazon’s customers by pushing the envelope in Automatic Speech Recognition (ASR), Machine Translation (MT), Natural Language Understanding (NLU), Machine Learning (ML) and Computer Vision (CV). As part of our AI team in Amazon AWS, you will work alongside internationally recognized experts to develop novel algorithms and modeling techniques to advance the state-of-the-art in human language technology. Your work will directly impact millions of our customers in the form of products and services that make use of speech and language technology. You will gain hands on experience with Amazon’s heterogeneous speech, text, and structured data sources, and large-scale computing resources to accelerate advances in spoken language understanding. We are hiring in all areas of human language technology: ASR, MT, NLU, text-to-speech (TTS), and Dialog Management, in addition to Computer Vision.
CA, ON, Toronto
WE ARE OPEN TO HIRING THIS ROLE IN SF BAY AREA, SEATTLE, TORONTO, OR VIRGINIA (HQ2). Amazon Advertising is one of Amazon's fastest growing and most profitable businesses, responsible for defining and delivering a collection of advertising products that drive discovery and sales. Our products and solutions are strategically important to enable our Retail and Marketplace businesses to drive long-term growth. We deliver billions of ad impressions and millions of clicks and break fresh ground in product and technical innovations every day! The Sponsored Products Demand Identification and Campaign Optimization (DEICO) team is responsible for building experiences across all seller and advertiser touchpoints to generate performant Sponsored Products demand and improved campaign performance through guardrail-based controls. We do this through 1) Creating performant demand focusing on Amazon supplier use cases (e.g. launching a new ASIN) through presets for campaign creation. 2) Making existing demand performant through diagnosing gaps and providing proactive and targeted recommendations, and 3) Developing and launching new campaign optimization rules that allow advertisers to provide performance guardrails and automate optimization of their campaigns. Within these domains, we focus on creating holistic recommendations through machine learning for products to advertise and optimal presets across every parameter that go into Sponsored Products ad campaign setup, as well as consolidated recommendations that improve performance of existing Sponsored Product campaigns. As an Applied Scientist on this team, you will: • Drive end-to-end Machine Learning projects that have a high degree of ambiguity, scale, complexity. • Perform hands-on analysis and modeling of enormous data sets to develop insights that increase traffic monetization and merchandise sales, without compromising the shopper experience. • Build machine learning models, perform proof-of-concept, experiment, optimize, and deploy your models into production; work closely with software engineers to assist in productionizing your ML models. • Run A/B experiments, gather data, and perform statistical analysis. • Establish scalable, efficient, automated processes for large-scale data analysis, machine-learning model development, model validation and serving. • Research new and innovative machine learning approaches. • Recruit Applied Scientists to the team and provide mentorship. Why you will love this opportunity: Amazon is investing heavily in building a world-class advertising business. This team defines and delivers a collection of advertising products that drive discovery and sales. Our solutions generate billions in revenue and drive long-term growth for Amazon’s Retail and Marketplace businesses. We deliver billions of ad impressions, millions of clicks daily, and break fresh ground to create world-class products. We are a highly motivated, collaborative, and fun-loving team with an entrepreneurial spirit - with a broad mandate to experiment and innovate. Impact and Career Growth: You will invent new experiences and influence customer-facing shopping experiences to help suppliers grow their retail business and the auction dynamics that leverage native advertising; this is your opportunity to work within the fastest-growing businesses across all of Amazon! Define a long-term science vision for our advertising business, driven from our customers' needs, translating that direction into specific plans for research and applied scientists, as well as engineering and product teams. This role combines science leadership, organizational ability, technical strength, product focus, and business understanding. Team video https://youtu.be/zD_6Lzw8raE
IN, KA, Bangalore
Amazon is investing heavily in building a world class advertising business and we are responsible for defining and delivering a collection of self-service performance advertising products that drive discovery and sales. Our products are strategically important to our Retail and Marketplace businesses driving long term growth. We deliver billions of ad impressions and millions of clicks daily and are breaking fresh ground to create world-class products. The ATT team, based in Bangalore, is responsible for ensuring that ads are compliant to world-wide advertising policies and are of high quality, leading to higher conversion for the advertisers and providing a great experience for the shoppers. Machine learning, particularly multi-modal data understanding, is fundamental to the way we drive our business, meet our goals and satisfy our customers. ATT team invests in researching and developing state of art models that analyze various type of ad assets – text, audio, images and videos - to ensure compliance to advertising policies. We also help advertisers create more successful ads by creating ML models to assist ad generation as well as to provide data-driven interpretable insights. Key job responsibilities Major responsibilities · Deliver key goals to enhance advertiser experience and protect shopper trust by innovative use of computer vision, NLP and statistical techniques · Drive core business analytics and data science explorations to inform key business decisions and algorithm roadmap · Establish scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation · Hire and develop top talent in machine learning and data science and accelerate the pace of innovation in the group · Work proactively with engineering teams and product managers to evangelize new algorithms and drive the implementation of large-scale complex ML models in production