How to build highly expressive speech models

New voice for Alexa’s Reading Sidekick feature avoids the instabilities common to models with variable prosody.

In June, Alexa announced a new feature called Reading Sidekick, which helps kids grow into confident readers by taking turns reading with Alexa, while Alexa provides encouragement and support. To make this an engaging and entertaining experience, the Amazon Text-to-Speech team developed a version of the Alexa voice that speaks more slowly and with more expressivity than the standard, neutral voice.

Girl reading.png
A child enjoying Reading Sidekick with her panda Echo Dot Kids.

Because expressive speech is more variable than neutral speech, expressive-speech models are prone to stability issues, such as sudden stoppages or harsh inflections. To tackle this problem, model developers might collect data that represents a dedicated style; but that is costly and time consuming. They might deliver a model that is not based on attention — that is, it doesn’t focus on particular words of prior inputs when determining how to handle the current word. However, attentionless models are more complex, requiring more effort to deploy and often causing additional latency. 

Our goal was to develop a highly expressive voice without increasing the burden of either data collection or model deployment. We did this in two ways: by developing new approaches for data preprocessing and by delivering models adapted to expressive speech. We also collaborated closely with user experience (UX) researchers, both before and after building our models.

Comparison of storytelling voices
Alexa's standard voice
Alexa's new storytelling voice

To determine what training data to collect, we ran a UX study before the start of the project, in which children and their parents listened to a baseline voice synthesizing narrative passages. The results indicated that a slower speech rate and enhanced expressivity would improve customer experience. When recording training data, we actively controlled both the speaking rate and the expressivity level.

After we’d built our models, we did a second UX study and found that, for story reading, subjects preferred our new voice over the standard Alexa voice by a two-to-one margin.

Data curation

The instability of highly expressive voice models is due to “extreme prosody”, which is common in the reading of children’s books. Prosody is the rhythm, emphasis, melody, duration, and loudness of speech; adults reading to young children will often exaggerate inflections, change volume dramatically, and extend or shorten the duration of words to convey meaning and hold their listeners’ attention.

Reading Sidekick screen.png
The Reading Sidekick book list screen for the Echo Show.

Although we want our dataset to capture a wide range of expressivity, some utterances may be too extreme. We developed a new approach to preprocessing training data that removes such outliers. For each utterance, we calculate the speaker embedding — a vector representation that captures prosodic features of the speaker’s voice. If the distance between a given speaker embedding and the average one is too large, we discard the utterance from the training set.

Next, from each speech sample, we remove segments that cannot be automatically transcribed from audio to text. Since most such segments are dead air, removing them prevents the model from pausing too long between words.

Modeling

On the modeling side, we use regularization and data augmentation to increase stability. A neural-network-based text-to-speech (NTTS) system consists of two components: (1) a mel-spectrogram generator and (2) a vocoder. The mel-spectrogram generator takes as input a sequence of phones — the shortest phonetic units — and outputs the amplitude of a signal at audible frequencies. It is responsible for the prosody of the voice. 

The vocoder adds phase information to the mel-spectrogram, to create the synthetic speech signal. Without the phase information, the speech would be robotic. Our team previously developed a universal vocoder that works well for this application.

During training, we apply an L2 penalty to the weights of the mel-spectrogram generator; that is, weights that deviate from the average are assessed a penalty during training, and the penalty varies with the square of the deviation. This is a form of regularization, which reduces overfitting on the recording data.

We also use data augmentation to improve the output voice. We add neutral recordings to the training recordings, providing less extreme prosodic trajectories for the model to learn from.

As an additional input, for both types of training data, we provide the model with a style id, which helps it learn to distinguish the storytelling style from other styles available through Alexa. The combination of recording, processing, and regularization makes the model stable.

TTS pipeline.cropped.png
The text-to-speech processing pipeline, with style ID as an input.

Evaluation

To evaluate the Reading Sidekick voice, we asked adult crowdsourced testers which voice they preferred for reading stories to children. The standard Alexa voice was our baseline. We tested 100 short passages with a mean duration of around 15 seconds, each of which was evaluated 30 times by different crowdsourced testers. The testers were native speakers of English; no other constraint was imposed on the tester selection.

pref_standard_vs_storytelling.png
Participants in a user study preferred the new storytelling voice to Alexa's standard voice by a two-to-one margin.

The results favor the Reading Sidekick voice by a large margin (61.16% Reading Sidekick vs 30.46% baseline, with P<.001), particularly considering the very noisy nature of crowdsourced evaluations and the fact that we did not discard any of the data received.

Thanks to Marco Nicolis and Arnaud Joly for their contributions to this research.

About the Author
Elena Sokolova is an applied-science manager in the Amazon Text-to-Speech group.

Related content

US, WA, Seattle
Job summaryAmazon brings buyers and sellers together. Our retail customers depend on us to give them access to every product at the best possible price. Our sellers depend on us to give them a platform to launch their business into every home and marketplace. Making this happen is the mission of every engineer in Amazon's North America Consumer (NAC) organization.To this end, the Science team is tasked with:· Organizing available data sources, and creating detailed dictionaries of data that can be used in future analyses.· Partnering with product teams in evaluating the financial and operational impact of new product offerings.· Conducting research into optimization and machine learning algorithms which can be applied to solve business problems.· Partnering with other scientists in evaluating algorithms and suggestions from a business view point.· Carrying out independent data-backed initiatives that can be leveraged later on in the fields of network organization, costing and financial modeling of processes.In order to execute the above mandate we are on the look out for smart and qualified Data Scientists who will own projects in partnership with product and research teams as well as operate autonomously on independent initiatives that are expected to unlock benefits in the future. A past background in Statistics is necessary, along with advanced proficiency in languages such as Python and R.Key job responsibilitiesAs a Data Scientist, you are able to use a range of advanced analytical methodologies to solve challenging business problems when the solution is unclear. You have a combination of business acumen, broad knowledge of statistics, deep understanding of ML algorithms, and an analytical mindset. You thrive in a collaborative environment, and are passionate about learning. Our team utilizes a variety of AWS tools such as Redshift, Sagemaker, Lambda, S3, and EC2 with a variety of skillsets in Linear and Discrete Optimization, ML, NLP, Forecasting, Probabilistic ML and Causal ML. You will bring knowledge in many of these domains along with your own specialties and skillsets.
US, CA, Pasadena
Job summaryThe Amazon Web Services (AWS) Center for Quantum Computing in Pasadena, CA, is hiring a Quantum Research Scientist to join a multi-disciplinary, fast-paced team of theoretical and experimental physicists, materials scientists, and hardware and software engineers pushing the forefront of quantum computing. The candidate should demonstrate a thorough knowledge of experimental measurement techniques as well as quantum mechanics theory.Inclusive Team CultureHere at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences.Work/Life BalanceOur team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives.Mentorship & Career GrowthOur team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future.Key job responsibilities* Contribute to fast-paced and agile research to help close the many orders of magnitude gap in gate error rates required for fault tolerant quantum computation* Design and perform experiments to characterize quantum devices in close collaboration with software and engineering teams* Develop models to understand and improve device performance* Effectively document results and communicate to a broad audience* Create robust software for implementation, automation, and analysis of measurements* Specify technical requirements in a cross-team collaboration using analytical arguments derived from physics theoryA day in the life* Analyze experimental data* Develop software to test and run new experiments on existing devices; collaborate with software engineers to achieve high code standard* Debug test setups to achieve high-quality data* Present results and cross-collaborate with others’ work* Perform code review for a colleague’s merge request
US, CA, Pasadena
Job summaryThe Amazon Web Services (AWS) Center for Quantum Computing in Pasadena, CA, is looking to hire a Quantum Research Scientist in the Test and Measurement group. You will join a multi-disciplinary team of theoretical and experimental physicists, materials scientists, and hardware and software engineers working at the forefront of quantum computing. You should have a deep and broad knowledge of experimental measurement techniques.Candidates with a track record of original scientific contributions will be preferred. We are looking for candidates with strong engineering principles, resourcefulness and a bias for action, superior problem solving, and excellent communication skills. Working effectively within a team environment is essential. As a research scientist you will be expected to work on new ideas and stay abreast of the field of experimental quantum computation.Inclusive Team CultureHere at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences.Work/Life BalanceOur team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives.Mentorship & Career GrowthOur team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future.Key job responsibilitiesIn this role, you will drive improvements in qubit performance by characterizing the impact of environmental and material noise on qubit dynamics. This will require designing experiments to assess the role of specific noise sources, ensuring the collection of statistically significant data, analyzing the results, and preparing clear summaries for the team. Finally, you will work with hardware engineers, material scientists, and circuit designers to implement changes which mitigate the impact of the most significant noise sources.
US, MA, Cambridge
Job summaryThe Alexa Artificial Intelligence (AI) team is looking for a passionate, talented, and inventive Applied Scientist with a strong machine learning background, to help build industry-leading Speech and Language technology.Key job responsibilitiesAs an Applied Scientist with the Alexa AI team, you will work with talented peers to develop novel algorithms and modeling techniques to advance the state of the art in spoken language understanding. Your work will directly impact our customers in the form of products and services that make use of speech and language technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in spoken language understanding.About the teamThe Alexa AI team has a mission to push the envelope in Natural Language Understanding (NLU). Specifically, we focus on incremental learning, continual learning and fairness, in order to provide the best-possible experience for our customers.
US, WA, Seattle
Job summaryThe Alexa Artificial Intelligence (AI) team is looking for a passionate, talented, and inventive Applied Scientist with a strong machine learning background to help build industry-leading Speech and Language technology. Our mission is to push the envelope in Natural Language Understanding (NLU), Audio Signal Processing, text-to-speech (TTS), and Dialog Management, in order to provide the best-possible experience for our customers.Key job responsibilitiesAs an Applied Scientist, you will work with talented peers to develop novel algorithms and modeling techniques to advance the state of the art in spoken language understanding. Your work will directly impact our customers in the form of products and services that make use of speech and language technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in spoken language understanding.
US, MA, Cambridge
Job summaryWant to transform the way people enjoy music, video, and radio? Come join the team that made Amazon Music, Spotify, Hulu, Netflix, Pandora, available to Alexa customers. We are innovating the way our customers interact with entertainment in the living room, on the go, and in the car. We are at the epicenter of the future of entertainment.Alexa Entertainment is looking for an Applied Scientist as we build a team of talented and passionate scientists for ASR (automatic speech recognition) and NLU (natural language understanding). As a Research Scientist, you will participate in the design, development, and evaluation of models and ML (machine learning) technology so that customers have the magical experience of entertainment via Alexa. You will help lay the foundation to move from directed interactions to learned behaviors that enable Alexa to proactively take action on behalf of the customer. And, you will have the satisfaction of working on a product your friends and family can relate to, and want to use every day. Like the world of smart phones less than 10 years ago, this is a rare opportunity to have a giant impact on the way people live.You will be part of a team delivering features that are highly anticipated by media and well received by our customers.
US, VA, Arlington
Job summaryThe People eXperience and Technology Central Science Team (PXTCS) uses economics, behavioral science, statistics, and machine learning to proactively identify mechanisms and process improvements which simultaneously improve Amazon and the lives, wellbeing, and the value of work to Amazonians. We are an interdisciplinary team that combines the talents of science and engineering to develop and deliver solutions that measurably achieve this goal.We are looking for a research scientist with expertise in applying causal inference, experimental design, or causal machine learning techniques to topics in labor, personnel, education, health, public, or behavioral science. We are particularly interested in candidates with experience applying these skills to strategic problems with significant business and/or social policy impact.Candidates will work with economists, scientists and engineers to estimate and validate their models on large scale data, and will help business partners turn the results of their analysis into policies, programs, and actions that have a major impact on Amazon’s business and its workforce. We are looking for creative thinkers who can combine a strong scientific toolbox with a desire to learn from others, and who know how to execute and deliver on big ideas.You will conduct, direct, and coordinate all phases of research projects, including defining key research questions, developing models, designing and implementing appropriate data collection methods, executing analysis plans, and communicating results. You will earn trust from our business partners by collaborating with them to define key research questions, communicate scientific approaches and findings, listen to and incorporate their feedback, and deliver successful solutions.
US, WA, Seattle
Job summaryWant to work on one of Amazon’s most ambitious efforts? Time and Attendance (TAA) is leading the charge to build products that support our global workforce of passionate Amazonians!At Amazon we take seriously our commitment to pay employees accurately and on-time. While each line of business is responsible for knowing and driving down pay defects for their own employees, the centralized Perfect Pay team manages data stores and analytics, program oversight, cross-org technical and non-technical projects, and drives accountability across leaders.TAA is looking for a strong Data Scientist, Machine Learning for the Perfect Pay program to drive and own design and development of Machine Learning products to detect anomalies and risks to prevent pay errors before they happen. You will lead the team in designing anomaly and risk detection models to identify and prevent defects for Amazonians in their HR and pay data. You will work on all aspects of the product development life cycle, with a focus on the hardest problems around building scalable machine learning models with native AWS solutions that leverage tools like SageMaker, Glue, and Redshift to grow with Amazon. You will build high quality, scalable models which create immediate and impactful value for our Amazonians worldwide, while also ensuring that our products are evolving in a sustainable long-term direction.Who are we looking for to join our team?We are looking for a Data Science, machine learning specialist to build new and innovative systems that can predict pay defects before they happen and drive operational excellence across businesses. The HR systems and tools have never been analyzed together in context. The opportunity to automate improving the Amazonian experience using ML and AI span from improving the pay experience, to building risk prevention, to automatically triggering internal HR systems to correct anomalies. Getting the opportunity to cross-functionally explore data sets which support 1.4 million Amazonians for the first time is a unique opportunity. The ideal candidate will be experienced in innovating in domains without current ML/AI products. Domain experience in time and attendance and payroll, or Amazon operations field experience is useful but not required.Key job responsibilitiesMain responsibilities• Use statistical and machine learning techniques to create scalable anomaly detection and risk management systems• Analyzing and understanding large amounts of Amazon’s historical HR data for specific instances of defects or broader risk trends• Design, development, and evaluation of highly innovative models for anomaly detection and risk assessment• Working closely with data engineering team to scope scalable data architecture solutions that support your ML models• Working closely with software engineering teams to drive real-time model implementations and new feature creations• Working closely with operations staff to optimize defect prevention and model implementations• Establishing scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation• Research and implement novel machine learning and statistical approaches• Working closely with HR Business Partners to understand their use-cases for anomaly and risk detection as well as to define the data needed to carry out the work
US, WA, Bellevue
Job summaryAmazon relies on the latest technology to deliver millions of packages every day to our customers – on time, at low cost, and safely. The Middle Mile Planning Research & Optimization Science team builds complex science models and solutions that work across our vendors, warehouses and carriers to optimize both time & cost of getting the packages delivered. Our models are state-of-the-art, make business decisions impacting billions of dollars a year, and improve ordering and delivery experience for millions of online shoppers. That said, this remains a fast growing business and our journey has only started. Our mission is to build the most efficient and transportation network on the planet, using our science and technology as our biggest advantage. We aim to leverage cutting edge technologies in machine learning and operations research to grow our businesses.As a Machine Learning Applied Scientist, you’ll design, model, develop and implement state-of-the-art machine learning models and solutions used by Amazon worldwide. You will need to collaborate effectively with internal stakeholders and cross-functional teams to solve problems, create operational efficiencies, and deliver successfully against high organizational standards. As part of your role you will regularly interact with software engineering teams and business leadership. The focus of this role is to research, develop, and deploy predictive models that will inform and support our business, primarily in the areas of carrier safety.Tasks/ Responsibilities:· Lead and partner with the engineering and operations teams to drive modeling and technical design for complex business problems.· Develop accurate and scalable machine learning models and methods to solve our hardest predictive problems in transportation.· Lead complex modeling analyses to aid management in making key business decisions and set new policies.
US, NJ, Newark
Job summaryGood storytelling starts with great listening. At Audible, that means each role and every project has our audience in mind. Because the same people who design, develop, and deploy our products also happen to use them. To us, that speaks volumes.ABOUT THIS ROLEAudible is searching for an exceptional data scientist to join our economics team and drive the development of models at the intersection of machine learning and econometrics at scale. The Audible economics organization works across the business to measure and maximize the value Audible delivers to customers, creators, and communities globally. In this role, there will be a focus on partnering with our content and product teams to build a groundbreaking catalog of audiobooks and spoken-word entertainment, develop innovative tools to generate value for creators, and optimize content distribution and monetization.We are looking for someone experienced in building ML models at scale for complex prediction and optimization problems, who also has a background (or burgeoning interest!) in causal inference or interpretable machine learning. In addition to working with our staff economists and data scientists, you will also collaborate closely with scientists across Audible and partner teams at Amazon on problems pertinent to subscription businesses and the production of original media content.As a Data Scientist, you will...· Work with leadership in our content and product organizations to identify key analytical problems and opportunities – your work is expected to be a key input to our future content strategy.· Develop and maintain scalable, innovative data science and machine learning models that deliver actionable insights and results.· Collaborate with other data scientists, economists, and analysts at Audible to build data-driven solutions to key business problems.