Scalable framework lets multiple text-to-speech models coexist

Thanks to a set of simple abstractions, models with different architectures can be integrated and optimized for particular hardware accelerators.

Voice agents like Alexa often have a variety of different speech synthesizers, which differ in attributes such as expressivity, personality, language, and speaking style. The machine learning models underlying these different applications can have completely different architectures, and integrating those architectures in a single voice service can be a time-consuming and challenging process.

To make that process easier and faster, Amazon’s Text-to-Speech group has developed a universal model integration framework that allows us to customize production voice models in a quick and scalable way.

Model variety

State-of-the-art voice models typically use two large neural networks to synthesize speech from text inputs.

Related content
How Alexa scales machine learning models to millions of customers.

The first network, called an acoustic model, takes text as input and generates a mel-spectrogram, an image that represents acoustic parameters such as pitch and energy of speech over time. The second network, called a vocoder, takes the mel-spectrogram as an input and produces an audio waveform of speech as the final output.

While we have released a universal architecture for the vocoder model that supports a wide variety of speaking styles, we still use different acoustic-model architectures to generate this diversity of speaking styles.

The most common architecture for the acoustic model relies on an attention mechanism, which learns which elements of the input text are most relevant to the current time slice — or “frame” — of the output spectrogram. With this mechanism, the network implicitly models the speech duration of different chunks of the text.

The same model also uses the technique of “teacher-forcing”, where the previously generated frame of speech is used as an input to produce the next one. While such an architecture can generate expressive and natural-sounding speech, it is prone to intelligibility errors such as mumbling or dropping or repeating words, and errors easily compound from one frame to the next.

More-modern architectures address these issues by explicitly modeling the durations of text chunks and generating speech frames in parallel, which is more efficient and stable than relying on previously generated frames as input. To align the text and speech sequences, the model simply “upsamples”, or repeats its encoding of a chunk of text (its representation vector), for as many speech frames as are dictated by the external duration model.

The continuous evolution of complex TTS models employed in different contexts — such as Alexa Q&A, storytelling for children, and smart-home automation — creates the need for a scalable framework that can handle them all.

The challenge of integration

To integrate acoustic models into production, we need a component that takes an input text utterance and returns a mel-spectrogram. The first difficulty is that speech is usually generated in sequential chunks, rather than being synthesized all at once. To minimize latency, our framework should return data as quickly as possible. A naive solution that wraps the whole model in code and processes everything with a single function call will be unacceptably slow.

Related content
Arabic posed unique challenges for speech recognition, language understanding, and speech synthesis.

Another challenge is adjusting the model to work with various hardware accelerators. As an example, to benefit from the high-performance AWS Inferentia runtime, we need to ensure that all tensors have fixed sizes (set once, during the model compilation phase). This means that we need to

  • add logic that splits longer utterances into smaller chunks that fit specific input sizes (depending on the model);
  • add logic that ensures proper padding; and
  • decide which functionality should be handled directly by the model and which by the integration layer.

When we want to run the same model on general-purpose GPUs, we probably don’t need these changes, and it would be useful if the framework could switch back and forth between contexts in an easy way. We therefore decouple the TTS model into a set of more specialized integration components, capable of doing all the required logic.

Integration components

The integration layer encapsulates the model in a set of components capable of transforming an input utterance into a mel-spectrogram. As the model usually operates in two stages — preprocessing data and generating data on demand — it is convenient to use two types of components:

  • a SequenceBlock, which takes an input tensor and returns a transformed tensor (the input can be the result of applying another SequenceBlock), and
  • a StreamableBlock, which generates data (e.g., frames) on demand. As an input it takes the results of another StreamableBlock (blocks can form a pipeline) and/or data generated by a SequenceBlock.

These simple abstractions offer great flexibility in creating variants of acoustic models. Here’s an example:

TTS framework.jpeg
An example of an acoustic model built using the SequenceBlock and StreamableBlock abstractions.

The acoustic model consists of

  • two encoders (SequenceBlocks), which convert the input text embedding into one-dimensional representation tensors, one for encoded text and one for predicted durations;
  • an upsampler (a StreamableBlock, which takes the encoders’ results as an input), which creates intermediary, speech-length sequences, according to the data returned by the encoders; and 
  • a decoder (a StreamableBlock), which generates mel-spectrogram frames.

The whole model is encapsulated in a specialized StreamableBlock called StreamablePipeline, which contains exactly one SequenceBlock and one StreamableBlock:

Related content
According to listener tests, whispers produced by a new machine learning model sound as natural as vocoded human whispers.

  • the SequenceBlockContainer is a specialized SequenceBlock that consists of a set of nested SequenceBlocks capable of running neural-network encoders;
  • the StreamableStack is specialized StreamableBlock that decodes outputs from the upsampler and creates mel-spectrogram frames.

The integration framework ensures that all components are run in the correct order, and depending on the specific versions of components, it allows for the use of various hardware accelerators.

The integration layer

The acoustic model is provided as a plugin, which we call an “addon”. An addon consists of exported neural networks, each represented as a named set of symbols and parameters (encoder, decoder, etc.), along with configuration data. One of the configuration attributes, called “stack”, specifies how integration components should be connected together to build a working integration layer. Here’s the code for the stack attribute that describes the architecture above:

'stack'=[
	{'type' : 'StreamablePipeline, 
	 'sequence_block' : {'type' : 'Encoders'},
	 'streamable_block' : 
		{'type': 'StreamableStack', 
		 'stack' : [ 
			{'type' : 'Upsampler'}, 
			{'type' : 'Decoder'} 
		]} 
	} 
]

This definition will create an integration layer consisting of a StreamablePipeline with

  • All encoders specified in the addon (the framework will automatically create all required components);
  • An upsampler, which generates intermediate data for the decoder; and
  • the decoder specified in the addon, which generates the final frames.

The JSON format allows us to make easy changes. For example, we can create a specialized component that runs all sequence blocks in parallel on a specific hardware accelerator and name it CustomizedEncoders. In this case, the only change in the configuration specification is to replace the name “Encoders” with “CustomizedEncoders”.

Running experiments using components with additional diagnostic or digital-signal-processing effects is also trivial. A new component’s only requirement is to extend one of two generic abstractions; other than that, there are no other restrictions. Even replacing one StreamableBlock with the whole nested sequence-to-sequence stack is perfectly fine, according to the framework design.

This framework is already used in production. It is a vital pillar of our recent, successful integration of state-of-the-art TTS architectures (without attention) and legacy models.

Acknowledgments: Daniel Korzekwa

Related content

ES, Barcelona
Are you a MS or PhD student interested in a 2026 internship in the field of machine learning, deep learning, generative AI, large language models, speech technology, robotics, computer vision, optimization, operations research, quantum computing, automated reasoning, or formal methods? If so, we want to hear from you! We are looking for students interested in using a variety of domain expertise to invent, design and implement state-of-the-art solutions for never-before-solved problems. You can find more information about the Amazon Science community as well as our interview process via the links below; https://www.amazon.science/ https://amazon.jobs/content/en/career-programs/university/science https://amazon.jobs/content/en/how-we-hire/university-roles/applied-science Key job responsibilities As an Applied Science Intern, you will own the design and development of end-to-end systems. You’ll have the opportunity to write technical white papers, create roadmaps and drive production level projects that will support Amazon Science. You will work closely with Amazon scientists and other science interns to develop solutions and deploy them into production. You will have the opportunity to design new algorithms, models, or other technical solutions whilst experiencing Amazon’s customer focused culture. The ideal intern must have the ability to work with diverse groups of people and cross-functional teams to solve complex business problems. A day in the life At Amazon, you will grow into the high impact person you know you’re ready to be. Every day will be filled with developing new skills and achieving personal growth. How often can you say that your work changes the world? At Amazon, you’ll say it often. Join us and define tomorrow. Some more benefits of an Amazon Science internship include; • All of our internships offer a competitive stipend/salary • Interns are paired with an experienced manager and mentor(s) • Interns receive invitations to different events such as intern program initiatives or site events • Interns can build their professional and personal network with other Amazon Scientists • Interns can potentially publish work at top tier conferences each year About the team Applicants will be reviewed on a rolling basis and are assigned to teams aligned with their research interests and experience prior to interviews. Start dates are available throughout the year and durations can vary in length from 3-6 months for full time internships. This role may available across multiple locations in the EMEA region (Austria, Estonia, France, Germany, Ireland, Israel, Italy, Jordan, Luxembourg, Netherlands, Poland, Romania, Spain, South Africa, UAE, and UK). Please note these are not remote internships.
US, CA, San Francisco
The Amazon AGI SF Lab is focused on developing new foundational capabilities for enabling useful AI agents that can take actions in the digital and physical worlds. In other words, we’re enabling practical AI that can actually do things for us and make our customers more productive, empowered, and fulfilled. The lab is designed to empower AI researchers and engineers to make major breakthroughs with speed and focus toward this goal. Our philosophy combines the agility of a startup with the resources of Amazon. By keeping the team lean, we’re able to maximize the amount of compute per person. Each team in the lab has the autonomy to move fast and the long-term commitment to pursue high-risk, high-payoff research. Key job responsibilities - Develop multimodal Large Language Models (LLMs) to observe, model and derive insights from manual workflows for automation - Work in a joint scrum with engineers for rapid invention, develop automation agent systems, and take them to launch for millions of customers - Collaborate with cross-functional teams of engineers, product managers, and scientists to identify and solve complex problems in GenAI - Design and execute experiments to evaluate the performance of different algorithms and models, and iterate quickly to improve results - Think big about the arc of development of GenAI over a multi-year horizon, and identify new opportunities to apply these technologies to solve real-world problems - Communicate results and insights to both technical and non-technical audiences, including through presentations and written reports - Mentor and guide junior scientists and engineers, and contribute to the overall growth and development of the team
US, WA, Seattle
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! We are looking for a self-motivated, passionate and resourceful Applied Science Manager to bring diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. You will lead a strong science team and work closely with other science and engineering leaders, product and business partners together to build the best personalized customer experience for Prime Video. At the end of the day, you will have the reward of seeing your contributions benefit millions of Amazon.com customers worldwide. Key job responsibilities - Lead to develop AI solutions for various Prime Video recommendation and personalization systems using Deep learning, GenAI, Reinforcement Learning, recommendation system and optimization methods; - Work closely with engineers and product managers to design, implement and launch AI solutions end-to-end; - Effectively communicate technical and non-technical ideas with teammates and stakeholders; - Stay up-to-date with advancements and the latest modeling techniques in the field; - Hire and grow a science team working in this exciting video personalization domain. About the team Prime Video Recommendation Science team owns science solution to power recommendation and personalization experience on various devices. We work closely with the engineering teams to launch our solutions in production.
US, WA, Bellevue
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Senior Applied Scientist to work on methodologies for Generative Artificial Intelligence (GenAI) models. As a Senior Applied Scientist, you will be responsible for leading the development of novel algorithms and modeling techniques to advance the state of the art. Your work will directly impact our customers and will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate development with multi-modal Large Language Models (LLMs) and GenAI. You will have significant influence on our overall strategy by working at the intersection of engineering and applied science to scale pre-training and post-training workflows and build efficient models. You will support the system architecture and the best practices that enable a quality infrastructure. Key job responsibilities Join us to work as an integral part of a team that has experience with GenAI models in this space. We work on these areas: - Pre-training and post-training multimodal LLMs - Scale training, optimization methods, and learning objectives - Utilize, build, and extend upon industry-leading frameworks - Work with other team members to investigate design approaches, prototype new technology, scientific techniques and evaluate technical feasibility - Deliver results independently in a self-organizing Agile environment while constantly embracing and adapting new scientific advances About the team The AGI team has a mission to push the envelope in GenAI with Large Language Models (LLMs) and multimodal systems, in order to provide the best-possible experience for our customers.
CA, BC, Vancouver
Join our Amazon Private Brands Selection Guidance organization in building science and tech solutions at scale to delight our customers with products across our leading private brands such as Amazon Basics, Amazon Essentials, and by Amazon. The Selection Guidance team applies Generative AI, Machine Learning, Statistics, and Economics solutions to drive our private brands product assortment, strategic business decisions, and product inputs such as title, price, merchandising and ordering. We are an interdisciplinary team of Scientists, Economists, Engineers, and Product Managers incubating and building day one solutions using novel technology, to solve some of the toughest business problems at Amazon. As a Sr. Data Scientist you will invent novel solutions and prototypes, and directly contribute to bringing your ideas to life through production implementation. Current research areas include entity resolution, agentic AI, large language models, and product substitutes. You will review and guide scientists across the team on their designs and implementations, and raise the team bar for science research and prototypes. This is a unique, high visibility opportunity for someone who wants to develop ambitious science solutions and have direct business and customer impact. Key job responsibilities - Partner with business stakeholders to deeply understand APB business problems and frame ambiguous business problems as science problems and solutions. - Invent novel science solutions, develop prototypes, and deploy production software to solve business problems. - Review and guide science solutions across the team. - Publish and socialize your and the team's research across Amazon and external avenues as appropriate - Leverage industry best practices to establish repeatable applied science practices, principles & processes.
US, WA, Seattle
We are looking for a passionate Applied Scientist to help pioneer the next generation of agentic AI applications for Amazon advertisers. In this role, you will design agentic architectures, develop tools and datasets, and contribute to building systems that can reason, plan, and act autonomously across complex advertiser workflows. You will work at the forefront of applied AI, developing methods for fine-tuning, reinforcement learning, and preference optimization, while helping create evaluation frameworks that ensure safety, reliability, and trust at scale. You will work backwards from the needs of advertisers—delivering customer-facing products that directly help them create, optimize, and grow their campaigns. Beyond building models, you will advance the agent ecosystem by experimenting with and applying core primitives such as tool orchestration, multi-step reasoning, and adaptive preference-driven behavior. This role requires working independently on ambiguous technical problems, collaborating closely with scientists, engineers, and product managers to bring innovative solutions into production. Key job responsibilities - Design and build agents to guide advertisers in conversational and non-conversational experience. - Design and implement advanced model and agent optimization techniques, including supervised fine-tuning, instruction tuning and preference optimization (e.g., DPO/IPO). - Curate datasets and tools for MCP. - Build evaluation pipelines for agent workflows, including automated benchmarks, multi-step reasoning tests, and safety guardrails. - Develop agentic architectures (e.g., CoT, ToT, ReAct) that integrate planning, tool use, and long-horizon reasoning. - Prototype and iterate on multi-agent orchestration frameworks and workflows. - Collaborate with peers across engineering and product to bring scientific innovations into production. - Stay current with the latest research in LLMs, RL, and agent-based AI, and translate findings into practical applications. About the team The Sponsored Products and Brands team at Amazon Ads is re-imagining the advertising landscape through the latest generative AI technologies, revolutionizing how millions of customers discover products and engage with brands across Amazon.com and beyond. We are at the forefront of re-inventing advertising experiences, bridging human creativity with artificial intelligence to transform every aspect of the advertising lifecycle from ad creation and optimization to performance analysis and customer insights. We are a passionate group of innovators dedicated to developing responsible and intelligent AI technologies that balance the needs of advertisers, enhance the shopping experience, and strengthen the marketplace. If you're energized by solving complex challenges and pushing the boundaries of what's possible with AI, join us in shaping the future of advertising. The Advertiser Guidance team within Sponsored Products and Brands is focused on guiding and supporting 1.6MM advertisers to meet their advertising needs of creating and managing ad campaigns. At this scale, the complexity of diverse advertiser goals, campaign types, and market dynamics creates both a massive technical challenge and a transformative opportunity: even small improvements in guidance systems can have outsized impact on advertiser success and Amazon’s retail ecosystem. Our vision is to build a highly personalized, context-aware agentic advertiser guidance system that leverages LLMs together with tools such as auction simulations, ML models, and optimization algorithms. This agentic framework, will operate across both chat and non-chat experiences in the ad console, scaling to natural language queries as well as proactively delivering guidance based on deep understanding of the advertiser. To execute this vision, we collaborate closely with stakeholders across Ad Console, Sales, and Marketing to identify opportunities—from high-level product guidance down to granular keyword recommendations—and deliver them through a tailored, personalized experience. Our work is grounded in state-of-the-art agent architectures, tool integration, reasoning frameworks, and model customization approaches (including tuning, MCP, and preference optimization), ensuring our systems are both scalable and adaptive.
US, CA, Sunnyvale
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! Key job responsibilities - Develop ML models for various recommendation & search systems using deep learning, online learning, and optimization methods - Work closely with other scientists, engineers and product managers to expand the depth of our product insights with data, create a variety of experiments to determine the high impact projects to include in planning roadmaps - Stay up-to-date with advancements and the latest modeling techniques in the field - Publish your research findings in top conferences and journals A day in the life We're using advanced approaches such as foundation models to connect information about our videos and customers from a variety of information sources, acquiring and processing data sets on a scale that only a few companies in the world can match. This will enable us to recommend titles effectively, even when we don't have a large behavioral signal (to tackle the cold-start title problem). It will also allow us to find our customer's niche interests, helping them discover groups of titles that they didn't even know existed. We are looking for creative & customer obsessed machine learning scientists who can apply the latest research, state of the art algorithms and ML to build highly scalable page personalization solutions. You'll be a research leader in the space and a hands-on ML practitioner, guiding and collaborating with talented teams of engineers and scientists and senior leaders in the Prime Video organization. You will also have the opportunity to publish your research at internal and external conferences. About the team Prime Video Recommendation Science team owns science solution to power recommendation and personalization experience on various Prime Video surfaces and devices. We work closely with the engineering teams to launch our solutions in production.
US, CA, Sunnyvale
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! We are looking for a self-motivated, passionate and resourceful Applied Scientist to bring diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. You will spend your time as a hands-on machine learning practitioner and a research leader. You will play a key role on the team, building and guiding machine learning models from the ground up. At the end of the day, you will have the reward of seeing your contributions benefit millions of Amazon.com customers worldwide. Key job responsibilities - Develop AI solutions for various Prime Video Search systems using Deep learning, GenAI, Reinforcement Learning, and optimization methods; - Work closely with engineers and product managers to design, implement and launch AI solutions end-to-end; - Design and conduct offline and online (A/B) experiments to evaluate proposed solutions based on in-depth data analyses; - Effectively communicate technical and non-technical ideas with teammates and stakeholders; - Stay up-to-date with advancements and the latest modeling techniques in the field; - Publish your research findings in top conferences and journals. About the team Prime Video Search Science team owns science solution to power search experience on various devices, from sourcing, relevance, ranking, to name a few. We work closely with the engineering teams to launch our solutions in production.
CA, ON, Toronto
Are you a passionate scientist in the computer vision area who is aspired to apply your skills to bring value to millions of customers? Here at Ring, we have a unique opportunity to innovate and see how the results of our work improve the lives of millions of people and make neighborhoods safer. As an Applied Scientist, you will work with talented peers pushing the frontier of computer vision and machine learning technology to deliver the best experience for our neighbors. This is a great opportunity for you to innovate in this space by developing highly optimized algorithms that will work at scale. This position requires experience with developing Multi-modal LLMs and/or Vision Language Models. You will collaborate with different Amazon teams to make informed decisions on the best practices in machine learning to build highly-optimized integrated hardware and software platforms. Key job responsibilities - Participate in the design, development, evaluation, deployment and updating of data-driven models for computer vision applications. - Research and implement the state-of-the-art computer vision and Vision Language models algorithms. - Collaborate with product managers and engineering teams to design and implement computer vision and machine learning based features for Ring devices - Influence system design and product vision by making informed decisions on the selection of technology, data sources, algorithms, and sensors.
CA, ON, Toronto
Are you a passionate scientist in the computer vision area who is aspired to apply your skills to bring value to millions of customers? Here at Ring, we have a unique opportunity to innovate and see how the results of our work improve the lives of millions of people and make neighborhoods safer. You will be part of a team committed to pushing the frontier of computer vision and machine learning technology to deliver the best experience for our neighbors. This is a great opportunity for you to innovate in this space by developing highly optimized algorithms that will work on scale. This position requires experience with developing Multi-modal LLMs and Vision Language Models. You will collaborate with different Amazon teams to make informed decisions on the best practices in machine learning to build highly-optimized integrated hardware and software platforms. Key job responsibilities - Participate in the design, development, evaluation, deployment and updating of data-driven models for computer vision applications. - Research and implement the state-of-the-art computer vision and Vision Language models algorithms. - Collaborate with product managers and engineering teams to design and implement computer vision and machine learning based features for Ring devices - Influence system design and product vision by making informed decisions on the selection of technology, data sources, algorithms, and sensors.