At Amazon, we’re excited to introduce the new AI-powered Dialogue Boost technology available on select Echo smart speakers and Fire TV devices. Dialogue Boost enhances the clarity of movie and TV dialogue while adaptively suppressing background music and sound effects. Thanks to machine learning and advanced audio separation techniques, Dialogue Boost helps people hear conversations in their favorite TV shows, movies, and podcasts without having to blast the volume. Dialogue Boost can improve the viewing experience for all our customers, but it’s especially useful for the nearly 20% of the global population with hearing loss.
Originally launched on Prime Video in 2022, the new Dialogue Boost leverages breakthroughs in deep-neural-network compression to run directly on-device, making it available to all media, including Netflix, YouTube, and Disney+.
Clearer dialogue for movie nights
For people with hearing loss, increasing the overall volume of a movie or TV show doesn’t make dialogue clearer, since music and other background sounds are also amplified. Most people solve this problem by using closed captions, but that isn’t the preferred viewing style for every customer.
The problem of hard-to-hear dialogue in movies has been getting worse over the last decade. This is due in part to the increased complexity and variety of modern theater and home sound systems, which means there isn’t a single mix that works well on all playback configurations.
For example, Hollywood sound editors may target a theater system with dozens of channels, including separate dialogue channels coming from the front of the theater and sound effects emanating from the sides. In the TV version, however, sound effects, music, and dialogue are all “down-mixed” to the same channel, making it even harder to understand what’s being said.
Sound source separation
We realized that, to improve our customers’ experience, we needed a way to suppress the music and sound effects while boosting the dialogue. We achieve this using a sound source separation system that processes audio in several stages.
The first stage is analysis, where the incoming audio stream is transformed into a time-frequency representation, which maps energy in different frequency bands against time.
The next stage involves a neural network trained on thousands of hours of speaking conditions including various languages, accents, recording circumstances, combinations of sound effects, and background noises. This model analyzes the time-frequency representation in real time to distinguish speech from other sounds.
Two key innovations allowed the team to bring Dialogue Boost to Fire TV Sticks and Echo smart speakers: a more efficient separation architecture that processes audio in frequency sub-bands and a training methodology that relies on pseudo-labeling, where a model is fine-tuned on data that it has labeled itself.
Sub-band processing
Many existing networks process all frequency content together through temporal sequence modeling, which is similar to token sequence modeling in LLMs — a computationally intensive approach.
Dividing the audio spectrum into frequency sub-bands enables inference to be parallelized, and each sub-band needs to be processed only along the time axis, a much simpler computational task. We also implemented a lightweight bridging module to merge sub-bands, improving cross-band consistency.
This architecture enables our model to achieve or surpass the previous state-of-the-art performance, competing with much larger models while using less than 1% as many operations and requiring about 2% as many model parameters.
Pseudo-labeling
In most prior work, training relied heavily on synthetic mixtures of speech, background sound, and effects. But this synthetic data didn't cover all real-world conditions, such as live broadcasts and music events.
Inspired by recent work on training multimodal LLMs, where state-of-the-art models benefit from pseudo-labeling pipelines, we created a system that generates training targets for real media content, better handling these rare scenarios. First, we train a large, powerful model on synthetic data and use it to extract speech signals from real data. Then we combine the pseudo-labeled real data with synthetic data and retrain the model.
This process continues until further training epochs no longer improve the model’s accuracy. At this point, in a process known as knowledge distillation, we use the fully trained large model to generate training targets for a model that’s small and efficient enough to process audio signals in real time.
The final stage is intelligent mixing, which goes beyond simple volume adjustment. The system combines multiple techniques to enhance dialogue while preserving the artistic intent of the original mix: it identifies speech-dominant audio channels, applies source separation to isolate dialogue, emphasizes frequency bands critical for speech intelligibility, and remixes these elements with the original audio. Viewers can adjust dialogue prominence while the system maintains overall sound quality and artistic balance.
When Amazon Prime Video first introduced Dialogue Boost, it relied on cloud-based processing to pre-enhance audio tracks. Knowledge distillation helped us compress the original AI models to less than 1% of their size. Our models are now able to run in real time, within device constraints, while maintaining nearly identical performance to cloud-based techniques.
The listening experience
Our research shows that in discriminative listening tests, over 86% of participants preferred the clarity of Dialogue-Boost-enhanced audio to that of unprocessed audio, particularly during scenes with complex soundscapes, such as action sequences.
For users with hearing loss, our research shows 100% feature approval, with users reporting significantly reduced listening effort during movie watching.
Customers have reported that Dialogue Boost also helps them understand whispered conversations, content with varied accents or dialects, and dialogue during action-heavy scenes, and it lets them enjoy movies without subtitle distraction. Additionally, for late-night viewers, or people who watch TV while others are sleeping, the technology has proven particularly valuable. Rather than constantly adjusting volume or relying on subtitles, viewers can maintain a comfortable listening level while ensuring that dialogue remains clear and understandable.
Acknowledgements
Dialogue Boost is the result of collaboration across Amazon Lab126 and Prime Video teams. We would like to thank Gordon Han, Berkant Tacer, Phil Hilmes, Peter Korn, Rui Wang, Ali Milani, Scott Isabelle, Vimal Bhat, Linda Liu, Mohamed Omar, Lakshmi Ziskin, Rohith Mysore, and Vijaya Kumar.