Conversational AI

Signal processor improves Echo’s bass response, loudness, and speech recognition accuracy

By Jun Yang

April 11, 2019

4 min read

Multiband dynamics processing, which separately modifies volume in different frequency bands of an audio signal, is known to improve listeners’ audio experiences. But in the context of voice-controlled systems like the Amazon Echo family of products, it can also improve automatic speech recognition by making echo cancellation easier.

Traditional multiband dynamics processors (MBDPs) have a few drawbacks, however. When splitting a signal into its component frequencies, they don’t always achieve clean separation; and they tend to use fixed frequency bands, which can’t be adjusted to the characteristics of specific audio devices.

Both of these drawbacks can affect the listener’s perception of both the loudness and bass response of an audio signal. They can also cause distortions that make echo cancellation more difficult.

At this year’s International Conference on Acoustics, Speech and Signal Processing, my colleagues and I present a novel MBDP design that addresses both these drawbacks. The technology began shipping in Alexa-enabled devices in 2017, and extensive user testing indicates that it improves listener perception of loudness and bass. In tests, it significantly improved performance on a fundamental speech recognition task. Moreover, the computational complexity of our MBDP system is small.

scrollingwaveformsV2.gif._CB467417779_.gif — Three waveforms: an original audio signal (top); the signal after processing by a conventional MBDP system, with spiky deformations throughout (middle); and the signal after processing by our novel system, which limits the distortion but better preserves shape (bottom).

An MBDP has two main functions: one is compression, or keeping the ratio of a signal’s maximum and minimum volumes within a prescribed range; and the other is peak limiting, or cutting off sudden volume spikes that can cause distortion or even cause the signal from cutting out momentarily, a condition called brownout.

Applying different compressors and limiters to different frequency bands provides greater signal control. But it also depends on filters that can provide clean frequency separation. So the key to our system’s performance is its configurable filter-bank design.

Our filter bank consists of a cascade of filters, all of which or only a few of which may be used at a time. An incoming signal is split in two; half of it passes to two sequential high-pass filters, which filter out frequencies below a cutoff frequency, and the other half passes to two sequential low-pass filters, which filter out frequencies above the same cutoff frequency.

The signal from the high-pass filter may be split again, and again passed to separate banks of high-pass and low-pass filters. This process may repeat an arbitrary number of times, and at each stage, the output of the low-pass filter passes to an “all-pass” filter, which leaves the signal unchanged but enables the synchronization of all the bands. The high-pass and low-pass frequencies may be set to arbitrary values, so that the filtration frequency bands can be tailored to specific applications.

Filterbank_architecture.png._CB467153366_.png — Our proposed reconfigurable filter bank

The signal in each frequency band passes to its own dedicated compressor and then to a limiter. At that point, the frequency-specific signals are recombined and passed to full-band limiter, which ensures that the frequency-specific modifications don’t cause the signal as a whole to distort.

Echo cancellation systems like the one found in Amazon Echo devices subtract a known audio signal — the electrical signal sent to the device’s loudspeaker — from the signal received by the device’s microphones. The more distortion the audio signal suffers, the less it will resemble the reference signal, and the less successful the subtraction will be.

Our MBDP system reduces distortion in three ways. First, the greater precision of the filter bank enables better control of the compression ratios in different frequencies. That means that the system can reduce a loudspeaker’s total harmonic distortion without compromising the overall loudness and bass response of the audio signal.

Similarly, the frequency-specific and full-band peak limiters ensure that the loudspeaker stays in its “linear dynamic range,” meaning that the sound pressure level doesn’t exceed the threshold at which it will begin to cause distortion.

The linear dynamic range is a mechanical property of the loudspeaker. But the electrical signal can become distorted before it even reaches the loudspeaker, if the amplifier attempts to output too high a voltage. This is known as clipping, and the full-band limiter can prevent that, as well.

We conducted extensive listening tests, in which study participants reported that audio processed using our reconfigurable MBDP scheme sounds much better and louder than audio processed using the traditional MBDP scheme. Spectral analyses also demonstrated that our system increases bass response by about five decibels.

FRR_graph.png._CB467153364_.png — Our system (blue line) significantly reduced the rate at which an Echo device falsely rejected Alexa’s wake word (false reject rate, or FRR), as a function of device audio volume.

To evaluate our system’s effect on speech recognition, we tested Echo devices’ responses to Alexa’s wake word — usually “Alexa” — when they were broadcasting audio at a range of volumes. We found that using our MBDP scheme instead of the traditional scheme significantly reduced the number of false rejects, or instances in which the Echo failed to recognize the wake word. We also found that the higher the Echo’s output volume, the greater the advantage offered by our approach.

Acknowledgments: Amit S. Chhetri, Carlo Murgia, Philip Hilmes

About the Author

Jun Yang

Jun Yang is a senior research scientist in Amazon Devices' Hardware Technology and Architecture group.

Signal processor improves Echo’s bass response, loudness, and speech recognition accuracy

Related content

Work with us