Amazon Scientist Outlines Multilayer System For Smart Speaker Echo Cancellation And Voice Enhancement
Smart speakers, such as the Amazon Echo family of products, are growing in popularity among consumer and business audiences. In order to improve the automatic speech recognition (ASR) and full-duplex voice communication (FDVC) performance of these smart speakers, acoustical echo cancellation (AEC) and noise reduction systems are required. These systems reduce the noises and echoes that can impact operation, such as an Echo device accurately hearing the wake word “Alexa.”
Existing echo cancellation schemes usually employ an adaptive linear filter (ALF) in either time, frequency or sub-band domains to model or approximate the real acoustic echo path between the smart speaker’s loudspeaker and microphone. It then subtracts the estimated echo from the microphone signal.
However, there is always a residual echo after the linear adaptive subtraction. This is because:
- The ALF can neither be perfectly accurate nor exactly model the transfer function of the echo path;
- The length of ALF is often insufficient, or
- There might be non-linearity in the echo path that is impossible for ALF to model.
Therefore, a nonlinear processing technique is necessary to further reduce the residual echo. But these additional processing techniques can be challenging when various echoes and noises simultaneously present.
As a result, techniques that can efficiently suppress these various types of complex echoes and noises are highly desirable. To achieve this goal, my paper, Multilayer Adaptation Based Complex Echo Cancellation and Voice Enhancement, which I presented at the recent IEEE ICASSP Conference, proposes a multilayer processing system that can significantly improve smart speaker ASR and FDVC performance.
The multilayer processing system comprises joint perceptual sub-band residual echo suppression (SBRES), sub-band noise reduction (SBNR) and adaptation-based nonlinear echo cancellation (NLEC) layers. The proposed multilayer system is shown below with a single-channel, but without losing generality. For context, Amazon Echo devices are multichannel, meaning they employ more than one microphone.
In our subjective and objective testing on smart speakers powered by Alexa, this multilayer signal processing system demonstrated that it can deliver significant performance improvements in word-error-rate (WER), echo and noise reduction. Specifically, the relative WER improvements of SBNR, SBRES, and NLEC layers were about 22%, 14%, and 17%, respectively. The echo and noise reduction improvements totaled about 40 decibels and 19 decibels, respectively. Moreover, the additional processing power required for this multilayer system is minimal, making it a viable option for smart speaker voice enhancement. In fact, for about a year now, we have utilized this multilayer noise and echo reduction approach with several Amazon Echo devices that have shipped to customers.