Amazon Alexa’s new wake word research at Interspeech

Work aims to improve accuracy of models both on- and off-device.

Every interaction with Alexa begins with the wake word: usually “Alexa”, but sometimes “Amazon”, “Echo”, or “Computer” — or, now, “Hey Samuel”. Only after positively identifying the wake word does an Alexa-enabled device send your request to the cloud for further processing.

Six years after the announcement of the first Amazon Echo, the Alexa science team continues to innovate new approaches to wake word recognition, improving Alexa’s responsiveness and accuracy.

At this year’s Interspeech, for instance, Alexa researchers presented five different papers about new techniques for wake word recognition. One of these — “Building a robust word-level wakeword verification network” — describes models that run in the cloud to confirm on-device wake word detections.

Wake word spectrogram
Because audio signals can be represented as two-dimensional mappings of frequency (y-axis) against time (x-axis), convolutional neural networks apply naturally to them.
From "Accurate detection of wake word start and end using a CNN"

Another paper, “Metadata-aware end-to-end keyword spotting”, describes a new system that uses metadata about the state of the Alexa-enabled device — such as the type of device and whether it’s playing music or sounding an alarm — to improve the accuracy of the on-device wake word detector.

The wake word detectors reported in both papers rely, at least in part, on convolutional neural networks. Originally developed for image processing, convolutional neural nets, or CNNs, repeatedly apply the same “filter” to small chunks of input data. For object recognition, for instance, a CNN might step through an image file in eight-by-eight blocks of pixels, inspecting each block for patterns associated with particular objects. 

Since audio signals can be represented as two-dimensional mappings of frequency against time, CNNs apply naturally to them as well. Each of the filters applied to a CNN’s inputs defines a channel through the first layer of the CNN, and usually, the number of channels increases with every layer.

Varying norms

Metadata-aware end-to-end keyword spotting” is motivated by the observation that if a device is emitting sound — music, synthesized speech, or an alarm sound — it causes a marked shift in the input signal’s log filter bank energies, or LFBEs. The log filter banks are a set of differently sized frequency bands chosen to emphasize the frequencies in which human hearing is most acute.

Graph showing average values of acoustic properties of wake word signals when a device is emitting sound and when it’s not.
Average values of acoustic properties — log filter-bank energies — of wake word signals as measured on-device when the device is emitting sound (orange) and when it’s not (blue).
From “Metadata-aware end-to-end keyword spotting”

To address this problem, applied scientists Hongyi Liu and Apurva Abhyankar and their colleagues include device metadata as an input to their wake word model. The model embeds the metadata, or represents it as points in a multidimensional space, such that location in the space conveys information useful to the model. The model uses the embeddings in two different ways.

One is as an additional input to the last few layers of the network, which decide whether the acoustic input signal includes the wake word. The final outputs of the convolutional layers are flattened, or strung together into a single long vector. The metadata embedding vector is fed into a fully connected layer — a layer all of whose processing nodes pass their outputs to all of the nodes of the next layer — and the output is concatenated to the flattened audio feature vector. 

This fused vector passes to a final fully connected layer, which issues a judgment about whether the network contains the wake word or not.

The other use of the metadata embedding is to modulate the outputs of the convolutional layers while they’re processing the input signal. The filters that a CNN applies to inputs are learned during training, and they can vary greatly in size. Consequently, the magnitude of the values passing through the network’s various channels can vary as well.

With CNNs, it’s common practice to normalize the channels’ outputs between layers, so that they’re all on a similar scale, and no one channel swamps the others. But Liu, Abhyankar, and their colleagues train their model to vary the normalization parameters depending on the metadata vector, which improves the network’s ability to generalize to heterogenous data sets. 

The researchers believe that this model better captures the characteristics of the input audio signal when the Alexa-enabled device is emitting sound. In their paper, they report experiments showing that, on average, a model trained with metadata information achieves a 14.6% improvement in false-reject rate relative to a baseline CNN model.

Paying attention

The metadata-aware wake word detector runs on-device, but the next two papers describe models that run in the cloud. On-device models must have small memory footprints, which means that they sacrifice some processing power. If an on-device model thinks it has detected a wake word, it sends a short snippet of audio to the cloud for confirmation by a larger, more-powerful model.

The on-device model tries to identify the start of the wake word, but sometimes it misses slightly. To ensure that the cloud-based model receives the whole wake word, the snippet sent by the device includes the half-second of audio preceding the device’s estimate of the wake word’s start.

Model depicting variations in alignment of wake word signals send to the cloud for verification.
Wake word signals sent to the cloud for verification vary in the quality of their alignment. Sometimes, in trying to identify the start of the wake word, the device misses by a fraction of a second, which can cause difficulty for cloud models trained on well-aligned data.
From “Building a robust word-level wakeword verification network”

When CNNs are trained on well-aligned data, convolutional-layer outputs that focus on particular regions of the input can become biased toward finding wake word features in those regions. This can result in weaker performance when the alignment is noisy.

In “Building a robust word-level wakeword verification network”, applied scientist Rajath Kumar and his colleagues address this problem by adding recurrent layers to their network, to process the outputs of the convolutional layers. Recurrent layers can capture information as time sequences. Instead of learning where the wake word occurs in the input, the recurrent layers learn how the sequence changes temporally when the wake word is present. 

This allows the researchers to train their network on well-aligned data without suffering much performance drop off on noisy data. To further improve performance, the researchers also use an attention layer to process and re-weight the sequential outputs of the recurrent layers, to emphasize the outputs required for wake word verification. The model is thus a convolutional-recurrent-attention (CRA) model.

Diagrams indicating differences between a conventional CNN architecture and a convolutional-recurrent-attention architecture.
These diagrams indicate the differences between a conventional CNN architecture (top) and a convolutional-recurrent-attention (CRA) architecture (bottom).
From “Building a robust word-level wakeword verification network”

To evaluate their CRA model, the researchers compared its performance to that of several CNN-only models. Each example in the training data included 195 input frames, or sequential snapshots of the frequency spectrum. Within that 195-frame span, two of the CNN models looked at sliding windows of 76 frames or 100 frames. A third CNN model, and the CRA model, looked at all 195 frames. The models’ performance was assessed relative to a baseline wake word detector that combines a deep neural network with a hidden Markov model, an architecture was the industry standard for some time. 

On accurately aligned inputs, the CRA model offers only a slight improvement over the 195-frame CNN model. Compared to the baseline, the CNN model reduced the false-acceptance rate by 53%, while the CRA reduced it by 55%. On the same task, the 100-frame CNN model achieved only a 35% reduction.

Table showing percentage of decrease in FAR in comparison to 2-stage DNN-HMM.

On noisily aligned inputs, the CRA model offered a much more dramatic improvement. Relative to baseline, it reduced the false-acceptance rate by 60%. The 195-frame CNN model managed only 31%, the 100-frame model 44%.

Related content

US, CA, Palo Alto
The Amazon Search team creates powerful, customer-focused search and advertising solutions and technologies. Whenever a customer visits an Amazon site worldwide and types in a query or browses through product categories, the Amazon Search services go to work. We design, develop, and deploy high performance, fault-tolerant distributed search systems used by millions of Amazon customers every day. Our team works to maximize the quality and effectiveness of the search experience for visitors to Amazon websites worldwide.
JP, Tokyo
The Amazon Logistics (AMZL) Team is responsible for the acquisition, design, construction, and management of all facilities in the Amazon Delivery Station Network. AMZL is looking for a talented and passionate Data Scientist to help shape its Last Mile business with technical strategies and solutions, by processing, analyzing and interpreting huge data sets. You should be comfortable with ambiguity, problem solving and enjoy working in a fast-paced, diverse and dynamic environment. Using analytical rigor and statistical methods, you mine through data to identify opportunities for Amazon and our delivery channels. And you collaborate with other scientists, engineers, Product and Program Managers to deploy new products and solutions. [More Information] Last Mile Department Data Analyst/BI Engineer Tokyo Office *Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation, visit https://www.amazon.jobs/disability/jp Key job responsibilities Creating a roadmap of the most challenging business questions and use data to articulate possible root cause analysis and solutions Managing and executing entire projects or components of large projects from start to finish including project management, data gathering and manipulation, synthesis and modeling, problem solving, and communication of insights Partnering with Product, Program and Engineering teams to design and run models, research new algorithms, and prove incrementality and drive growth Understanding drivers, impacts, and key influences on seller growth dynamics Developing and scaling end-to-end ML Models and solutions Automating feedback loops for algorithms in production Utilizing Amazon systems and tools to effectively work with terabytes of data About the team Last Mile Execution Analytics (LMEA) team of JP works as an integral part of Amazon Logistics to ensure that its business intelligence, analytics, tools and planning needs are met. By providing information, insight, and decision support, we strive to enable success of all parts of AMZL. Our customer set includes senior management, station operations, external vendors, long-term planning, Ops technology (Voice of the Delivery Station, Voice of the Customer), network planning, and pretty much every BI and Ops teams. Voice of Employee [Work Life Harmony] We believe, it is important to spend private time such as spending time with your family or doing anything you like to spur innovation. Amazon promotes a fulfilling and flexible work style according to the work volume and lifestyle of each employee.
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
LU, Luxembourg
Are you a talented and inventive scientist with a strong passion about modern data technologies and interested to improve business processes, extracting value from the data? Would you like to be a part of an organization that is aiming to use self-learning technology to process data in order to support the management of the procurement function? The Global Procurement Technology, as a part of Global Procurement Operations, is seeking a skilled Data Scientist to help build its future data intelligence in business ecosystem, working with large distributed systems of data and providing Machine Learning (ML) and Predictive Modeling expertise. You will be a member of the Data Engineering and ML Team, joining a fast-growing global organization, with a great vision to transform the Procurement field, and become the role model in the market. This team plays a strategic role supporting the core Procurement business domains as well as it is the cornerstone of any transformation and innovation initiative. Our mission is to provide a high-quality data environment to facilitate process optimization and business digitalization, on a global scale. We are supporting business initiatives, including but not limited to, strategic supplier sourcing (e.g. contracting, negotiation, spend analysis, market research, etc.), order management, supplier performance, etc. We are seeking an individual who can thrive in a fast-paced work environment, be collaborative and share knowledge and experience with his colleagues. You are expected to deliver results, but at the same time have fun with your teammates and enjoy working in the company. In Amazon, you will find all the resources required to learn new skills, grow your career, and become a better professional. You will connect with world leaders in your field and you will be tackling Data Science challenges to ensure business continuity, by taking the right decisions for your customers. As a Data Scientist in the team, you will: -be the subject matter expert to support team strategies that will take Global Procurement Operations towards world-class predictive maintenance practices and processes, driving more effective procurement functions, e.g. supplier segmentation, negotiations, shipping supplies volume forecast, spend management, etc. -have strong analytical skills and excel in the design, creation, management, and enterprise use of large data sets, combining raw data from different sources -provide technical expertise to support the development of ML models to facilitate intelligent digital services, such as Contract Lifecycle Management (CLM) and Negotiations platform -cooperate closely with different groups of stakeholders, e.g. data/software engineers, product/program managers, analysts, senior leadership, etc. to evaluate business needs and objectives to set up the best data management environment -create and share with audiences of varying levels technical papers and presentations -deal with ambiguity, prioritizing needs, and delivering results in a dynamic environment Basic qualifications -Master’s Degree in Computer Science/Engineering, Informatics, Mathematics, or a related technical discipline -3+ years of industry experience in data engineering/science, business intelligence or related field -3+ years experience in algorithm design, engineering and implementation for very-large scale applications to solve real problems -Very good knowledge of data modeling and evaluation -Very good understanding of regression modeling, forecasting techniques, time series analysis, machine-learning concepts such as supervised and unsupervised learning, classification, random forest, etc. -SQL and query performance tuning skills Preferred qualifications -2+ years of proficiency in using R, Python, Scala, Java or any modern language for data processing and statistical analysis -Experience with various RDBMS, such as PostgreSQL, MS SQL Server, MySQL, etc. -Experience architecting Big Data and ML solutions with AWS products (Redshift, DynamoDB, Lambda, S3, EMR, SageMaker, Lex, Kendra, Forecast etc.) -Experience articulating business questions and using quantitative techniques to arrive at a solution using available data -Experience with agile/scrum methodologies and its benefits of managing projects efficiently and delivering results iteratively -Excellent written and verbal communication skills including data visualization, especially in regards to quantitative topics discussed with non-technical colleagues
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, WA, Seattle
We are a team of doers working passionately to apply cutting-edge advances in deep learning in the life sciences to solve real-world problems. As a Senior Applied Science Manager you will participate in developing exciting products for customers. Our team rewards curiosity while maintaining a laser-focus in bringing products to market. Competitive candidates are responsive, flexible, and able to succeed within an open, collaborative, entrepreneurial, startup-like environment. At the leading edge of both academic and applied research in this product area, you have the opportunity to work together with a diverse and talented team of scientists, engineers, and product managers and collaborate with others teams. Location is in Seattle, US Embrace Diversity Here at Amazon, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust Balance Work and Life Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives Mentor & Grow Careers Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. Key job responsibilities • Manage high performing engineering and science teams • Hire and develop top-performing engineers, scientists, and other managers • Develop and execute on project plans and delivery commitments • Work with business, data science, software engineer, biological, and product leaders to help define product requirements and with managers, scientists, and engineers to execute on them • Build and maintain world-class customer experience and operational excellence for your deliverables