Amazon releases 51-language dataset for language understanding

MASSIVE dataset and Massively Multilingual NLU (MMNLU-22) competition and workshop will help researchers scale natural-language-understanding technology to every language on Earth.

Imagine that all people around the world could use voice AI systems such as Alexa in their native tongues.

Multilingual Alexa.png
The MASSIVE dataset is a step toward the creation of multilingual natural-language-understanding models that can generalize easily to new languages.

One promising approach to realizing this vision is massively multilingual natural-language understanding (MMNLU), a paradigm in which a single machine learning model can parse and understand inputs from many typologically diverse languages. By learning a shared data representation that spans languages, the model can transfer knowledge from languages with abundant training data to those in which training data is scarce.

Today we are pleased to make three announcements related to MMNLU.

First, we are releasing a new dataset called MASSIVE, which is composed of one million labeled utterances spanning 51 languages, along with open-source code, which provides examples of how to perform massively multilingual NLU modeling and allows practitioners to re-create baseline results for intent classification and slot filling that are presented in our paper..

Related content
Neural text-to-speech enables new multilingual model to use the same voice for Spanish and English responses.

Second, we are launching a new competition using the MASSIVE dataset called Massively Multilingual NLU 2022 (MMNLU-22).

And third, we will cohost a workshop at EMNLP 2022 in Abu Dhabi and online, also called Massively Multilingual NLU 2022, which will highlight the results from the competition and include presentations from invited speakers and oral and poster sessions from submitted papers on multilingual natural-language processing (NLP).

“We are very excited to share this large multilingual dataset with the worldwide language research community,” says Prem Natarajan, vice president of Alexa AI Natural Understanding. “We hope that this dataset will enable researchers across the world to drive new advances in multilingual language understanding that expand the availability and reach of conversational-AI technologies.”

The MASSIVE dataset

MASSIVE is a parallel dataset, meaning that every utterance is given in all 51 languages. This enables models to learn shared representations of utterances with the same intents, regardless of language, facilitating cross-linguistic training on natural-language-understanding (NLU) tasks. It also allows for adaptation to other NLP tasks such as machine translation, multilingual paraphrasing, new linguistic analyses of imperative morphologies, and more.

Related content
In experiments, multilingual models outperform monolingual models.

NLU — a subdiscipline of NLP — is a machine's ability to understand the meaning of a text and identify the relevant entities. For instance, given the utterance “What is the temperature in New York?”, an NLU model might classify the intent as “weather_query” and recognize relevant entities as “weather_descriptor: temperature” and “place_name: new york.”

Our particular focus is on NLU as a component of spoken-language understanding (SLU), in which audio is converted to text before NLU is performed. Although SLU-based virtual assistants like Alexa have made major capability advances in the past decade, academic and industrial NLU efforts worldwide are still limited to a small subset of the world's 7,000+ languages. One difficulty in creating massively multilingual NLU models is the lack of labeled data for training and evaluation — particularly data that is realistic for a given task and natural for a given language. High naturalness typically requires human vetting, which is often costly.

MASSIVE — Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation — contains one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize or translate the English-only SLURP dataset into 50 typologically diverse languages from 29 genera, including low-resource languages.

Name # Lang Utt/Lang DomainsIntents Slots
MASSIVE5119,521186055
SLURP (Bastianelli et al., 2020)116,521186055
NLU Evaluation Data (Liu et al., 2019)125,716185456
Airline Travel Information System (ATIS) (Price, 1990)15,871126129
ATIS with Hindi and Turkish (Upadhyay et al., 2018)31,315-5,871 126129
MultiATIS++ (Xu et al., 2020)91,422-5,897 121-2699-140
Snips (Coucke et al., 2018)114,484 - 753
Snips with French (Saade et al., 2019)24,818214-1511-12
Task Oriented Parsing (TOP) (Gupta et al., 2018)144,87322536
Multilingual Task-Oriented Semantic Parsing
(MTOP) (Li et al., 2021)
615,195-22,288 11104-113 72-75
Cross-Lingual Multilingual Task Oriented Dialog
(Schuster et al., 2019)
35,083-43,323 31211
Microsoft Dialog Challenge (Li et al., 2018)138,27631129
Fluent Speech Commands (FSC)
(Lugosch et al., 2019)
130,043 - 31 -
Chinese Audio-Textual Spoken Language
Understanding (CATSLU) (Zhu et al., 2019)
116,2584 - 94

We have released a paper describing the dataset and presenting baseline modeling results on XLM-R and mT5 models. Tools for the dataset, as well as the modeling code used for our baseline results, are available in our Github repository. MASSIVE is licensed under the CC BY 4.0 license, encouraging its broadest possible use across academia and industry.

MMNLU competition and workshop

The MASSIVE leaderboard and the Massively Multilingual NLU 2022 competition, hosted on eval.ai, are composed of two tasks. In the first, called MMNLU-22-Full, each competitor trains and tests a single model on all 51 languages of the full MASSIVE dataset. In the second task, called MMNLU-22-ZeroShot, each competitor fine-tunes a pretrained model only with English-labeled data and tests it on all 50 non-English languages.

Related content
As Alexa expands into new countries, she usually has to be trained on new languages. But sometimes, she has to be re-trained on languages she’s already learned. British English, American English, and Indian English, for instance, are different enough that for each of them, we trained a new machine learning model from scratch.

This assesses the model’s ability to generalize to new languages, an important consideration given the number of languages around the world for which there is little-to-no labeled data. Zero-shot learning is a key technology for scaling NLU technology to many more low-resource languages worldwide.

The permanent MASSIVE leaderboard has been launched, and on July 25 the Massively Multilingual NLU 2022 evaluation split will be released. Participants will then have until August 8 to perform inference on the evaluation set and submit their predictions, which will be used to determine the winners. Winners will be invited to give an oral presentation at the Massively Multilingual NLU 2022 workshop.

The Massively Multilingual NLU 2022 workshop is collocated with EMNLP 2022 and will take place on either December 7 or 8, both in person in Abu Dhabi and online. Paper submissions spanning the breadth of multilingualism in NLU are sought, and the first call for papers will be released soon. The workshop will feature speakers on various topics related to multilingualism and NLU, as well as talks from the top performers from the MMNLU-22 competition.

Related content
In a paper we’re presenting at this year’s Conference on Empirical Methods in Natural Language Processing, we describe experiments with a new data selection technique.

Let’s scale natural-language-understanding technology to every language on Earth. Come build with us!

Acknowledgments

Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan for core dataset contributions; Andrew Turner for product and program management; Anna-Karin Johansson for vendor management; Saleh Soltan for text-to-text modeling discussions; Anne Yoder, Zheng Xie, Adeetee Bhide, Misa Sunaga, Trang Doan, and Satyam Dwivedi for program management and language expertise; Wayne Blossom, Brendan Egan, Columbine Marshall, Todd Tieuli, and Augusta Niles for creating the hidden evaluation split of the dataset; Jack FitzGerald, Kay Rottmann, Julia Hirschberg, Anna Rumshisky, and Mohit Bansal for workshop organization; and Charith Peris and Jack FitzGerald for leaderboard and competition setup.

Related content

US, WA, Seattle
The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon’s on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon’s goods and services are aligned with Amazon’s corporate goals. We are seeking an experienced high-energy Economist to help envision, design and build the next generation of retail pricing capabilities. You will work at the intersection of economic theory, statistical inference, and machine learning to design new methods and pricing strategies to deliver game changing value to our customers. Roughly 85% of previous intern cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. Key job responsibilities Amazon’s Pricing Science and Research team is seeking an Economist to help envision, design and build the next generation of pricing capabilities behind Amazon’s on-line retail business. As an economist on our team, you will work at the intersection of economic theory, statistical inference, and machine learning to design new methods and pricing strategies with the potential to deliver game changing value to our customers. This is an opportunity for a high-energy individual to work with our unprecedented retail data to bring cutting edge research into real world applications, and communicate the insights we produce to our leadership. This position is perfect for someone who has a deep and broad analytic background and is passionate about using mathematical modeling and statistical analysis to make a real difference. You should be familiar with modern tools for data science and business analysis. We are particularly interested in candidates with research background in applied microeconomics, econometrics, statistical inference and/or finance. A day in the life Discussions with business partners, as well as product managers and tech leaders to understand the business problem. Brainstorming with other scientists and economists to design the right model for the problem in hand. Present the results and new ideas for existing or forward looking problems to leadership. Deep dive into the data. Modeling and creating working prototypes. Analyze the results and review with partners. Partnering with other scientists for research problems. About the team The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon’s on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon’s goods and services are aligned with Amazon’s corporate goals.
US, CA, San Francisco
The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon's on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon's goods and services are aligned with Amazon's corporate goals. We are seeking an experienced high-energy Economist to help envision, design and build the next generation of retail pricing capabilities. You will work at the intersection of statistical inference, experimentation design, economic theory and machine learning to design new methods and pricing strategies for assessing pricing innovations. Roughly 85% of previous intern cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. Key job responsibilities Amazon's Pricing Science and Research team is seeking an Economist to help envision, design and build the next generation of pricing capabilities behind Amazon's on-line retail business. As an economist on our team, you will will have the opportunity to work with our unprecedented retail data to bring cutting edge research into real world applications, and communicate the insights we produce to our leadership. This position is perfect for someone who has a deep and broad analytic background and is passionate about using mathematical modeling and statistical analysis to make a real difference. You should be familiar with modern tools for data science and business analysis. We are particularly interested in candidates with research background in experimentation design, applied microeconomics, econometrics, statistical inference and/or finance. A day in the life Discussions with business partners, as well as product managers and tech leaders to understand the business problem. Brainstorming with other scientists and economists to design the right model for the problem in hand. Present the results and new ideas for existing or forward looking problems to leadership. Deep dive into the data. Modeling and creating working prototypes. Analyze the results and review with partners. Partnering with other scientists for research problems. About the team The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon's on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon's goods and services are aligned with Amazon's corporate goals.
US, WA, Bellevue
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of interns from previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US
The Amazon Supply Chain Optimization Technology (SCOT) organization is looking for an Intern in Economics to work on exciting and challenging problems related to Amazon's worldwide inventory planning. SCOT provides unique opportunities to both create and see the direct impact of your work on billions of dollars’ worth of inventory, in one of the world’s most advanced supply chains, and at massive scale. We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. We are looking for a PhD candidate with exposure to Program Evaluation/Causal Inference. Knowledge of econometrics and Stata/R/or Python is necessary, and experience with SQL, Hadoop, and Spark would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US, WA, Seattle
The Selling Partner Fees team owns the end-to-end fees experience for two million active third party sellers. We own the fee strategy, fee seller experience, fee accuracy and integrity, fee science and analytics, and we provide scalable technology to monetize all services available to third-party sellers. We are looking for an Intern Economist with excellent coding skills to design and develop rigorous models to assess the causal impact of fees on third party sellers’ behavior and business performance. As a Science Intern, you will have access to large datasets with billions of transactions and will translate ambiguous fee related business problems into rigorous scientific models. You will work on real world problems which will help to inform strategic direction and have the opportunity to make an impact for both Amazon and our Selling Partners.
US, WA, Bellevue
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. We are looking for a PhD candidate with exposure to Program Evaluation/Causal Inference. Some knowledge of econometrics, as well as basic familiarity with Stata or R is necessary, and experience with SQL, Hadoop, Spark and Python would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US, MA, Westborough
Are you inspired by invention? Is problem solving through teamwork in your DNA? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Robotics. We are a smart team of doers that work passionately to apply cutting edge advances in robotics and software to solve real-world challenges that will transform our customers’ experiences in ways we can’t even imagine yet. We invent new improvements every day. We are Amazon Robotics and we will give you the tools and support you need to invent with us in ways that are rewarding, fulfilling and fun. Amazon Robotics is seeking interns and co-ops with a passion for robotic research to work on cutting edge algorithms for robotics. Our team works on challenging and high-impact projects, including allocating resources to complete a million orders a day, coordinating the motion of thousands of robots, autonomous navigation in warehouses, identifying objects and damage, and learning how to grasp all the products Amazon sells. We are seeking internship candidates with backgrounds in computer vision, machine learning, resource allocation, discrete optimization, search, and planning/scheduling. You will be challenged intellectually and have a good time while you are at it! Please note that by applying to this role you would be considered for Applied Scientist summer intern, spring co-op, and fall co-op roles on various Amazon Robotics teams. These teams work on robotics research within areas such as computer vision, machine learning, robotic manipulation, navigation, path planning, perception, artificial intelligence, human-robot interaction, optimization and more.
US, MA, Westborough
Are you inspired by invention? Is problem solving through teamwork in your DNA? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Robotics. We are a smart team of doers that work passionately to apply cutting edge advances in robotics and software to solve real-world challenges that will transform our customers’ experiences in ways we can’t even imagine yet. We invent new improvements every day. We are Amazon Robotics and we will give you the tools and support you need to invent with us in ways that are rewarding, fulfilling and fun. Amazon Robotics is seeking interns and co-ops with a passion for robotic research to work on cutting edge algorithms for robotics. Our team works on challenging and high-impact projects, including allocating resources to complete a million orders a day, coordinating the motion of thousands of robots, autonomous navigation in warehouses, identifying objects and damage, and learning how to grasp all the products Amazon sells. We are seeking internship candidates with backgrounds in computer vision, machine learning, resource allocation, discrete optimization, search, and planning/scheduling. You will be challenged intellectually and have a good time while you are at it! Please note that by applying to this role you would be considered for Applied Scientist summer intern, spring co-op, and fall co-op roles on various Amazon Robotics teams. These teams work on robotics research within areas such as computer vision, machine learning, robotic manipulation, navigation, path planning, perception, artificial intelligence, human-robot interaction, optimization and more.
US, WA, Bellevue
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US, MA, Boston
Are you inspired by invention? Is problem solving through teamwork in your DNA? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Robotics. We are a smart team of doers that work passionately to apply cutting edge advances in robotics and software to solve real-world challenges that will transform our customers’ experiences in ways we can’t even imagine yet. We invent new improvements every day. We are Amazon Robotics and we will give you the tools and support you need to invent with us in ways that are rewarding, fulfilling and fun. Amazon Robotics, a wholly owned subsidiary of Amazon.com, empowers a smarter, faster, more consistent customer experience through automation. Amazon Robotics automates fulfillment center operations using various methods of robotic technology including autonomous mobile robots, sophisticated control software, language perception, power management, computer vision, depth sensing, machine learning, object recognition, and semantic understanding of commands. Amazon Robotics has a dedicated focus on research and development to continuously explore new opportunities to extend its product lines into new areas. AR is seeking uniquely talented and motivated data scientists to join our Global Services and Support (GSS) Tools Team. GSS Tools focuses on improving the supportability of the Amazon Robotics solutions through automation, with the explicit goal of simplifying issue resolution for our global network of Fulfillment Centers. The candidate will work closely with software engineers, Fulfillment Center operation teams, system engineers, and product managers in the development, qualification, documentation, and deployment of new - as well as enhancements to existing - operational models, metrics, and data driven dashboards. As such, this individual must possess the technical aptitude to pick-up new BI tools and programming languages to interface with different data access layers for metric computation, data mining, and data modeling. This role is a 6 month co-op to join AR full time (40 hours/week) from July – December 2023. The Co-op will be responsible for: Diving deep into operational data and metrics to identify and communicate trends used to drive development of new tools for supportability Translating operational metrics into functional requirements for BI-tools, models, and reporting Collaborating with cross functional teams to automate AR problem detection and diagnostics