Amazon releases 51-language dataset for language understanding
MASSIVE dataset and Massively Multilingual NLU (MMNLU-22) competition and workshop will help researchers scale natural-language-understanding technology to every language on Earth.
Imagine that all people around the world could use voice AI systems such as Alexa in their native tongues.
One promising approach to realizing this vision is massively multilingual natural-language understanding (MMNLU), a paradigm in which a single machine learning model can parse and understand inputs from many typologically diverse languages. By learning a shared data representation that spans languages, the model can transfer knowledge from languages with abundant training data to those in which training data is scarce.
Today we are pleased to make three announcements related to MMNLU.
First, we are releasing a new dataset called MASSIVE, which is composed of one million labeled utterances spanning 51 languages, along with open-source code, which provides examples of how to perform massively multilingual NLU modeling and allows practitioners to re-create baseline results for intent classification and slot filling that are presented in our paper..
Second, we are launching a new competition using the MASSIVE dataset called Massively Multilingual NLU 2022 (MMNLU-22).
And third, we will cohost a workshop at EMNLP 2022 in Abu Dhabi and online, also called Massively Multilingual NLU 2022, which will highlight the results from the competition and include presentations from invited speakers and oral and poster sessions from submitted papers on multilingual natural-language processing (NLP).
“We are very excited to share this large multilingual dataset with the worldwide language research community,” says Prem Natarajan, vice president of Alexa AI Natural Understanding. “We hope that this dataset will enable researchers across the world to drive new advances in multilingual language understanding that expand the availability and reach of conversational-AI technologies.”
The MASSIVE dataset
MASSIVE is a parallel dataset, meaning that every utterance is given in all 51 languages. This enables models to learn shared representations of utterances with the same intents, regardless of language, facilitating cross-linguistic training on natural-language-understanding (NLU) tasks. It also allows for adaptation to other NLP tasks such as machine translation, multilingual paraphrasing, new linguistic analyses of imperative morphologies, and more.
NLU — a subdiscipline of NLP — is a machine's ability to understand the meaning of a text and identify the relevant entities. For instance, given the utterance “What is the temperature in New York?”, an NLU model might classify the intent as “weather_query” and recognize relevant entities as “weather_descriptor: temperature” and “place_name: new york.”
Our particular focus is on NLU as a component of spoken-language understanding (SLU), in which audio is converted to text before NLU is performed. Although SLU-based virtual assistants like Alexa have made major capability advances in the past decade, academic and industrial NLU efforts worldwide are still limited to a small subset of the world's 7,000+ languages. One difficulty in creating massively multilingual NLU models is the lack of labeled data for training and evaluation — particularly data that is realistic for a given task and natural for a given language. High naturalness typically requires human vetting, which is often costly.
MASSIVE — Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation — contains one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize or translate the English-only SLURP dataset into 50 typologically diverse languages from 29 genera, including low-resource languages.
|SLURP (Bastianelli et al., 2020)||1||16,521||18||60||55|
|NLU Evaluation Data (Liu et al., 2019)||1||25,716||18||54||56|
|Airline Travel Information System (ATIS) (Price, 1990)||1||5,871||1||26||129|
|ATIS with Hindi and Turkish (Upadhyay et al., 2018)||3||1,315-5,871||1||26||129|
|MultiATIS++ (Xu et al., 2020)||9||1,422-5,897||1||21-26||99-140|
|Snips (Coucke et al., 2018)||1||14,484||-||7||53|
|Snips with French (Saade et al., 2019)||2||4,818||2||14-15||11-12|
|Task Oriented Parsing (TOP) (Gupta et al., 2018)||1||44,873||2||25||36|
|Multilingual Task-Oriented Semantic Parsing|
(MTOP) (Li et al., 2021)
|Cross-Lingual Multilingual Task Oriented Dialog |
(Schuster et al., 2019)
|Microsoft Dialog Challenge (Li et al., 2018)||1||38,276||3||11||29|
|Fluent Speech Commands (FSC) |
(Lugosch et al., 2019)
|Chinese Audio-Textual Spoken Language|
Understanding (CATSLU) (Zhu et al., 2019)
We have released a paper describing the dataset and presenting baseline modeling results on XLM-R and mT5 models. Tools for the dataset, as well as the modeling code used for our baseline results, are available in our Github repository. MASSIVE is licensed under the CC BY 4.0 license, encouraging its broadest possible use across academia and industry.
MMNLU competition and workshop
The MASSIVE leaderboard and the Massively Multilingual NLU 2022 competition, hosted on eval.ai, are composed of two tasks. In the first, called MMNLU-22-Full, each competitor trains and tests a single model on all 51 languages of the full MASSIVE dataset. In the second task, called MMNLU-22-ZeroShot, each competitor fine-tunes a pretrained model only with English-labeled data and tests it on all 50 non-English languages.
This assesses the model’s ability to generalize to new languages, an important consideration given the number of languages around the world for which there is little-to-no labeled data. Zero-shot learning is a key technology for scaling NLU technology to many more low-resource languages worldwide.
The permanent MASSIVE leaderboard has been launched, and on July 25 the Massively Multilingual NLU 2022 evaluation split will be released. Participants will then have until August 8 to perform inference on the evaluation set and submit their predictions, which will be used to determine the winners. Winners will be invited to give an oral presentation at the Massively Multilingual NLU 2022 workshop.
The Massively Multilingual NLU 2022 workshop is collocated with EMNLP 2022 and will take place on either December 7 or 8, both in person in Abu Dhabi and online. Paper submissions spanning the breadth of multilingualism in NLU are sought, and the first call for papers will be released soon. The workshop will feature speakers on various topics related to multilingualism and NLU, as well as talks from the top performers from the MMNLU-22 competition.
Let’s scale natural-language-understanding technology to every language on Earth. Come build with us!
Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan for core dataset contributions; Andrew Turner for product and program management; Anna-Karin Johansson for vendor management; Saleh Soltan for text-to-text modeling discussions; Anne Yoder, Zheng Xie, Adeetee Bhide, Misa Sunaga, Trang Doan, and Satyam Dwivedi for program management and language expertise; Wayne Blossom, Brendan Egan, Columbine Marshall, Todd Tieuli, and Augusta Niles for creating the hidden evaluation split of the dataset; Jack FitzGerald, Kay Rottmann, Julia Hirschberg, Anna Rumshisky, and Mohit Bansal for workshop organization; and Charith Peris and Jack FitzGerald for leaderboard and competition setup.