Conversational AI

Cross-lingual transfer learning for multilingual voice agents

In experiments, multilingual models outperform monolingual models.

January 13, 2021

4 min read

For voice agents like Alexa, expanding into a new language has traditionally meant training a new natural-language-understanding model from scratch, an approach that doesn’t scale well.

An alternative is to train a multilingual model — a single model that can handle multiple languages simultaneously: it requires less effort to support a single large model than to support a swarm of smaller ones, and a multilingual model lets users make requests in a mix of different languages, which is closer to what we would expect from artificial intelligence in the 21st century.

In a paper we presented last month at the International Conference on Computational Linguistics (COLING), we investigate the use of transfer learning and data mixing to train a multilingual model. We show that the resulting model’s performance is similar to or better than that of the monolingual models currently used in production.

Multilingual-model architecture

Multilingual modeling has become a popular topic in the last few years, with a particular focus on transferring knowledge from models trained on large corpora in one language to models trained on a small amount of data in other languages. This problem is known as low-resource cross-lingual transfer learning. In our paper, we also experiment with high-resource to high-resource transfer, to mimic real-world situations.

Single-language models are trained with data in different languages, but otherwise, they generally have the same architecture. It follows that by using the same model architecture, we should be able to train a generalized multilingual model that is fed by data from multiple languages.

The architecture of our slot-filling and intent classification model.

A voice agent’s natural-language-understanding (NLU) model is first trained to recognize the utterance domain, such as music, weather, etc. Then separate models are trained to perform domain-specific intent-classification and slot-filling tasks.

For example, if the request is “play ‘Bad Romance’ by Lady Gaga”, the intent will be “play music”, while in order to fulfill this request, the system needs to capture the slots and slot values {song name = Bad Romance} and {artist name = Lady Gaga}.

In our experiments, the domain classification model is a max-entropy logistic regression model.

For intent classification and slot filling, we build a multitask deep-neural-network model. We first map input tokens into shared-space word embeddings and then feed them into a bidirectional long-short-term-memory (LSTM) encoder to obtain context information. This content then propagates to the downstream tasks, with a conditional random field used for slot filling and a multilayer perceptron used for intent classification.

Knowledge transfer and results

We trained our models using data in four languages, including three relatively closely related languages, UK English, Spanish, and Italian. The fourth language is Hindi, a low-resource language that is lexically and grammatically different from the other three.

In our transfer learning experiments, we transferred different blocks of information — embeddings and encoder and decoder weights — from a model trained in English to multilingual models that combined English with each of the other three languages. We also experimented with data mixing, training one model on English and Spanish and another on English and Italian and transferring them to multilingual models that included Italian and Spanish, respectively.

After transfer, we then fine-tuned each of our models on data from all four languages in our data set.

We evaluate our models according to four metrics: domain accuracy for the domain classification task; intent accuracy for the intent classification task; micro-averaged slot F1 for slot filling; and frame accuracy, which is the relative number of utterances for which the domain, intent, and all slots are correctly identified.

For each of our multilingual models, we compared its performance on each of its languages to that of a state-of-the-art single-language model for the same language. The baseline models use maximum-entropy models rather than deep neural nets as encoders.

All performance metrics show a similar pattern: multilingual deep-neural-net models usually perform better than monolingual models. The best results come from transfer of the encoder weights from source models to target models, with an average improvement in frame accuracy of about 1%. The additional transfer of the decoder weights slightly degrades performance, although the resulting model still beats the baseline.

Data mixing during training of the source model does improve performance, but only slightly.

Interestingly, the greatest improvement in frame accuracy — around 1.2% — comes from transferring models into Hindi. This may be because the baseline model for Hindi was trained on a low-resource data set. The multilingual models may learn general linguistic information from other languages that the monolingual model can’t extract from the Hindi data set alone.

About the Author

Olga Golovneva

Olga Golovneva is a research scientist in Alexa AI's Natural Understanding group.

Cross-lingual transfer learning for multilingual voice agents

In experiments, multilingual models outperform monolingual models.

Multilingual-model architecture

Knowledge transfer and results

Related content

Work with us