How Alexa's new Live Translation for conversations works
Parallel speech recognizers, language ID, and translation models geared to conversational speech are among the modifications that make Live Translation possible.
Today, Amazon launched Alexa’s new Live Translation feature, which allows individuals speaking in two different languages to converse with each other, with Alexa acting as an interpreter and translating both sides of the conversation.
With this new feature, a customer can ask Alexa to initiate a translation session for a pair of languages. Once the session has commenced, customers can speak phrases or sentences in either language. Alexa will automatically identify which language is being spoken and translate each side of the conversation.
At launch, the feature will work with six language pairs — English and Spanish, French, German, Italian, Brazilian Portuguese, or Hindi — on Echo devices with locale set to English US.
The Live Translation feature leverages several existing Amazon systems, including Alexa’s automatic-speech-recognition (ASR) system, Amazon Translate, and Alexa’s text-to-speech system, with the overall architecture and machine learning models designed and optimized for conversational-speech translation.
During a translation session, Alexa runs two ASR models in parallel, along with a separate model for language identification. Input speech passes to both ASR models at once. Based on the language ID model’s classification result, however, only one ASR model’s output is sent to the translation engine.
This parallel implementation is necessary to keep the latency of the translation request acceptable, as waiting to begin speech recognition until the language ID model has returned a result would delay the playback of the translated audio.
Moreover, we found that the language ID model works best when it bases its decision on both acoustic information about the speech signal and the outputs of both ASR models. The ASR data often helps, for instance, in the cases of non-native speakers of a language, whose speech often has consistent acoustic properties regardless of the language being spoken.
Once the language ID system has selected a language, the associated ASR output is post-processed and sent to Amazon Translate. The resulting translation is passed to Alexa’s text-to-speech system for playback.
Like most ASR systems, the ones we use for live translation include both an acoustic model and a language model. The acoustic model converts audio into phonemes, the smallest units of speech; the language model encodes the probabilities of particular strings of words, which helps the ASR system decide between alternative interpretations of the same sequence of phonemes.
Each of the ASR systems used for Live Translation, like Alexa’s existing ASR models, includes two types of language models: a traditional language model, which encodes probabilities for relatively short strings of words (typically around four), and a neural language model, which can account for longer-range dependencies. The Live Translation language models were trained to handle more-conversational speech covering a wider range of topics than Alexa's existing ASR models.
To train our acoustic models, we used connectionist temporal classification (CTC), followed by multiple passes of state-level minimum-Bayes-risk (sMBR) training. To make the acoustic model more robust, we also mixed noise into the training set, enabling the model to focus on characteristics of the input signal that vary less under different acoustic conditions.
Adapting to conversational speech also required modification of Alexa’s end-pointer, which determines when a customer has finished speaking. The end-pointer already distinguishes between pauses at the ends of sentences, indicating that the customer has stopped speaking and that Alexa needs to follow up, and mid-sentence pauses, which may be permitted to go on a little longer. For Live Translation, we modified the end-pointer to tolerate longer pauses at the ends of sentences, as speakers engaged in long conversations will often take time between sentences to formulate their thoughts.
Finally, because Amazon Translate’s neural-machine-translation system was designed to work with textual input, the Live Translation system adjusts for common disfluencies and punctuates and formats the ASR output. This ensures that the inputs to Amazon Translate look more like the written text that it’s used to seeing.
In ongoing work, we’re exploring several approaches to improving the Live Translation feature further. One of these is semi-supervised learning, in which Alexa’s existing models annotate unlabeled data, and we use the highest-confidence outputs as additional training examples for our translation-specific ASR and language ID models.
To improve the fluency of the translation and its robustness to spoken-language input, we are also working on adapting the neural-machine-translation engine to conversational-speech data and generating translations that incorporate relevant context, such as tone of voice or formal versus informal translations. Finally, we are continuously working on improving the quality of the overall translations and of colloquial and idiomatic expressions in particular.