Conversational AI

RescoreBERT: Using BERT models to improve ASR rescoring

Knowledge distillation and discriminative training enable efficient use of a BERT-based model to rescore automatic-speech-recognition hypotheses.

By Yi Gu

June 1, 2022

3 min read

When someone speaks to a voice agent like Alexa, an automatic speech recognition (ASR) model converts the speech to text. Typically, the core ASR model is trained on limited data, which means that it can struggle with rare words and phrases. So the ASR model’s hypotheses usually pass to a language model — a model that encodes the probabilities of sequences of words — trained on a much larger body of texts. The language model reranks the hypotheses, with the goal of improving ASR accuracy.

In natural-language processing, one of the most widely used language models is BERT (bidirectional encoder representations from Transformers). To use BERT as a rescoring model, one typically masks each input token and computes its log-likelihood from the rest of the input, then sums those scores to produce a total score called PLL (pseudo log-likelihood). However, this computation is very expensive, which makes it impractical for real-time ASR. For rescoring, most ASR models use more efficient long-short-term-memory (LSTM) language models.

Rescoring

To get a sense for the value of rescoring, suppose that an ASR model outputs these hypotheses, from more to less likely: (a) “is fishing the opposite of fusion”, (b) “is fission the opposite of fusion”, and (c) “is fission the opposite of fashion”. Without second-pass rescoring, ASR would give an incorrect output: “is fishing the opposite of fusion”. If the second-pass language model does its job well, it should give priority to the hypothesis “is fission the opposite of fusion” and correctly rerank the hypotheses. A language model trained from scratch on a limited set of data will often struggle with rare words such as “fission”.

The RescoreBERT model. Each ASR hypothesis is demarcated by a CLS (classification) token and then encoded by BERT. The encoding of the CLS token itself is a representation of the entire sentence, which passes to a feed-forward neural network. That network computes sentence-level second-pass scores, which are then interpolated with the first-pass scores for reranking.

Distillation

To reduce the computational expense of computing PLL scores, we adapt previous work from Amazon and pass the BERT model’s output through a neural network trained to mimic the PLL scores assigned by a larger, “teacher” model. We name this method MLM (masked language model) distillation, because the distilled model is trained to match the teacher’s predictions of masked inputs.

Graphic of Agora sampling the development set and generating, labeling and adding new points back to the training set.

Discriminative training

Because the first- and second-pass scores are linearly interpolated, it’s not enough for the rescoring model to assign the correct hypothesis a better (in this case, lower) score; the interpolated score for the correct hypothesis also has to be the lowest among all hypotheses.

As a result, it would be beneficial to account for first-pass scores when training the second-pass rescoring model. However, the MLM distillation aims to distill the PLL scores and hence does not account for the first-pass scores. To account for the first-pass scores, we apply discriminative training after MLM distillation.

RescoreBERT: Using BERT models to improve ASR rescoring

Knowledge distillation and discriminative training enable efficient use of a BERT-based model to rescore automatic-speech-recognition hypotheses.

Rescoring

Distillation

Discriminative training

Related content

Work with us