Using warped language models to correct speech recognition errors
Model using ASR hypotheses as extra inputs reduces word error rate of human transcriptions by almost 11%.
Language-related machine learning applications have made great strides in recent years, thanks in part to masked language models such as BERT: during training, sentences are fed to the models with certain words either masked out or replaced with random substitutions, and the models learn to output complete and corrected sentences.
The success of masked language models has led to the development of warped language models, which add insertions and deletions to the menu of possible alterations. Warped language models were designed specifically to address the types of errors common in automatic speech recognition (ASR), so they could serve as the basis for ASR models.
In a paper we presented at this year’s Interspeech, we describe how to use warped language models, not during ASR, but to correct ASR output — or to correct the human transcriptions of speech used to train ASR models.
This new use case required us to modify the design of warped language models so that they not only output text strings but also classify the errors in the input strings. From that information, we can produce a corrected text even when its word count differs from that of the input.
Since ASR models output not a single transcript of input speech but a ranked list of hypotheses, we also experimented with using multiple hypotheses as inputs to our error correction model. With the human transcriptions, we generated the hypotheses by subjecting the transcribed speech to ASR.
We found that the multiple-hypothesis approach had particular benefits for the correction of human transcription errors. There, it was able to reduce the word error rate by about 11%. For ASR outputs, the same model reduced the word error rate by almost 6%.
Warped language models
A warped language model outputs a token — either a word or a special symbol, such as a blank when it detects a spurious insertion — for each word of its input. This means, however, that it can’t fully correct word deletions: it has to choose between outputting the dropped word or the input word at the current position.
We adapt the basic architecture of the warped language model so that, for each input token, it predicts both an output token and a warping operation.
Our model still outputs a single token for each input token, but from the combination of the token and the warping operation, a simple correction algorithm can deduce the original input.
The figure below, for example, depicts our model’s handling of the input sentence “I saying that table I [mask] apples place oranges.” The middle row indicates our model’s output: first, an operation name, and second, an output token. When our model replaces the input “saying” with the output “was” and flags the operation as “drop”, the correction algorithm deduces that the sentence should have begun “I was saying”, not “I saying”.
The great advantage of masked (and warped) language models is that they are unsupervised: the masking (and warping) operations can be performed automatically, enabling an effectively unlimited amount of training data. Our model is similarly unsupervised: we simply modify the warping algorithm so that, when it applies an operation, it also tags the output with the operation’s name.
After training our model on a corpus of English-language texts, we fine-tuned it on the output of an ASR model for a separate set of spoken-language utterances. For each utterance, we kept the top five ASR hypotheses.
Algorithms automatically aligned the tokens of the hypotheses and standardized their lengths, adding blank tokens where necessary. We treated hypotheses two through five as warped versions of the top hypothesis, automatically computing the minimum number of warping operations required to transform the top hypothesis into an alternate hypothesis and labeling the hypothesis tokens appropriately.
For each input, our model combines all five hypotheses to produce a single vector representation (an embedding) that the model’s decoder uses to produce output strings.
During training, the model outputs a separate set of predictions for each hypothesis. This ensures the fine-tuning of the operation predictor as well as the token predictor, as the operation classifications will differ for each hypothesis, even if the token strings are identical. At run time, however, we keep only the output corresponding to the top-ranked ASR hypothesis.
Without fine-tuning on the ASR hypotheses, our model reduced the word error rate of ASR model outputs by 5%. But it slightly increased the word error rate on the human transcriptions of speech. This is probably because the human-transcribed speech, even when erroneous, is still syntactically and semantically coherent, so errors are hard to identify. The addition of the alternate ASR hypotheses, however, enables the correction model to exploit additional information in the speech signal itself, leading to a pronounced reduction in word error rate.