On benchmark data set, question-answering system halves error rate
Improvements come from new transfer learning method, new publicly released data set.
The last few years have seen great advances in the design of language models, which are a critical component of language-based AI systems. Language models can be used to compute the probability of any given sequence (even discontinuous sequences) of words, which is useful in natural-language processing.
The new language models are all built atop the Transformer neural architecture, which is particularly good at learning long-range dependencies among input data, such as the semantic and syntactic relationships between individual words of a sentence. At the annual meeting of the Association for the Advancement of Artificial Intelligence, my colleagues and I will present a method for adapting these new models — BERT, for instance — to the problem of answer selection, a central topic in the field of question answering.
In tests on an industry-standard benchmark data set, our new model demonstrated a 10% absolute improvement in mean average precision over the previous state-of-the-art answer selection model. That translates to an error rate reduction of 50%.
Our approach uses transfer learning, in which a machine learning model pretrained on one task — here, word sequence prediction — is fine-tuned on another — here, answer selection. Our innovation is to introduce an intermediate step between the pre-training of the source model and its adaptation to new target domains.
In the intermediate step, we fine-tune the language model on a large corpus of general question-answer pairs. Then we fine-tune it even further, on a small body of topic-specific questions and answers (the target domain). We call our system TANDA, for transfer and adapt.
The corpus we use for the intermediate step is based on the public data set Natural Questions (NQ), which was designed for the training of reading comprehension systems. We transformed NQ so that it could be used to train answer selection systems instead, and the public release of our modified data set — dubbed ASNQ, for answer selection NQ — is itself an important contribution to the research community.
In addition to its performance gains, our system has several other advantages:
1. It can be fine-tuned on target data without a heavy hyperparameter search. Hyperparameters are characteristics of a neural network such as the number of layers, the number of nodes per layer, and the learning rate of the training algorithm, which is often determined through trial and error. The stability of our model means that it can be adapted to a target domain with very little training data.
2. It’s robust to noise: in our tests, errors in the target domain data had little effect on the system’s accuracy. Again, this is important because of the difficulty of acquiring high-quality data.
3. The most time-consuming part of our procedure — the intermediate step — need be performed only once. The resulting model can be adapted to an indefinite number of target domains.
Nuts and bolts
Answer selection assumes that for a given question, the system has access to a set of candidate answers; in practice, candidates are often assembled through standard keyword search. Answer selection systems are thus trained on pairs of sentences — one question and one candidate answer at a time — and try to learn which candidates are viable answers.
In the past, researchers have attempted to adapt Transformer-based language models to answer selection by directly fine-tuning them on small sets of domain-specific data, but we hypothesized that the addition of an intermediate fine-tuning step would lead to better results.
In reading comprehension, the system receives a question and a block of text; its job is to select the one sentence in the block that best answers the question. NQ is a set of text blocks in each of which one sentence has been labeled as the best answer. To convert NQ into the answer selection data set ASNQ, we extracted the best answers from their text blocks and labeled them as successful answer sentences. The other sentences in each block we labeled unsuccessful answers.
Generally, Transformer-based language models are trained on sentences in which particular words have been hidden, or “masked”, and the models must learn to fill in the blanks.
BERT models are also trained on a second objective, which is to determine whether a second input sentence follows naturally from the first. The input to a BERT model is thus a pair of (masked) sentences.
This suits BERT particularly well to the task of answer sentence selection, where the inputs are also pairs of sentences. Our procedure is first to fine-tune a Transformer-based model on ASNQ and then fine-tune it again on a smaller, domain-specific data set — a sports news set, for instance, which might include questions such as “When did the Philadelphia Eagles play the fog bowl?”
We tested our approach using two public data sets, WikiQA and TREC-QA, and evaluated the system’s performance according to mean average precision (MAP) and mean reciprocal rank (MRR). Intuitively, MAP measures the quality of a sorted list of answers according to the correctness of the complete ranking, while MRR measures the probability that the correct answer is near the top of the list.
On WikiQA and TREC-QA, our system’s MAP was 92% and 94.3%, respectively, a significant improvement over the previous records of 83.4% and 87.5%. MRR for our system was 93.3% and 97.4%, up from 84.8% and 94%, respectively.