The FEVER data set: What doesn’t kill it will make it stronger
This year at the Conference on Empirical Methods in Natural-Language Processing (EMNLP), we will cohost the Second Workshop on Fact Extraction and Verification — or FEVER — which will explore techniques for automatically assessing the veracity of factual assertions online.
Fact verification is an important part of Alexa’s question-answering service, enabling Alexa to validate the answers she provides and to justify those answers with evidence. The Alexa team’s interest in fact verification is widely shared, as is evidenced by a host of recent challenges, papers, and conferences — including the Truth and Trust Online conference.
The workshop originated from a public data set — the FEVER data set — that we created together with colleagues at the University of Sheffield. The data set contains 185,000 factual assertions, both true and false, which are correlated with Wikipedia excerpts that either substantiate or refute them.
Like the first workshop, the second will feature invited talks from leaders in the field, papers on a range of topics related to fact verification, and presentations by contestants in an open, FEVER-based competition announced the previous spring.
In the first FEVER competition, contestants used the FEVER data set to train machine learning systems to verify facts. The systems were evaluated according to their FEVER scores, which measure both the accuracy of their truth assessments and the quality of the supporting evidence they supply.
This year’s FEVER competition was designed to help augment the FEVER data set through the well-studied machine learning technique of adversarial example generation. The technique has long been a staple of computer vision research but has recently gained ground in natural-language-processing research; Stanford University’s SQuAD dataset is one prominent example.
Contestants were invited to produce test cases — either algorithmically or manually — that would elicit mistaken responses from fact verification systems trained on FEVER data. Our aim is that by identifying characteristics of the error-inducing test cases we would learn new ways to augment the FEVER data, so that the resulting systems would be both more accurate and more resilient.
At the first FEVER workshop, we reported the performance of 23 teams that participated in the first challenge. The top four finishers allowed us to create versions of their systems that we could host online, so that participants in the second FEVER challenge could attack them at will.
Since the first workshop, however, another 39 teams have submitted fact verification systems trained on FEVER data, pushing the top FEVER score from 64% up to 70%. Three of those teams also submitted hostable versions of their systems, bringing the total number of targets for the second challenge to seven. Following the taxonomy of the Build It, Break It, Fix It contest model, we call the designers of target systems “Builders”.
Three “Breaker” teams submitted adversarial examples. One of these — the Columbia University Natural-Language Processing group, or CUNLP — was also a Builder. CUNLP submitted 501 algorithmically generated adversarial examples; TMLab, from the Samsung R&D Institute Poland, submitted 79 examples, most of which were algorithmically generated but a few of which were manual; and NbAuzDrLqg, from the University of Massachusetts Amherst Center for Intelligent Information Retrieval, submitted 102 manually generated examples.
Only texts that look like valid assertions require verification, so we discounted adversarial examples if they were semantically or syntactically incoherent or if they could not be substantiated or refuted by Wikipedia data. On that basis, we created a weighted FEVER score called the resilience score, which we used to evaluate the Breakers’ submissions.
We tested all three sets of adversarial examples — plus an in-house baseline consisting of 498 algorithmically generated examples — against all seven target systems. The average resilience of the Builder models was 28.5%, whereas their average FEVER score on the original data set was 58.3%. This demonstrates that the adversarial examples were indeed exposing omissions in the original data set.
TMLabs’ examples were the most potent, producing more errors per example than either of the others. They were generated using a variation of the GPT-2 language model, which (like all language models) was designed to predict the next word in a sequence of words on the basis of those that preceded it.
The CUNLP researchers used their successful adversarial examples as templates for generating additional training data. The idea was that if the model was re-trained on the type of data that tended to stump it, it would learn how to handle that data. CUNLP thus became not only a Builder and a Breaker but also our one “Fixer”. After re-training, the CUNLP system became 11% more resilient to adversarial examples, and its FEVER score on the original task also increased, by 2%.
In addition to presentations by Builders and Breakers, the workshop will also feature two oral paper presentations and 10 posters. The papers cover a range of topics: some are theoretical explorations of what it means to verify an assertion, drawing on work in areas such as stance detection, argumentation theory, and psychology; others are more-concrete experiments with natural-language-processing and search systems.
The invited speakers include William Wang of the University of California, Santa Barbara; Emine Yilmaz of University College London, an Amazon Scholar; Hoifung Poon of Microsoft Research; Sameer Singh of the University of California, Irvine; and David Corney of Full Fact.
The problem of fact verification is far from solved. That’s why we’re excited to be cohosting this Second Workshop and pleased to see the wide adoption of the FEVER data set and the FEVER score and the contributions they’re making to continuing progress in the field.