Dataset development

Amazon and University of Sheffield researchers make large-scale fact extraction and verification dataset publicly available

Arpit Mittal

May 4, 2018

Amazon and University of Sheffield researchers are addressing the fact verification challenge

In recent years, the amount of textual information produced daily has increased exponentially. This information explosion has been accelerated by the ease with which data can be shared across the web. Most of the textual information is generated as free-form text, and only a small fraction is available in structured format (Wikidata, Freebase etc.) that can be processed and analyzed directly by machines.

Search and information retrieval

RarePlanes soar higher: Self-supervised pretraining for resource constrained and synthetic datasets

Justin Downes, Will Gleave, Dan Nakada

WACV 2023 Workshop on Pretraining Large Vision and Multimodal Models

2022

Self-supervised pretraining has advanced the capabilities of many computer vision tasks without requiring additional labels. One drawback is this technique requires extensive datasets and computational resources. This requirement of large datasets to pretrain with has often precluded the use of smaller, more niche datasets. Recently a method of pretraining has been developed that uses several stages of

Computer vision

GEMv2: Multilingual NLG benchmarking in a single line of code

Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bernd Bohnet, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Jekaterina Novikova, Jenna Kanerva, Jenny Chim, Jiawei Zhou, Jordan Clive, Joshua Maynez, João Sedoc, Juraj Juraska, Kaustubh Dhole, Khyathi Raghavi Chandu, Laura Perez-Beltrachini, Leonardo Ribeiro, Lewis Tunstall, Li Zhang, Mahima Pushkarna, Mathias Creutz, Michael White, Mihir Sanjay Kale, Moussa Kamal Eddine, Nico Daheim, Nishant Subramani, Ondrej Dusek, Paul Pu Liang, Pawan Sasanka Ammanamanch, Qi Zhu, Ratish Puduppully, Reno Kriz, Rifat Shahriyar, Saad Mahamood, Salomey Osei, Samuel Cahyawijaya, Sanja Štajner, Sebastien Montella, Shailza Jolly, Simon Mille, Tianhao Shen, Tosin Adewumi, Vikas Raunak, Vipul Raheja, Vitaly Nikolaev, Vivian Tsai, Yacine Jernite, Ying Xu, Yisi Sang, Yixin Liu, Yufang Hou

EMNLP 2022

2022

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods

Conversational AI

ExPUNations: Augmenting puns with keywords and explanations

Jiao Sun, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Tagyoung Chung, Jing Huang, Yang Liu, Nanyun Peng

EMNLP 2022

2022

The tasks of humor understanding and generation are challenging and subjective even for humans, requiring commonsense and real-world knowledge to master. Puns, in particular, add the challenge of fusing that knowledge with the ability to interpret lexical-semantic ambiguity. In this paper, we present the ExPUNations (ExPUN) dataset, in which we augment an existing dataset of puns with detailed crowdsourced

Conversational AI

Self-supervised pretraining for large-scale point clouds

Zaiwei Zhang, Min Bai, Erran Li

NeurIPS 2022

2022

Pretraining on large unlabeled datasets has been proven to improve the down stream task performance on many computer vision tasks, such as 2D object detection and video classification. However, for large scale 3D scenes, such as outdoor LiDAR point clouds, pretraining is not widely used. Due to the special data characteristics of large 3D point clouds, approaches for 2D pretraining frameworks tend to not

Computer vision

Mintaka: A complex, natural, and multilingual dataset for end-to-end question answering

Priyanka Sen, Alham Fikri Aji, Amir Saffari

COLING 2022

2022

We introduce MINTAKA, a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. Mintaka is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex

Conversational AI

Large scale real-world multi-person tracking

Bing Shuai, Alessandro Bergamo, Uta Buechler, Andrew Berneshawi, Alyssa Boden, Joe Tighe

ECCV 2022

2022

This paper presents a new large scale multi-person tracking dataset – PersonPath22, which is over an order of magnitude larger than currently available high quality multi-object tracking datasets such as MOT17, HiEve, and MOT20 datasets. The lack of large scale training and test data for this task has limited the community’s ability to understand the performance of their tracking systems on a wide range

Computer vision

FEVER: Fact Extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Arpit Mittal

2018

In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as SUPPORTED, REFUTED or NOTENOUGHINFO by annotators achieving 0.6841

Conversational AI

DIVA: Dataset derivative of a learning task

Yonatan Dukler, Alessandro Achille, Giovanni Paolini, Avinash Ravichandran, Marzia Polito, Stefano Soatto

ICLR 2022

2022

We present a method to compute the derivative of a learning task with respect to a dataset. A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN). The “dataset derivative” is a linear operator, computed around the trained model, that informs how perturbations of the weight of each training sample affect the validation error

Machine learning

ABO: Dataset and benchmarks for real-world 3D object understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, Jitendra Malik

CVPR 2022

2022

We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. ABO contains product catalog images, metadata, and artist-created 3D models with complex geometries and physically-based materials that correspond to real, household objects. We derive challenging benchmarks that exploit the unique properties of ABO and measure the current

Computer vision

Dataset development

Work with us