Code and datasets

Data augmentation using pre-trained transformer models

Varun Kumar, Ashutosh Choudhary, Eunah Cho

2020

Language model based pre-trained models such as BERT have provided significant gains across different NLP tasks. In this paper, we study different types of transformer based pretrained models such as auto-regressive models (GPT-2), auto-encoder models (BERT), and seq2seq models (BART) for conditional data augmentation. We show that prepending the class labels to text sequences provides a simple yet effective

Conversational AI

Web Question Answering

Luca Soldaini, Alessandro Moschitti

2020

Large transformer-based language models have been shown to be very effective in many classification tasks. However, their computational complexity prevents their use in applications requiring the classification of a large set of candidates. While previous works have investigated approaches to reduce model size, relatively little attention has been paid to techniques to improve batch throughput during inference

Search and information retrieval

The Schema-Guided Natural Language Generation (SG-NLG) Dataset

Yuheng Du, Shereen Oraby, Vittorio Perera, Minmin Shen, Anjali Narayan-Chen, Tagyoung Chung, Anushree Venkatesh, Dilek Hakkani-Tür, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, Pranav Khaitan

2020

The SG-NLG dataset is a pre-processed version of the DSTC8 Schema-Guided Dialogue SGD dataset, designed specifically for data-to-text NLG. The original DSTC8 SGD contains ~20,000 dialogues spanning across ~20 domains. This SG-NLG dataset is designed to make it easier to conduct NLG experiments on the SGD data. We pre-process SGD by pairing the schema for each system turn with the corresponding set of natural

Conversational AI

TANDA: Transfer and adapt pre-trained transformer models for answer sentence selection

Siddhant Garg, Thuy Vu, Alessandro Moschitti

2020

We propose TANDA, an effective technique for fine-tuning pre-trained Transformer models for natural language tasks. Specifically, we first transfer a pre-trained model into a model for a general task by fine-tuning it with a large and high quality dataset. We then perform a second fine-tuning step to adapt the transferred model to the target domain. We demonstrate the benefits of our approach for answer

Conversational AI

Multi-domain goal-oriented dialogues (MultiDoGO): Strategies toward curating and annotating large scale dialogue data

Denis Peskov, Nancy Clarke, Jason Krone, Brigi Fodor, Yi Zhang, Adel Youssef, Mona Diab

2019

The need for high-quality, large-scale, goal-oriented dialogue datasets continues to grow as virtual assistants become increasingly widespread. However, existing publicly available datasets useful for this area are limited either in their size, linguistic diversity, domain coverage, or annotation granularity. We introduce the MultiDoGO dataset to overcome these limitations. With a total of over 65,000 dialogues

Conversational AI

Task2Vec

Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhranshu Maji, Charless Fowlkes, Stefano Soatto, Pietro Perona

2019

We introduce a method to generate vectorial representations of visual classification tasks that can be used to reason about the nature of those tasks and their relations. Given a dataset with ground-truth labels and a loss function, we process images through a “probe network” and compute an embedding based on estimates of the Fisher information matrix associated with the probe network parameters. This provides

Computer vision

OR Rl benchmarks

Bharathan Balaji, Jordan Bell-Masterson , Andreas Damianou, Pablo Garcia Moreno, Runfei Luo, Alvaro Maggiar, Balakrishnan (Murali) Narayanaswamy, Chun Ye

2019

Reinforcement Learning (RL) has achieved state-of-the-art results in domains such as robotics and games. We build on this previous work by applying RL algorithms to a selection of canonical online stochastic optimization problems with a range of practical applications: Bin Packing, Newsvendor, and Vehicle Routing. While there is a nascent literature that applies RL to these problems, there are no commonly

Machine learning

Contextual Query Rewrite (CQR) Dataset for spoken dialogue

Pushpendre Rastogi, Arpit Gupta, Tongfei Chen, Lambert Mathias

2019

Dialogue assistants are used by millions of people today to fulfill a variety of tasks. Such assistants also serve as a digital marketplace where any developer can build a domain-specific, task-oriented, dialogue agent offering a service such as booking cabs, ordering food, listening to music, shopping etc. Also, these agents may interact with each other, when completing a task on behalf of the user. Accomplishing

Conversational AI

Topic modeling with Wasserstein autoencoders

Feng Nan, Ran Ding, Ramesh Nallapati, Bing Xiang

2019

We propose a novel neural topic model in the Wasserstein autoencoders (WAE) framework. Unlike existing variational autoencoder based models, we directly enforce Dirichlet prior on the latent document-topic vectors. We exploit the structure of the latent space and apply a suitable kernel in minimizing the Maximum Mean Discrepancy (MMD) to perform distribution matching. We discover that MMD performs much

Conversational AI

Amazon SageMaker Debugger

Nathalie Rauschmayr, Vikas Kumar, Rahul Huilgol, Andrea Olgiati, Satadal Bhattacharjee, Nihal Harish, Vandana Kannan, Amol Lele, Anirudh Acharya, Jared Nielsen, Lakshmi Ramakrishnan, Ishaaq Chandy, Ishan Bhatt, Zhihan Li, Kohen Chia, Neelesh Dodda, Jiacheng Gu, Miyoung Choi, Balajee Nagarajan, Jeffrey Geevarghes, Denis Davydenko, Sifei Li, Lu Huang, Edward Kim, Tyler Hill, Krishnaram Kenthapadi

2019

Amazon SageMaker Debugger automates the debugging process of machine learning training jobs. From training jobs, Debugger allows you to run your own training script (Zero Script Change experience) using Debugger built-in features—Hook and Rule—to capture tensors, have flexibility to build customized Hooks and Rules for configuring tensors as you want, and make the tensors available for analysis by saving

Machine learning

MLIO

Can Balioglu, Rizwan Gilani

2019

MLIO is a high performance data access library for machine learning tasks with support for multiple data formats. It makes it easy for scientists to train models on their data without worrying about the format or where it's stored. Algorithm developers can also use MLIO to build production-quality algorithms that support a rich variety of data formats and provide helpful parsing and validation messages

Machine learning

Joint biased embeddings

Esma Balkir, Masha Naslidnyk, Dave Palfrey, Arpit Mittal, Sophie Durrant

2019

In this paper we study techniques to improve the performance of bilinear embedding methods for knowledge graph completion on large datasets, where at each epoch the model sees a very small percentage of the training data, and the number of generated negative examples for each positive example is limited to a small portion of the entire set of entities. We first present a heuristic method to infer the types

Machine learning

Code and datasets

More resources

Related content

Work with us