Code and datasets

Crowd coachable recommendations / retrieval (CCR)

Zhuotong Chen, Yifei Ma, Branislav Kveton, Anoop Deoras

2023

Contents al_demo_prime_pantry.ipynb povides a notebook template to run oracle-labeled active learning experiments on a small-scale dataset. al_demo_nq.ipynb provides oracle-labeled experiments on the larger-scale natural questions dataset. The only change between the two notebooks is DATA_NAME="nq" in the configuration line. One may also change it to DATA_NAME="msmarco" for the larger-scale MS-MARCO oracle-labeled

Conversational AI

Controlling LLM memorization

Mustafa Ozdayi, Charith Peris, Jack G. M. FitzGerald, Christophe Dupuy, Jimit Majmudar, Haidar Khan, Rahil Parikh, Rahul Gupta

2023

Large language models (LLMs) are known to memorize significant portions of their training data. Parts of this memorized content have been shown to be extractable by simply querying the model, which poses a privacy risk. We present a novel approach which uses prompt-tuning to control the extraction rates of memorized content in LLMs. We present two prompt training strategies to increase and decrease extraction

Conversational AI

PhraseSumm: Abstractive short phrase summarization

Kasturi Bhattacharjee, Kathleen McKeown, Rashmi Gangadharaiah

2023

This repository contains the dataset released alongside the paper PhraseSumm: Abstractive short phrase summarization to appear as a Findings publication at AACL-IJCNLP 2023. This dataset has been released in order to aid in further research & experimentation of a new task of PhraseSumm, i.e. short phrase summarization introduced in the above paper. The data folder contains the relevant files including train

Conversational AI

Learning action embeddings for off-policy evaluation

Matej Cief, Jacek Golebiowski, Philipp Schmidt, Ziawasch Abedjan, Artur Bekasov

2023

This repository contains code for evaluating the methods proposed in Learning action embeddings for off-policy evaluation. To get started, we recommend checking the Example.ipynb notebook as it clearly demonstrates benefits of the proposed method from Section 3 and implements everything in a few lines of code. To run the notebook, you only need python 3 with standard machine learning libraries. To run the

Machine learning

ContraCLM: Contrastive learning for causal language model

Nihal Jain, Dejiao Zhang, Wasi Ahmad, Zijian Wang, Feng Nan, Xiaopeng LI, Ming Tan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Xiaofei Ma, Bing Xiang

2023

Despite exciting progress in causal language models, the expressiveness of their representations is largely limited due to poor discrimination ability. To remedy this issue, we present CONTRACLM, a novel contrastive learning framework at both the token-level and the sequence-level. We assess CONTRACLM on a variety of downstream tasks. We show that CONTRACLM enhances the discrimination of representations

Conversational AI

Robust table question answering

Weizhe Lin, Rexhina Blloshmi, Bill Byrne, Adrià de Gispert, Gonzalo Iglesias

2023

Inner Table Retriever (ITR) is a general-purpose approach for handling long tables in TableQA that extracts sub-tables to preserve the most relevant information for a question. ITR can be easily integrated into existing systems to improve their accuracy achieve state-of-the-art results.

Search and information retrieval

Unique batches

Donato Crisostomi, Andrea Caciolai, Alessandro Pedrani, Alessandro Manzotti, Enrico Palumbo, Kay Rottmann, Davide Bernardi

2023

This package contains the code to implement and test a new approach to model training. Its goal is to reduce the training time while keeping the final accuracy on par. We do these by taking advantage (reducing) data redundancy. We defined it Unique Batches because we deduplicate the data on a batch level and not on the entire dataset, keeping the model learning trajectory very close to the full dataset

Data augmentation for entity resolution

Dae Yon Hwang, Yaroslav Nechaev, Cyprien delichy, Renxian Zhang

2023

In this work, we investigate Data Augmentation methods to improve the performance of state-of-the-art models for four different downstream tasks. Specifically, we propose Generative Adversarial Network using Language Models (GAN-LM) approach that combines a deep generative model with a pre-trained language model to produce diverse augmentations. We compare the GAN-LM to various conventional methods in non-contextual

Machine learning

Optimizing multi-task training through dynamic pipelines

Chenyu Jiang, Zhen Jia, Shuai Zheng, Yida Wang, Chuan Wu

2023

During multi-task training, the model commonly receives input sequences of highly different lengths due to the diverse contexts of different tasks. Padding (to the same sequence length) or packing (short examples into long sequences of the same length) is usually adopted to prepare input samples for model training, which is nonetheless not space or computation efficient. This project adopts a dynamic micro-batching

Cloud and systems

Supervised intent clustering

Giorgio Barnabo, Antonio Uva, Sandro Pollastrini, Chiara Rubagotti, Davide Bernardi

2023

This is a package to fine-tune language models in order to create clustering-friendly embeddings. It is based on the paper Supervised clustering loss for clustering-friendly sentence embeddings: An application to intent clustering. Modern virtual assistants are trained to classify customer requests into a taxonomy of predesigned intents. Requests that fall outside of this taxonomy, however, are often unhandled

Conversational AI

NameGuess: Column name expansion for tabular data

Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Shen Wang, Huzefa Rangwala, George Karypis

2023

Recent advances in large language models have revolutionized many sectors, including the database industry. One common challenge when dealing with large volumes of tabular data is the pervasive use of abbreviated column names, which can negatively impact performance on various data search, access, and understanding tasks. To address this issue, we introduce a new task, called NameGuess, to expand column

Conversational AI

Carbon assessment with machine learning

Bharathan Balaji, Venkata Sai Gargeya Vunnava, Shikhar Gupta, Nina Domingo, Harsh Gupta, Geoffrey Guest, Jared Kramer, Aravind Srinivasan

2023

This code repository presents a machine learning based method for selection of an Environmental Impact Factor (EIF) for a given product, material, or activity, which is a fundamental step of carbon footprinting. The code documents the methods in the following research papers. EIF matching with generative AI, published in CCAI@NeurIPS 2024 -- Parakeet: Emission Factor Recommendation for Carbon Footprinting

Conversational AI

Code and datasets

More resources

Related content

Work with us