Multimodal interaction

MMT4: Multi modality to text transfer transformer

Amir Tavanaei, Karim Bouyarmane, Iman Keivanloo, Ismail Tutar

KDD 2022

2022

Recent studies have demonstrated the ability of auto-regressive and seq-to-seq generative models to reach state-of-the-art performance on various Natural Language Understanding (NLU) and Natural Language Processing (NLP) tasks. They operate by framing all the tasks in a single formulation: text auto-completion or text-to-text encoding-decoding. These models can be trained on the products corpus in order

Conversational AI

X-DETR: A versatile architecture for instance-wise vision-language tasks

Zhaowei Cai, Gukyeong Kwon, Avinash Ravichandran, Erhan Bas, Zhuowen Tu, Rahul Bhotika, Stefano Soatto

ECCV 2022

2022

In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are

Computer vision

CMA-CLIP: Cross-modality attention clip for text-image classification

Jinmiao Fu, Shaoyuan Xu, Huidong Liu, Yang Liu, Ning Xie, Chien-Chih Wang, Jia Liu, Yi Sun, Bryan Wang

IEEE ICIP 2022

2022

Multi-modal learning with both text and images benefits multiple applications, such as attribute extraction for e-commerce products. In this paper, we propose Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new multi-modal architecture to jointly learn the fine-grained inter-modality relationship. It fuses CLIP with a sequence-wise attention module and a modality-wise attention

Computer vision

Semantic VL-BERT: Visual grounding via attribute learning

Prashan Wanigasekara, Kechen Qin, Emre Barut, Fan Yang, Weitong Ruan, Chengwei Su

IJCNN 2022

2022

In recent years, Smart Home Assistants have expanded into tens of thousands of devices and transformed from a voice only assistant to a much more versatile smart assistant, that uses a connected display to provide a multi-modal customer experience. In order to further improve on the multimodality experience, comprehension systems need models that can work with multisensory inputs. We focus on the problem

Computer vision

ASD-transformer: Efficient active speaker detection using self and multimodal transformers

Gourav Datta, Tyler Etchart, Vivek Yadav, Varsha Hedau, Pradeep Natarajan, Shih-Fu Chang

ICASSP 2022

2022

Multimodal active speaker detection (ASD) methods assign a speaking/not-speaking label per individual in a video clip. ASD is critical for applications such as natural human-computer interaction, speaker diarization, and video reframing. Recent work has shown the success of transformers in multimodal settings, thus we propose a novel framework that leverages modern transformer and concatenation mechanisms

Computer vision

Helping voice shoppers make purchase decisions

Gustavo Penha, Eyal Krikon, Vanessa Murdock, Sandeep Avula

CHI 2022

2022

Online shoppers have a lot of information at their disposal when making a purchase decision. They can look at images of the product, read reviews, make comparisons with other products, do research online, read expert reviews, and more. Voice shopping (purchasing items via a Voice assistant such as Amazon Alexa or Google Assistant) is different. Voice introduces novel challenges as the communication channel

Search and information retrieval

Vision-language pre-training with triple contrastive learning

Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, Junzhou Huang

CVPR 2022

2022

Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However, simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded

Computer vision

Multi-modal pre-training for automated speech recognition

David M. Chan, Shalini Ghosh, Debmalya Chakrabarty, Björn Hoffmeister

ICASSP 2022

2022

Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that

Machine learning

FashionVLP: Vision language transformer for fashion retrieval with feedback

Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue (Rex) Wu, Varsha Hedau, Pradeep Natarajan

CVPR 2022

2022

Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image retrieval

Search and information retrieval

LaTr: Layout-aware transformer for scene-text VQA

Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R. Manmatha

CVPR 2022

2022

We propose a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named LayoutAware Transformer (LaTr). The task of STVQA requires models to reason over different modalities. Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout information. Accounting for this, we propose a single objective

Computer vision

Multimodal interaction

Work with us