Computer vision

Helping devices see and understand our visual world.

X-Former: Unifying contrastive and reconstruction learning for MLLMs

Swetha Sirnam, Jinyu Yang, Tal Neiman, Mamshad Nayeem Rizve, Son Tran, Benjamin Yao, Trishul Chilimbi, Mubarak Shah

ECCV 2024

2024

Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing

Computer vision
HVCLIP: High-dimensional vector in CLIP for unsupervised domain adaptation

Sol Vesdapunt, Kah Kuen Fu, Yue (Rex) Wu, Xu Zhang, Pradeep Natarajan

ECCV 2024

2024

Recent advancement in the large-scale image-text pre-training model (such as CLIP) has significantly improved unsupervised domain adaptation (UDA) by leveraging the pre-trained knowledge to bridge the source and target domain gap. However, Catastrophic forgetting still remains to be the main challenge, since traditional fine-tuning method to adjust CLIP model weights on a target domain can quickly override

Computer vision
Correspondence-free SE(3) point cloud registration in RKHS via unsupervised equivariant learning

Ray Zhang, Zheming Zhou, Min Sun, Omid Alizadeh, Cheng-Hao Kuo, Ryan M. Eustice, Maani Ghaffari, Arnie Sen

ECCV 2024

2024

This paper introduces a robust unsupervised SE(3) point cloud registration method that operates without requiring point correspondences. The method frames point clouds as functions in a reproducing kernel Hilbert space (RKHS), leveraging SE(3)-equivariant features for direct feature space registration. A novel RKHS distance metric is proposed, offering reliable performance amidst noise, outliers, and asymmetrical

Computer vision
Open vocabulary multi-label video classification

Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan, Ashish Tawari, Son Tran, Mubarak Shah

ECCV 2024

2024

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously

Computer vision
REFINESUMM: Self-refining MLLM for generating a multimodal summarization dataset

Vaidehi Patil, Leonardo Ribeiro, Mengwen Liu, Mohit Bansal, Markus Dreyer

ACL 2024

2024

Multimodal Large Language Models (MLLMs) excel at synthesizing key information from diverse sources. However, generating accurate and faithful multimodal summaries is challenging, primarily due to the lack of appropriate multimodal datasets for fine-tuning that meaningfully integrate textual and visual modalities. To address this gap, we present a new dataset specifically designed for image-text multimodal

Computer vision

Kai Weinsziehr/MPG

Amazon and Max Planck Society launch Science Hub

Staff writer

May 27, 2022

The first Amazon Science Hub to exist outside the US will focus on driving AI research and development throughout Germany.

Machine learning
Paper on translating images into maps wins ICRA best-paper award

Chris Russell

May 26, 2022

Reformulating the mapping problem to take advantage of sequence-to-sequence Transformers improves performance by an average of 15%.

Computer vision
Courtesy of Ankan Bansal

Ankan Bansal’s long journey into the world of computer vision

Staff writer

May 3, 2022

How a math-loving student travelled 7,000 miles to pursue a passion and wound up becoming an applied scientist.

Computer vision
How does Astro localize itself in an ever-changing home?

Jianbo Ye, Arnie Sen

April 19, 2022

Deep learning to produce invariant representations, estimations of sensor reliability, and efficient map representations all contribute to Astro’s superior spatial intelligence.

Computer vision
“Robin deals with a world where things are changing all around it”

Alan S. Brown

April 18, 2022

An advanced perception system, which detects and learns from its own mistakes, enables Robin robots to select individual objects from jumbled packages — at production scale.

Robotics
How Prime Video uses machine learning to ensure video quality

Sathya Balakrishnan, Ihsan Ozcelik

March 4, 2022

Detectors for block corruption, audio artifacts, and errors in audio-video synchronization are just three of Prime Video’s quality assurance tools.

Computer vision

Computer vision

Recent publications

Related content

Work with us