Computer vision

Helping devices see and understand our visual world.

Towards image copy detection at e-commerce scale

Vishnu Prabhakaran, Vishruit Kulshreshtha, Purav Aggarwal, Gokul Swamy

IEEE ICIP 2025

2025

Copy Detection system aims to identify if a query image is an edited/manipulated copy of an image from a large reference database with millions of images. While global image descriptors can retrieve visually similar images, they struggle to differentiate near-duplicates from semantically similar instances. We propose a dual-triplet metric learning (DTML) technique to learn global image features that group

Search and information retrieval
VADE: Visual attention guided hallucination detection and elimination

Vishnu Prabhakaran, Purav Aggarwal, Vinay Kumar Verma, Gokul Swamy, Anoop S V K K Saladi

ACL 2025

2025

Vision Language Models (VLMs) have achieved significant advancements in complex visual understanding tasks. However, VLMs are prone to hallucinations—generating outputs that lack alignment with visual content. This paper addresses hallucination detection in VLMs by leveraging the visual grounding information encoded in transformer attention maps. We identify three primary challenges in this approach: the

Conversational AI
POp-GS: Next best view in 3D-Gaussian splatting with P-Optimality

Joey Wilson, Marcelino Almeida, Sachit Mahajan, Martin Labrie, Maani Ghaffari, Omid Alizadeh, Min Sun, Cheng-Hao Kuo, Arnab Sen

CVPR 2025

2025

In this paper, we present a novel algorithm for quantifying uncertainty and information gained within 3D Gaussian Splatting (3D-GS) through P-Optimality. While 3D-GS has proven to be a useful world model with high-quality rasterizations, it does not natively quantify uncertainty or information, posing a challenge for real-world applications such as 3D-GS SLAM. We propose to quantify information gain in

Computer vision
UA-Pose: Uncertainty-aware 6D object pose estimation and online object completion with partial references

Ming-Feng Li, Xin Yang, Fu-En Wang, Hritam Basak, Yuyin Sun, Shreekant Gayaka, Min Sun, Cheng-Hao Kuo

CVPR 2025

2025

6D object pose estimation has shown strong generalizability to novel objects. However, existing methods often require either a complete, well-reconstructed 3D model or numerous reference images that fully cover the object. Estimating 6D poses from partial references, which capture only fragments of an object’s appearance and geometry, remains challenging. To address this, we propose UA-Pose, an uncertainty-aware

Computer vision
Video quality assessment for resolution cross-over in live sports

Jingwen Zhu, Yixu Chen, Hai Wei, Sriram Sethuraman, Yongjun Wu

ICME 2025

2025

In adaptive bitrate streaming, resolution cross-over refers to the point on the convex hull where the encoding resolution should switch to achieve better quality. Accurate cross-over prediction is crucial for streaming providers to optimize resolution at given bandwidths. Most existing works rely on objective Video Quality Metrics (VQM), particularly VMAF, to determine the resolution cross-over. However

Computer vision

A quick guide to Amazon’s papers at CVPR 2024

Staff writer

June 13, 2024

As in other areas of AI, generative models and foundation models — such as vision-language models — are a hot topic.

Computer vision
How Project P.I. helps Amazon remove imperfect products

John Roach

June 03, 2024

A combination of generative AI and computer vision imaging tunnels is helping Amazon proactively improve the customer experience.

Computer vision
More reliable nearest-neighbor search with deep metric learning

Qin Zhang

May 31, 2024

Novel loss term that can be added to any loss function regularizes interclass and intraclass distances.

Machine learning
Generalizing diffusion modeling to multimodal, multitask settings

Changyou Chen

May 17, 2024

A novel loss function and a way to aggregate multimodal input data are key to dramatic improvements on some test data.

Computer vision
Virtual try-all: Visualizing any product in any personal setting

Karim Bouyarmane

April 16, 2024

First model to work across a wide range of products uses a second U-Net encoder to capture fine-grained product details.

Computer vision
New pretraining tasks enable better document understanding

Srikar Appalaraju

March 07, 2024

DocFormerV2 makes sense of documents using local features, outperforming much bigger models.

Computer vision

Computer vision

Recent publications

Related content

Work with us