Computer vision

Helping devices see and understand our visual world.

Prompting vision-language models for aspect-controlled generation of referring expressions

Danfeng Guo, Sanchit Agarwal, Arpit Gupta, Jiun-Yu Kao, Emre Barut, Tagyoung Chung, Jing Huang, Mohit Bansal

NAACL 2024

2024

Referring Expression Generation (REG) is the task of generating a description that unambiguously identifies a given target in the scene. Different from Image Captioning (IC), REG requires learning fine-grained characteristics of not only the scene objects but also their surrounding context. Referring expressions are usually not singular; an object can often be uniquely referenced in numerous ways, for in-stance

Computer vision
FairRAG: Fair human generation via fair retrieval augmentation

Robik Shrestha, Yang Zou, James Chen, Zhiheng Li, Yusheng Xie, Tiffany Deng

CVPR 2024

2024

Existing text-to-image generative models reflect or even amplify societal biases ingrained in their training data. This is especially concerning for human image generation where models are biased against certain demographic groups. Existing attempts to rectify this issue are hindered by the inherent limitations of the pre-trained models and fail to substantially improve demographic diversity. In this work

Computer vision
M3T: A new benchmark dataset for multi-modal document-level machine translation

Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nădejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari

NAACL 2024

2024

Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world documents

Computer vision
De-noised vision-language fusion guided by visual cues for e-commerce product search

Zhizhang Hu, Shasha Li, Ming Du, Arnab Dhua, Doug Gray

CVPR 2024 Workshop on Multimodal Learning and Applications

2024

In e-commerce applications, vision-language multimodal transformer models play a pivotal role in product search. The key to successfully training a multimodal model lies in the alignment quality of image-text pairs in the dataset. However, the data in practice is often automatically collected with minimal manual intervention. Hence the alignment of image-text pairs is far from ideal. In e-commerce, this

Computer vision
Benchmarking zero-shot recognition with vision-language models: Challenges on granularity and specificity

Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joe Tighe, Davide Modolo

CVPR 2024 Workshop on "What is Next in Multimodal Foundation Models?"

2024

This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs ex-cel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs’ consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show

Computer vision

The surprisingly subtle challenge of automating damage detection

Sean O'Neill

September 19, 2022

Why detecting damage is so tricky at Amazon’s scale — and how researchers are training robots to help with that gargantuan task.

Robotics
BillionPhotos.com — stock.adobe.com

Ying Ding’s human-centered approach to AI-enhanced medical imaging diagnosis

Staff writer

August 16, 2022

ARA recipient is using artificial intelligence to help doctors make decisions based on radiological data.

Machine learning
"Among all sources of information, visual information may be the most interesting"

Mariana Lenharo

July 20, 2022

Violetta Shevchenko, an Amazon applied scientist and former intern, combines vision and language to create solutions to challenging problems.

Computer vision
Better joint representations of image and text

Liqun Chen

July 1, 2022

Two methods presented at CVPR achieve state-of-the-art results by imposing additional structure on the representational space.

Computer vision
A little public data makes privacy-preserving AI models more accurate

Alessandro Achille, Yu-Xiang Wang

June 24, 2022

Technique that mixes public and private training data can meet differential-privacy criteria while cutting error increase by 60%-70%.

Computer vision
How a passion for reinforcement learning guided Alexander Long’s trajectory

Mariana Lenharo

June 24, 2022

The field motivated him to pursue a PhD, which eventually led him to Amazon.

Computer vision

Computer vision

Recent publications

Related content

Work with us