Computer vision

A quick guide to Amazon’s papers at CVPR 2024

As in other areas of AI, generative models and foundation models — such as vision-language models — are a hot topic.

By Staff writer

June 13, 2024

In the past few years, foundation models and generative-AI models — and particularly, large language models (LLMs) — have become a major topic of AI research. That’s true even in field of computer vision, with its increased focus on vision-language models that yoke LLMs and image encoders.

This shift can be seen in the topics of the Amazon papers accepted to this year’s Computer Vision and Pattern Recognition Conference (CVPR 2024). A plurality of the papers deal with vision-language models, while a number of others concern related topics such as visual question answering, hallucination mitigation, and retrieval-aided generation. At the same time, however, classical computer vision topics such as 3-D reconstruction, object tracking, and pose estimation remain well represented.

3-D reconstruction

No more ambiguity in 360◦ room layout via bi-layout estimation
Yu-Ju Tsai, Jin-Cheng Jhang, Jingjing Zheng, Wei Wang, Albert Chen, Min Sun, Cheng-Hao Kuo, Ming-Hsuan Yang

ViewFusion: Towards multi-view consistency via interpolated denoising
Xianghui Yang, Yan Zuo, Sameera Ramasinghe, Loris Bazzani, Gil Avraham, Anton van den Hengel

Multiview consistency.png — The object views produced by standard diffusion models are often realistic, but adjacent views may lack alignment *(left)*. ViewFusion incorporates an autoregressive process that fosters consistency across views *(right)*. From "ViewFusion: Towards multi-view consistency via interpolated denoising".

Algorithmic information theory

Interpretable measures of conceptual similarity by complexity-constrained descriptive auto-encoding
Alessandro Achille, Greg Ver Steeg, Tian Yu Liu, Matthew Trager, Carson Klingenberg, Stefano Soatto

Geospatial analysis

Bridging remote sensors with multisensor geospatial foundation models
Boran Han, Shuai Zhang, Xingjian Shi, Markus Reichstein

Hallucination mitigation

Multi-modal hallucination control by visual information grounding
Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto

THRONE: An object-based hallucination benchmark for the free-form generations of large vision-language models
Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, Stefano Soatto

Metric learning

Learning for transductive threshold calibration in open-world recognition
Qin Zhang, Dongsheng An, Tianjun Xiao, Tong He, Qingming Tang, Ying Nian Wu, Joe Tighe, Yifan Xing, Stefano Soatto

Model robustness

GDA: Generalized diffusion for robust test-time adaptation
Yun Yun Tsai, Fu-Chen Chen, Albert Chen, Junfeng Yang, Che-Chun Su, Min Sun, Cheng-Hao Kuo

Object-centric learning

Adaptive slot attention: Object discovery with dynamic slot number
Ke Fan, Zechen Bai, Tianjun Xiao, Tong He, Max Horn, Yanwei Fu, Francesco Locatello, Zheng Zhang

Object tracking

Self-supervised multi-object tracking with path consistency
Zijia Lu, Bing Shuai, Yanbei Chen, Zhenlin Xu, Davide Modolo

Pose estimation

MRC-Net: 6-DoF pose estimation with multiscale residual correlation
Yuelong Li, Yafei Mao, Raja Bala, Sunil Hadap

Responsible AI

FairRAG: Fair human generation via fair retrieval augmentation
Robik Shrestha, Yang Zou, James Chen, Zhiheng Li, Yusheng Xie, Tiffany Deng

Retrieval-augmented generation

CPR: Retrieval augmented generation for copyright protection
Aditya Golatkar, Alessandro Achille, Luca Zancato, Yu-Xiang Wang, Ashwin Swaminathan, Stefano Soatto

Security

Sharpness-aware optimization for real-world adversarial attacks for diverse compute platforms with enhanced transferability
Muchao Ye, Xiang Xu, Qin Zhang, Jon Wu

Video-language models

VidLA: Video-language alignment at scale
Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi

Vision-language models

Accept the modality gap: An exploration in the hyperbolic space
Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, Ajanthan Thalaiyasingam

Modality gap.png — "Accept the modality gap: An exploration in the hyperbolic space" propose a new angle-based contrastive loss that permits the placement of images anywhere along the axis emanating from a text embedding, enabling a hierarchy among images.

Enhancing vision-language pre-training with rich supervisions
Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto

GROUNDHOG: Grounding large language models to holistic segmentation
Yichi Zhang, Martin Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi (QZ) Gao, Joyce Chai

Hyperbolic learning with synthetic captions for open-world detection
Fanjie Kong, Yanbei Chen, Jiarui Cai, Davide Modolo

Non-autoregressive sequence-to-sequence vision-language models
Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, Stefano Soatto

On the scalability of diffusion-based text-to-image generation
Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto

UNet scaling.png — The effect of UNet scaling on text-image alignment. In "On the scalability of diffusion-based text-to-image generation", Amazon researchers vary a UNet along two dimensions: channel number *(left)* and transformer depth *(right)*. The prompts are (1) "square blue apples on a tree with circular yellow leaves"; (2) "five frosted glass bottles"; (3) "a yellow box to the right of a blue sphere"; (4) "the International Space Station flying in front of the moon".

Visual question answering

GRAM: Global reasoning for multi-page VQA
Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman

Question aware vision transformer for multimodal reasoning
Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, Ron Litman

Synthesize step-by-step: Tools, templates and LLMs as data generators for reasoning-based chart VQA
Zhuowan Li, Bhavan Jasani, Peng Tang, Shabnam Ghadar

About the Author

Staff writer

A quick guide to Amazon’s papers at CVPR 2024

As in other areas of AI, generative models and foundation models — such as vision-language models — are a hot topic.

3-D reconstruction

Algorithmic information theory

Geospatial analysis

Hallucination mitigation

Metric learning

Model robustness

Object-centric learning

Object tracking

Pose estimation

Responsible AI

Retrieval-augmented generation

Security

Video-language models

Vision-language models

Visual question answering

Related content

Work with us