Computer vision

Helping devices see and understand our visual world.

Capturing gaze shifts for guidance: Cross-modal fusion enhancement for VLM hallucination mitigation

Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas

ICML 2026

2026

Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs.Some methods attempt to mitigate hallucination by amplifying visual token attention proportion-ally to their attention scores. However, these methods overlook

Computer vision
SMPRO: Self-supervised visual preference alignment via differentiable multi-preference multi-group ranking

Swetha Sirnam, Rui Meng, Shwetha Ram, Tal Neiman, Son Tran, Mubarak Shah

AAAI 2026

2026

Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning models with human preferences. However, existing DPO-based methods suffer from 3 key drawbacks: they rely on only a single positive-negative preference pair per question, restricting the diversity and richness of feedback; they often emphasize minimizing negative preference scores while neglecting to strengthen

Computer vision
Playgen-Mog: Framework for diverse multi-agent play generation via mixture-of-Gaussians trajectory prediction

Kevin Song

CVPR IEEE 2026 Workshop on Computer Vision in Sports

2026

Multi-agent trajectory generation in team sports requires models that capture both the diversity of possible plays and realistic spatial coordination between players on plays. Standard generative approaches such as Conditional Variational Autoencoders (CVAE) and diffusion models struggle with this task, exhibiting posterior collapse or convergence to the dataset mean. Moreover, most trajectory prediction

Computer vision
Beyond disjoint tasks: Towards more natural continual learning for vision-language models

Xiang Xu, Yiyang Su, Tianchen Zhao, Zheng Zhang, Zhuowen Tu, Anil Jain, Jon Wu

ICML 2026

2026

Continual learning methods for vision-language models are developed on benchmarks where each new task introduces entirely new domain knowledge. Real-world task sequences are more natural: they routinely share visual concepts, language patterns, and even training samples across stages. However, existing mixture-of-expert methods that assign one expert per task with fixed routing can split similar inputs

Computer vision
Adaptive geometry routing for vision–language understanding

Sarthak Srivastava, Kathy Wu

KDD 2026

2026

Vision language models face a fundamental geometry trade-off: Euclidean representations excel at instance-level discrimination, while hyperbolic representations naturally encode semantic hierarchies. Hybrid training is challenging because one geometry may dominate early, leaving the other under-trained failure mode we term geometry dominance. We introduce Adaptive Geometry Routing (AGR), a framework that

Computer vision

Long-form-video understanding and synthesis

Raffay Hamid

June 28, 2023

Four CVPR papers from Prime Video examine a broad set of topics related to efficient model training for understanding and synthesizing long-form cinematic content.

Computer vision
A quick guide to Amazon's 20-plus papers at CVPR 2023

Staff writer

June 20, 2023

Image segmentation, multimodal models, and innovative machine learning methods are among the Amazon researchers' areas of focus.

Computer vision
A user-controllable framework that unifies style transfer methods

Yue (Rex) Wu

February 14, 2023

A diversity of outputs ensures that style transfer model can satisfy any user’s tastes.

Computer vision
Computer vision for automated quality inspection

Staff writer

January 5, 2023

How an AWS customer uses Lookout for Vision to build custom computer vision models to automate quality inspection and detect defects.

Computer vision
WACV: Where application-based research finds a home

Larry Hardesty

January 4, 2023

As video scales up — in both duration and resolution — it raises new research questions.

Computer vision
More-efficient annotation for semantic segmentation in video

Nan Qiao

January 3, 2023

Automated methods with a little human guidance use annotators’ time much more efficiently.

Computer vision

Computer vision

Recent publications

Related content

Work with us