-
Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring
-
2025Video summarization aims to generate a condensed textual version of an original video. Summaries may consist of either plain text or a shortlist of salient events, possibly including temporal or spatial references. Video Large Language Models (VLLMs) exhibit impressive zero-shot capabilities in video analysis. However, their performance varies significantly according to the LLM prompt, the characteristics
-
SPIE Defense + Commercial Sensing 20252025Transformer models have revolutionized the field of image captioning, offering advanced capabilities through self attention mechanisms that capture intricate visual and textual relationships. This paper presents an innovative approach to applying transformer models for image captioning. Current State-of-the-Art (SOTA) performance has only been achieved by large vision-language models (LVLMs). Our approach
-
IEEE ICIP 20252025Copy Detection system aims to identify if a query image is an edited/manipulated copy of an image from a large reference database with millions of images. While global image descriptors can retrieve visually similar images, they struggle to differentiate near-duplicates from semantically similar instances. We propose a dual-triplet metric learning (DTML) technique to learn global image features that group
-
2025Vision Language Models (VLMs) have achieved significant advancements in complex visual understanding tasks. However, VLMs are prone to hallucinations—generating outputs that lack alignment with visual content. This paper addresses hallucination detection in VLMs by leveraging the visual grounding information encoded in transformer attention maps. We identify three primary challenges in this approach: the
Related content
-
February 14, 2023A diversity of outputs ensures that style transfer model can satisfy any user’s tastes.
-
January 5, 2023How an AWS customer uses Lookout for Vision to build custom computer vision models to automate quality inspection and detect defects.
-
January 4, 2023As video scales up — in both duration and resolution — it raises new research questions.
-
January 3, 2023Automated methods with a little human guidance use annotators’ time much more efficiently.
-
December 26, 2022Combining contrastive training and selection of hard negative examples establishes new benchmarks.
-
December 16, 2022University of Wisconsin-Madison associate professor and ARA recipient has authored a series of pioneering papers on real-time object instance segmentation.