Characterizing video question answering with sparsified inputs

Shiyuan Huang; Robinson Piramuthu; Vicente Ordonez; Shih-Fu Chang; Gunnar Sigurdsson

Publication

Characterizing video question answering with sparsified inputs

By Shiyuan Huang, Robinson Piramuthu, Vicente Ordonez, Shih-Fu Chang, Gunnar Sigurdsson

2024

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

In Video Question Answering, videos are often processed as a full-length frame sequence to ensure minimal information loss. Recent works have shown evidence that sparse video inputs are sufficient to maintain high performance. However, they usually discuss single frame selection. In our work, we extend the setting to various input lengths and other modalities, and characterize the task with different input sparsities and provide a tool for doing that. Specifically, we propose a Multi-Gumbel-based sparsification module to adaptively find the best video inputs for the final task. We experiment over public VideoQA benchmarks and analyze how sparsified inputs affect the performance. We have observed only 5.2%−5.8% loss of performance with only 10% of video lengths. Meanwhile, we also observed the complimentary behaviour between visual and textual inputs, even under highly sparsified settings, and that by adding just 100 words, we can even beat a state-of-the-art. Our work suggest the potential of improving data efficiency for video-and-language tasks.

Characterizing video question answering with sparsified inputs

Latest news

Work with us