Characterizing video question answering with sparsified inputs
2024
In Video Question Answering, videos are often processed as a full-length frame sequence to ensure minimal information loss. Recent works have shown evidence that sparse video inputs are sufficient to maintain high performance. However, they usually discuss single frame selection. In our work, we extend the setting to various input lengths and other modalities, and characterize the task with different input sparsities and provide a tool for doing that. Specifically, we propose a Multi-Gumbel-based sparsification module to adaptively find the best video inputs for the final task. We experiment over public VideoQA benchmarks and analyze how sparsified inputs affect the performance. We have observed only 5.2%−5.8% loss of performance with only 10% of video lengths. Meanwhile, we also observed the complimentary behaviour between visual and textual inputs, even under highly sparsified settings, and that by adding just 100 words, we can even beat a state-of-the-art. Our work suggest the potential of improving data efficiency for video-and-language tasks.
Research areas