Video story question answering with character-centric scene parsing and question-aware temporal attention

Shijie Geng; Ji Zhang; Zuohui Fu; Hang Zhang; Ahmed Elgammal; Gerard de Melo; Dimitris Metaxas

Publication

Video story question answering with character-centric scene parsing and question-aware temporal attention

By Shijie Geng, Ji Zhang, Zuohui Fu, Hang Zhang, Ahmed Elgammal, Gerard de Melo, Dimitris Metaxas

2019

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

With the exploding growth of videos, there is an increasing interests for automatic video understanding. Video Story Question Answering (VSQA) proves to be an effective way for benchmarking the comprehension ability of a model. Recent VSQA approaches merely extract visual features from the whole scene or detected objects in each frame. However, it is hard to claim a method really understands a video without considering the characters inside. Additionally, relations and actions acquired by scene parsing are indispensable in the comprehension of video stories. In this work, we incorporate character-centric scene parsing to assist the VSQA task. Our reasoning framework consists of two parts: the ﬁrst part utilizes question-aware temporal attention to locate the corresponding frame intervals; the econd part nvolves across-attention transformer for multiple stream fusion. We train and test our VSQA model on the recently released TVQA dataset, which is the largest VSQA dataset until now. The experiments show that all modules in our framework work collaboratively and signiﬁcantly improve the overall performance. Ablation studies demonstrate that our scene parsing based framework is efﬁcacious for deeper understanding of video semantics.

Video story question answering with character-centric scene parsing and question-aware temporal attention

Latest news

Work with us