Video story question answering with character-centric scene parsing and question-aware temporal attention
With the exploding growth of videos, there is an increasing interests for automatic video understanding. Video Story Question Answering (VSQA) proves to be an effective way for benchmarking the comprehension ability of a model. Recent VSQA approaches merely extract visual features from the whole scene or detected objects in each frame. However, it is hard to claim a method really understands a video without considering the characters inside. Additionally, relations and actions acquired by scene parsing are indispensable in the comprehension of video stories. In this work, we incorporate character-centric scene parsing to assist the VSQA task. Our reasoning framework consists of two parts: the ﬁrst part utilizes question-aware temporal attention to locate the corresponding frame intervals; the econd part nvolves across-attention transformer for multiple stream fusion. We train and test our VSQA model on the recently released TVQA dataset, which is the largest VSQA dataset until now. The experiments show that all modules in our framework work collaboratively and signiﬁcantly improve the overall performance. Ablation studies demonstrate that our scene parsing based framework is efﬁcacious for deeper understanding of video semantics.