ASD-transformer: Efficient active speaker detection using self and multimodal transformers
Multimodal active speaker detection (ASD) methods assign a speaking/not-speaking label per individual in a video clip. ASD is critical for applications such as natural human-computer interaction, speaker diarization, and video reframing. Recent work has shown the success of transformers in multimodal settings, thus we propose a novel framework that leverages modern transformer and concatenation mechanisms to efficiently capture the interaction between audio and video modalities for ASD. We achieve mAP similar to state-of-the-art (93.0% vs 93.5%) on the AVA-ActiveSpeaker dataset. Further, our model has ∼3× smaller size (15.23MB vs 49.82MB), reduced FLOPs count (11.8 vs 14.3), and lower training time (15h vs 38h). To verify our model is making predictions from the right visual cues, we computed saliency maps over input images. We found that in addition to mouth regions, the nose, cheek, and area under the eye were helpful in identifying active speakers. Our ablation study reveals that the mouth region alone achieved lower mAP (91.9% vs 93.0%) compared to full face region, supporting our hypothesis that facial expressions in addition to mouth region are useful for ASD.