SAAML: A framework for semi-supervised affective adaptation via metric learning
Socially intelligent systems such as home robots should be able to perceive emotions and social behaviors. Affect recognition datasets have limited labeled data, and existing large unlabeled datasets, e.g., VoxCeleb2, suitable for pre-training, mostly contain neutral expressions, limiting their application to affective downstream tasks. We introduce a novel Semi-supervised Affective Adaptation framework via Metric Learning (SAAML) to adapt pre-trained audiovisual models (e.g., AV-HuBERT) to expressive behaviors associated with emotions and social communication. The proposed framework automatically retrieves a large number of emotional excerpts (> 100 hours) from the VoxCeleb2 dataset via metric learning from two emotion recognition datasets (MSP-IMPROV and CREMA-D), and learns domain-invariant emotion-aware representations. Experimental results show that fine-tuning the proposed affect-aware AV-HuBERT (AW-HuBERT) improves the emotion recognition accuracy by 3-6% compared to fine-tuning the original pre-trained models. We further validate the effectiveness of the AW-HuBERT on human-centered visual understanding tasks, namely, facial expression recognition, video highlight detection, and continuous emotion recognition. The proposed approach consistently outperforms AVHuBERT and delivers competitive performance compared to the existing methods. With this work, we demonstrate the effectiveness of adaptive pre-training for existing models on domain-specific data to enhance their performance for human-centered tasks.