Speaker identification for household scenarios with self-attention and adversarial training
Speaker identification based on voice input is a fundamental capability in speech processing enabling versatile downstream applications, such as personalization and authentication. With the advent of deep learning, most state-of-the-art methods apply machine learning techniques and derive acoustic embeddings from utterances with convolutional neural networks (CNNs) and recurrent neural networks (RNNs). This paper addresses two inherent limitations of current approaches. First, voice characteristics over long time spans might not be fully captured by CNNs and RNNs, as they are designed to focus on local feature extraction and adjacent dependencies modeling, respectively. Second, complex deep learning models can be fragile with regard to subtle but intentional changes in model inputs, also known as adversarial perturbations. To distill informative global acoustic embedding representations from utterances and be robust to adversarial perturbations, we propose a Self-Attentive Adversarial Speaker-Identification method (SAASI). In experiments on the VCTK dataset, SAASI significantly outperforms four state-of-the-art baselines in identifying both known and new speakers.