Investigating self-supervised features for expressive, multilingual voice conversion

Álvaro Martín Cortinas; Daniel Sáez-Trigueros; Jaime Lorenzo Trueba; Grzegorz Beringer; Ivan Valles; Roberto Barra-Chicote; Biel Tura Vecino; Adam Gabrys; Piotr Bilinski; Tom Merritt

Publication

Investigating self-supervised features for expressive, multilingual voice conversion

By Álvaro Martín Cortinas, Daniel Sáez-Trigueros, Jaime Lorenzo Trueba, Grzegorz Beringer, Ivan Valles, Roberto Barra-Chicote, Biel Tura Vecino, Adam Gabrys, Piotr Bilinski, Tom Merritt

2024

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Voice conversion (VC) systems are widely used for several applications, from speaker anonymisation to personalised speech synthesis. Supervised approaches learn a mapping between different speakers using parallel data, which is expensive to produce. Un-supervised approaches are typically trained to reconstruct the in-put signal, which is composed of the content and the speaker in-formation. Disentangling these components is a challenge and often leads to speaker leakage or prosodic information removal. In this paper, we explore voice conversion by leveraging the potential of self-supervised learning (SSL). A combination of the latent representations of SSL models, concatenated with speaker embeddings, is fed to a vocoder which is trained to reconstruct the input. Zero-shot voice conversion results show that this approach allows to keep the prosody and content of the source speaker while matching the speaker similarity of a VC system based on phonetic posteriorgrams (PPGs).

Investigating self-supervised features for expressive, multilingual voice conversion

Latest news

Work with us