Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding

Bhuvan Agrawal; Markus Müller; Samridhi Choudhary; Martin Radfar; Thanasis Mouchtaris; Ross McGowan; Nathan Susanj; Siegfried Kunzmann

Publication

Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding

By Bhuvan Agrawal, Markus Müller, Samridhi Choudhary, Martin Radfar, Thanasis Mouchtaris, Ross McGowan, Nathan Susanj, Siegfried Kunzmann

2022

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio semantics data. In this paper, we consider an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the ‘acoustic’ and ‘text’ embeddings. We propose using different multi-modal losses to explicitly align the acoustic embedding to the text embeddings (obtained via a semantically powerful pre-trained BERT model) in the latent space. We train the CMLS model on two publicly available E2E datasets and one internal dataset, across different cross-modal losses. Our proposed triplet loss function achieves the best performance. It achieves a relative improvement of 22.1% over an E2E model without a cross-modal space and a relative improvement of 2.8% over a previously published CMLS model using L2 loss on our internal dataset.

Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding

Latest news

Work with us