Phonetic embedding for ASR robustness in entity resolution
Entity Resolution (ER) in spoken dialog systems can suffer from phonetic variation in search queries caused by Automatic Speech Recognition (ASR) errors. In this paper, we propose a phonetic embedding technique to improve the robustness of the ER system to this variation, which includes a phonetic embedding model, a training-data augmentation and sampling method, and an ASR robustness evaluation methodology. We test the technique on two use cases: voice search for videos and for books in the e-commerce domain. Combined with a semantic embedding neural vector search (NVS) model, phonetic embedding reduces the error rate of retrieval by 7.07% relative for video, by 4.23% for books compared to NVS not using phonetic embedding, and by 49.9% for video, and by 35.3% for books compared to a lexical search baseline.