Domain-Specific Utterance End-Point Detection for Speech Recognition
2017
The task of automatically detecting the end of a device-directed user request is particularly challenging in case of switching short command and long free-form utterances. While low latency end-pointing configurations typically lead to good user experiences in the case of short requests, such as “play music”, it can be too aggressive in domains with longer free-form queries, where users tend to pause noticeably between words and hence are easily cut off prematurely. We previously proposed an approach for accurate end-pointing by continuously estimating pause duration features over all active recognition hypotheses. In this paper, we study the behavior of these pause duration features and infer domain-dependent parametrizations. We furthermore propose to adapt the end-pointer aggressiveness on-the-fly by comparing the Viterbi scores of active short command vs. long free-form decoding hypotheses. The experimental evaluation evidences a 18% relative reduction in word error rate on free-form requests while maintaining low latency on short queries
Research areas