-
Interspeech 20162016We present a new model called LATTICERNN, which generalizes recurrent neural networks (RNNs) to process weighted lattices as input, instead of sequences. A LATTICERNN can encode the complete structure of a lattice into a dense representation, which makes it suitable to a variety of problems, including rescoring, classifying, parsing, or translating lattices using deep neural networks (DNNs). In this paper
-
Interspeech 20162016The goal of this paper is to simulate the benefits of jointly applying active learning (AL) and semi-supervised training (SST) in a new speech recognition application. Our data selection approach relies on confidence filtering, and its impact on both the acoustic and language models (AM and LM) is studied. While AL is known to be beneficial to AM training, we show that it also carries out substantial improvements
-
Interspeech 20152015In an online automatic speech recognition system, the role of the endpoint detector is to infer when a user has finished speaking a query. Accurate and low-latency endpoint detection is crucial for natural voice interaction. Classic voice activity detector (VAD) based approaches monitor the incoming audio and trigger when a sufficiently long pause is detected. Such approaches are typically limited due to
-
Interspeech 20152015In the past, conventional i-vectors based on a Universal Background Model (UBM) have been successfully used as input features to adapt a Deep Neural Network (DNN) Acoustic Model (AM) for Automatic Speech Recognition (ASR). In contrast, this paper introduces Hidden Markov Model (HMM) based ivectors that use HMM state alignment information from an ASR system for estimating i-vectors. Further, we propose passing
-
Interspeech 20152015We investigate the problem of speaker adaptation of DNN acoustic models in two settings: the traditional unsupervised adaptation and a supervised adaptation (SuA) where a few minutes of transcribed speech is available. SuA presents additional difficulties when a test speaker’s adaptation information does not match the registered speaker’s information. Employing feature-space maximum likelihood linear regression
Related content
-
November 3, 2020Fourth challenge features four new teams.
-
October 30, 2020Prosody transfer technique addresses the problem of “source speaker leakage”, while prosody selection model better matches prosody to semantic content.
-
October 29, 2020Watch the replay of Shehzad Mevawalla's Interspeech 2020 keynote talk.
-
October 29, 2020Watch the replay of the Interspeech 2020 industry forum session.
-
October 28, 2020Watch as four Amazon Alexa scientists talk about current state, new developments, and recent announcements surrounding advancements in Alexa speech technologies.