-
ICASSP 20222022We present a general model for acoustic wave decomposition (AWD) on a rigid surface for a general microphone array configuration. The decomposition is modeled as a sparse recovery optimization problem that is independent of the shape of the rigid surface or the microphone array geometry. We describe an efficient algorithm for solving the optimization problem for broadband signals, and establish its effectiveness
-
ICASSP 20222022State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations, making training low-resource TTS systems problematic. In this paper, we propose a novel extremely low-resource TTS method called Voice Filter that
-
ICASSP 20222022Automatic dubbing (AD) addresses the problem of translating speech in a video with speech in another language while preserving the viewer experience. A most important requirement of AD is isochrony, i.e. dubbed speech has to closely match the timing of speech and pauses of the original audio. In our automatic dubbing system, isochrony is modeled by controlling the verbosity of machine translation; inserting
-
ICASSP 20222022This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that
-
ICASSP 20222022Confidence estimation for Speech Emotion Recognition (SER) is instrumental in improving the reliability in the behavior of downstream applications. In this work we propose (1) a novel confidence metric for SER based on the relationship between emotion primitives: arousal, valence, and dominance (AVD) and emotion categories (ECs), (2) EmoConfidNet - a DNN trained alongside the EC recognizer to predict the
Related content
-
September 28, 2020Hear Tur discuss his experience from his work on DARPA programs, how he’s seen the field of conversational AI evolve, and more.
-
September 24, 2020A combination of audio and visual signals guide the device’s movement, so the screen is always in view.
-
September 24, 2020Adjusting prosody and speaking style to conversational context is a first step toward “concept-to-speech”.
-
September 24, 2020Natural turn-taking uses multiple cues — acoustic, linguistic, and visual — to help Alexa interact more naturally, without the need to repeat the wake word.
-
September 24, 2020Deep learning and reasoning enable customers to explicitly teach Alexa how to interpret their novel requests.
-
September 18, 2020Learn how Alexa Conversations helps developers in authoring complex dialogue management rules.