-
Interspeech 20222022The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and between attention
-
Interspeech 20222022This paper investigates an incremental learning framework for a real-world voice assistant employing RNN-Transducer based automatic speech recognition (ASR) model. Such a model needs to be regularly updated to keep up with changing distribution of customer requests. We demonstrate that a simple fine-tuning approach with a combination of old and new training data can be used to incrementally update the model
-
Interspeech 20222022Inference with large deep learning models in resource-constrained settings is increasingly a bottleneck in real-world applications of state-of-the-art AI. Here we address this by low-precision weight quantization. We achieve very low accuracy degradation by reparameterizing the weights in a way that leaves the weight distribution approximately uniform. We show lower bit-width quantization and less accuracy
-
Interspeech 20222022Creating realistic and natural-sounding synthetic speech remains a big challenge for voice identities unseen during training. As there is growing interest in synthesizing voices of new speakers, here we investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities. Firstly, we
-
Computer Assisted Language Learning Journal2022The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation
Related content
-
March 10, 2025Inaugural global university competition focused on advancing secure, trusted AI-assisted software development.
-
February 20, 2025Using large language models to generate training data and updating models through both fine tuning and reinforcement learning improves the success rate of code generation by 39%.
-
February 6, 2025Novel training procedure and decoding mechanism enable model to outperform much larger foundation model prompted to perform the same task.
-
December 11, 2024LLM-augmented clustering enables QualIT to outperform other topic-modeling methods in both topic coherence and topic diversity.
-
December 9, 2024The Amazon AGI SF Lab will focus on developing new foundational capabilities for enabling useful AI agents.
-
December 4, 2024Amazon Nova Canvas and Amazon Nova Reel use diffusion transformers to deliver studio-quality visual content.