Interspeech: Where speech recognition and synthesis converge
Senior principal scientist Jasha Droppo on the shared architectures of large language models and spectrum quantization text-to-speech models — and other convergences between the two fields.
As the start of this year’s Interspeech draws near, “generative AI” has become a watchword in both the machine learning community and the popular press, where it generally refers to models that synthesize text or images.
Text-to-speech (TTS) models, which are a major area of research at Interspeech, have, in some sense, always been “generative”. But as Jasha Droppo, a senior principal scientist in the Alexa AI organization, explains, TTS, too, has been reshaped by the new generative-AI paradigm.
The first neural TTS models were trained in a “point-to-point” fashion, says Droppo, whose own Interspeech paper is on speech synthesis.
“Let's say you're estimating spectrograms — and a spectrogram is basically an image where every pixel, every little element of the image, is how much energy is in the signal at that time and that frequency,” Droppo explains. “We would estimate one time slice of the spectrogram, say, and have energy content over frequency for that particular time slice. And the best we could do at the time was to look at the distance between that and the speech sounds that we wanted the model to create.
“But in text-to-speech data, there are many valid ways of expressing the text. You could change the pacing; you could change the stress; you could insert pauses in different places. So this concept that there is one single point estimate that's the correct answer was just flawed.”
Generative AI offers an alternative to point-to-point training. Large language models (LLMs), for instance, compute probability distributions over sequences of words; at generation time, they simply select samples from those distributions.
“The advances in generative modeling for text-to-speech have this characteristic that they don't have one single correct answer,” Droppo says. “You're estimating the probability of being correct over all possible answers.”
The first of these generative approaches to TTS, Droppo says, was normalizing flows, which pass data through a sequence of invertible transformations (the flow) in order to approximate a prior distribution (the normalization). Next came diffusion modeling, which incrementally adds noise to data samples and trains a model to denoise the results, until, ultimately, it can generate data from random inputs.
Most recently, Droppo says, a new approach known as spectrum quantization has generated excitement among TTS researchers.
“If we were to have an acoustic tokenizer — that is, something that takes a, say, a 100-millisecond segment of the spectrogram and turns it into an integer — if we have the right component like that, we've taken this continuous problem, this image-processing problem of modeling the spectrogram, and turned it into a unit prediction problem,” Droppo says. “The model doesn't care where these integers came from. It just knows there's a sequence, and there's some structure at a high level.”
In this respect, Droppo explains, a spectrum quantization model is very much like a causal LLM, which is trained on the task of predicting the next word in a sequence of words.
“That's all a causal LLM sees as well,” Droppo says. “It doesn't see the text; it sees text tokens. Spectrum quantization allows the model to look at speech in the exact same way the model looks at text. And now we can take all of the code and modeling and insights that we've used to scale large language models and bring that to bear on speech modeling. This is what I find exciting these days.”
Droppo’s work, however, is not confined to TTS; the bulk of the papers he’s coauthored at Amazon are on automatic speech recognition (ASR) and related techniques for processing acoustic input signals. The breadth of his work gives him a more holistic view of speech as a research topic.
“In my experience as a human, I can't separate the process of generating speech and understanding speech,” Droppo says. “It seems very unified to me. And I think that if I were to build the perfect machine, it would also not really differentiate between trying to understand what I'm talking about and trying to understand what the other party in the conversation is talking about.”
More specifically, Droppo says, “the problems with doing speech recognition end to end and doing TTS end to end share similar aspects, such as being able to handle words that aren't well represented in the data. An ASR system will struggle to transcribe a word it has never heard before, and a TTS system will struggle to pronounce correctly a word it has never encountered before. And so the problem spaces between these two systems, even though they're inverse with each other, tend to overlap, and the solutions that you come up with to solve one can also be applied to the other.”
As a case in point, Alexa AI researchers have used audio data generated by TTS models to train ASR models. But, Droppo says, this is just the tip of the iceberg. “At Amazon,” he says, “it's been my mission to bring text to speech and speech to text closer together.”