Speech synthesizer learns expressive style from one-second voice sample
Users find speech with transferred expression 9% more natural than standard synthesized speech.
Text-to-speech (TTS) systems, like Alexa’s or the ones available to Amazon Web Services customers through Amazon Polly, convert text into synthetic speech. In recent years, most TTS systems have moved from concatenative approaches — which strung together ultrashort snippets of pre-recorded sounds — to neural networks, which synthesize speech sounds from scratch.
The great advantage of neural TTS is that it enables much more efficient adaptation to new voices or speaking patterns. In a paper we presented last week at the International Conference on Acoustics, Speech, and Signal Processing, we showed just how efficient that adaptation can be. Our paper describes a system that can vary its expressive style — the degree of excitement in its synthetic voice — on the strength of just one example, lasting about a second.
In experiments, we compared our system to a state-of-the-art, neutral-expression TTS system using both empirical analyses and human-perception studies. According to Kullback–Leibler divergence, which measures the distance between two probability distributions, our system was 22% better than baseline at discovering independent latent factors underlying the speech generation process.
In the paper, we also report the results of a user study, which relied on the Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) methodology. Subjects found speech generated by our system 9% more natural than baseline. These results demonstrate that it should be possible to greatly expand the expressive range of voice agents like Alexa with minimal development overhead.
|Neutral TTS||VAE||VAE + flow||Recordings|
|High excitation||High excitation||High excitation||High excitation|
|Medium excitation||Medium excitation||Medium excitation||Medium excitation|
|Low excitation||Low excitation||Low excitation||Low excitation|
Table: Audio samples of the output of our system (VAE + flow) compared to that of a neutral TTS system, a "plain vanilla" VAE system, and live recordings.
Our system is a modification of a state-of-the-art TTS system that uses a type of neural network known as a variational autoencoder (VAE). A VAE has two components, an encoder and a decoder. The encoder learns to produce a probability distribution that represents characteristics of a given input. Samples drawn from that distribution pass to the decoder, which uses them to produce outputs.
In a typical TTS application, the input to the VAE is a speech sample. The system also has a second encoder, which takes a text string as input. At run time, the encoded representation of the text string is concatenated with the sample from the VAE encoder, and the combined representation passes to the decoder. The output of the decoder is synthesized speech.
In our work, we add another component to the VAE encoder. To reduce computational complexity, the distribution learned by the encoder is typically a diagonal Gaussian. A diagonal Gaussian represents the probable values of each variable in the distribution, but it doesn’t represent the relationships between pairs of variables, which is known as the covariance. As such, it is an approximation of the true distribution of speech sample characteristics.
To flesh out the diagonal Gaussian into a full covariance Gaussian, we use a technique called householder flows. A householder flow is a series of operations that fill in the blanks of the covariance Gaussian.
In the original implementation of the householder flow, the network learns to tailor the first operation in the sequence to specific characteristics of a given speech sample. The subsequent operations are mathematical transformations of the initial operation, also learned during training. In our experiments, we compared this implementation of the householder flow to two others, for a total of three candidate architectures.
In our second implementation of the householder flow, we let all of the operations in the sequence depend directly on the input to the VAE.
In the third implementation, which is new with our paper, the operations are all independent of the input. That is, the network learns how to transform speech representations in general rather than learning to transform each speech sample in a different way. In experiments, this implementation proved to be the most successful.
Past research suggests that low Kullback–Leibler divergence (KLD) indicates better “disentanglement” between the data features extracted by the VAE encoder. That is, the features may correspond better to distinct properties of the data. Our hypothesis: better disentanglement would improve the network’s ability to do one-shot learning, or learning from single examples.
Consequently, we used our implementation of the householder flow — the one with the lowest KLD — in the human perceptual tests. Those tests compared our VAE system (with householder flow) to the baseline VAE system (no householder flow) and to a standard, neutral-expression TTS system. According to listeners’ scores, our system produced more-natural-sounding speech than either baseline.
In future work, we will expand this approach to other expressive characteristics of speech, to see if we can maintain or even improve upon the combination of naturalness and expressiveness.