Effect of Data Reduction on Seq-to-seq Acoustic Models for Speech Synthesis
2019
Recent speech synthesis systems based on sampling from autoregressive neural networks models can generate speech almost undistinguishable from human recordings. To work properly these models required large amounts of data. However, they are more efficient at dealing less homogenous data, which might make possible to compensate the lack of data from one speaker with data from other speakers. This paper evaluates this hypothesis by training several tacotron-like models with different blends of data. The mel-spectrograms generated by these models were converted to audio with a WaveRNN-like neuralvocoder trained on 74 speakers from 17 different languages. Our experiments show that the naturalness of models trained on a blend of 5k utterances from 7 speakers is better than that of speaker dependent (SD) models trained on 15k utterances, and very close to that of SD models trained on 25k utterances. We also demonstrate that models mixing only 1250 utterances from a target speaker with 5k utterances from another 6 speakers can produce significantly better quality than state-of-the-art DNNguided unit selection systems using more than 10 times more utterances from that target speaker data.
Research areas