Duration modeling of neural TTS for automatic dubbing
Automatic dubbing (AD) addresses the problem of translating speech in a video with speech in another language while preserving the viewer experience. A most important requirement of AD is isochrony, i.e. dubbed speech has to closely match the timing of speech and pauses of the original audio. In our automatic dubbing system, isochrony is modeled by controlling the verbosity of machine translation; inserting pauses in the translations, a.k.a. prosodic alignment; and controlling the duration of text-to-speech (TTS) utterances. The latter two steps heavily rely on speech duration information, either to predict or control TTS duration. So far, duration prediction was based on a proxy method while duration control on linear warping of the TTS speech spectrogram. In this study, we propose novel duration models for neural TTS that can be leveraged both to predict and control TTS duration. Experimental results show that compared to previous work, the new models improve or match the performance of prosodic alignment and significantly enhance neural TTS speech quality for both slow and fast speaking rates.