BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data

Mateusz Lajszczak; Guillermo Cambara Ruiz; Yang Li; Fatih Beyhan; Arent van Korlaar; Fan Yang; Arnaud Joly; Álvaro Martín Cortinas; Ammar Abbas; Adam Michalski; Alexis Moinet; Sri Karlapati; Ewa Muszynska; Haohan Guo; Bartosz Putrycz; Soledad López Gambino; Kayeon Yoo; Elena Sokolova; Thomas Drugman

Publication

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data

By Mateusz Lajszczak, Guillermo Cambara Ruiz, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

2024

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion- parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS.

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data

Latest news

Work with us