Text-to-speech models based on large language models (LLMs) have gotten very good at producing natural-sounding speech, even in voices cloned from short audio files. But some problems with these models still persist.
One is accent leakage in polyglot text to speech. It should be possible to transfer a voice recorded in English to French, German, or Spanish with the correct accent and without loss of voice identity. But with most systems, the reference speaker's native accent leaks into the target language, or the target language's accent overwrites characteristics of the speaker’s voice.
Expressiveness is another challenge, including the laughs, sighs, hesitations, and other indications of emotion that make speech engaging.
And then there’s reliability. Unlike traditional text-to-speech (TTS) systems, LLM-based systems are autoregressive, meaning they generate speech tokens one at a time, without explicitly modeling duration. This can cause hallucinated repetitions, unexpected cutoffs, and inconsistent pronunciation.
At Amazon, we're working to address all these issues.
Mitigating accent leakage in polyglot TTS
We use a locale-specific data augmentation approach to address the problem of accent leakage. Specifically, we use low-rank adaptation (LoRA) to fine-tune our polyglot models on data that is heavily weighted toward target locales. This also allows us to do accent-free polyglot voice cloning: the cloned voice speaks the target language with native-like pronunciation but without loss of speaker identity.
American English (original voice)
Transfer to Spanish
Transfer to German
Improving expressiveness
We use classifier-free guidance (CFG) to generate synthetic reference audio samples with enhanced expressiveness. Using these as conditioning during inference pushes the model toward more expressive prosodic styles.
Originally developed for diffusion modeling, CFG controls how strongly generation follows conditioning. CFG-based reference samples decouple speaker identity from accent, teaching the model to preserve voice characteristics while adopting native pronunciation in the target language.
This allows us to scale a small number of recorded voices to many new locales and languages, while increasing expressiveness. Scored according to MUSHRA (multiple stimuli with hidden reference and anchor) listening tests, the quality of our models’ polyglot outputs across nine locales spanning English, French, Italian, German, and Spanish improved 5% to 20% over those of our previous model family.
Locale |
Improvement over baseline |
US-English |
+12.43% |
Southern US-English |
+20.05% |
Great Britain-English |
+5.97% |
Australia-English |
+5.50% |
US-Spanish |
+11.78% |
Spain-Spanish |
+13.23% |
France-French |
+8.44% |
Germany-German |
+14.12% |
Italy-Italian |
+9.80% |
Robustness
Traditional TTS had failure modes, but hallucination and random truncation weren't chief among them. LLM-based TTS can generate confident-sounding speech that doesn't match the input, and it will sometimes stop mid-sentence.
Chain-of-thought for autoregressive TTS
Traditional TTS pipelines have explicit stages: grapheme-to-phoneme conversion, duration prediction, and acoustic generation. More recent, non-autoregressive end-to-end models like FastSpeech predict durations explicitly before speech generation.
LLM-based TTS takes an alternate approach. Duration emerges implicitly from autoregressive generation. There's no explicit plan for how long the utterance should be or how long each phoneme should take. This is why these models hallucinate (keep generating past the intended content) or truncate (stop too early).
To address this problem, we add chain-of-thought reasoning to the model: before generating speech tokens, the model predicts phoneme sequences and estimates duration (total length and per-phoneme timing).
This isn't the same as traditional TTS pipelines. Bolting duration prediction onto an autoregressive architecture is a different problem than building it into a non-autoregressive one, and it has its own challenges.
Phoneme prediction enables the model to handle heteronyms ("read," "lead") and unusual names more reliably. Duration prediction gives the model a timing plan, which reduces both hallucination and truncation. These predictions are also useful for debugging, as you can see what the model "thought" it was going to generate before it started generating.
Guardrails
Our guardrails use the chain-of-thought predictions as checkpoints. We know the expected phoneme count and approximate speech duration before generation starts. After generation, we do a pair of checks: does the output duration match the prediction, and is the output length reasonable given the phoneme count? Large deviations flag likely hallucinations or truncations.
When an agent detects problems, it can prompt the TTS system to regenerate with different sampling parameters or fall back to alternative approaches.
Data filtering
To filter the text data passing to the TTS model, we combine speech-recognition-based metrics with metrics based on the LLM’s attention mechanism. Automatic speech recognition (ASR) catches actual transcription errors. Taken together, the metrics keep data that's genuinely well aligned while preserving expressiveness that ASR-only filtering would discard.
On generic long-form text, our full array of techniques reduces critical errors to an average of less than one second per hour, where “critical errors” include hallucinations, cutoffs beyond one word, and mismatches between input text and output speech.
Conclusion
LLM-based TTS models sound noticeably more natural than traditional systems. However, in our experience, they introduce new failure modes that need to be addressed before they can be deployed reliably in production. We have found that LoRA-based fine tuning addresses the heavy accent leakage observed in polyglot TTS, while classifier-free guidance is a useful tool for improving expressiveness. As for reliability, we find that smart data filtering and chain-of-thought reasoning coupled with guardrails and agentic regeneration can significantly reduce hallucination.