Conversational AI

Improving quality and robustness in LLM-based text-to-speech systems

Low-rank adaptation, data augmentation, and chain-of-thought reasoning are among the techniques enabling accent-free polyglot outputs, improved expressiveness, and reliable synthesis.

April 1, 2026

5 min read

Key takeaways

Accent-free polyglot voice cloning is achieved through locale-specific data augmentation and low-rank adaptation (LoRA) fine-tuning, enabling cloned voices to speak target languages with native-like pronunciation without loss of speaker identity.
Expressiveness is enhanced through classifier-free guidance (CFG), which generates synthetic reference audio samples with improved prosodic styles — delivering 5%–20% quality improvements across nine locales spanning English, French, Italian, German, and Spanish.
Reliability is improved through chain-of-thought reasoning, guardrails, agentic regeneration, and smart data filtering, reducing critical errors to less than one second per hour on long-form text by predicting phoneme sequences and duration before generation

Was this answer helpful?

Text-to-speech models based on large language models (LLMs) have gotten very good at producing natural-sounding speech, even in voices cloned from short audio files. But some problems with these models still persist.

One is accent leakage in polyglot text to speech. It should be possible to transfer a voice recorded in English to French, German, or Spanish with the correct accent and without loss of voice identity. But with most systems, the reference speaker's native accent leaks into the target language, or the target language's accent overwrites characteristics of the speaker’s voice.

It should be possible to transfer a voice recorded in English to another language — say, French — with the correct accent and without loss of voice identity *(left)*. But with many systems, the reference speaker's native accent leaks into the target language *(right)*.

Expressiveness is another challenge, including the laughs, sighs, hesitations, and other indications of emotion that make speech engaging.

And then there’s reliability. Unlike traditional text-to-speech (TTS) systems, LLM-based systems are autoregressive, meaning they generate speech tokens one at a time, without explicitly modeling duration. This can cause hallucinated repetitions, unexpected cutoffs, and inconsistent pronunciation.

At Amazon, we're working to address all these issues.

Mitigating accent leakage in polyglot TTS

We use a locale-specific data augmentation approach to address the problem of accent leakage. Specifically, we use low-rank adaptation (LoRA) to fine-tune our polyglot models on data that is heavily weighted toward target locales. This also allows us to do accent-free polyglot voice cloning: the cloned voice speaks the target language with native-like pronunciation but without loss of speaker identity.

Examples of polyglot text to speech with voice cloning

Improving expressiveness

We use classifier-free guidance (CFG) to generate synthetic reference audio samples with enhanced expressiveness. Using these as conditioning during inference pushes the model toward more expressive prosodic styles.

Originally developed for diffusion modeling, CFG controls how strongly generation follows conditioning. CFG-based reference samples decouple speaker identity from accent, teaching the model to preserve voice characteristics while adopting native pronunciation in the target language.

This allows us to scale a small number of recorded voices to many new locales and languages, while increasing expressiveness. Scored according to MUSHRA (multiple stimuli with hidden reference and anchor) listening tests, the quality of our models’ polyglot outputs across nine locales spanning English, French, Italian, German, and Spanish improved 5% to 20% over those of our previous model family.

Locale	Improvement over baseline
US-English	+12.43%
Southern US-English	+20.05%
Great Britain-English	+5.97%
Australia-English	+5.50%
US-Spanish	+11.78%
Spain-Spanish	+13.23%
France-French	+8.44%
Germany-German	+14.12%
Italy-Italian	+9.80%

Robustness

Traditional TTS had failure modes, but hallucination and random truncation weren't chief among them. LLM-based TTS can generate confident-sounding speech that doesn't match the input, and it will sometimes stop mid-sentence.

Chain-of-thought for autoregressive TTS

Traditional TTS pipelines have explicit stages: grapheme-to-phoneme conversion, duration prediction, and acoustic generation. More recent, non-autoregressive end-to-end models like FastSpeech predict durations explicitly before speech generation.

LLM-based TTS takes an alternate approach. Duration emerges implicitly from autoregressive generation. There's no explicit plan for how long the utterance should be or how long each phoneme should take. This is why these models hallucinate (keep generating past the intended content) or truncate (stop too early).

To address this problem, we add chain-of-thought reasoning to the model: before generating speech tokens, the model predicts phoneme sequences and estimates duration (total length and per-phoneme timing).

This isn't the same as traditional TTS pipelines. Bolting duration prediction onto an autoregressive architecture is a different problem than building it into a non-autoregressive one, and it has its own challenges.

Phoneme prediction enables the model to handle heteronyms ("read," "lead") and unusual names more reliably. Duration prediction gives the model a timing plan, which reduces both hallucination and truncation. These predictions are also useful for debugging, as you can see what the model "thought" it was going to generate before it started generating.

Guardrails

Our guardrails use the chain-of-thought predictions as checkpoints. We know the expected phoneme count and approximate speech duration before generation starts. After generation, we do a pair of checks: does the output duration match the prediction, and is the output length reasonable given the phoneme count? Large deviations flag likely hallucinations or truncations.

When an agent detects problems, it can prompt the TTS system to regenerate with different sampling parameters or fall back to alternative approaches.

Data filtering

To filter the text data passing to the TTS model, we combine speech-recognition-based metrics with metrics based on the LLM’s attention mechanism. Automatic speech recognition (ASR) catches actual transcription errors. Taken together, the metrics keep data that's genuinely well aligned while preserving expressiveness that ASR-only filtering would discard.

On generic long-form text, our full array of techniques reduces critical errors to an average of less than one second per hour, where “critical errors” include hallucinations, cutoffs beyond one word, and mismatches between input text and output speech.

Conclusion

LLM-based TTS models sound noticeably more natural than traditional systems. However, in our experience, they introduce new failure modes that need to be addressed before they can be deployed reliably in production. We have found that LoRA-based fine tuning addresses the heavy accent leakage observed in polyglot TTS, while classifier-free guidance is a useful tool for improving expressiveness. As for reliability, we find that smart data filtering and chain-of-thought reasoning coupled with guardrails and agentic regeneration can significantly reduce hallucination.

About the Author

Ammar Abbas

Ammar Abbas is a senior applied scientist with Audible.

Improving quality and robustness in LLM-based text-to-speech systems

Low-rank adaptation, data augmentation, and chain-of-thought reasoning are among the techniques enabling accent-free polyglot outputs, improved expressiveness, and reliable synthesis.

Mitigating accent leakage in polyglot TTS

American English (original voice)

Transfer to Spanish

Transfer to German