IMPACT: Iterative mask-based parallel decoding for text-to-audio generation with diffusion modeling

Kuan Po Huang; Shu-wen Yang; Huy Phan; Bo-Ru (Roy) Lu; Byeonggeun Kim; Sashank Macha; Qingming Tang; Shalini Ghosh; Hung-yi Lee; Chieh-Chi Kao; Chao Wang

Publication

IMPACT: Iterative mask-based parallel decoding for text-to-audio generation with diffusion modeling

By Kuan Po Huang, Shu-wen Yang, Huy Phan, Bo-Ru (Roy) Lu, Byeonggeun Kim, Sashank Macha, Qingming Tang, Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang

2025

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However, its audio quality still lags behind that of diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Frechet Distance (FD) and Frechet Audio Distance (FAD) while significantly reducing latency compared to prior models.

IMPACT: Iterative mask-based parallel decoding for text-to-audio generation with diffusion modeling

Latest news

Work with us