Improving faithfulness of text-to-image diffusion models through inference intervention
2025
Text-to-Image diffusion models have shown remarkable capabilities in generating high-quality images. However, current models often struggle to adhere to the complete set of conditions specified in the input text and return unfaithful generations. Existing works address this problem by either fine-tuning the base model or modifying the latent representations during the inference stage with gradient-based updates. Not only are these approaches computationally expensive, but also they usually only improve limited kinds of errors (e.g., the count of objects). In this work, we propose an intervention-based mechanism to enhance the faithfulness of diffusion models by controlling the denoising process. Starting with layout-conditional diffusion models, our approach first detects incorrectly-generated/missing objects during denoising steps. Next, a layout is constructed from the erroneous objects (feedback). Finally, we return to an earlier denoising step. The new layout is fed to the diffusion model to obtain its latent representation. Correction is applied by composing the new latents with the original ones and continuing the generation process, thereby driving the generation away from erroneous directions. As additional feedback and correction strategy, we also explore retrieval-augmented generation to help the model recover missing objects. We conduct experiments on VPEval and HRS-Bench datasets and measure faithfulness across four dimensions; presence of objects, object counts, scale of objects and spatial relations between objects. Compared to GLIGEN, the state-of-the-art model on the VPEval dataset, our approach significantly improves on all metrics (+6.7% average accuracy increase). On HRS-Bench dataset, it also outperforms existing models in count and scale metrics.
Research areas