Structured human assessment of text-to-image generative models
2025
Following the great progress in text-conditioned image generation there is a dire need for establishing clear comparison benchmarks. Unfortunately, assessing performance of such models is highly subjective and notoriously difficult. Current automatic assessment of generated images quality and their alignment to text are approximate at best while human assessment is subjective, poorly calibrated and not very well defined. To address these concerns, we propose GenomeBench, a new framework for assessing quality of text-to-image generative models. It consists of a prompt dataset richly annotated with semantic components based on a formalized grounding of language and images. On top of it, we define a procedure to collect human assessment through a carefully guided question answering process. Finally, these assessments are summarized into a novel score built around quality and alignment to text. We show the proposal achieves higher interannotator agreement with respect to the baseline human assessment and better correlation between quality and alignment compared to automatic assessment. Finally, we use this framework to dissect the performance of recent text-to-image models, providing insights on strength and weakness of each.
Research areas