SATA-BENCH: Select all that apply benchmark for multiple choice questions
2025
Current large language model (LLM) evaluations primarily focus on single-answer tasks, whereas many real-world applications require identifying multiple correct answers. This capability remains under-explored due to the lack of dedicated evaluation frameworks. We introduce SATA-BENCH, a benchmark for evaluating LLMs on Select All That Apply (SATA) questions spanning six domains, including read-ing comprehension, legal reasoning, and biomedicine. Our evaluation of 32 models demonstrates substantial limitations: the strongest model achieves only 75.3% Jaccard Index and 41.8% exact match accuracy. We identify three systematic biases underlying these failures: (i) unselection bias: models systematically avoid certain correct answer choices; (ii) speculation bias: models include incorrect answers when uncertain; and (iii) count bias: models consistently under-predict the number of correct answers. To address these limitations, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding and abstention handling to guide models toward complete and accurate multi-answer selections. Choice funnel improves the accuracy of the exact match by up to 29%while reducing the inference cost by more than 64% compared to the existing approaches. We release SATA-BENCH and Choice Funnel to encourage the development of LLMs capable of robust decision-making in realistic multi-answer scenarios.
Research areas