SATA-BENCH: Select all that apply benchmark for multiple choice questions

Weijie Xu; Shixian Cui; Xi Fang; Stephanie Eckman; Chandan Reddy

Publication

SATA-BENCH: Select all that apply benchmark for multiple choice questions

By Weijie Xu, Shixian Cui, Xi Fang, Stephanie Eckman, Chandan Reddy

2025

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Current large language model (LLM) evaluations primarily focus on single-answer tasks, whereas many real-world applications require identifying multiple correct answers. This capability remains under-explored due to the lack of dedicated evaluation frameworks. We introduce SATA-BENCH, a benchmark for evaluating LLMs on Select All That Apply (SATA) questions spanning six domains, including read-ing comprehension, legal reasoning, and biomedicine. Our evaluation of 32 models demonstrates substantial limitations: the strongest model achieves only 75.3% Jaccard Index and 41.8% exact match accuracy. We identify three systematic biases underlying these failures: (i) unselection bias: models systematically avoid certain correct answer choices; (ii) speculation bias: models include incorrect answers when uncertain; and (iii) count bias: models consistently under-predict the number of correct answers. To address these limitations, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding and abstention handling to guide models toward complete and accurate multi-answer selections. Choice funnel improves the accuracy of the exact match by up to 29%while reducing the inference cost by more than 64% compared to the existing approaches. We release SATA-BENCH and Choice Funnel to encourage the development of LLMs capable of robust decision-making in realistic multi-answer scenarios.

SATA-BENCH: Select all that apply benchmark for multiple choice questions

Latest news

Work with us