Researchers who build large language models have made major strides in developing reasoning systems that can perform well-defined coding and math tasks, where each problem has one right answer. But real-world, personal, and human-oriented questions will always resist a single correct response.
These real-world problems rely on “open-ended” reasoning, which often contains hidden biases and assumptions about gender, race, and age. Thus, if a person asks an LLM an open-ended question, the LLM might offer advice that differs depending on the person’s group affiliation, potentially steering people belonging to different groups in different directions. In domains such as employment, education, and healthcare, these differing results have the potential to profoundly shape human outcomes.
It’s difficult to eliminate bias in LLM training data, since such bias is intrinsic to the human-created texts that make up a language model’s training data. However, it is possible to identify bias within the language model, allowing engineers and researchers who train LLMs to then mitigate it.
To this end, we developed a three-stage evaluation pipeline called FiSCo (fairness in semantic context) that uncovers hidden biases in LLMs. Converting qualitative bias detection into a rigorous, reproducible measurement, FiSCo detects whether language models respond fairly to different groups of people defined by sensitive attributes such as gender, race, and age when multiple valid responses to their questions exist, a challenge that has long been difficult to quantify.
Importantly, FiSco reframes fairness as a reasoning problem, asking whether models provide semantically equivalent guidance to individuals who differ only by their protected attributes or group affiliations. FiSCo’s guiding principle is to reason about meaning, not correctness. The goal is not to decide whether an answer is right but whether it is equally reasoned and equitable for all groups.
Our approach and its empirical validation were presented in our paper “Quantifying fairness in LLMs beyond tokens: A semantic and statistical perspective (FiSCo),” which was selected as an oral-spotlight presentation at the Conference on Language Modeling (COLM 2025), marking it as among the top contributions to the conference.
A new frontier
Most fairness metrics for LLMs focus on the choice of words and overall sentiment in model responses. While focusing on these measures can filter out offensive language, it misses subtle nuances in meaning that might ultimately affect opportunity and encouragement. Consider a real example we observed, where two personas ask an LLM for career advice. The LLM encourages the male persona to apply to a top-tier MBA program, while the female persona is advised to choose a part-time, local option. Both answers sound positive, but they are based on unexamined biases that could ultimately lead to vastly different real-world outcomes.
At its core, FiSCo asks a simple question: if we change only a protected attribute, such as gender, age, or race, while keeping everything else identical, do language models' long-form responses change in systematic ways?
FiSCo follows a three-stage analysis pipeline to identify systematic bias in these answers. The first step is called “controlled generation,” where we create matched prompts that differ only in the protected attribute. For each of these, we ask the model to generate multiple responses, to capture the randomness inherent in LLM responses.
The second step is called “semantic comparison,” where we decompose each answer into its parts for analysis. How does each answer describe what to do, why to do it, what resources to use, and what risks are being run? We then compare these answers in a process called alignment across measures of meaning, checking for similarity, difference, and comparative relevance. This step enables our process to evolve as LLMs evolve. It’s highly adaptive to LLM outputs, which, as they grow in size, tend to offer longer and more complex reasoning in their answers, which our framework is designed to accommodate.
Finally, we perform validation, where we perform tests for statistical significance, such as Welch’s t-test, to compare intragroup and intergroup distributions. The results ultimately show whether there are consistent differences in responses.
Experiments with FiSCo revealed measurable semantic differences across age, gender, and race. Some closed-source models show only minor disparities, while smaller or mid-sized open-source models exhibit stronger biases. Surprisingly, newer reasoning models are not always fairer. For example, in the case of GPT-OSS-120B, the model produces more biased responses than smaller or older LLMs.
Larger models such as GPT-4o and Claude 3 tend to display lower bias, while smaller open models like Llama 3 and Mixtral show greater disparities, particularly along racial and gender lines. These findings suggest that reasoning ability and fairness do not necessarily evolve together, highlighting the need for fairness-aware model development.
Fairness is not just about what models say; it’s about what they mean. FiSCo provides a way to measure this principle, giving both researchers and organizations the tools to understand, compare, and improve the fairness of language models in open-ended contexts. It enables teams to monitor fairness regressions, create fairness dashboards, audit model updates, and support governance loops for transparency and compliance.
By combining scenario generation, semantic alignment, and statistical rigor, FiSCo offers a scalable and interpretable framework for assessing fairness that evolves alongside the reasoning capabilities of modern LLMs.
For more details and access to data and code, visit the FiSCo GitHub page.