As large language models (LLMs) become increasingly useful across a variety of domains, the stakes of keeping them safe rise accordingly. Because bad actors might, for instance, try to use LLMs to write malicious code or make step-by-step guides for synthesizing toxic compounds, researchers are developing rigorous safeguards to keep LLMs from generating content that could pose serious public safety and security risks.
The most common way to assess the risks to LLMs is called red-teaming, where human evaluators design adversarial prompts intended to elicit harmful responses. But expert-curated sets of prompts cannot capture the full range of possible outcomes. Moreover, many evaluations focus on isolated prompts rather than conversations, which are where harmful behavior often emerges. Finally, today’s benchmark failure metrics provide only a single score, rather than confidence bounds on worst-case conversational risks. This makes the findings unreliable and non-generalizable to the vast space of possible conversations.
In a paper we presented at this year’s International Conference on Learning Representations (ICLR), we, along with researchers from the University of Illinois Urbana-Champaign (UIUC), address these red-teaming limitations by focusing on the failures within conversational threat models and then assigning a probability to an attack rate, which is defined as the number of successful attacks divided by the total number of attacks. Our approach, called the C3LLM (certifying catastrophic conversational risks in LLMs) framework, shifts the focus of benchmarking failure from empirical spot-checking to statistical certification.
How to model a conversation
In order to build our framework, we first needed to model conversations, also known as “multiturn dialogues.” We used a graph where each node corresponds to a prompt. The edges that connect the nodes indicate that the prompts are semantically related. This graph approximates plausible conversational transitions, capturing how a user might naturally progress through related questions. In this way, we generate a more complete picture of queries, one that maintains the complexity of possible conversations.
Our approach shifts the focus of benchmarking failure from empirical spot checking to statistical certification.
The graph also lets us define the distribution of conversational threats, allowing us to determine the probability of harm across a range of adversarial capabilities. We simulate the lowest level of adversarial capability by sampling prompts independently, which is similar to traditional benchmarking, focusing on a single node or query at a time. This approach is denoted as Random Node with Jailbreak (RNwJ) is our result.
The next level up involves sampling a sequence that follows semantically connected paths through the graph. We developed two variants, the first is termed Graph Path vanilla (GPv), where each query is sampled following the graph, the second appraoch — Graph Path harmful target constraint (GPh), restricts the final query to come from a target harmful set. For the most advanced level of bad-actor capabilities, we approximate adversarial steering, when a bad actor coaxes an LLM toward a harmful output. For this level, we sample adaptively, examining prior movements throughout the graph-based conversation to map the distance to a query that ultimately produces the harmful output. This approach — Adaptive with Rejection (AwR) — can mimic realistic red-teaming where an attacker adapts their phrasing to circumvent safety mechanisms.
The graph gives us the ability to create sets of multiturn-dialogue prompts — specific sequences of queries — that we can run on a target LLM. We then label the LLM responses as catastrophic or non-catastrophic using a separate ChatGPT-based judging mechanism that determines whether the model responses are harmful. This produces empirical estimates of the attack success rates under each conversational distribution. Given the attack success rate, C3LLM uses the Clopper-Pearson method to calculate the lower and upper bounds on the probability of catastrophic risk.
Application: How does C3LLM perform on frontier LLMs?
UIUC researchers applied the proposed C3LLM framework to frontier proprietary models available at the time of the study, such as Claude-Sonnet-4 and Nova Premier, as well as open-weights models (models whose trained parameters are publicly available). The following figures show the certification results on the chemical/biological benchmark. Each panel shows the distribution of lower bounds and upper bounds under different specifications for one LLM.
The following figures show the certification results on the cybercrime benchmark. Each panel shows the distribution of lower and upper bounds under different specifications for one LLM.
The results reveal that catastrophic risks are nontrivial for all frontier LLMs, with notable differences in safety across models. By comparing the bounds, we observe that among the models evaluated, Claude-Sonnet-4 and Nova Premier are safer than the others, while Mistral-Large and DeepSeek-R1 exhibit higher risks. In particular, Nova Premier demonstrates consistently low risk levels, largely because its built-in guardrails often block potentially unsafe content. On the other hand, DeepSeek-R1 reaches a certified lower bound of over 70% in cybercrime scenarios under RNwJ distributions.
Unlike prior work that reports attack success rates on fixed benchmarks, our approach provides high-confidence probabilistic bounds over large conversation spaces, enabling meaningful comparisons across models. We open-sourced the C3LLM framework for reproducibility and hope it enables researchers in industry and academia to perform more-principled safety studies.