Conversational AI

How catastrophic is your LLM?

A new framework provides a statistical method for estimating the likelihood of catastrophic failures in large language models in adversarial conversations.

April 27, 2026

4 min read

Overview by Amazon Nova

The C3LLM framework models conversations as multiturn dialogues using a graph where nodes represent prompts and edges represent semantic relationships between prompts.
The framework uses Clopper-Pearson confidence intervals to calculate lower and upper bounds on attack success rates, providing high-confidence probabilistic bounds over large conversation spaces.
The C3LLM framework has been open sourced to enable more-principled safety studies by researchers in industry and academia.

Was this answer helpful?

As large language models (LLMs) become increasingly useful across a variety of domains, the stakes of keeping them safe rise accordingly. Because bad actors might, for instance, try to use LLMs to write malicious code or make step-by-step guides for synthesizing toxic compounds, researchers are developing rigorous safeguards to keep LLMs from generating content that could pose serious public safety and security risks.

The most common way to assess the risks to LLMs is called red-teaming, where human evaluators design adversarial prompts intended to elicit harmful responses. But expert-curated sets of prompts cannot capture the full range of possible outcomes. Moreover, many evaluations focus on isolated prompts rather than conversations, which are where harmful behavior often emerges. Finally, today’s benchmark failure metrics provide only a single score, rather than confidence bounds on worst-case conversational risks. This makes the findings unreliable and non-generalizable to the vast space of possible conversations.

Read the paper and explore the code

The C3LLM framework is open source. Access the code on GitHub and read the full paper on Amazon Science.

In a paper we presented at this year’s International Conference on Learning Representations (ICLR), we, along with researchers from the University of Illinois Urbana-Champaign (UIUC), address these red-teaming limitations by focusing on the failures within conversational threat models and then assigning a probability to an attack rate, which is defined as the number of successful attacks divided by the total number of attacks. Our approach, called the C3LLM (certifying catastrophic conversational risks in LLMs) framework, shifts the focus of benchmarking failure from empirical spot-checking to statistical certification.

Diagram of the C3LLM framework pipeline. Starting from a query set on the left, semantically similar queries are connected into a graph. Probability distributions over query sequences are defined as formal specifications on the graph. Each sampled sequence is sent to a target LLM, responses are evaluated by a judge model for harmfulness, and results are aggregated to compute statistical upper and lower bounds on the probability of catastrophic risk. — The C3LLM (certification of catastrophic risks in multiturn conversation for LLMs) framework. Starting from a query set, we construct a graph in which edges connect semantically similar queries. On this graph, we define formal specifications as probability distributions over query sequences. For each sampled sequence, we query the LLM, use a judge model to determine whether the response is harmful, and aggregate the results to compute statistical certification bounds on the probability of catastrophic risk.

How to model a conversation

In order to build our framework, we first needed to model conversations, also known as “multiturn dialogues.” We used a graph where each node corresponds to a prompt. The edges that connect the nodes indicate that the prompts are semantically related. This graph approximates plausible conversational transitions, capturing how a user might naturally progress through related questions. In this way, we generate a more complete picture of queries, one that maintains the complexity of possible conversations.

Our approach shifts the focus of benchmarking failure from empirical spot checking to statistical certification.

The graph also lets us define the distribution of conversational threats, allowing us to determine the probability of harm across a range of adversarial capabilities. We simulate the lowest level of adversarial capability by sampling prompts independently, which is similar to traditional benchmarking, focusing on a single node or query at a time. This approach is denoted as Random Node with Jailbreak (RNwJ) is our result.

The next level up involves sampling a sequence that follows semantically connected paths through the graph. We developed two variants, the first is termed Graph Path vanilla (GPv), where each query is sampled following the graph, the second appraoch — Graph Path harmful target constraint (GPh), restricts the final query to come from a target harmful set. For the most advanced level of bad-actor capabilities, we approximate adversarial steering, when a bad actor coaxes an LLM toward a harmful output. For this level, we sample adaptively, examining prior movements throughout the graph-based conversation to map the distance to a query that ultimately produces the harmful output. This approach — Adaptive with Rejection (AwR) — can mimic realistic red-teaming where an attacker adapts their phrasing to circumvent safety mechanisms.

The graph gives us the ability to create sets of multiturn-dialogue prompts — specific sequences of queries — that we can run on a target LLM. We then label the LLM responses as catastrophic or non-catastrophic using a separate ChatGPT-based judging mechanism that determines whether the model responses are harmful. This produces empirical estimates of the attack success rates under each conversational distribution. Given the attack success rate, C3LLM uses the Clopper-Pearson method to calculate the lower and upper bounds on the probability of catastrophic risk.

Application: How does C3LLM perform on frontier LLMs?

UIUC researchers applied the proposed C3LLM framework to frontier proprietary models available at the time of the study, such as Claude-Sonnet-4 and Nova Premier, as well as open-weights models (models whose trained parameters are publicly available). The following figures show the certification results on the chemical/biological benchmark. Each panel shows the distribution of lower bounds and upper bounds under different specifications for one LLM.

Grid of charts showing C3LLM certification results on the chemical and biological benchmark. Each panel displays the distribution of statistical lower bounds and upper bounds on attack success rates under different conversational specifications for a different frontier LLM, including Claude-Sonnet-4, Nova Premier, Mistral-Large, and DeepSeek-R1. We report statistical certification bounds under different distributions for each dataset and model (median of 95% confidence intervals across all specifications under a distribution). Distributions: random node with jailbreak (RNwJ); graph path, vanilla (GPv); graph path, harmful target constraint (GPh); and adaptive with rejection (AwR).

The following figures show the certification results on the cybercrime benchmark. Each panel shows the distribution of lower and upper bounds under different specifications for one LLM.

Grid of charts showing C3LLM certification results on the cybercrime benchmark. Each panel displays the distribution of statistical lower bounds and upper bounds on attack success rates under different conversational specifications for a different frontier LLM. DeepSeek-R1 shows notably higher certified risk levels compared to other models. We report statistical certification bounds under different distributions for each dataset and model (median of 95% confidence intervals across all specifications under a distribution). Distributions: random node with jailbreak (RNwJ); graph path, vanilla (GPv); graph path, harmful target constraint (GPh); and adaptive with rejection (AwR).

The results reveal that catastrophic risks are nontrivial for all frontier LLMs, with notable differences in safety across models. By comparing the bounds, we observe that among the models evaluated, Claude-Sonnet-4 and Nova Premier are safer than the others, while Mistral-Large and DeepSeek-R1 exhibit higher risks. In particular, Nova Premier demonstrates consistently low risk levels, largely because its built-in guardrails often block potentially unsafe content. On the other hand, DeepSeek-R1 reaches a certified lower bound of over 70% in cybercrime scenarios under RNwJ distributions.

Unlike prior work that reports attack success rates on fixed benchmarks, our approach provides high-confidence probabilistic bounds over large conversation spaces, enabling meaningful comparisons across models. We open-sourced the C3LLM framework for reproducibility and hope it enables researchers in industry and academia to perform more-principled safety studies.

About the Author

Qian Hu

Qian Hu is a senior applied scientist in Amazon's Artificial General Intelligence (AGI) organization.

Weitong Ruan

Weitong Ruan is a senior manager of applied science with Amazon's Artificial General Intelligence (AGI) organization.

Rahul Gupta

Rahul Gupta is a senior manager of applied science in Amazon's Artificial General Intelligence (AGI) organization.

How catastrophic is your LLM?

A new framework provides a statistical method for estimating the likelihood of catastrophic failures in large language models in adversarial conversations.

How to model a conversation

Application: How does C3LLM perform on frontier LLMs?

Related content

Work with us