Machine learning

AWS and Gray Lab at Johns Hopkins Whiting School of Engineering announce groundbreaking database for AI/ML antibody design

The Antibody Developability Benchmark is powered by one of the most diverse antibody datasets, enabling transparent performance evaluation for AI-guided antibody design.

By Staff writer

April 14, 2026

10 min read

Overview by Amazon Nova

AWS and Johns Hopkins Engineering have launched the Antibody Developability Benchmark, a large-scale, diverse dataset for evaluating AI-guided antibody design.
The dataset includes 50 seed antibodies with four structural formats targeting 42 antigens, measuring six key developability traits.
It features engineered variants with both favorable and unfavorable developability outcomes, validated through wet-lab experiments.
The benchmark supports zero-shot learning, allowing models to be evaluated without prior exposure to the dataset, enhancing confidence in results.
The benchmark results are now available as part of Amazon Bio Discovery; additional benchmarks will be added over time and released in a paper later this year.

Was this answer helpful?

In 1986 the US Food and Drug Administration issued its first approval for human use of a therapeutic antibody. Despite steady advances in methodology, genetic sequencing, and biomedical science, 40 years later the process of discovering and optimizing therapeutic antibodies often remains prohibitively expensive, in terms of both cost and time. Recent experiences with pandemic-style infectious-disease outbreaks lend an even greater urgency to the need to more quickly and efficiently identify and develop these antibodies.

Artificial-intelligence- and machine-learning-guided approaches to antibody design, in the form of biological foundation models (BioFM), represent a significant opportunity to address these challenges. Models built using protein language models (pLMs) and structure-based deep-learning frameworks have significant potential to predict antibody developability properties — the characteristics that determine whether a molecule is manufacturable, stable, and safe as a therapeutic. The development of those tools could drastically shorten discovery timelines while also reducing experimental costs.

That potential, however, has been hindered by the lack of a public dataset that would allow researchers to benchmark those tools, a crucial step in the development of trustworthy in-silico tools for drug discovery. While there are existing public antibody datasets, they are too frequently limited by a focus on a single antibody format or target. Others are composed of naturally occurring or clinically advanced antibodies, a bias that severely limits their utility for training or evaluating predictive models.

“Trust in the predictions made by these models must be grounded in evaluations against experimental data that is sufficiently large and diverse,” explained Luca Giancardo, an applied scientist with Amazon Web Services (AWS) who works on the Amazon Bio Discovery team. “That data must be representative of the real sequence space encountered during antibody engineering and balanced in terms of developability outcomes.”

Jeffrey Gray is a professor in the Chemical and Biomolecular Engineering Department at the Johns Hopkins Whiting School of Engineering, where he leads the Gray Lab, which focuses on the computational prediction and design of protein structures. He is also the original developer of RosettaDock, a tool for the prediction of the structure of protein complexes from their constituent proteins.

Gray noted that while AI has made tremendous progress in the prediction and design of antibody properties, his own lab’s benchmarks have shown that current models do not yet reliably predict critical developability features, such as solubility and specificity, needed for efficient design of therapeutics. He cited the lack of diverse data in standardized conditions as a primary limitation for training models. That, coupled with the absence of a comprehensive, heterogenous, large-scale database, has acted as a significant drag on the potential of developing AI tools for antibody development.

Antibody developability benchmark

To that end, AWS, in collaboration with the Gray Lab and Johns Hopkins Engineering are announcing the launch of the Antibody Developability Benchmark, powered by the largest and most diverse antibody dataset in public literature. This is the first large-scale benchmark of antibody biophysical and biochemical properties designed to support the development and rigorous evaluation of in-silico antibody property predictors.

0

seed antibodies
0

structural formats
0

antigen targets

The Antibody Developability Benchmark is 20 times as diverse — in terms of antibody formats, targets, and developability profiles — as benchmarks currently available in the scientific literature. While other datasets may contain more individual antibody designs, they typically explore a single target or antibody framework with limited property coverage. The Antibody Developability Benchmark is unique in its combination of scale and heterogeneity, encompassing 50 seed antibodies, four structural formats, and 42 antigens. It also includes both favorable and unfavorable developability outcomes.

Gray lauded the opportunity to work with AWS experts, noting that the collaboration has enabled the creation of a dataset larger and more diverse than any of the publicly available datasets. He called the project an important next step toward fulfilling the promise of AI to improve human health.

Existing public antibody datasets typically focus on a single target or format with limited property coverage (left). The Antibody Developability Benchmark is 20 times more diverse — spanning 50 seed antibodies, 4 structural formats, 42 antigens, and both favorable and unfavorable outcomes (right).

The Antibody Developability Benchmark includes the first heterogeneous antibody-property dataset explicitly designed to capture favorable and unfavorable developability profiles across multiple antigens and mutation strategies. Crucially, all data was affirmed via wet-lab experiments, providing ground truth validation that existing public benchmarks lack.

“This dataset will allow researchers to confidently be able to answer ‘Which model is better suited for our purposes?’,” noted Giancardo, whose Bio Discovery team led the development of the dataset. “Today there are many computational models coming out that are mostly evaluated on either proprietary data or public datasets, which are not representative of antibody heterogeneity. That means deciding what is better or worse is very, very hard — if not impossible.”

The unmatched diversity and deliberate heterogeneity of the Antibody Developability Benchmark will help make those determinations possible.

Michael Chungyoun, a PhD researcher at JHU who worked on the project, observed that the benchmark covers a wide space of antibodies, particularly in terms of their properties. He noted that allowing researchers to check against a very diverse benchmark can save time and labor by helping them compare models and choose the best approach.

The antibody dataset

The dataset consists of 50 clinically and scientifically relevant seed antibodies spanning four structural formats — IgG, VHH, NearGermline-IgG, and scFv — targeting 42 distinct antigens. It measures expression, purity, thermostability, aggregation, polyreactivity, and hydrophobicity — six traits that are essential in the development of viable therapeutic antibodies.

antibody structural format — The 50 seed antibodies in the Antibody Developability Benchmark span four structural formats: IgG (35), VHH (7), NearGermline-IgG (5), and scFv (3).

“The composition is a deliberate design choice,” Giancardo noted. “We strove to find a balance between heterogeneity of antibody classes, therapeutic targets, and mutation types, with the aim of creating benchmarks that would be generalizable across the structural diversity of the modern therapeutic-antibody landscape.”

Researchers at the Gray Lab, assisted by a sponsored research grant from AWS, helped select the seed antibodies for inclusion in the dataset. They were intentional about the seeds they chose, Chungyoun noted, opting in some cases for existing clinical-stage antibodies or FDA-approved antibodies. The team also selected antibodies more akin to those that circulate in the human body but aren't approved therapeutics. Those are called germline antibodies.

Chungyoun explained that germline antibodies are those found in the human body, and they have important biophysical characteristics. While some of those characteristics are shared with therapeutic antibodies, there are also differences between the two. The extent of those differences, and how to bridge that gap, is a vital and unanswered question.

Traditional antibody-based drug discovery begins with antibodies that come from animals or humans. Chungyoun explained that germline antibodies occasionally need to be modified to look more like therapeutics. That process is one researchers are still exploring.

Mutation strategy

The dataset also includes engineered variants of each seed antibody, generated by applying systematic mutation strategies to each seed.

“Initially, the hardest thing was essentially coming up with example sequences that would cover the broad spectrum of properties and the ways of mutating these sequences,” Giancardo explained. “It's challenging because you have to do it a priori until you do it, and then you don't know what will come out.”

Working with Johns Hopkins Engineering, Giancardo and his team systematically engineered variants employing a variety of approaches, including protein-language-model-guided (pLM-guided) versus non-pLM-guided mutation selection and amino acid substitutions versus insertions/deletions.

“Protein language models are essentially the equivalent of large language models [LLMs] for the protein world,” Giancardo said. “There are multiple ways of looking at proteins. A common way is expressing them as a string of amino acids, which are essentially letters.” When some of the letters in the amino acid chains are masked, the models can be trained to fill in the gaps — the same "self-supervised" approach used to train LLMs. The models can also be trained to predict what changes inserting a different letter or letters — i.e., mutation — will yield. That approach resulted in a wide variety of mutations — up to 99 engineered variants per seed.

The breadth and depth of those mutations contribute to another distinguishing feature of the Antibody Developability Benchmark: its deliberate heterogeneity. The inclusion of both favorable, or developable, and unfavorable, or poorly developable, examples sets it apart from existing datasets.

“This range is essential for training and evaluating machine learning models, which require balanced label distributions and exposure to the failure modes they are intended to predict and avoid,” Giancardo explained. He also clarified that those failures still fall within a range of viability.

“These are not examples that are obviously wrong but rather bad examples that have a fighting chance," he added. "These all still meet some baseline quality assessment, meaning researchers could reasonably send them to a wet-lab partner to test.”

Zero-shot learning

Gray and his team at Hopkins Engineering also collaborated with their AWS counterparts by selecting and running existing open-source antibody design and prediction models on their own. They then shared their findings with the Bio Discovery team, who compared the results those models generated against the benchmarking dataset without exposing those models to the information in that dataset.

“This is essentially zero-shot inference,” Giancardo said. That siloed approach allowed both sides to have greater confidence in the results the Antibody Developability Benchmark generated. “The fact that we operated separately gave us confidence that we were not introducing errors. There is no data leakage of any sort, even from an external perspective.”

The teams compared their data and used those results to further fine-tune the Antibody Developability Benchmark. That iterative process means researchers who utilize the benchmark can have greater confidence about the viability of their models before the necessary, and costly, step of working with a wet lab partner. That can also shorten the overall timeline in terms of experimentation.

“When you are confident enough to do a screen, then you can turn to the wet lab, get new metrics, and further train on those results, which will be much, much, much more meaningful,” Giancardo explained.

The future

Researchers at both AWS and Hopkins Engineering emphasized the importance of sharing model benchmarks based on the Antibody Developability Benchmark Dataset with the larger scientific community. The benchmark results are now available as part of Amazon Bio Discovery; additional benchmarks will be added over time and released in a paper later this year.

The sharp uptick in proposed protein AI models has researchers excited, but the expense and time commitment of wet labs has meant researchers have thus far been unable to compare those models head to head, Chungyoun observed. He noted that the launch of this dataset means those researchers now have an opportunity to learn which model properties improve performance. That can serve to illuminate the connection between what models learn and how those models can be improved to better predict those properties.

The dataset won’t remain static either: more models and properties will be added in the future.

"The database has the potential to surface models and tools that may have previously gone unrecognized — research published in lesser-known venues or work that simply didn't receive the attention it deserved," said Nina Cheng, a senior science manager in the AWS Life Sciences organization. "This database can play a key role in bringing that kind of overlooked work to light."

Acknowledgements

Amazon Bio Discovery Science and product team: Luca Giancardo, Yue Zhao, Melih Yilmaz, Kemal Sonmez, Lan Guo, Gordon Trang, Edward Lee, Chuanyui Teh, Fangda Xu, Nina Cheng, Jiwon Kim.

About the Author

Staff writer