In recent years, large language models (LLMs) have become indispensable assistants for software engineers and knowledge workers. Nimbus Therapeutics enlisted us at Amazon’s Generative AI Innovation Center and Artificial General Intelligence (AGI) organization to investigate whether it’s possible to make equally capable assistants for medicinal chemists discovering new drugs. Such an agent could significantly speed up drug discovery, potentially saving lives.
AI in drug discovery has traditionally involved models called graph neural networks, or GNNs. GNNs are the workhorses of molecular-property prediction across pharmaceutical R&D, and for good reason: they deliver strong accuracy on well-defined tasks.
Typically, multiple GNNs, specialized for different molecular properties, have to be built and maintained in-house — an expensive, operationally complex process. In recent years, the success of LLMs in a variety of research domains has caught the eye of biotech firms, but for drug discovery, general, off-the-shelf LLMs have proven to be less accurate than GNNs or other computational methods.
We have adopted a new approach that combines the accuracy of GNNs with the generalizability and reasoning ability of LLMs. Using supervised fine tuning (SFT) and reinforcement fine tuning (RFT) to customize a general-purpose LLM, we were able to achieve results comparable to those of using multiple GNNs, at a fraction of the time and labor.
Fine-tuned LLMs offer a significantly simplified workflow. In the traditional setting, each GNN has a separate interface, with its own quirks, data formats, and failure modes. Results come back as disconnected numbers that the chemist must manually integrate. When a new property needs to be predicted, someone must construct a multitask dataset and train and validate an entirely new model, a process that can take weeks.
GNNs are the workhorses of molecular-property prediction across pharmaceutical R&D.
In contrast, a single, fine-tuned LLM allows a chemist to submit one query and receive predictions on all molecular properties of interest. Adding a new property requires incremental fine tuning rather than building a new model from scratch. Moreover, a language model opens the door to a qualitatively different capability: conversation.
With a fine-tuned LLM, it’s now possible to ask for the reasoning behind the model outputs or to suggest molecular modifications that might yield the desired properties. This points toward an assistant that unifies molecular-property prediction and generation in one interactive experience, which we see as the ideal next step for AI-assisted drug design.
Customized LLMs unlock domain-specific scientific assistants, giving lean biotech teams a practical way to collaborate with AI systems that speak their scientific language.
Today, bringing a single drug to market takes 10 to 15 years and costs on average over $2 billion, with only about 8 percent of drug candidates that enter clinical trials receiving FDA approval. We believe that AI assistants could particularly improve productivity in the early stages of this pipeline, where chemists design molecules with druglike properties. Increasing the speed of development and the number of viable candidates would maximize the chances of delivering a safe and efficacious drug to the clinic.
What we looked at
Our work with Nimbus Therapeutics focused on properties spanning three categories critical to drug development:
- Lipophilicity (which has one associated property) determines whether a molecule can cross biological membranes. It is fundamental to drug absorption and distribution and affects all other characteristics of a drug.
- Permeability (four associated properties) measures how easily a drug enters the body via the bloodstream.
- Clearance (six properties) determines how quickly the body eliminates a drug. A drug that takes too long to be cleared could become toxic; one that is cleared too quickly won’t be effective.
These properties span different value ranges and exhibit complex interdependencies — in practice requiring separate multitask GNN models . We tested the general-purpose LLMs Claude Sonnet 4 and Nova 2 Lite on the task of predicting all three sets of properties for particular molecules. Despite their impressive capabilities elsewhere, the models significantly underperformed specialized GNNs, with an accuracy gap that ranged from 40% to over 200% error, as measured by the root mean squared error (RMSE), depending on the property.
However, we discovered that Nova 2 Lite with supervised fine tuning (SFT), followed by reinforcement fine tuning (RFT), could close that gap. Our single, fine-tuned LLM predicted 11 different molecular properties with accuracy similar to that of multiple separately trained multitask GNN models.
How we did it
Our approach to fine-tuning the LLM follows a principle common to both human-expertise development and machine learning: foundational knowledge must precede performance optimization. During SFT, the model learned core concepts such as molecular structure and property relationships. Then, during RFT, training shifted to the development of predictive judgment through practice and feedback.
During SFT, we exposed Nova 2 Lite to more than 55,000 molecules labeled with experimental measurements across 11 properties. SFT was essential because the domain-specific tasks we asked the model to perform fall far outside Nova 2 Lite’s generalized pretraining data. For example, we use a notation called SMILES (simplified molecular-input line entry system) to represent chemical structures. Without SFT, the LLM wouldn’t have been able to perform a task like “predict chemical property from SMILES strings in structured JSON format”.
The second training stage, reinforcement fine tuning (RFT), is especially critical for properties with limited experimental data, where SFT alone struggles to generalize. RFT also enables the intramodel transfer of learning across properties. For instance, lipophilicity affects permeability, and both can inform metabolism predictions. Further, RFT shifts the learning objective from pattern matching ("given molecule X, output value Y based on similar examples") to quality optimization ("minimize prediction error across all properties").
We tested the SFT and RFT models on 15,000 molecules unseen during training. We also built a system prompt that encompassed a knowledge of both core chemistry and our 11 chemical properties of interest, including their definitions and expected value ranges.
During the RFT stage, we experimented with three strategies for generated rewards, which guided the learning process. Molecular-property prediction is particularly amenable to reward engineering for RFT since the output is a single number, which allows us to measure exactly how far off each prediction is.
Our first strategy was to use an exponential decay function, so predictions closer to the true value received exponentially higher rewards. But at high error, improving from “terrible” to merely “bad” yielded almost no reward difference, keeping the model from learning from its worst predictions, while at low error, small changes resulted in large reward differences, which made the reward signal noisy and ultimately unhelpful.
Our second strategy, binary pass/fail rewards, created the opposite problem. The model received zero reinforcement for gradual improvement: it either crossed an arbitrary threshold (in our case, correct within 10 percent) or learned nothing.
Rewards based on the Huber loss — a metric proposed in 1964 by the Swiss statistician Peter Huber, which limits the influence of outliers — solved both issues. Unlike exponential decay, Huber rewards don't become negligible on large errors — the model always receives a meaningful signal to improve — yet they remain stable near the correct answer, refining predictions without overreacting to small fluctuations. This yielded our best result, a 4.9% R² improvement over baseline, and we used the Huber reward as the default for training the model on multiple molecular properties simultaneously.
Carrying this forward into multiproperty training, we fine-tuned a single model to predict all 11 properties simultaneously. Our best-performing model was Nova 2 Lite with RFT on top of full-rank SFT, meaning that all the model parameters were updated. It outperforms Claude Sonnet 4 by 39% and base Nova 2 Lite by 37% on average RMSE. While averaging 5% behind the baseline GNN, it matches or outperforms the GNN on 7 of 11 properties — a striking result given that a single LLM is going toe-to-toe with multiple independently trained multitask GNN models, reducing not just model count but the entire infrastructure footprint around training, deployment, and maintenance.
It’s important to note that Nova Forge — a service that allows Amazon Web Services customers to use proprietary data during both pretraining and SFT — supports both SFT and RFT on SageMaker, enabling extensive model customization. Since SageMaker handles the training framework and infrastructure maintenance internally, organizations avoid the cost of building and maintaining custom training pipelines from scratch.
What’s next?
Based on these initial experiments and results, Nimbus Therapeutics recently deployed its Novus model on Amazon Bedrock. Novus is the company’s custom-built LLM, created through Nova Forge. In its current form, Novus handles molecular-property prediction with an accuracy that is competitive with purpose-built GNNs.
The next milestone is extending those capabilities toward molecular design, enabling the model to propose structural modifications, predict their downstream properties, and explain its reasoning, all in a single conversation.
Acknowledgements
Leela Dodda (Nimbus), Aarush Garg (Nimbus), Matthew Medina (Nimbus), Md Tamzeed Islam , Elyse Zhang, Clement Perrot, Rohit Thekkanal, Shiv Vitaladevuni