As foundation models (FMs) — such as Transformer-based large language models (LLMs) — have grown in popularity, there is a pattern we have seen repeatedly: a new model launches with stunning benchmark scores, teams get excited and start testing, and then they hit production reality. The model that aced the public benchmarks struggles with specific use cases organizations want to enable.
This is because the public benchmarks are “in the probability distribution” of the data used to train the models, whereas the use cases the organizations are interested in are “out of distribution." This distribution mismatch happens for two main reasons:
- The application depends on data, knowledge, and tools secured within an organization; these assets are not part of the public datasets used to train LLMs.
- Customer behavior and the application context keep evolving, so the new model is obsolete on the day it is deployed.
A few months back, we asked how we could meet these fundamental challenges. Our front-row seats to diverse, large-scale application-development efforts within Amazon helped us invent a whole new service called Amazon Nova Forge that empowers organizations to build their own expert foundation models using Amazon Nova.
In essence, Nova Forge gives you the training tools and recipes to make your differentiated use cases become “in distribution,” so your application can meet the highest standards of accuracy, reliability, cost effectiveness, and control. The result is a model that knows your organization and use cases as an expert in your domain. We call this model a “Novella” — a variant of Nova that is optimized for your organization.
Before Amazon Nova Forge
Historically, organizations have had three suboptimal choices for mitigating the challenges I described above.
First, they could fine-tune closed-weights LLMs using APIs that are typically based on low-rank adapters (LoRA). But such limited adaptation cannot give the customized model a deep understanding of proprietary domain knowledge and complex workflows.
Second, they could continue pretraining a base open-weights model or continue post-training one that is already instruction tuned for a set of use cases. But open-weights models do not come with the data used to train them or with their exact training recipes — e.g., how many training epochs, on which datasets, and at what learning rates. Consequently, it is extremely difficult to steer them to particular use cases without regressing on the core properties of the base model, a phenomenon known as “catastrophic forgetting.”
Third, they could build a frontier-scale model from scratch, but that requires massive computational resources, expert developers, and time.
Nova Forge, by contrast, is built on an entirely new paradigm of “open training” and has two main pillars: access to checkpoints from each major stage of model development and the ability to mix proprietary data with the data curated for training Amazon Nova.
Access to checkpoints from each major stage of model development
Most state-of-the-art foundation models are trained in three stages. First is pretraining, where the model is trained to predict the next token (i.e., unit of the LLM’s vocabulary, such as a word or a word part) in a sequence of tokens using large quantities of unlabeled data.
Second is mid-training, where real-world and synthetic user-system interactions (traces) help improve the model’s performance on a prioritized set of applications and tasks while increasing (or at least preserving) generalizability to previously unseen tasks. Mid-training is like pretraining, except that the data is specific to a set of tasks that the model provider wants the model to excel at, and the learning rate (i.e., how much a given training example modifies the model) is different.
Third is post-training, including supervised fine-tuning (SFT), where the model learns to complete tasks from curated demonstrations and instructions (e.g., from software engineering), and reinforcement learning (RL), which helps improve accuracy on these tasks and align the model’s outputs to specific policies.
Depending on the complexity of the target application and the relative importance of historical data and ongoing usage, organizations need the ability to infuse their data and knowledge into one or all of these stages. This is why Nova Forge provides three model checkpoints — pretrained, mid-trained, and post-trained — and the recipes and code to continue training from any of them.
If you are working in a novel domain that is not represented in the pretraining data at all (e.g., geospatial or radiology images) and have many trillions of tokens, you can continue pretraining from the pretrained checkpoint. If you have a few billion to a few trillion tokens of historical data or can synthesize interactions, you can continue from mid-trained checkpoints. You can also perform SFT and RL on the mid-trained checkpoints. Lastly, the most common use case is to continually update the model using RL from real-world feedback or synthetic data.
Mixing proprietary and frontier data
Foundation models with frontier capabilities come from frontier-scale data. While techniques such as regularization and carefully crafted learning rates can help mitigate the challenges of catastrophic forgetting, the best way to infuse new knowledge into a model without losing existing capabilities is to mix frontier-scale data with your own proprietary data.
This is why, for all stages of training, Nova Forge provides API-based mixing of the high-quality curated data used to train our frontier models with your proprietary data. To the best of our knowledge, no proprietary FM provider — or even open-weights-model developer — has provided the ability to mix frontier-scale data with proprietary data during pretraining, mid-training, and post-training.
When organizations blend their proprietary data with high-quality curated data at early stages, they achieve something fundamentally different from customization choices that were available before Nova Forge: they build models where expertise in their domain is the core capability of the model, not an afterthought. The model learns to reason about domain-specific concepts as fluently as it reasons about the general knowledge available in public sources.
Consider the experience of Nimbus Therapeutics, a clinical-stage drug discovery company, when building an AI system to accelerate molecular design. Drug discovery requires finding the right balance of many properties within a single molecule. It is an exponentially complex task that cannot be solved by manual exploration of candidate combinations. The goal was to build a model that could generate molecular designs, reason through complex problems, and predict which molecules are worth testing in the lab, where each experiment can cost thousands of dollars.
Off-the-shelf LLMs lacked the deep understanding of chemistry required for such specialized work. While Nimbus had already built a suite of specialized machine learning models to address this gap, these models still lacked true chemical-reasoning capabilities, and maintaining a collection of separate models had become increasingly complex and resource intensive.
The team began by testing Nova 2 Lite on pharmaceutical-patent analysis, where it achieved 95% accuracy without any customization. This impressive result gave them confidence to use Nova Forge for a more ambitious goal: creating one unified molecular-intelligence system. For instance, a model needs to understand not just how to connect atoms to make a realistic molecule but how specific structural features in each molecule map to physico-chemical properties, biological activities, and toxicophores. A grasp of these complex relationships is difficult to bolt on after a model's knowledge of structures has solidified.
Nova Forge enabled the team to bring in its own proprietary chemistry datasets and drive performance improvement using supervised fine-tuning and reinforcement learning. Early results show that the custom model built using Nova Forge already outperforms other leading LLMs on molecular-property prediction tasks by significant margins, with the promise of expanding into molecular generation — a cutting-edge technology that will help bring better medicines to patients more quickly than ever before.
The next frontier
We released Amazon Nova Forge as the first service that enables organizations to build their own frontier models with Nova, through this “open training” approach.
The capabilities we recently launched with Nova 2 Lite and three other Nova 1 models address the two challenges I outlined earlier. We are now working to meet an emerging challenge — reducing the time and effort required to transfer knowledge from an existing, customized Nova model to a newly released Nova model.
To that end, we are offering Forge customers early access to a more capable model, Nova 2 Pro, at the same time that we are providing it to our internal teams. Forge customers can use Nova 2 Pro in Amazon Bedrock right away to build their applications. In a few weeks, we will provide recipes for training from multiple checkpoints of Nova 2 Pro. Such early access to even more powerful models in Forge makes it easy for organizations to plan ahead for the transfer of knowledge to newer, more capable Nova models.
Our open-training approach also makes it easy for the broader research community to explore fundamental research questions — and it is another reason I am excited by the potential of Nova Forge. Just as open-source software enabled the modern Internet, open training may enable a future where every organization can build its own frontier AI.
The so what
I gave Nova 2 Lite a description of Nova Forge and asked for a one-sentence summary for our customers. Nova 2 Lite came back with “Nova Forge: Your AI, your rules—built faster, smarter, and on your terms.” I could not have done a better job of summarizing the spirit of what we are trying to accomplish here, helping organizations of all sizes and expertise excel in their domains and deliver value with AI.