Modern AI coding tools have revolutionized software engineering, with developers now using AI assistants to write a substantial fraction of their code across a range of applications. As scientists studying the theory of machine learning, we’re already seeing a similar transformation in basic scientific methodology, especially for research of a mathematical nature.
More precisely, AI tools are now able to develop and write rigorous mathematical proofs only from prompts providing high-level proof sketches. These proofs are written in longstanding “languages” for detailing mathematical arguments, in the same way that code is written in formal programming languages like Python. AI seems to have become proficient in both kinds of languages and their underlying logics.
Working with proof-based AI tools is akin to collaborating with a smart, broadly educated but occasionally error-prone colleague.
We came to this realization during a three-week period last summer, when we used agentic AI tools to write a mathematical paper that normally would have taken months. The 50-page paper describes and solves an optimization problem based on concepts from graph theory and machine learning. A typical prompt we would give the AI to set up the general framework for our paper looked like this: “Imagine a directed acyclic network of linear least-squares learning agents, each of which shares a common dataset but each of which sees only a different subset of the features.”
A typical prompt for a theorem statement and proof went “We believe that if the network contains a sufficiently long chain of agents whose features cover the entire dataset, some agent in the chain should rapidly converge to the globally optimal linear model. The proof should use the fact that error monotonically decreases in the chain, which forces long sequences of agents to be multi-accurate with respect to each other’s features.” While incantations like these might be opaque to the casual reader, they all have precise, standard mathematical interpretations that the AI was aware of, due to its training, and it proceeded to translate informal intuitions into precise definitions and statements. This translation was imperfect (as discussed below) but resulted in a great first draft that could then be corrected and smoothed.
To be clear, for this specific paper, we already knew the general outline of the proofs we had in mind. What AI did was to automate and dramatically speed up the process of filling in the missing details and writing them with formal precision. But more recently, we’ve written papers that we believe are substantially different and better than what we would have produced without AI assistance — in which the AI contributed key ideas that were crucial to the final results.
It’s important to note that AI tools are advancing quickly, which makes the future difficult to predict. While their use has shown potential to produce faster and better research, it has also generated serious questions for those who care about the future of science and its relationship to the broader world. AI is changing research norms and workflows. This raises concerns about how to train future generations of scientists.
Specifically, how can intuition and “good taste” in scientific research be developed when AI automates many of the steps that have historically been used to train young researchers? Peer review is another challenge: AI-generated research papers, quickly churned out at scale, highlight the limitations of peer review and modern-day publishing structures and also exacerbate already emerging challenges to incentives for scientific success. Without claiming to have answers or solutions to these concerns, we are personally living through them and will discuss each in turn.
New ways of doing research
One of our major takeaways from our summer research project was that working with proof-based AI tools is akin to collaborating with a smart, broadly educated but occasionally error-prone colleague. One can verbally sketch a mathematical argument to an AI agent as you might to a human collaborator, and the agent can turn that sketch into a formally written lemma or theorem along with its proof.
Increasingly, AI agents can find proofs themselves without a sketch, especially when those proofs are "standard" in some areas of mathematics. This is more useful than it sounds: many kinds of arguments are "standard" in some field, but often one in which you, the human author, are not an expert. An advantage of AI tools is that they are conversant in an enormous number of areas of mathematics and other scientific disciplines.
For example, in our case, along the way to proving one of our main results from the sketch we provided incrementally, the AI spontaneously proved a simple but useful lemma we were not aware of, which meaningfully simplified the argument we had in mind. The implications of this sort of creativity are exciting, especially for lowering the barrier to discovery: scientists without access to a diverse community of collaborators could also participate in cutting-edge research in ways that were previously impossible.
Using these tools still requires caution and expertise, however. The proofs they generate are correct perhaps only three-quarters of the time. But when they’re wrong, if you can identify the errors, it is often possible to iterate to correctness and then continue along a promising path.
If the errors remain uncorrected, trying to continue often takes you down a dead end. A 25% error rate is low enough to make the tools extremely useful to experts but high enough to sometimes devolve into "AI research slop" — polished-looking but ultimately flawed or uninteresting work — when used without care or discernment. The models, after all, still don’t know what is “interesting” or “useful.”
We also noticed some recurring failure modes or “rabbit holes” that come from using the AI tools. While writing our paper, we asked the AI to generate a small, self-contained result, which it did perfectly in a matter of minutes, at which point we told it this subproject was completed. Nevertheless, during the coming days, the AI would spontaneously take the initiative to suggest returning to the topic, despite being repeatedly told not to do so unless asked. This was an irritating reminder that generative AI does not have perfect recall but only an incomplete summary or embedding of the context. While working on the code for the experiments to illustrate our theoretical findings, we found that the AI could alternate between writing large amounts of rather complex working code very rapidly and getting lost for hours on something trivial, like simply printing out which iteration of a loop was being executed.
Training the next generation
Historically, people earn expertise in the mathematical sciences through struggle as junior researchers. PhD students spend years working through the details of technical arguments to gain hard-won intuitions about when a proof approach is promising, when they are being led astray by a problem, or what constitutes a novel and interesting research direction.
But these aspects of being a researcher are exactly what AI tools are “giving away”. If doctoral students can simply ask AI for proofs — which is extremely tempting, especially when it is in service of advancing research — how do they develop the experience and skill that, for now at least, are required to use AI tools productively in the first place?
We may need to be more intentional about teaching these foundational skills to young researchers, perhaps adopting an advanced version of teaching arithmetic in grade school without the use of calculators. The straightforward recommendation is to require junior researchers to write papers “the old-fashioned way”, even when their work could be sped up by AI.
Perhaps in a separate track, students would be trained to understand and work with emerging AI tools. This is an area of increasing importance that will likely require creative solutions. While we are strong believers that AI tools will do astounding things for science, it may be important to deliberately moderate their use in order to build researchers up to the point at which they can use them wisely and tastefully, not simply as short cuts to second-rate (or worse) research.
These next-generation training challenges aren’t unique to scientists using AI. We see them across myriad fields, including engineering, customer service, law, writing, and design — really, any industry in which entry-level tasks, previously used to introduce young workers to a field, are now done using AI. To find creative solutions to this skills-training challenge, or to just better anticipate the changes at hand, it might be helpful to look at analogies across fields or over time.
After high-level programming languages and compilers were widely introduced in the early 1960s, most software engineers no longer wrote machine code or assembly language, which provided direct instructions to the underlying hardware but were tedious to program. But the best programmers still understood enough about how compilers translated high-level languages into machine code to reason about correctness and performance. We hope that making it easier to construct and check technical arguments will let all researchers operate at a higher level of abstraction and “think bigger thoughts”. The culture we envision would emphasize taste, problem selection, and modeling skills and devalue technical wizardry for its own sake.
Without a serious, community-wide re-evaluation of peer review, AI threatens to arrest scientific progress at the community level even as it accelerates it at the level of individual researchers.
Breaking and remaking peer review
From our perspective, peer review is not only, or even primarily, a process to verify the correctness and quality of research. Rather, its purpose is to focus a scarce resource — the attention of the research community — in the right places. Science progresses as researchers build on each other’s work, but there is already too much work out there for anyone to keep up with. The publication process should help identify the most interesting and promising directions, so they can be more efficiently and thoroughly developed.
How does AI influence this focusing of communal attention? AI tools make it much easier to produce work that looks polished and correct, dramatically lowering the barrier to generating “papers” that can be submitted to journals and conferences. Many of these papers are neither interesting nor actually correct — but discovering this requires significant effort from reviewers.
This is straining an already overburdened machine learning publishing ecosystem struggling with tens of thousands of submissions per venue. We have seen that reducing the time and effort needed to produce "a paper" — not necessarily a good paper — is beginning to destabilize our existing institutions for peer review. The most recent iterations of AI and ML conferences have seen the number of submissions growing by large multiples, with a significant number of papers polished by AI, but ultimately of low quality, making it surprisingly far through the review process before being noticed and called out.
This is a problem across research fields, partially because it’s creating a market for AI-generated papers. This has in turn engendered a countermarket for AI-assisted detection of AI-generated papers — much like the familiar technological arms races around things like spam and its detection, but with the integrity of scientific publication at stake, not just the filtration of annoying or fraudulent e-mails.
As a short-term fix, AI-driven automated correctness checks (e.g., formal verification of mathematical proofs), tools for which are already being deployed in major conferences, could be valuable. Think of this as a form of unit testing for math instead of code. The aim is to filter out papers that have nontrivial errors, while focusing the job of the human reviewer on the important parts of science that they are best suited to evaluate: determining what we learn about the world from a new result, and how useful and interesting it is, rather than being drowned in the monotony of checking countless papers for technical correctness.
Without a serious, community-wide re-evaluation of peer review, AI threatens to arrest scientific progress at the community level even as it accelerates it at the level of individual researchers.
Looking ahead
We think AI is bringing a sea change in scientific research methodology, training, and peer review; there is no hiding from what is coming. But there are opportunities to proactively adapt and ensure that AI-assisted research fulfills its promise. What will research look like at the end of next year? The year after that? We’ve seen more change in the past year than in the previous decade, so all we can confidently predict is "different".
Our scientific institutions — peer review, publishing, graduate education — evolved over decades to match the constraints of human cognition and effort. Those constraints are shifting rapidly, and our institutions will need to shift with them. Our goal should be to steer toward a world where AI amplifies human creativity and insight, accelerates discovery, and expands who can participate in the research enterprise — while preserving the joy and rigor that make science worthwhile.