New pretraining tasks enable better document understanding

DocFormerV2 makes sense of documents using local features, outperforming much bigger models.

In the digital era, when documents are generated and distributed at unprecedented rates, automatically understanding them is crucial. Consider the tasks of extracting payment information from invoices or digitizing historical records, where layouts and handwritten notes play an important role in understanding context. These scenarios highlight the complexity of document understanding, which requires not just recognizing text but also interpreting visual elements and their spatial relationships.

A mailing label from Harvard University Press, with several preprinted, labeled spaces for shipping data, such as "sold to", "ship to", and "date".
Visual document understanding (VDU): A snippet of a document receipt from the DocVQA dataset. A VDU model might be asked to predict the “sold to” address (visual question answering), to predict all relations (“sold to” → <address>, “ship to” → <address>), or to infer information from the table at the top of the document.

At this year’s meeting of the Association for the Advancement of Artificial Intelligence (AAAI 2024), we proposed a model we call DocFormerv2, which doesn't just read documents but understands them, making sense of both textual and visual information in a way that mimics human comprehension. For example, just as a person might infer a report's key points from its layout, headings, text, and associated tables, DocFormerv2 analyzes these elements collectively to grasp the document's overall message.

Related content
Method preserves knowledge encoded in teacher model’s attention heads even when student model has fewer of them.

Unlike its predecessors, DocFormerv2 employs a transformer-based architecture that excels in capturing local features within documents — small, specific details such as the style of a font, the way a paragraph is arranged, or how pictures are placed next to text. This means it can discern the significance of layout elements with higher accuracy than prior models.

A standout feature of DocFormerv2 is its use of self-supervised learning, the approach used in many of today’s most successful AI models, such as GPT. Self-supervised learning uses unannotated data, which enables training on enormous public datasets. In language modeling, for instance, next-token prediction (used by GPT) or masked-token prediction (used by T5 or BERT) are popular.

A schematic of the DocFormerv2 architecture, which takes as input both images of the document and the associated OCR output, along with the spatial coordinates of text, and which is trained on two tasks, token to line and token to grid.
DocFormerv2 architecture.

For DocFormerv2, in addition to standard masked-token prediction, we propose two additional tasks, token-to-line prediction and token-to-grid assignment. These tasks are designed to deepen the model's understanding of the intricate relationship between text and its spatial arrangement within documents. Let’s take a closer look at them.

Token to line

The token-to-line task trains DocFormerv2 to recognize how textual elements align within lines, imparting an understanding that goes beyond mere words to include the flow and structure of text as it appears in documents. This follows the intuition that most of the information needed for key-value prediction in a form or for visual question answering (VQA) is on either the same line or adjacent lines of a document. For instance, in the diagram below, in order to predict the value for "Total" (box a), the model has to look in the same line (box d, "$4.32"). Through this type of task, the model learns to give importance to information about the relative positions of tokens and its semantic implications.

At left is a store receipt with the labels "state tax", "total", and "change" surrounded by red boxes and labeled, respectively, b, a, and c and the total amount of the charge, $4.32, labeled d. At right is a product order form with a 16-cell red grid superimposed on it, each cell labeled with a blue number (1-16).
Novel document pretraining tasks: token to line and token to grid.

Token to grid

Semantic information varies across a document's different regions. For instance, financial documents might have headers at the top, fillable information in the middle, and footers or instructions at the bottom. Page numbers are usually found at the top or bottom of a document, while company names in receipts or invoices often appear at the top. Understanding a document accurately requires recognizing how its content is organized within a specific visual layout and structure. Armed with this intuition, the token-to-grid task pairs the semantics of texts with their locations (visual, spatial, or both) in the document. Specifically, a grid is superimposed on the document, and each OCR token is assigned a grid number. During training, DocFormerv2 is tasked with predicting the grid number for each token.

Target tasks and impact

On nine different datasets covering a range of document-understanding tasks, DocFormerv2 outperforms previous comparably sized models and even does better than much larger models — including one that is 106 times as big as DocFormerv2. Since text from documents is extracted using OCR models, which do make prediction errors, we also show that DocFormerv2 is more resilient to OCR errors than its predecessors.

One of the tasks we trained DocFormerv2 on is table VQA, a challenging task in which the model must answer questions about tables (with either images, text, or both as input). DocFormerv2 achieved 4.3% absolute performance improvement over the next best model.

A spreadsheet table labeled "FM radio stations" whose column labels include "frequency", "call sign", "name", and "format". The entries in the "call sign" column are "KUSK", "KKYA", "KDAM", "WNAX-FM", and "KVHT". "WNAX-FM" is surrounded by a red box.
For the question "Which of these stations does not have a 'k’ in its call sign?", DocFormerv2 correctly answers "WNAX-FM" (fourth row, second column). This requires reasoning over spatial, visual, and language features.
A spreadsheet table with three columns, labeled "District", "Location", and "Communities served". Four of the eight cells in the "Communities served" column — those whose entries begin "Roman Catholic Diocese of Cleveland" — are surrounded by red boxes.
For the question "How many of the schools serve the Roman Catholic diocese of Cleveland?", DocFormerv2 correctly answers "four". This requires arithmetic counting — a challenging task for machine learning models — and reasoning over multiple rows.
A police boat with the word "Police" written on its hull and, below the picture, the text query "What color is the word 'police' written in?"
In this example, an image and text (from an OCR model) are fed to DocFormerv2 along with the question “What color is the word ‘police’ written in?”. Due to its multimodal nature, DocFormerv2 can “see” the image and correctly answer “white”.

But DocFormerv2 also displayed more-qualitative advantages over its predecessors. Because it’s trained to make sense of local features, DocFormerv2 can answer correctly when asked questions like “Which of these stations do not have a ‘k’ in their call sign?” or “How many of the schools serve the Roman Catholic diocese of Cleveland?” (The second question requires counting — a hard skill to learn.)

In order to show the versatility and generalizability of DocFormerv2, we also tested it on scene-text VQA, a task that’s related to but distinct from document understanding. Again, it significantly outperformed comparably sized predecessors.

While DocFormerv2 has made significant strides in interpreting complex documents, several challenges and exciting opportunities lie ahead, like teaching the model to deal with diverse document layouts and enhancing multimodal integration.

Related content

US, WA, Bellevue
Conversational AI ModEling and Learning (CAMEL) team is part of Amazon Devices organization where our mission is to build a best-in-class Conversational AI that is intuitive, intelligent, and responsive, by developing superior Large Language Models (LLM) solutions and services which increase the capabilities built into the model and which enable utilizing thousands of APIs and external knowledge sources to provide the best experience for each request across millions of customers and endpoints. We are looking for a passionate, talented, and resourceful Senior Applied Scientist in the field of LLM, Artificial Intelligence (AI), Natural Language Processing (NLP), Recommender Systems and/or Information Retrieval, to invent and build scalable solutions for a state-of-the-art context-aware conversational AI. A successful candidate will have strong machine learning background and a desire to push the envelope in one or more of the above areas. The ideal candidate would also have hands-on experiences in building Generative AI solutions with LLMs, enjoy operating in dynamic environments, be self-motivated to take on challenging problems to deliver big customer impact, moving fast to ship solutions and then iterating on user feedback and interactions. Key job responsibilities As a Senior Applied Scientist, you will leverage your technical expertise and experience to demonstrate leadership in tackling large complex problems, setting the direction and collaborating with other talented applied scientists and engineers to research and develop LLM modeling and engineering techniques to reduce friction and enable natural and contextual conversations. You will analyze, understand and improve user experiences by leveraging Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in artificial intelligence. You will work on core LLM technologies, including Prompt Engineering, Model Fine-Tuning, Reinforcement Learning from Human Feedback (RLHF), Evaluation, etc. Your work will directly impact our customers in the form of novel products and services .
US, CA, Pasadena
The Amazon Web Services (AWS) Center for Quantum Computing (CQC) is a multi-disciplinary team of scientists, engineers, and technicians, on a mission to develop a fault-tolerant quantum computer. We are looking to hire a Research Scientist with fabrication and data analysis experience working on all elements of a superconducting circuit. The position is on-site at our lab, located on the in Pasadena, CA. The ideal candidate will have had prior experience building software tools for data analysis and visualization to enable deep diving into fabrication details, electrical test data. We are looking for candidates with strong engineering principles, resourcefulness and data science experience. Organization and communication skills are essential. Key job responsibilities * Develop and automate data pipeline pertinent to superconducting device fabrication. * Develop analytical tools to uncover new information about established and new processes. * Develop new or contribute to modifying existing data visualization tools. * Utilize machine learning to enable better deeper dives into fabrication and related data. * Interface with various software, design, fabrication and electrical test teams to enable new functionalities. A day in the life The role will be vital to the fabrication team and quantum computing device integration mechanism. The candidate will develop software based analytical tools to enable data driven decisions across projects related to fabrication and supporting infrastructure. Each fabrication run delivers additional data. The candidate will stay close to the details of fabrication providing data analysis and quick feedback to key stakeholders. At the end of fabrication runs custom and standardized reports will be generated by the candidate to provide insights into data generated from the run. This position may require occasional weekend work. About the team AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. Hybrid Work We value innovation and recognize this sometimes requires uninterrupted time to focus on a build. We also value in-person collaboration and time spent face-to-face. Our team affords employees options to work in the office every day or in a flexible, hybrid work model near one of our U.S. Amazon offices.
US, WA, Seattle
Do you want to join an innovative team of scientists who use machine learning and statistical techniques to help Amazon provide the best customer experience by preventing eCommerce fraud? Are you excited by the prospect of analyzing and modeling terabytes of data and creating state-of-the-art algorithms to solve real world problems? Do you like to own end-to-end business problems/metrics and directly impact the profitability of the company? Do you enjoy collaborating in a diverse team environment? If yes, then you may be a great fit to join the Amazon Buyer Risk Prevention (BRP) Machine Learning group. We are looking for a talented scientist who is passionate to build advanced algorithmic systems that help manage safety of millions of transactions every day. Key job responsibilities Use machine learning and statistical techniques to create scalable risk management systems Learning and understanding large amounts of Amazon’s historical business data for specific instances of risk or broader risk trends Design, development and evaluation of highly innovative models for risk management Working closely with software engineering teams to drive real-time model implementations and new feature creations Working closely with operations staff to optimize risk management operations, Establishing scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation Tracking general business activity and providing clear, compelling management reporting on a regular basis Research and implement novel machine learning and statistical approaches
CA, ON, Toronto
Amazon Advertising is one of Amazon's fastest growing and most profitable businesses, responsible for defining and delivering a collection of advertising products that drive discovery and sales. Our products and solutions are strategically important to enable our Retail and Marketplace businesses to drive long-term growth. We deliver billions of ad impressions and millions of clicks and break fresh ground in product and technical innovations every day! As an Applied Scientist on this team, you will: - Drive end-to-end Machine Learning projects that have a high degree of ambiguity, scale, complexity. - Perform hands-on analysis and modeling of enormous data sets to develop insights that increase traffic monetization and merchandise sales, without compromising the shopper experience. - Build machine learning models, perform proof-of-concept, experiment, optimize, and deploy your models into production; work closely with software engineers to assist in productionizing your ML models. - Run A/B experiments, gather data, and perform statistical analysis. - Establish scalable, efficient, automated processes for large-scale data analysis, machine-learning model development, model validation and serving. - Research new and innovative machine learning approaches. - Recruit Applied Scientists to the team and provide mentorship. Why you will love this opportunity: Amazon is investing heavily in building a world-class advertising business. This team defines and delivers a collection of advertising products that drive discovery and sales. Our solutions generate billions in revenue and drive long-term growth for Amazon’s Retail and Marketplace businesses. We deliver billions of ad impressions, millions of clicks daily, and break fresh ground to create world-class products. We are a highly motivated, collaborative, and fun-loving team with an entrepreneurial spirit - with a broad mandate to experiment and innovate. Impact and Career Growth: You will invent new experiences and influence customer-facing shopping experiences to help suppliers grow their retail business and the auction dynamics that leverage native advertising; this is your opportunity to work within the fastest-growing businesses across all of Amazon! Define a long-term science vision for our advertising business, driven from our customers' needs, translating that direction into specific plans for research and applied scientists, as well as engineering and product teams. This role combines science leadership, organizational ability, technical strength, product focus, and business understanding. Team video https://youtu.be/zD_6Lzw8raE
US, WA, Seattle
Amazon Advertising is one of Amazon's fastest growing and most profitable businesses, responsible for defining and delivering a collection of advertising products that drive discovery and sales. Our products and solutions are strategically important to enable our Retail and Marketplace businesses to drive long-term growth. We deliver billions of ad impressions and millions of clicks and break fresh ground in product and technical innovations every day! As the Data Science Manager on this team, you will: - Lead of team of scientists, business intelligence engineers, etc., on solving science problems with a high degree of complexity and ambiguity. - Develop science roadmaps, run annual planning, and foster cross-team collaboration to execute complex projects. - Perform hands-on data analysis, build machine-learning models, run regular A/B tests, and communicate the impact to senior management. - Hire and develop top talent, provide technical and career development guidance to scientists and engineers in the organization. - Analyze historical data to identify trends and support optimal decision making. - Apply statistical and machine learning knowledge to specific business problems and data. - Formalize assumptions about how our systems should work, create statistical definitions of outliers, and develop methods to systematically identify outliers. Work out why such examples are outliers and define if any actions needed. - Given anecdotes about anomalies or generate automatic scripts to define anomalies, deep dive to explain why they happen, and identify fixes. - Build decision-making models and propose effective solutions for the business problems you define. - Conduct written and verbal presentations to share insights to audiences of varying levels of technical sophistication. Why you will love this opportunity: Amazon has invested heavily in building a world-class advertising business. This team defines and delivers a collection of advertising products that drive discovery and sales. Our solutions generate billions in revenue and drive long-term growth for Amazon’s Retail and Marketplace businesses. We deliver billions of ad impressions, millions of clicks daily, and break fresh ground to create world-class products. We are a highly motivated, collaborative, and fun-loving team with an entrepreneurial spirit - with a broad mandate to experiment and innovate. Impact and Career Growth: You will invent new experiences and influence customer-facing shopping experiences to help suppliers grow their retail business and the auction dynamics that leverage native advertising; this is your opportunity to work within the fastest-growing businesses across all of Amazon! Define a long-term science vision for our advertising business, driven from our customers' needs, translating that direction into specific plans for research and applied scientists, as well as engineering and product teams. This role combines science leadership, organizational ability, technical strength, product focus, and business understanding. Team video ~ https://youtu.be/zD_6Lzw8raE
US, WA, Seattle
Amazon Advertising is one of Amazon's fastest growing and most profitable businesses, responsible for defining and delivering a collection of advertising products that drive discovery and sales. Our products and solutions are strategically important to enable our Retail and Marketplace businesses to drive long-term growth. We deliver billions of ad impressions and millions of clicks and break fresh ground in product and technical innovations every day! As an Applied Science Manager in Machine Learning, you will: - Directly manage and lead a cross-functional team of Applied Scientists, Data Scientists, Economists, and Business Intelligence Engineers. - Develop and manage a research agenda that balances short term deliverables with measurable business impact as well as long term investments. - Lead marketplace design and development based on economic theory and data analysis. - Provide technical and scientific guidance to team members. - Rapidly design, prototype and test many possible hypotheses in a high-ambiguity environment, making use of both quantitative and business judgment - Advance the team's engineering craftsmanship and drive continued scientific innovation as a thought leader and practitioner. - Develop science and engineering roadmaps, run annual planning, and foster cross-team collaboration to execute complex projects. - Perform hands-on data analysis, build machine-learning models, run regular A/B tests, and communicate the impact to senior management. - Collaborate with business and software teams across Amazon Ads. - Stay up to date with recent scientific publications relevant to the team. - Hire and develop top talent, provide technical and career development guidance to scientists and engineers within and across the organization. Why you will love this opportunity: Amazon is investing heavily in building a world-class advertising business. This team defines and delivers a collection of advertising products that drive discovery and sales. Our solutions generate billions in revenue and drive long-term growth for Amazon’s Retail and Marketplace businesses. We deliver billions of ad impressions, millions of clicks daily, and break fresh ground to create world-class products. We are a highly motivated, collaborative, and fun-loving team with an entrepreneurial spirit - with a broad mandate to experiment and innovate. Impact and Career Growth: You will invent new experiences and influence customer-facing shopping experiences to help suppliers grow their retail business and the auction dynamics that leverage native advertising; this is your opportunity to work within the fastest-growing businesses across all of Amazon! Define a long-term science vision for our advertising business, driven from our customers' needs, translating that direction into specific plans for research and applied scientists, as well as engineering and product teams. This role combines science leadership, organizational ability, technical strength, product focus, and business understanding. Team video ~ https://youtu.be/zD_6Lzw8raE
US, WA, Seattle
Do you want to join an innovative team of scientists who use machine learning and statistical techniques to help Amazon provide the best customer experience by preventing eCommerce fraud? Are you excited by the prospect of analyzing and modeling terabytes of data and creating state-of-the-art algorithms to solve real world problems? Do you like to own end-to-end business problems/metrics and directly impact the profitability of the company? Do you enjoy collaborating in a diverse team environment? If yes, then you may be a great fit to join the Amazon Buyer Risk Prevention (BRP) Machine Learning group. We are looking for a talented scientist who is passionate to build advanced algorithmic systems that help manage safety of millions of transactions every day. Key job responsibilities Use machine learning and statistical techniques to create scalable risk management systems Learning and understanding large amounts of Amazon’s historical business data for specific instances of risk or broader risk trends Design, development and evaluation of highly innovative models for risk management Working closely with software engineering teams to drive real-time model implementations and new feature creations Working closely with operations staff to optimize risk management operations, Establishing scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation Tracking general business activity and providing clear, compelling management reporting on a regular basis Research and implement novel machine learning and statistical approaches
US, CA, San Diego
Do you want to join an innovative team of scientists who use machine learning and statistical techniques to help Amazon provide the best customer experience by preventing eCommerce fraud? Are you excited by the prospect of analyzing and modeling terabytes of data and creating state-of-the-art algorithms to solve real world problems? Do you like to own end-to-end business problems/metrics and directly impact the profitability of the company? Do you enjoy collaborating in a diverse team environment? If yes, then you may be a great fit to join the Amazon Buyer Risk Prevention (BRP) Machine Learning group. We are looking for a talented scientist who is passionate to build advanced algorithmic systems that help manage safety of millions of transactions every day. Key job responsibilities Use machine learning and statistical techniques to create scalable risk management systems Learning and understanding large amounts of Amazon’s historical business data for specific instances of risk or broader risk trends Design, development and evaluation of highly innovative models for risk management Working closely with software engineering teams to drive real-time model implementations and new feature creations Working closely with operations staff to optimize risk management operations, Establishing scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation Tracking general business activity and providing clear, compelling management reporting on a regular basis Research and implement novel machine learning and statistical approaches
US, WA, Seattle
Do you want to join an innovative team of scientists who use machine learning and statistical techniques to help Amazon provide the best customer experience by preventing eCommerce fraud? Are you excited by the prospect of analyzing and modeling terabytes of data and creating state-of-the-art algorithms to solve real world problems? Do you like to own end-to-end business problems/metrics and directly impact the profitability of the company? Do you enjoy collaborating in a diverse team environment? If yes, then you may be a great fit to join the Amazon Buyer Risk Prevention (BRP) Machine Learning group. We are looking for a talented scientist who is passionate to build advanced algorithmic systems that help manage safety of millions of transactions every day. Key job responsibilities Use machine learning and statistical techniques to create scalable risk management systems Learning and understanding large amounts of Amazon’s historical business data for specific instances of risk or broader risk trends Design, development and evaluation of highly innovative models for risk management Working closely with software engineering teams to drive real-time model implementations and new feature creations Working closely with operations staff to optimize risk management operations, Establishing scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation Tracking general business activity and providing clear, compelling management reporting on a regular basis Research and implement novel machine learning and statistical approaches
US, CA, San Diego
Do you want to join an innovative team of scientists who use machine learning and statistical techniques to help Amazon provide the best customer experience by preventing eCommerce fraud? Are you excited by the prospect of analyzing and modeling terabytes of data and creating state-of-the-art algorithms to solve real world problems? Do you like to own end-to-end business problems/metrics and directly impact the profitability of the company? Do you enjoy collaborating in a diverse team environment? If yes, then you may be a great fit to join the Amazon Buyer Risk Prevention (BRP) Machine Learning group. We are looking for a talented scientist who is passionate to build advanced algorithmic systems that help manage safety of millions of transactions every day. Key job responsibilities Use machine learning and statistical techniques to create scalable risk management systems Learning and understanding large amounts of Amazon’s historical business data for specific instances of risk or broader risk trends Design, development and evaluation of highly innovative models for risk management Working closely with software engineering teams to drive real-time model implementations and new feature creations Working closely with operations staff to optimize risk management operations, Establishing scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation Tracking general business activity and providing clear, compelling management reporting on a regular basis Research and implement novel machine learning and statistical approaches