New pretraining tasks enable better document understanding

DocFormerV2 makes sense of documents using local features, outperforming much bigger models.

In the digital era, when documents are generated and distributed at unprecedented rates, automatically understanding them is crucial. Consider the tasks of extracting payment information from invoices or digitizing historical records, where layouts and handwritten notes play an important role in understanding context. These scenarios highlight the complexity of document understanding, which requires not just recognizing text but also interpreting visual elements and their spatial relationships.

A mailing label from Harvard University Press, with several preprinted, labeled spaces for shipping data, such as "sold to", "ship to", and "date".
Visual document understanding (VDU): A snippet of a document receipt from the DocVQA dataset. A VDU model might be asked to predict the “sold to” address (visual question answering), to predict all relations (“sold to” → <address>, “ship to” → <address>), or to infer information from the table at the top of the document.

At this year’s meeting of the Association for the Advancement of Artificial Intelligence (AAAI 2024), we proposed a model we call DocFormerv2, which doesn't just read documents but understands them, making sense of both textual and visual information in a way that mimics human comprehension. For example, just as a person might infer a report's key points from its layout, headings, text, and associated tables, DocFormerv2 analyzes these elements collectively to grasp the document's overall message.

Related content
Method preserves knowledge encoded in teacher model’s attention heads even when student model has fewer of them.

Unlike its predecessors, DocFormerv2 employs a transformer-based architecture that excels in capturing local features within documents — small, specific details such as the style of a font, the way a paragraph is arranged, or how pictures are placed next to text. This means it can discern the significance of layout elements with higher accuracy than prior models.

A standout feature of DocFormerv2 is its use of self-supervised learning, the approach used in many of today’s most successful AI models, such as GPT. Self-supervised learning uses unannotated data, which enables training on enormous public datasets. In language modeling, for instance, next-token prediction (used by GPT) or masked-token prediction (used by T5 or BERT) are popular.

A schematic of the DocFormerv2 architecture, which takes as input both images of the document and the associated OCR output, along with the spatial coordinates of text, and which is trained on two tasks, token to line and token to grid.
DocFormerv2 architecture.

For DocFormerv2, in addition to standard masked-token prediction, we propose two additional tasks, token-to-line prediction and token-to-grid assignment. These tasks are designed to deepen the model's understanding of the intricate relationship between text and its spatial arrangement within documents. Let’s take a closer look at them.

Token to line

The token-to-line task trains DocFormerv2 to recognize how textual elements align within lines, imparting an understanding that goes beyond mere words to include the flow and structure of text as it appears in documents. This follows the intuition that most of the information needed for key-value prediction in a form or for visual question answering (VQA) is on either the same line or adjacent lines of a document. For instance, in the diagram below, in order to predict the value for "Total" (box a), the model has to look in the same line (box d, "$4.32"). Through this type of task, the model learns to give importance to information about the relative positions of tokens and its semantic implications.

At left is a store receipt with the labels "state tax", "total", and "change" surrounded by red boxes and labeled, respectively, b, a, and c and the total amount of the charge, $4.32, labeled d. At right is a product order form with a 16-cell red grid superimposed on it, each cell labeled with a blue number (1-16).
Novel document pretraining tasks: token to line and token to grid.

Token to grid

Semantic information varies across a document's different regions. For instance, financial documents might have headers at the top, fillable information in the middle, and footers or instructions at the bottom. Page numbers are usually found at the top or bottom of a document, while company names in receipts or invoices often appear at the top. Understanding a document accurately requires recognizing how its content is organized within a specific visual layout and structure. Armed with this intuition, the token-to-grid task pairs the semantics of texts with their locations (visual, spatial, or both) in the document. Specifically, a grid is superimposed on the document, and each OCR token is assigned a grid number. During training, DocFormerv2 is tasked with predicting the grid number for each token.

Target tasks and impact

On nine different datasets covering a range of document-understanding tasks, DocFormerv2 outperforms previous comparably sized models and even does better than much larger models — including one that is 106 times as big as DocFormerv2. Since text from documents is extracted using OCR models, which do make prediction errors, we also show that DocFormerv2 is more resilient to OCR errors than its predecessors.

One of the tasks we trained DocFormerv2 on is table VQA, a challenging task in which the model must answer questions about tables (with either images, text, or both as input). DocFormerv2 achieved 4.3% absolute performance improvement over the next best model.

A spreadsheet table labeled "FM radio stations" whose column labels include "frequency", "call sign", "name", and "format". The entries in the "call sign" column are "KUSK", "KKYA", "KDAM", "WNAX-FM", and "KVHT". "WNAX-FM" is surrounded by a red box.
For the question "Which of these stations does not have a 'k’ in its call sign?", DocFormerv2 correctly answers "WNAX-FM" (fourth row, second column). This requires reasoning over spatial, visual, and language features.
A spreadsheet table with three columns, labeled "District", "Location", and "Communities served". Four of the eight cells in the "Communities served" column — those whose entries begin "Roman Catholic Diocese of Cleveland" — are surrounded by red boxes.
For the question "How many of the schools serve the Roman Catholic diocese of Cleveland?", DocFormerv2 correctly answers "four". This requires arithmetic counting — a challenging task for machine learning models — and reasoning over multiple rows.
A police boat with the word "Police" written on its hull and, below the picture, the text query "What color is the word 'police' written in?"
In this example, an image and text (from an OCR model) are fed to DocFormerv2 along with the question “What color is the word ‘police’ written in?”. Due to its multimodal nature, DocFormerv2 can “see” the image and correctly answer “white”.

But DocFormerv2 also displayed more-qualitative advantages over its predecessors. Because it’s trained to make sense of local features, DocFormerv2 can answer correctly when asked questions like “Which of these stations do not have a ‘k’ in their call sign?” or “How many of the schools serve the Roman Catholic diocese of Cleveland?” (The second question requires counting — a hard skill to learn.)

In order to show the versatility and generalizability of DocFormerv2, we also tested it on scene-text VQA, a task that’s related to but distinct from document understanding. Again, it significantly outperformed comparably sized predecessors.

While DocFormerv2 has made significant strides in interpreting complex documents, several challenges and exciting opportunities lie ahead, like teaching the model to deal with diverse document layouts and enhancing multimodal integration.

Related content

US, WA, Bellevue
Conversational AI ModEling and Learning (CAMEL) team is part of Amazon Devices organization where our mission is to build a best-in-class Conversational AI that is intuitive, intelligent, and responsive, by developing superior Large Language Models (LLM) solutions and services which increase the capabilities built into the model and which enable utilizing thousands of APIs and external knowledge sources to provide the best experience for each request across millions of customers and endpoints. We are looking for a passionate, talented, and resourceful Applied Scientist in the field of LLM, Artificial Intelligence (AI), Natural Language Processing (NLP), Recommender Systems and/or Information Retrieval, to invent and build scalable solutions for a state-of-the-art context-aware conversational AI. A successful candidate will have strong machine learning background and a desire to push the envelope in one or more of the above areas. The ideal candidate would also have hands-on experiences in building Generative AI solutions with LLMs, enjoy operating in dynamic environments, be self-motivated to take on challenging problems to deliver big customer impact, moving fast to ship solutions and then iterating on user feedback and interactions. Key job responsibilities As an Applied Scientist, you will leverage your technical expertise and experience to collaborate with other talented applied scientists and engineers to research and develop novel algorithms and modeling techniques to reduce friction and enable natural and contextual conversations. You will analyze, understand and improve user experiences by leveraging Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in artificial intelligence. You will work on core LLM technologies, including Prompt Engineering and Optimization, Supervised Fine-Tuning, Learning from Human Feedback, Evaluation, Self-Learning, etc. Your work will directly impact our customers in the form of novel products and services.
GB, Cambridge
The Artificial General Intelligence team (AGI) has an exciting position for an Applied Scientist with a strong background NLP and Large Language Models to help us develop state-of-the-art conversational systems. As part of this team, you will collaborate with talented scientists and software engineers to enable conversational assistants capabilities to support the use of external tools and sources of information, and develop novel reasoning capabilities to revolutionise the user experience for millions of Alexa customers. Key job responsibilities As an Applied Scientist, you will develop innovative solutions to complex problems to extend the functionalities of conversational assistants . You will use your technical expertise to research and implement novel algorithms and modelling solutions in collaboration with other scientists and engineers. You will analyse customer behaviours and define metrics to enable the identification of actionable insights and measure improvements in customer experience. You will communicate results and insights to both technical and non-technical audiences through written reports, presentations and external publications.
US, WA, Bellevue
Conversational AI ModEling and Learning (CAMEL) team is part of Amazon Artificial General Intelligence (AGI) organization where our mission is to create a best-in-class Conversational AI that is intuitive, intelligent, and responsive, by developing superior Large Language Models (LLM) solutions and services which increase the capabilities built into the model and which enable utilizing thousands of APIs and external knowledge sources to provide the best experience for each request across millions of customers and endpoints. We are looking for a passionate, talented, and resourceful Applied Scientist in the field of LLM, Artificial Intelligence (AI), Natural Language Processing (NLP), Recommender Systems and/or Information Retrieval, to invent and build scalable solutions for a state-of-the-art context-aware conversational AI. A successful candidate will have strong machine learning background and a desire to push the envelope in one or more of the above areas. The ideal candidate would also have hands-on experiences in building Generative AI solutions with LLMs, enjoy operating in dynamic environments, be self-motivated to take on challenging problems to deliver big customer impact, moving fast to ship solutions and then iterating on user feedback and interactions. Key job responsibilities As an Applied Scientist, you will leverage your technical expertise and experience to collaborate with other talented applied scientists and engineers to research and develop novel algorithms and modeling techniques to reduce friction and enable natural and contextual conversations. You will analyze, understand and improve user experiences by leveraging Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in artificial intelligence. You will work on core LLM technologies, including Supervised Fine-Tuning (SFT), In-Context Learning (ICL), Learning from Human Feedback (LHF), etc. Your work will directly impact our customers in the form of novel products and services.
IL, Tel Aviv
Are you an inventive, curious, and driven Applied Scientist with a strong background in AI and Deep Learning? Join Amazon’s AWS Multimodal generative AI science team and be a catalyst for groundbreaking advancements in Computer Vision, Generative AI, and foundational models. As part of the AWS Multimodal generative AI science team, you’ll lead innovative research projects, develop state-of-the-art algorithms, and pioneer solutions that will directly impact millions of Amazon customers. Leveraging Amazon’s vast computing power, you’ll work alongside a supportive and diverse group of top-tier scientists and engineers, contributing to products that redefine the industry. Key job responsibilities * Lead research initiatives in Multimodal generative AI, pushing the boundaries of model efficiency, accuracy, and scalability. * Design, implement, and evaluate deep learning models in a production environment. * Collaborate with cross-functional teams to transfer research outcomes into scalable AWS services. * Publish in top-tier conferences and journals, keeping Amazon at the forefront of innovation. * Mentor and guide other scientists and engineers, fostering a culture of scientific curiosity and excellence.
US, WA, Seattle
We are seeking a highly skilled economist to measure and understand how each Customer Service activity impacts customers. This candidate's analysis will assist teams across Amazon to prioritize defect elimination efforts and optimize how we respond to customer contacts. This candidate will partner closely with our product, program, and tech teams to deliver their findings to users via systems and dashboards that guide Customer Service planning and policy rules. Key job responsibilities - Develop Causal, Economic, and Machine Learning models at scale. - Engage in economic analysis and raise the bar for research. - Inform strategic discussions with senior leaders across the company to guide policies. A day in the life If you are not sure that every qualification on the list above describes you exactly, we'd still love to hear from you! At Amazon, we value people with unique backgrounds, experiences, and skillsets. If you’re passionate about this role and want to make an impact on a global scale, please apply! Amazon offers a full range of benefits that support you and eligible family members, including domestic partners and their children. Benefits can vary by location, the number of regularly scheduled hours you work, length of employment, and job status such as seasonal or temporary employment. The benefits that generally apply to regular, full-time employees include: 1. Medical, Dental, and Vision Coverage 2. Maternity and Parental Leave Options 3. Paid Time Off (PTO) 4. 401(k) Plan About the team The Worldwide defect elimination team's mission is to understand and resolve all issues impacting customers at scale. The Customer Service Economics and Optimization team is a force multiplier within this group, helping to understand the impact of these issues and our actions to optimize the customer experience.
NL, Amsterdam
Are you a MS or PhD student interested in a 2025 Internship in the field of machine learning, deep learning, speech, robotics, computer vision, optimization, quantum computing, automated reasoning, or formal methods? If so, we want to hear from you! We are looking for students interested in using a variety of domain expertise to invent, design and implement state-of-the-art solutions for never-before-solved problems. You can find more information about the Amazon Science community as well as our interview process via the links below; https://www.amazon.science/ https://amazon.jobs/content/en/career-programs/university/science https://amazon.jobs/content/en/how-we-hire/university-roles/applied-science Key job responsibilities As an Applied Science Intern, you will own the design and development of end-to-end systems. You’ll have the opportunity to write technical white papers, create roadmaps and drive production level projects that will support Amazon Science. You will work closely with Amazon scientists, and other science interns to develop solutions and deploy them into production. You will have the opportunity to design new algorithms, models, or other technical solutions whilst experiencing Amazon’s customer focused culture. The ideal intern must have the ability to work with diverse groups of people and cross-functional teams to solve complex business problems. A day in the life At Amazon, you will grow into the high impact, visionary person you know you’re ready to be. Every day will be filled with developing new skills and achieving personal growth. How often can you say that your work changes the world? At Amazon, you’ll say it often. Join us and define tomorrow. Some more benefits of an Amazon Science internship include; • All of our internships offer a competitive stipend/salary • Interns are paired with an experienced manager and mentor(s) • Interns receive invitations to different events such as intern program initiatives or site events • Interns can build their professional and personal network with other Amazon Scientists • Interns can potentially publish work at top tier conferences each year About the team Applicants will be reviewed on a rolling basis and are assigned to teams aligned with their research interests and experience prior to interviews. Start dates are available throughout the year and durations can vary in length from 3-6 months for full time internships. This role may available across multiple locations in the EMEA region (Austria, Estonia, France, Germany, Ireland, Israel, Italy, Luxembourg, Netherlands, Poland, Romania, Spain, UAE, and UK). Please note these are not remote internships.
US, WA, Seattle
Come be a part of a rapidly expanding $35 billion-dollar global business. At Amazon Business, a fast-growing startup passionate about building solutions, we set out every day to innovate and disrupt the status quo. We stand at the intersection of tech & retail in the B2B space developing innovative purchasing and procurement solutions to help businesses and organizations thrive. At Amazon Business, we strive to be the most recognized and preferred strategic partner for smart business buying. Bring your insight, imagination and a healthy disregard for the impossible. Join us in building and celebrating the value of Amazon Business to buyers and sellers of all sizes and industries. Unlock your career potential. The AB Sales Analytics, Data, Product and Tech (ADAPTech) team uses CRM, data, product, and science to improve Sales productivity and performance. It has four pillars: 1) SalesTech maintains Salesforce to enable Sales workflows, and supports >2K users in nine countries; 2) Product and Science builds tools embedded with bespoke Machine Learning (ML) and GenAI large language models to enable sales reps to prioritize top accounts, position the right Amazon Business (AB) product features, and take actions based on critical customer events; 3) Sales Data Management (SDM) and Sales Account Management (SAM) enrich customer profiles and business hierarchies while improving productivity through automation and integration of internal/external tools; and 4) Business Intelligence (BI) enables self-service reporting simplifying access to key insights through WBRs and dashboards. Sales teams leverage these products to identify which customers to target, what features to target them with, and when to target them, in order to capture their share of wallet. A successful Applied Scientist at Amazon demonstrates bias for action and operates in a startup environment, with outstanding leadership skills, and proven ability to build and manage medium-scale modeling projects, identify data requirements, build methodology and tools that are statistically grounded. We need great leaders to think big and design new solutions to solve complex problems using machine learning (ML) and Generative AI techniques to improve our customers’ experience when using AB. You have hands-on experience making the right decisions about technology, models and methodology choices. Key job responsibilities As an Applied Scientist, you will primarily leverage machine learning techniques and generative AI to outreach customers based on their life cycle stage, behavioral patterns, and purchase history. You may also perform text mining and insight analysis of real-time customer conversations and make the model learn and recommend the solutions. Your work will directly impact the trust customers place in Amazon Business. You will partner with product management and technical leadership to identify opportunities to innovate customer journey experiences. You will identify new areas of investment and work to align product roadmaps to deliver on these opportunities. As a science leader, you will not only develop unique scientific solutions, but also play a crucial role in shaping strategies. Additional responsibilities include: -Design, implement, test, deploy and maintain innovative data and machine learning solutions to further the customer experience. -Create experiments and prototype implementations of new learning algorithms and prediction techniques -Develop algorithms for new capabilities and trace decisions in the data and assess how proposed changes could potentially impact business metrics to cater needs of Amazon Business Sales -Build models that measure incremental value, predict growth, define and conduct experiments to optimize engagement of AB customers, and communicate insights and recommendations to product, sales, and finance partners. A day in the life In this role, you will be a technical expert with significant scope and impact. You will work with Technical Product Managers, Data Engineers, other Scientists, and Salesforce developers, to build new and enhance existing ML models to optimize customer experience. You will prototype and test new ideas, iterate quickly, and deploy models to production. Also, you will conduct in-depth data analysis and feature engineering to build robust ML models.
US, WA, Seattle
We are building GenAI based shopping assistant for Amazon. We reimage Amazon Search with an interactive conversational experience that helps you find answers to product questions, perform product comparisons, receive personalized product suggestions, and so much more, to easily find the perfect product for your needs. We’re looking for the best and brightest across Amazon to help us realize and deliver this vision to our customers right away. This will be a once in a generation transformation for Search, just like the Mosaic browser made the Internet easier to engage with three decades ago. If you missed the 90s—WWW, Mosaic, and the founding of Amazon and Google—you don’t want to miss this opportunity.
US, WA, Seattle
At Amazon, we believe that scientific innovation is essential to being the most customer-centric company in the world. Our scientists' ability to have an impact at scale allows us to attract some of the brightest minds in machine learning, artificial intelligence and related fields. Amazon scientists employ the company's working backwards method to identify problems to solve on behalf of customers in research areas ranging from machine learning to operations, GenAI, robotics, quantum computing, computer vision, economics, search, sustainability and more. Learn more about Amazon Science here: https://www.amazon.science/ We are hiring across multiple businesses and in many locations across the US. Apply here to learn more about open roles that could be a compelling fit for your background. Key job responsibilities You will be responsible for defining key research directions, adopting or inventing new machine learning techniques, conducting rigorous experiments, publishing results, and ensuring that research is translated into practice. You will develop long-term strategies, persuade teams to adopt those strategies, propose goals and deliver on them. You will also participate in organizational planning, hiring, mentorship and leadership development. You will be technically fearless and with a passion for building scalable science and engineering solutions. You will serve as a key scientific resource in full-cycle development (conception, design, implementation, testing to documentation, delivery, and maintenance).
US, WA, Bellevue
Amazon Web Services (AWS) offers a broad set of global compute, storage, database, analytics, application, and deployment services that help organizations move faster, lower IT costs, and scale applications. These services are trusted by the largest enterprises and the hottest start-ups to power a wide variety of workloads including web and mobile applications, data processing and warehousing, storage, archive, and many others. We are looking for an applied scientist to help us define and build a new enterprise application. AWS Applications is building services in Supply Chain Management and is looking for a scientist to tackle complex science problems in Supply Chain including demand planning, supply planning and sustainability which will be used by our customers across a wide range of industries. We operate a fast growing business and our journey has only started. Our mission is to build the most efficient and optimal supply chain software on the planet, using our science and technology as our biggest advantage. We aim to leverage cutting edge technologies in optimization, operations research, and machine learning to grow our businesses. As an Applied Scientist, you’ll design, model, develop and implement state-of-the-art models and solutions used by users worldwide. As part of your role you will regularly interact with software engineering teams and business leadership. The focus of this role is to research, develop, and deploy models to improve state-of-the-art for time series. You will have the opportunity to work on our assistant solution allowing our users to ask data questions in natural language and get intelligent insights and exceptions. Key job responsibilities Lead and partner with the engineering to drive modeling and technical design for complex business problems. Develop accurate and scalable machine learning models to solve our hardest supply chain problems. Lead complex modeling analyses to aid management in making key business decisions and set product direction. A day in the life Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying.