Seeing is Understanding: A New Approach to Document AI

Author: Denis Avetisyan

Researchers have developed a system that intelligently extracts only the necessary information from visual documents, dramatically improving performance in question-answering and information retrieval.

AgenticOCR dynamically processes visual information with precision, employing operations like zoom and rotation to decompress details only as needed, enabling on-demand access to relevant visual data.

AgenticOCR dynamically parses documents based on user queries, shifting optical character recognition from a static process to an efficient, agentic framework for visual reasoning.

Existing retrieval-augmented generation (RAG) systems struggle with efficiently processing complex visual documents, often delivering extraneous context that dilutes salient evidence. This work introduces ‘AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation’, a dynamic parsing paradigm that transforms optical character recognition (OCR) into a query-driven, on-demand extraction system. By autonomously analyzing document layout, AgenticOCR selectively recognizes regions of interest, decoupling retrieval granularity from rigid page-level chunking and improving both efficiency and accuracy. Could this agentic approach represent a crucial advancement toward truly scalable and reliable visual document understanding in RAG systems?

Beyond Static Capture: The Limitations of Traditional OCR

Traditional Optical Character Recognition (OCR) systems fundamentally approach document analysis as a conversion of static images into machine-readable text, a methodology that inherently limits their ability to extract complex information. This image-centric perspective treats the document’s layout – the relationships between text blocks, tables, and images – as mere visual noise to be discarded after character identification. Consequently, retrieving information requiring an understanding of document structure, such as locating a specific data point within a table or associating a caption with an image, becomes significantly more challenging. The system lacks inherent awareness of the document’s logical organization, forcing users to rely on cumbersome post-processing steps and manual intervention to reconstruct meaning lost during the initial static conversion. This limitation hinders the automation of workflows reliant on nuanced document understanding, restricting OCR’s effectiveness beyond simple text capture.

Traditional OCR systems, designed to interpret static document images, frequently falter when confronted with the complexities of real-world layouts. Documents rarely present text in simple, linear arrangements; instead, they incorporate multi-column formats, tables, embedded images, and varying font styles – all of which confound conventional algorithms. Consequently, the raw output from these systems often necessitates substantial post-processing – including error correction, layout reconstruction, and data normalization – before it can be reliably integrated into automated workflows. This extensive manual or programmatic refinement creates significant bottlenecks, increasing processing time, inflating costs, and limiting the scalability of document-based automation initiatives. The need for these corrective measures highlights a fundamental limitation of static OCR in handling the inherent variability and complexity of document structures.

Traditional OCR systems typically process entire documents uniformly, regardless of content relevance, which introduces significant limitations in efficiency and accuracy. This blanket approach fails to prioritize information-rich sections, forcing the system to expend resources on irrelevant areas like headers, footers, or boilerplate text. Consequently, crucial data may be obscured by errors stemming from the processing of these non-essential elements, or the overall process becomes computationally expensive and time-consuming. A system capable of dynamically focusing on pertinent document regions-identifying tables, key phrases, or specific data fields-would dramatically reduce processing time, minimize errors, and unlock the true potential of automated document understanding by intelligently allocating resources where they matter most.

The AgenticOCR model effectively decompresses visual information by combining zoom, OCR, and structured HTML table generation to retrieve and extract relevant evidence from input documents.

AgenticOCR: A Framework for Dynamic Document Intelligence

AgenticOCR departs from traditional Optical Character Recognition (OCR) by implementing an iterative processing loop driven by user queries. Instead of processing an entire document at once, the system dynamically identifies and focuses on specific regions relevant to the information requested. This is achieved through a feedback mechanism where initial query results inform the selection of subsequent document areas for analysis. The process continues until the query is satisfied or a predetermined iteration limit is reached, allowing for targeted information extraction and reducing computational cost by avoiding unnecessary processing of irrelevant document content. This approach contrasts with conventional OCR which typically performs a uniform analysis of the entire input document, regardless of the user’s specific needs.

AgenticOCR utilizes Vision-Language Models (VLMs) to concurrently process visual and textual document elements, facilitating precise information retrieval. These models are trained to establish correlations between image regions and corresponding text, allowing the system to interpret document layout as semantically meaningful. This dual-modal understanding enables targeted extraction; rather than processing an entire document, the VLM identifies and focuses on regions likely to contain answers to specific queries. The models leverage attention mechanisms to weigh the importance of different visual and textual features, improving the accuracy of information extraction from complex documents with varying formats and structures.

AgenticOCR incorporates Document Layout Analysis (DLA) to enhance OCR performance on documents with non-sequential reading orders and complex structures. DLA identifies elements such as headings, paragraphs, tables, and lists, and their spatial relationships, allowing the system to process document regions in a logical sequence rather than a raster scan. This capability is crucial for accurately extracting information from forms, invoices, and other documents where data is not linearly arranged. By understanding the document’s structure, AgenticOCR minimizes errors associated with incorrect data association and reduces the need for post-processing correction, leading to improved accuracy and processing efficiency compared to traditional OCR methods.

AgenticOCR incorporates an Image Zoom Tool to facilitate granular examination of document areas identified as relevant to a given query. This tool allows the system to dynamically increase magnification on specific regions of interest, enabling more precise character recognition and improved accuracy when dealing with low-resolution images or complex typography. The zoom functionality is integrated with the Vision-Language Model, providing contextual awareness during the detailed analysis and enhancing the extraction of information from targeted document segments. This focused magnification improves the system’s ability to resolve ambiguities and accurately interpret content within those specific regions.

The AgenticOCR model accurately localizes and retrieves evidence from documents by leveraging on-demand zoom and OCR, as demonstrated by its successful extraction of textual content from a key region within a real-world document.

Refining Intelligence: Optimization Through Policy and Distillation

AgenticOCR employs Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm, to iteratively improve its query selection process. GRPO operates by treating multiple query selection policies as a group, allowing for more stable and efficient learning compared to training individual policies in isolation. This is achieved through a shared regularization term in the policy loss function, encouraging similar behavior among policies within the group. The algorithm optimizes the query selection policy based on rewards received from successfully extracting relevant information, thereby refining the agent’s ability to formulate effective queries for document analysis and information retrieval.

Rejection Sampling and trajectory distillation are employed to improve the quality of information extracted by AgenticOCR. Rejection Sampling functions as a filtering process, discarding lower-quality query trajectories to focus on those most likely to yield accurate results. Trajectory distillation then refines these selected trajectories by training a smaller model – utilizing Gemini-3-Pro-Preview – to mimic the behavior of a larger, more complex model. This distillation process creates a condensed representation of effective query strategies, allowing the system to maintain high fidelity in information retrieval while improving computational efficiency and generalization performance.

Gemini-3-Pro-Preview plays a critical role in trajectory distillation by serving as the model used to generate and evaluate potential query sequences, or trajectories. These trajectories, representing the agent’s interaction with documents, are assessed based on their ability to retrieve relevant information. The high-fidelity output of Gemini-3-Pro-Preview allows for the selection of optimal trajectories, effectively distilling the learned behaviors of the agent into a refined query selection policy. This process ensures that the final policy leverages high-quality examples, leading to improved accuracy and efficiency in information retrieval tasks.

AgenticOCR’s iterative process of querying documents and receiving feedback enables a continuous learning cycle. Each interaction provides data used to refine the query selection policy via reinforcement learning, specifically Group Relative Policy Optimization. This allows the model to assess the effectiveness of its queries and adjust its strategy for future interactions. The system learns to prioritize queries that yield relevant information, and to discard those that do not, ultimately increasing the efficiency and accuracy of key information location and extraction from documents over time.

AgenticOCR leverages trajectory distillation to initialize a supervised fine-tuning policy, further refined with GRPO, and integrated into visual Retrieval-Augmented Generation (RAG) pipelines for robust document understanding.

Validation and Impact: Benchmarking Against Real-World Complexity

AgenticOCR’s robust performance stems from stringent evaluation against demanding datasets designed to mimic real-world document complexity. MMLongBench-Doc, a particularly challenging benchmark, tests the system’s ability to process lengthy, visually intricate documents, while FinRAGBench-V focuses on financial document understanding and question answering. These benchmarks aren’t simply about recognizing text; they assess the system’s capacity for nuanced understanding, requiring precise Key Information Extraction and accurate Element-Level Evidence Citation. By subjecting AgenticOCR to these rigorous trials, researchers ensure the system’s reliability and demonstrate its superiority over conventional methods in handling the complexities of real-world document processing tasks.

AgenticOCR demonstrates a marked advancement in document understanding through its proficiency in extracting key information and pinpointing the precise elements supporting its conclusions. Rigorous testing on the challenging MMLongBench-Doc dataset reveals an accuracy of 66.4%, a figure that notably exceeds the performance of human experts, who achieved a baseline score of 65.8% on the same benchmark. This capability signifies a shift towards more reliable and transparent document processing, as the system not only identifies crucial data but also provides verifiable evidence for its interpretations, offering a level of accountability often absent in traditional methods and paving the way for enhanced trust in automated document analysis.

AgenticOCR demonstrates a significant advancement in document understanding through its performance on the FinRAGBench-V dataset, achieving an accuracy of 78.6%. This result surpasses the capabilities of competing agentic frameworks, highlighting the system’s robust ability to process and interpret complex financial documents. The benchmark focuses on tasks demanding intricate reasoning and information retrieval from visually rich sources, and AgenticOCR’s superior score indicates a marked improvement in handling these challenges. This achievement underscores the effectiveness of the system’s architecture, particularly its capacity to synthesize information and provide accurate answers to questions posed about financial data – a crucial capability for applications ranging from automated compliance checks to intelligent financial analysis.

AgenticOCR demonstrates a remarkable capacity to locate relevant information within complex documents, as evidenced by its high recall rates on challenging benchmark datasets. The system achieves a combined recall of 68.8% across both text and layout-based tasks on MMLongBench-Doc, indicating its proficiency in identifying crucial details regardless of their presentation. Further analysis reveals exceptional performance in pinpointing entire pages containing relevant information, with page-level recall reaching 93.5% on MMLongBench-Doc and an even more impressive 95.3% on FinRAGBench-V; these figures highlight the system’s ability to effectively navigate and extract information from visually rich documents, surpassing the limitations of approaches that focus solely on textual content.

AgenticOCR establishes new benchmarks in complex question answering through a synergistic combination of Retrieval-Augmented Generation (RAG) and advanced visual document understanding. This approach moves beyond simple text extraction by enabling the system to not only ‘see’ the document’s layout but also to intelligently retrieve relevant information based on the question asked. By grounding its responses in visually-confirmed evidence within the document, AgenticOCR minimizes hallucination and maximizes accuracy – a capability demonstrated by its performance on challenging datasets. The integration of RAG allows the model to dynamically access and incorporate contextual information, effectively simulating a reasoned thought process when interpreting complex documents and formulating answers, resulting in state-of-the-art results and pushing the boundaries of document AI.

Despite the generative model's robustness, the framework incorrectly identified a café name due to the AgenticOCR model's single-page limitation and the low resolution of a page screenshot, illustrating a current limitation and suggesting areas for improvement. — Despite the generative model’s robustness, the framework incorrectly identified a café name due to the AgenticOCR model’s single-page limitation and the low resolution of a page screenshot, illustrating a current limitation and suggesting areas for improvement.

AgenticOCR embodies a pursuit of elegance in visual document understanding. The framework moves beyond exhaustive optical character recognition, instead focusing on parsing only the necessary information as dictated by the query – a demonstration of harmonious form and function. This approach mirrors a deeply held belief: that true intelligence isn’t about processing everything, but about discerning what matters. As Yann LeCun aptly stated, “Simplicity is a sign of profundity.” AgenticOCR, by streamlining the process and prioritizing relevant data, exemplifies this principle, achieving efficient retrieval-augmented generation through focused, intelligent design. The result is a system where beauty in code emerges through simplicity and clarity, a symphony of focused processing.

The Road Ahead

The elegance of AgenticOCR lies not merely in its performance gains, but in the subtle re-framing of optical character recognition itself. For too long, the field has treated document parsing as a brute-force exercise-extract everything, then ask questions. This work suggests a more harmonious approach, a dialogue between query and document. Yet, the true test will be extending this principle beyond retrieval-augmented generation. Can this agentic framework be generalized to other forms of visual reasoning, where the very act of ‘seeing’ is shaped by intent?

Current limitations hint at deeper, unresolved problems. The reliance on pre-trained language models, while pragmatic, introduces an inherent opacity. A truly robust system will require a more transparent understanding of why certain parsing choices are made, not just that they are effective. Consistency in these choices-a form of empathy for future analytical steps-remains a significant challenge.

The pursuit of truly intelligent document understanding isn’t about achieving perfect character recognition; it’s about building systems that can gracefully handle ambiguity, prioritize information based on context, and, ultimately, discern signal from noise. It’s a quiet architecture, one where the scaffolding disappears when the structure stands firm.

Original article: https://arxiv.org/pdf/2602.24134.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Static Capture: The Limitations of Traditional OCR

AgenticOCR: A Framework for Dynamic Document Intelligence

Refining Intelligence: Optimization Through Policy and Distillation

Validation and Impact: Benchmarking Against Real-World Complexity

The Road Ahead

See also: