Spot the Fake: The Human-AI Divide in Document Forensics

Author: Denis Avetisyan


New research reveals a surprising disconnect between human intuition and automated systems when it comes to identifying AI-generated documents like receipts.

Though visually convincing-down to realistic fonts and paper textures-receipts generated by a two-stage <span class="katex-eq" data-katex-display="false">GPT-4o</span> pipeline consistently exhibit subtle arithmetic errors undetectable through casual inspection, highlighting the persistent gap between superficial realism and functional correctness in generative models.
Though visually convincing-down to realistic fonts and paper textures-receipts generated by a two-stage GPT-4o pipeline consistently exhibit subtle arithmetic errors undetectable through casual inspection, highlighting the persistent gap between superficial realism and functional correctness in generative models.

A comprehensive dataset and human study demonstrate that arithmetic verification outperforms visual inspection for detecting forgeries created by multimodal large language models.

Despite growing concerns about AI-generated document forgery, discerning synthetic from authentic content presents a surprising paradox. This is explored in ‘GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics’, which introduces a new benchmark and human study revealing that while people excel at spotting visual artifacts in AI-generated receipts, they are less effective at overall forgery detection than automated systems. This discrepancy stems from the fact that critical forensic signals – arithmetic errors – are invisible to human inspection but readily verifiable by large language models. Does this highlight a fundamental asymmetry in forensic detection, and what implications does it have for developing robust defenses against AI-driven document fraud?


The Illusion of Authenticity: Receipts in the Age of AI

The increasing sophistication of artificial intelligence now extends to the creation of remarkably convincing, yet entirely fabricated, receipts. This proliferation of synthetic documentation presents a growing threat to financial security, enabling fraudulent expense claims, insurance scams, and even money laundering schemes. Unlike traditional forgeries which often exhibit detectable physical flaws – inconsistencies in ink, paper stock, or printing quality – these AI-generated receipts are photorealistic, meticulously mimicking the appearance of legitimate transactions. The ease with which these documents can be created and disseminated – requiring minimal technical skill or specialized equipment – dramatically lowers the barrier to entry for fraudulent activity and complicates efforts to verify financial records. This new landscape demands a proactive shift in fraud detection, moving beyond visual inspection to more robust analytical methods capable of discerning authenticity at a deeper, data-driven level.

For decades, financial fraud investigations have depended on scrutinizing the physical characteristics of receipts – the paper stock, ink variations, thermal printer inconsistencies, and even the subtle imperfections introduced during the printing process. However, the advent of sophisticated artificial intelligence now allows for the creation of entirely synthetic receipts that flawlessly mimic these physical attributes. These digitally fabricated documents bypass traditional forensic techniques, rendering once-reliable methods increasingly ineffective. Because AI can generate receipts indistinguishable from authentic ones at a purely visual level, investigators are finding it difficult to detect manipulation using conventional means, demanding a shift towards analytical approaches that examine the underlying data and semantic coherence of the claimed transaction.

Addressing the escalating threat of synthetic receipts demands a shift in forensic analysis beyond traditional methods. Current techniques, focused on detecting physical imperfections in documents, are proving inadequate against increasingly sophisticated AI-generated forgeries. Consequently, research is now centered on developing detection systems that scrutinize the meaning and mathematics within receipts. These advanced techniques employ algorithms to verify the logical consistency of listed items, cross-reference pricing with known databases, and identify anomalies in numerical sequences – essentially, seeking discrepancies not in how a receipt looks, but in what it claims. This semantic and numerical analysis offers a powerful defense against fraudulent claims, as even a flawlessly rendered image can betray inconsistencies in its underlying data.

Analysis of AI-generated receipt detection failure rates across error categories reveals that LLaMA 4 Scout consistently fails to identify such receipts, exhibiting near-zero detection accuracy, while other models demonstrate varying degrees of failure indicated by darker red shades representing higher error rates.
Analysis of AI-generated receipt detection failure rates across error categories reveals that LLaMA 4 Scout consistently fails to identify such receipts, exhibiting near-zero detection accuracy, while other models demonstrate varying degrees of failure indicated by darker red shades representing higher error rates.

Beyond Pixels: Multimodal LLMs and Forensic Accounting

Multimodal Large Language Models (LLMs) represent a novel approach to receipt authenticity verification by moving beyond solely textual analysis. Traditional forgery detection methods often focus on inconsistencies within the text of a receipt; however, multimodal LLMs incorporate both visual elements – such as the receipt’s layout, fonts, and logos – and textual data into their assessment. This integrated analysis allows the model to cross-reference information between the image and the text, identifying discrepancies that would be missed by methods examining only one modality. For example, a mismatch between the total displayed in the image of the receipt and the calculated total within the text would be flagged as a potential forgery, leveraging the model’s ability to process and correlate information from different input types.

Multimodal Large Language Models (LLMs) enhance forensic receipt analysis by employing factual consistency checks and arithmetic verification. Traditional forgery detection often relies on visual inspection or basic data matching; however, these methods are susceptible to sophisticated manipulations. LLMs, conversely, can cross-reference information within the receipt – such as verifying that subtotal, tax, and total amounts align – and externally validate details like merchant addresses or product pricing against known databases. This dual approach of internal consistency and external corroboration allows LLMs to identify discrepancies that would likely bypass conventional detection techniques, including alterations to numerical values or fabricated details.

The efficacy of multimodal Large Language Models (LLMs) in forensic receipt analysis is significantly determined by their capacity to detect nuanced manipulations, specifically numerical hallucinations – instances where the model generates incorrect or fabricated numerical data. Current performance benchmarks indicate Claude Sonnet 4 achieves a near-perfect F1 score of 0.975 in identifying these forged numerical values within receipts, demonstrating a high degree of accuracy in uncovering subtle inconsistencies that might otherwise go undetected by conventional methods. This high F1 score represents a balance between precision and recall, suggesting minimal false positives and false negatives in forgery detection.

Five multimodal large language models were evaluated on GPT4o-Receipt, demonstrating varying levels of accuracy, F1-score, recall, and false positive rate <span class="katex-eq" data-katex-display="false">\downarrow</span> (lower is better) and <span class="katex-eq" data-katex-display="false">\uparrow</span> (higher is better).
Five multimodal large language models were evaluated on GPT4o-Receipt, demonstrating varying levels of accuracy, F1-score, recall, and false positive rate \downarrow (lower is better) and \uparrow (higher is better).

The Devil’s in the Details: Uncovering Failure Modes

Hallucinations in generative models, manifesting as both textual inconsistencies and numerical errors, pose a significant threat to the reliability of generated receipt data. These failures are not random; the models can produce outputs that appear valid and plausible to human review, despite containing inaccuracies in item descriptions, pricing, totals, or merchant details. This is particularly concerning as these models are increasingly deployed in automated data extraction and validation pipelines, where such subtle errors can lead to incorrect financial reporting, fraudulent claims processing, or flawed data analysis. The ability of a model to generate convincing, yet false, information is therefore a critical failure mode requiring dedicated detection and mitigation strategies.

The GPT4o-Receipt dataset is a publicly available resource designed to quantitatively assess the performance of forensic techniques in identifying AI-generated receipt images. It comprises a large collection of both real and synthetically generated receipt images, created using GPT-4o, and includes detailed annotations indicating the origin of each image. This allows researchers to evaluate the robustness of detection methods against manipulations such as text and numerical hallucinations, and realistic forgeries. The dataset facilitates standardized benchmarking and comparison of different forensic approaches, providing a metric for evaluating their resilience to increasingly sophisticated AI-generated content. Performance is typically reported as recall and precision scores on this dataset.

Robust visual realism assessment in generative models benefits from advanced forensic techniques. Analyzing diffusion reconstruction error quantifies the discrepancy between a generated image and its reconstruction through a diffusion process, highlighting potential artificiality. Simultaneously, leveraging CLIP (Contrastive Language-Image Pre-training) features allows for comparison of visual and textual embeddings, identifying inconsistencies indicative of forgery. Recent evaluations utilizing these methods demonstrate high performance; specifically, the Claude Sonnet 4 model achieves a recall of 0.972 in detecting AI-generated receipts, indicating a substantial capacity to differentiate manipulated data from authentic examples.

Arithmetic hardening minimally impacts the detection performance of Claude Sonnet 4, Gemini 2.5 Flash, and Grok 4-retaining over 94% of their baseline F1 scores-suggesting these models utilize diverse forensic signals, while GPT-5 Nano and LLaMA 4 Scout exhibit greater dependence on arithmetic consistency for detection.
Arithmetic hardening minimally impacts the detection performance of Claude Sonnet 4, Gemini 2.5 Flash, and Grok 4-retaining over 94% of their baseline F1 scores-suggesting these models utilize diverse forensic signals, while GPT-5 Nano and LLaMA 4 Scout exhibit greater dependence on arithmetic consistency for detection.

Beyond Human Perception: Validating Authenticity in the Age of AI

Human perceptual studies remain critical in evaluating the realism of generated receipts due to the limitations of current automated detection methods. While algorithms excel at identifying broad inconsistencies, they often fail to capture subtle visual artifacts or imperfections that are readily apparent to the human eye. These artifacts can include inconsistencies in font rendering, minor distortions in graphical elements, or unnatural textures. Consequently, human annotators are employed to assess the overall visual quality and identify these nuanced discrepancies, providing a crucial validation step that complements automated forensic analysis and helps refine the training of more robust detection models.

Integrating human evaluation with automated forensic techniques, specifically frequency perception head analysis, improves the reliability of authenticity assessments. Frequency perception head analysis examines the distribution of frequencies within an image to detect inconsistencies indicative of manipulation; however, this method isn’t foolproof and can miss subtle forgeries. Human perceptual studies complement this by identifying artifacts imperceptible to algorithms, such as unnatural textures or illogical arrangements of elements. By combining the speed and scalability of automated analysis with the nuanced judgment of human evaluators, forensic assessments can achieve a higher degree of accuracy and reduce both false positives and false negatives in detecting document forgery.

Zero-shot forensic detection leverages multimodal Large Language Models (LLMs) to identify forgeries without requiring prior training on specific manipulation techniques, offering a significant improvement in adaptability to novel forgery methods. Recent evaluations demonstrate the efficacy of this approach; human annotators, tasked with identifying forged receipts, achieved a lower F1 score compared to the Claude Sonnet 4 LLM. This outcome indicates that Claude Sonnet 4 exhibits superior performance in forensic analysis, potentially due to its ability to process and integrate information from multiple data modalities – such as text and image – more effectively than human assessment.

Claude Sonnet 4 demonstrates the highest overall AI receipt detection performance (<span class="katex-eq" data-katex-display="false">F1 = 0.975</span>), while Gemini 2.5 Flash exhibits the best calibration with a low false positive rate of <span class="katex-eq" data-katex-display="false">0.023</span>, though LLaMA 4 Scout prioritizes minimizing false positives at the cost of recall, and human annotators fall in the middle with moderate performance.
Claude Sonnet 4 demonstrates the highest overall AI receipt detection performance (F1 = 0.975), while Gemini 2.5 Flash exhibits the best calibration with a low false positive rate of 0.023, though LLaMA 4 Scout prioritizes minimizing false positives at the cost of recall, and human annotators fall in the middle with moderate performance.

The Arms Race Continues: Towards Truly Robust Detection

The pursuit of digital authenticity is characterized by a perpetual cycle of innovation and counter-innovation, an ongoing arms race between those who create forgeries and the developers of detection methods. As manipulation techniques become increasingly subtle and sophisticated – leveraging advancements in generative AI and image processing – forensic science must continually evolve to maintain its effectiveness. This isn’t merely a matter of incremental improvements; breakthroughs in areas like deep learning and multi-modal analysis are essential to stay ahead of increasingly realistic forgeries. Without continuous research and the development of novel techniques, the ability to reliably verify the integrity of digital evidence will be compromised, potentially undermining trust in online information and legal proceedings.

The escalating complexity of digital forgery demands a shift towards forensic methods built on inherent resilience. Current detection techniques often rely on identifying specific artifacts introduced by known manipulation tools, a strategy vulnerable to evolving adversarial tactics. Future research prioritizes the development of systems less susceptible to circumvention, focusing on fundamental inconsistencies introduced by any alteration – regardless of the method employed. This necessitates exploring techniques that analyze media at a deeper, more intrinsic level, potentially leveraging principles from signal processing, materials science, and even perceptual psychology. Such robust systems won’t simply recognize known forgeries; they will identify media that deviate from expected natural characteristics, providing a critical defense against currently unknown and future manipulation techniques.

The escalating sophistication of digital forgery demands detection systems that move beyond static analysis and embrace continuous learning. Future security hinges on the development of adaptive methods capable of identifying novel manipulation techniques as they emerge, rather than relying on pre-defined signatures. Recent advancements, such as Claude Sonnet 4, demonstrate the viability of this approach; the model impressively retains over 94% of its forgery detection accuracy even when deprived of arithmetic signal analysis. This resilience underscores the importance of leveraging multiple forensic channels – examining inconsistencies across various image characteristics – and building systems that aren’t solely dependent on any single feature, thereby ensuring continued effectiveness in a constantly evolving landscape of digital deception.

The pursuit of flawless forgery detection feels predictably optimistic. This work demonstrates a familiar pattern: human intuition, while surprisingly effective at visual assessment, is ultimately brittle against increasingly sophisticated AI. The reliance on arithmetic verification as a more robust method isn’t a victory for elegance, but a concession to reality. As Marvin Minsky observed, “Common sense is what stops us from picking up telephone poles and sticking them in our mouths.” Similarly, automated systems, however crude, provide a necessary check against the allure of visually convincing, yet fundamentally flawed, AI-generated documents. The asymmetry revealed – humans good at seeing the lie, machines better at proving it – isn’t progress, merely a shifting of the burden.

What’s Next?

The demonstrated human proficiency in spotting visually synthetic receipts is, predictably, a diminishing advantage. The illusion of authenticity, it seems, is easier to maintain against perception than against calculation. Any system built on ‘looking right’ will inevitably face adversaries who understand the subtle cues of imperfection-and will exploit them. The real challenge isn’t building better visual detectors, but accepting that arithmetic verification, while less glamorous, offers a more robust, if brutally pragmatic, path forward. It’s a reminder that elegance often fails where simple constraints succeed.

Future work will undoubtedly explore scaling these verification techniques, and the inevitable arms race will focus on crafting receipts that almost add up. The paper correctly identifies adversarial robustness as a critical area, but it sidesteps the deeper problem: each layer of defense adds complexity, and complexity is merely deferred tech debt. Consider the cost of maintaining these systems, the inevitable edge cases, and the eventual need for another, more intricate solution. Documentation is, of course, a myth invented by managers.

The long-term trajectory suggests a move toward entirely machine-authored, machine-verified documents-a closed loop of synthetic truth. CI is, after all, the closest thing we have to a temple-and it’s a frightening thought that the future of forensic analysis might simply be ‘did it pass the tests?’ The pursuit of perfect forgery detection feels less like progress and more like rearranging the deck chairs on the Titanic.


Original article: https://arxiv.org/pdf/2603.11442.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-13 16:10