Detecting Financial Fraud with the Power of Language

Author: Denis Avetisyan

A new approach combines the reasoning abilities of large language models with statistical anomaly detection to uncover hidden patterns in accounting data.

Prompt-engineered language models, integrated with Isolation Forest scores, demonstrate improved fraud detection and explainability in double-entry bookkeeping systems.

Detecting subtle fraud in financial records remains a critical challenge despite reliance on traditional Journal Entry Tests, which often generate overwhelming false positives. This paper, ‘AuditCopilot: Leveraging LLMs for Fraud Detection in Double-Entry Bookkeeping’, investigates the potential of large language models to serve as more effective anomaly detectors in double-entry bookkeeping systems. Our results demonstrate that prompt-engineered LLMs, particularly when combined with Isolation Forest scores, consistently outperform both rule-based methods and classical machine learning baselines while providing interpretable explanations. Could this represent a significant step towards AI-augmented auditing, strengthening financial integrity through human-AI collaboration?

The Inevitable Rise of Anomalies in Financial Records

The foundation of contemporary financial record-keeping rests upon the principles of Double-Entry Bookkeeping, a system where every financial transaction impacts at least two accounts, ensuring the accounting equation – assets equaling liabilities plus equity – remains in balance. While historically managed through manual processes, the sheer volume of transactions in modern commerce generates datasets of unprecedented scale and complexity. Each sale, purchase, payroll action, and investment contributes to this growing digital record, demanding increasingly sophisticated methods for storage, processing, and analysis. This proliferation of data, while offering opportunities for deeper financial insights, simultaneously presents significant challenges in maintaining data integrity and detecting irregularities, necessitating a shift towards automated analytical techniques to effectively manage and interpret these vast financial landscapes.

The bedrock of any functional tax system rests upon the uncompromised accuracy of ledger data, yet maintaining this integrity is becoming increasingly challenging. Historically, accountants have relied on manual review to identify unusual transactions or discrepancies indicative of error or fraud; however, the sheer volume of financial data generated by modern economies renders this approach unsustainable. As transaction rates escalate and data complexity grows, manual efforts become not only time-consuming and costly but also significantly prone to oversight. Human reviewers, susceptible to fatigue and cognitive biases, can easily miss subtle anomalies that, when aggregated, may signal substantial financial irregularities, highlighting the urgent need for automated anomaly detection systems capable of processing and analyzing vast datasets with greater speed and precision.

The escalating sophistication of financial fraud presents a significant challenge to conventional accounting practices. Historically, rule-based systems – designed to flag transactions violating pre-defined criteria – have formed the cornerstone of anomaly detection. However, these systems are proving increasingly inadequate as perpetrators employ ever more nuanced and adaptive techniques to circumvent detection. Fraudsters quickly learn to operate just outside these rigid parameters, rendering the rules obsolete and creating a constant cycle of updates and adjustments. This limitation highlights the critical need for automated solutions leveraging advanced analytical techniques, such as machine learning, capable of identifying subtle patterns and deviations indicative of fraudulent activity – patterns that would remain hidden to static, rule-based approaches and offer a more proactive and resilient defense against financial crime.

Unsupervised Learning: A Necessary Progression in Data Integrity

Anomaly detection within financial datasets utilizes algorithms to identify transactions that deviate significantly from established patterns. These systems analyze numerous variables – including transaction amount, frequency, location, and time – to establish a baseline of ‘normal’ activity. Suspicious transactions are flagged when they fall outside statistically defined boundaries or exhibit unusual combinations of characteristics. The scale of modern financial data – often involving millions of daily transactions – necessitates automated anomaly detection systems, as manual review is impractical. Identified anomalies are then subject to further investigation by fraud analysts to confirm legitimacy and prevent financial loss.

The utility of unsupervised learning in fraud detection stems from its ability to operate effectively without the need for pre-labeled datasets. Traditional supervised machine learning algorithms require extensive, accurately labeled data – a significant obstacle in fraud scenarios where fraudulent transactions represent a small minority of overall activity and labeling is time-consuming and expensive. Obtaining sufficient labeled fraud cases is often impractical due to the rarity of such events and the evolving nature of fraudulent tactics. Unsupervised methods, such as clustering and anomaly detection, circumvent this limitation by identifying patterns and outliers directly from the raw, unlabeled transaction data, thereby enabling proactive fraud detection even with limited prior knowledge of fraudulent behavior.

Unsupervised learning algorithms establish a baseline of expected data behavior by analyzing inherent relationships and distributions within the dataset. These algorithms, such as clustering and dimensionality reduction techniques, identify common patterns and create a representative model of “normal” activity. When new data points are introduced, the algorithm calculates their deviation from this established model, often using metrics like distance or density. Data points exhibiting significant deviations – falling outside established thresholds or clusters – are flagged as anomalies requiring manual review. This process allows for the identification of potentially fraudulent or erroneous transactions without requiring pre-defined labels for suspicious activity, making it suitable for datasets where anomalies are rare or previously unknown.

Isolation Forest: An Efficient Algorithm for Anomaly Pinpointing

The Isolation Forest algorithm is an unsupervised machine learning technique used for anomaly detection. It operates on the principle that anomalies are ‘few and different’ and can be isolated more easily than normal instances. This is achieved by randomly partitioning the data space using a tree-based approach. During partitioning, instances are recursively split based on randomly selected features and values. Anomalies, requiring fewer partitions to be isolated, exhibit shorter average path lengths in the tree structure. The algorithm quantifies anomaly scores based on these path lengths; lower scores indicate a higher probability of being an anomaly. This method’s efficiency stems from its ability to avoid explicitly modeling the normal data points, focusing instead on identifying instances that are easily separable.

Synthetic General Ledger Data was utilized as the initial validation dataset for the Isolation Forest algorithm to facilitate controlled experimentation and performance benchmarking. This synthetic data allowed for the creation of a known anomaly landscape, enabling precise evaluation of the algorithm’s ability to correctly identify outliers. The controlled nature of the dataset also permitted systematic adjustment of algorithm parameters and direct comparison against established anomaly detection techniques, such as traditional JETs, providing a quantifiable baseline for performance assessment and optimization prior to application on real-world financial data.

The application of Isolation Forest in conjunction with a prompt-engineered Large Language Model (LLM) yielded a peak F1 score of 0.94 when tested against a Synthetic General Ledger dataset. This represents a substantial improvement in performance compared to baseline methods; the combined approach generated only 12 false positive instances. In contrast, the Isolation Forest algorithm operating independently produced 169 false positives, while traditional Java Evasion Techniques (JETs) resulted in a significantly higher 942 false positives. These results demonstrate the LLM’s capacity to refine anomaly detection and minimize inaccurate classifications.

SHAP Values: Illuminating the Logic Behind Anomaly Detections

The inherent complexity of the Isolation Forest algorithm, while effective at identifying anomalies, often leaves the reasoning behind those detections opaque. To address this, researchers employed SHAP (SHapley Additive exPlanations) values, a method from game theory, to dissect the algorithm’s decision-making process. SHAP values quantify the contribution of each individual feature – such as transaction amount, time interval, or recipient details – to the overall anomaly score assigned to a given transaction. By attributing a specific impact to each feature, these values transform the ‘black box’ of the Isolation Forest into a more understandable system, revealing precisely which characteristics are driving the identification of potentially fraudulent or unusual activity. This feature-level explanation enhances trust in the model and allows for more targeted investigation of flagged transactions.

The Isolation Forest model, while effective at flagging unusual transactions, benefits from detailed explanation of why a particular instance is flagged. SHAP values provide this insight by quantifying the contribution of each feature to the anomaly score; for example, a large transaction amount or an unusually high frequency of transfers can be identified as key drivers of the assessment. Similarly, the characteristics of the counterparty – perhaps a previously unseen entity or one associated with higher risk – can significantly influence the algorithm’s decision. This feature-level attribution allows for a nuanced understanding of the model’s reasoning, moving beyond a simple binary ‘anomaly/not anomaly’ classification and instead revealing the specific aspects of each transaction that triggered the alert.

The integration of SHAP values with the Isolation Forest model delivers not only accurate anomaly detection, but also a crucial layer of interpretability for those tasked with investigating flagged transactions. Auditors and investigators can move beyond simply knowing an anomaly exists, and instead rapidly discern why the algorithm determined a particular transaction to be unusual – whether due to an unexpectedly high amount, an unusual transaction frequency, or the involvement of a specific counterparty. This streamlined understanding accelerates the investigative process and supports more informed decision-making, as demonstrated by the approach’s strong performance on a synthetic dataset, achieving a precision of 0.90 and a recall of 0.98.

The pursuit of reliable anomaly detection, as demonstrated in this work with AuditCopilot, echoes a fundamental tenet of computational correctness. Ken Thompson famously stated, “Software is only ever approximately correct.” This sentiment highlights the inherent difficulty in achieving absolute certainty, even with rigorous testing. AuditCopilot attempts to move beyond simply identifying outliers-a statistical approximation-by leveraging LLMs to provide interpretable explanations, essentially building a case why an entry is anomalous. This focus on justification isn’t merely about usability; it’s about approaching a more provable understanding of financial irregularities, aligning with a mathematical pursuit of truth within the complex domain of double-entry bookkeeping and fraud detection.

Future Directions

The demonstrated confluence of Large Language Models and statistical anomaly detection, while promising, merely scratches the surface of a fundamentally unsolved problem. The current reliance on prompt engineering, a process bordering on alchemy, is inherently brittle. A truly elegant solution will not require iteratively coaxing a model to approximate logical reasoning; it will be logical reasoning, expressed in a form directly amenable to formal verification. The current approach, for all its empirical success, remains a black box, trading interpretability for performance-a Faustian bargain the field must ultimately reject.

Further refinement necessitates a move beyond superficial anomaly scores. The Isolation Forest, while computationally efficient, offers limited insight into the why of an anomaly. Future work should explore methods for generating formal proofs of inconsistency, leveraging the LLM’s linguistic capabilities to translate financial transactions into logical statements. This pursuit demands a rigorous mathematical foundation, replacing heuristic prompt design with provably correct algorithms.

Ultimately, the goal is not simply to flag suspicious entries, but to construct an auditable, self-verifying financial system. Every byte of computational effort must contribute to a reduction in ambiguity, eliminating the potential for both fraud and error. The current landscape, replete with opaque models and hand-crafted prompts, represents a significant departure from this ideal, and a clear indication of work yet to be done.

Original article: https://arxiv.org/pdf/2512.02726.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Rise of Anomalies in Financial Records

Unsupervised Learning: A Necessary Progression in Data Integrity

Isolation Forest: An Efficient Algorithm for Anomaly Pinpointing

SHAP Values: Illuminating the Logic Behind Anomaly Detections

Future Directions

See also: