Checking the Facts: A New Approach to Financial Claim Verification

Author: Denis Avetisyan


Researchers have developed a novel framework that uses synthetically generated data to dramatically improve the efficiency and accuracy of fact-checking financial statements and claims.

FISCAL leverages synthetic data generation and parameter-efficient fine-tuning to achieve competitive accuracy with significantly smaller models.

Despite the increasing demand for reliable financial applications of large language models, current systems often struggle with factual accuracy and computational cost. This work introduces FISCAL-Financial Synthetic Claim-document Augmented Learning-a novel framework for generating domain-specific synthetic data to address these limitations. By training a compact 7B parameter model, MiniCheck-FISCAL, on this synthetic data, we demonstrate performance that rivals much larger systems-and even surpasses state-of-the-art models like Gemini 1.5 Flash-on key financial fact-checking benchmarks. Could this approach unlock a new era of efficient, trustworthy, and scalable AI for financial intelligence?


The Challenge of Discerning Truth in Financial Reporting

Traditional fact-checking systems frequently falter when applied to financial reporting due to the intricate nature of economic data and market analysis. These systems often depend on identifying keywords or matching phrases against databases, a technique inadequate for discerning truth in claims that hinge on complex calculations, projections, or contextual interpretations. A statement regarding a company’s profitability, for example, isn’t simply true or false based on the presence of “profit” – it requires verification against detailed financial statements, understanding of accounting principles, and consideration of relevant market conditions. This reliance on superficial analysis renders existing methods vulnerable to manipulation through carefully worded statements or the presentation of selective data, ultimately hindering their effectiveness in combating the spread of financial misinformation and potentially misleading investors.

The escalating spread of deliberately misleading financial information necessitates a shift towards verification methods that move beyond simple keyword detection. Increasingly, misinformation leverages complex narratives, nuanced language, and sophisticated data presentation to appear legitimate, often exploiting the inherent complexities of financial markets and instruments. Existing fact-checking tools, designed for more straightforward claims, struggle to dissect these intricate falsehoods, requiring techniques capable of understanding context, identifying logical fallacies within financial reasoning, and verifying claims against multiple data sources. This demands a move towards systems that can not only assess the factual accuracy of statements but also evaluate the validity of the underlying arguments and the credibility of the information’s origins, effectively combating the growing threat of financially motivated deception.

Existing financial fact-checking systems demonstrate a concerning lack of adaptability. These approaches, frequently built upon pattern recognition and specific keyword identification, struggle when presented with even slight variations in how a financial claim is worded or structured within a document. A system trained to verify claims in a standardized annual report, for instance, often fails when encountering the same information presented in a news article, social media post, or even a differently formatted report. This brittleness stems from an inability to understand the underlying meaning of a claim independent of its surface presentation, hindering their capacity to generalize across diverse sources and claim formulations. Consequently, even minor alterations in phrasing or document layout can lead to verification failures, exposing a critical vulnerability in the face of increasingly sophisticated financial misinformation campaigns.

FISCAL: A Framework for Controlled Data Generation

The FISCAL framework utilizes a ‘Modular Claim-Document Generator’ to produce synthetic datasets consisting of claim-document pairs. This generator facilitates controlled experimentation by allowing researchers to systematically vary characteristics of the generated data, such as claim validity or document complexity. The creation of these synthetic datasets addresses the limitations of relying solely on real-world data, which can be scarce, biased, or lack necessary annotations for training and evaluating natural language processing models. By providing a scalable and customizable data source, the generator supports comprehensive model training across a range of tasks, including fact verification, evidence retrieval, and misinformation detection.

The FISCAL framework’s modular data generator utilizes specialized components to create diverse and challenging claim-document pairs. The Claim Paraphraser Module introduces lexical and syntactic variation in claims without altering their core meaning, increasing the robustness of evaluation. The Conflict Insertion Module deliberately introduces contradictory statements within or between claims and supporting documents, simulating scenarios requiring reasoning and evidence reconciliation. Finally, the Summarization Module generates condensed versions of source documents, testing a model’s ability to extract relevant information from incomplete or abstracted evidence.

FISCAL’s ability to simulate misinformation relies on modules that systematically alter factual content within generated documents. The ‘Fact Exclusion Module’ removes specific factual statements, creating instances where information is absent, while the ‘Fact Value Distortion Module’ modifies the values associated with existing facts – for example, changing quantities, dates, or locations. These manipulations allow FISCAL to generate synthetic claim-document pairs exhibiting various forms of misinformation, including omissions and inaccuracies, enabling the evaluation of model robustness against such scenarios. The degree and type of distortion are controllable parameters within these modules, facilitating the creation of datasets with specific characteristics of misinformation.

MiniCheck-FISCAL: A Fine-Tuned Language Model for Financial Fact-Checking

MiniCheck-FISCAL is a 7 billion parameter language model specifically fine-tuned for financial fact-checking. Training utilizes the ‘FISCAL-Data’ dataset, a collection of financial claims and supporting evidence. To efficiently adapt the model, Low-Rank Adaptation (LoRA) is employed. LoRA freezes the pre-trained model weights and injects trainable low-rank matrices, significantly reducing the number of trainable parameters and computational cost compared to full fine-tuning. This allows for effective adaptation with limited resources while maintaining performance.

MiniCheck-FISCAL is trained using Causal Language Modeling (CLM), a technique where the model predicts the next token in a sequence. This approach is applied to claim-evidence pairs, formatted as text sequences where the claim is presented followed by the supporting evidence. By predicting subsequent tokens, the model learns to assess the relationship between a claim and its evidence, discerning whether the evidence supports, contradicts, or is neutral towards the claim. This allows MiniCheck-FISCAL to move beyond simple keyword matching and develop a more nuanced understanding of factual relationships, crucial for accurate financial fact-checking.

MiniCheck-FISCAL leverages the pre-trained weights of the MiniCheck-7B model as a starting point, demonstrating competitive accuracy on financial fact-checking benchmarks despite its 7 billion parameter size. Empirical results indicate that MiniCheck-FISCAL achieves performance comparable to significantly larger models; specifically, it outperforms models containing up to 140 billion parameters on established financial fact-checking datasets. This efficiency is attributed to the focused fine-tuning process on the ‘FISCAL-Data’ dataset, allowing it to maximize performance within a constrained parameter budget.

Evaluating Data Integrity and Model Performance

To ensure the reliability of synthetically generated data, the research utilizes a novel approach termed ‘LLM as Judge’. This methodology leverages the capabilities of large language models not merely as data generators, but also as evaluators of data quality. Specifically, the LLM assesses whether individual claims within the synthetic data are atomic – meaning they express a single, verifiable fact – and whether the overall dataset maintains logical coherence. This automated evaluation process moves beyond traditional metrics, offering a nuanced understanding of data integrity by examining the fundamental building blocks of information and their relationships, thereby bolstering the trustworthiness of datasets used for downstream tasks and model training.

To ensure the reliability of synthetic data evaluation, the consistency of judgements from multiple Large Language Models (LLMs) was rigorously assessed using Cohen’s Kappa, a statistical measure of inter-rater agreement. This metric goes beyond simple percentage agreement by accounting for the possibility of chance agreement, thus providing a more robust indication of genuine consensus. A high Cohen’s Kappa score signifies that the LLMs are not merely randomly assigning labels, but are consistently evaluating the data based on shared understanding of the task criteria. Utilizing this approach allows for a more confident determination of data quality, mitigating potential biases introduced by relying on a single judge and strengthening the overall validity of the synthetic dataset.

Evaluations demonstrate that MiniCheck-FISCAL achieves a noteworthy 75.6% accuracy on the challenging FinDVer (FDV-IE subset), signifying its robust performance in financial document verification. This result positions the model as a strong performer, exceeding the capabilities of several established language models including Mistral-7B-v3, Gemma-7B, Llama-2-7B, Qwen2-72B, and Gemini-1.5-Flash. While GPT-4o remains the current leader with 78.5% accuracy, MiniCheck-FISCAL demonstrates a clear advancement in specialized financial data understanding, suggesting a promising trajectory for automated financial information extraction and validation systems.

Evaluations demonstrate that the MiniCheck-FISCAL model achieves a noteworthy F1 score of 86.43 when tested on the FISCAL dataset, signifying substantial improvements in its performance over the baseline MiniCheck-7B model. This advancement is particularly driven by a $26.8$ point increase in Recall, indicating a significantly enhanced ability to identify all relevant instances within the data, and an $8.22$ point increase in Precision, demonstrating a reduction in false positives. These combined gains suggest that MiniCheck-FISCAL not only finds more of the correct answers but also does so with greater accuracy, representing a robust and reliable solution for financial statement verification tasks.

Evaluations reveal that MiniCheck-FISCAL significantly enhances performance on financial document verification tasks when contrasted with its baseline model. Specifically, the model achieves a $10.84$ point increase in F1 score on the FinDVer dataset, elevating its score to $70.53$ from $59.69$. This improvement extends to the Fin-Fact dataset, where MiniCheck-FISCAL demonstrates a $7.55$ point gain, reaching an F1 score of $60.69$ compared to the baseline’s $53.14$. These results indicate a substantial advancement in the model’s ability to accurately process and validate information contained within complex financial documents, suggesting its potential for reliable automation in financial verification processes.

The FISCAL framework, detailed in the study, prioritizes an elegant solution to the challenges of financial fact-checking, mirroring a philosophy that structure dictates behavior. By generating synthetic data, the system doesn’t attempt to brute-force accuracy with ever-larger models, but instead cultivates reliability through carefully constructed training material. This approach echoes Linus Torvalds’ sentiment: “Most good programmers do programming as an exercise in frustration.” The creation of FISCAL, and the success of MiniCheck-FISCAL in achieving competitive results with limited parameters, suggests a mindful balance between complexity and efficiency, acknowledging that every simplification – or clever trick – carries inherent risks. The focus on modularity further embodies this principle, allowing for targeted improvements and adaptations without compromising the integrity of the whole system.

Future Directions

The pursuit of factual grounding in large language models often fixates on scaling-more parameters, more data. FISCAL offers a counterpoint, suggesting that intelligent data curation, even synthetic generation, might yield disproportionate gains in efficiency. However, the elegance of this approach hinges on the fidelity of the synthetic financial claims. The framework’s limitations will become apparent as adversarial attacks refine their ability to expose the subtle distinctions between authentic and artificially constructed financial reasoning. A crucial next step involves developing robust metrics not just for accuracy, but for detectability – how easily a model’s reliance on synthetic data can be identified.

The modularity of FISCAL is its quiet strength. It tacitly acknowledges that financial fact-checking isn’t monolithic; it’s a collection of sub-problems, each requiring specialized knowledge. Future work should explore the integration of diverse synthetic data modules – regulatory filings, earnings calls, macroeconomic indicators – to build a more comprehensive, and ultimately, more resilient fact-checking system. This necessitates moving beyond simple claim verification towards a more holistic assessment of financial narratives.

Ultimately, the success of approaches like FISCAL will depend on recognizing that a fact-checking model is not an isolated entity, but a component within a larger ecosystem. The true measure of progress lies not in achieving state-of-the-art accuracy on benchmarks, but in reducing the systemic risk embedded within financial information itself. A small, reliable model, keenly aware of its limitations, may prove more valuable than a behemoth confidently propagating subtle errors.


Original article: https://arxiv.org/pdf/2511.19671.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-26 08:41