When Financial AI Gets It Wrong: A New Benchmark for Truthfulness

Author: Denis Avetisyan


Researchers have created a rigorous new test to expose how easily financial question-answering systems, even those powered by knowledge graphs, can be misled and generate inaccurate responses.

The study demonstrates that the proposed algorithm achieves superior performance across all evaluated methods, consistently minimizing the error function defined as <span class="katex-eq" data-katex-display="false"> E = \sum_{i=1}^{n} |y_i - \hat{y}_i| </span>, where <span class="katex-eq" data-katex-display="false"> y_i </span> represents the actual value and <span class="katex-eq" data-katex-display="false"> \hat{y}_i </span> the predicted value for the i-th data point in the dataset.
The study demonstrates that the proposed algorithm achieves superior performance across all evaluated methods, consistently minimizing the error function defined as E = \sum_{i=1}^{n} |y_i - \hat{y}_i| , where y_i represents the actual value and \hat{y}_i the predicted value for the i-th data point in the dataset.

This paper introduces FinReflectKG — HalluBench, a benchmark and evaluation framework for assessing hallucination in knowledge graph-augmented financial question answering systems using SEC filings.

Despite growing reliance on AI-powered question answering in high-stakes financial contexts, current systems lack robust mechanisms to reliably detect factual inaccuracies-or “hallucinations.” To address this critical gap, we introduce ‘FinReflectKG — HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems’, a new benchmark dataset and evaluation framework designed to rigorously assess hallucination detection in knowledge graph-augmented financial question answering, utilizing SEC 10-K filings. Our findings reveal that while several detection methods perform well under ideal conditions, performance degrades significantly with noisy knowledge graph data-with embedding-based approaches demonstrating comparatively greater robustness. How can we build truly reliable financial information systems that mitigate the risks of hallucination and ensure trustworthy AI-driven insights?


The Emerging Challenge of Hallucinations in Financial Reasoning

Large Language Models (LLMs) are rapidly being deployed in financial question answering systems, showcasing a remarkable aptitude for processing complex inquiries and generating seemingly coherent responses. However, this potential is tempered by a significant challenge: the propensity for ‘hallucinations’ – the generation of answers that, while grammatically correct and contextually relevant, are factually incorrect or lack supporting evidence from the provided data. These aren’t simply errors of calculation; rather, LLMs can confidently present fabricated information as truth, a critical issue within finance where precision is non-negotiable. The models, trained to predict the most likely sequence of words, sometimes prioritize fluency over factual accuracy, leading to plausible but ultimately unreliable responses. This necessitates the development of robust detection and mitigation strategies to ensure the trustworthiness of LLM-driven financial applications and prevent the dissemination of misleading information.

The integration of Large Language Models into financial question answering systems, while promising, introduces a critical vulnerability stemming from the propensity for ‘hallucinations’ – the generation of factually incorrect or unsupported responses. In the financial domain, even seemingly minor inaccuracies can have substantial consequences, ranging from flawed investment strategies to regulatory non-compliance and significant monetary losses. Therefore, the development of robust detection and mitigation strategies is not merely a technical challenge, but a fundamental necessity. Current research focuses on techniques like retrieval-augmented generation, fact verification using external knowledge bases, and uncertainty estimation to identify and correct these errors before they impact critical financial decisions. The pursuit of reliable and grounded responses is paramount to ensuring trust and responsible implementation of these powerful technologies within the financial sector.

Conventional question answering systems often falter when confronted with the intricacies of financial data, a domain characterized by complex regulations, evolving market dynamics, and subtle semantic variations. These systems, typically reliant on keyword matching and pre-defined rules, struggle to interpret the contextual dependencies within financial reports, news articles, and economic indicators. Consequently, they frequently misinterpret nuanced language, fail to identify relevant information buried within lengthy documents, and lack the capacity to reason about the implications of financial data. This inadequacy underscores the urgent need for novel approaches – leveraging techniques like knowledge graphs, semantic reasoning, and advanced natural language understanding – to ensure that financial question answering systems are not only accurate but also reliably grounded in verifiable data, fostering trust and mitigating potential risks.

Structuring Financial Knowledge: Knowledge Graphs as a Foundation

Knowledge Graphs represent a structured approach to data organization, moving beyond the unstructured nature of typical text corpora used to train Large Language Models (LLMs). These graphs utilize a subject-predicate-object triplet structure to define relationships between entities – for example, “Apple reports $383.9 billion in revenue”. By representing financial data in this format, extracted from sources like SEC filings, LLMs gain access to factual assertions distinct from the probabilistic relationships learned during training. This allows for explicit reasoning and verification, moving beyond the LLM’s inherent capacity for text generation and towards a system capable of validating information against a defined, structured knowledge base. The use of Knowledge Graphs, therefore, aims to improve the reliability and trustworthiness of LLM-driven applications in finance by grounding responses in verifiable data points.

The FinReflectKG pipeline employs a Triplet Extraction process to automatically generate a Knowledge Graph from SEC 10-K filings. This process identifies factual relationships within the text and represents them as subject-predicate-object triplets. For example, a sentence stating “Apple acquired Beats in 2014” would be parsed into the triplet (Apple, acquired, Beats) with a temporal qualifier of 2014. These extracted triplets are then used to construct a Knowledge Graph, where entities become nodes and relationships become edges, resulting in a structured representation of financial information derived directly from company disclosures. The resulting graph contains millions of such triplets, offering a comprehensive and interconnected view of financial data.

Integrating structured knowledge from Knowledge Graphs into Large Language Models (LLMs) enhances the accuracy and reliability of financial Question Answering (QA) systems by providing a mechanism for answer verification. LLMs, while proficient in natural language processing, can generate inaccurate or hallucinated responses. By cross-referencing generated answers with the established facts and relationships within the Knowledge Graph, the system can validate the response against a trusted source of information derived from SEC 10-K filings. This verification process reduces the risk of incorrect information being presented and increases confidence in the system’s output, making it suitable for applications requiring high levels of factual correctness in financial analysis.

Dissecting the Problem: Multi-Faceted Approaches to Hallucination Detection

Current hallucination detection techniques encompass a range of methodologies. Fine-Tuned LLM Classifiers utilize large language models trained to directly identify hallucinated content within generated text. LLM-as-Judge leverages the reasoning capabilities of LLMs to assess the factual consistency of responses. NLI (Natural Language Inference) Models determine the logical relationship between a generated answer and supporting evidence, flagging inconsistencies. Span Detectors pinpoint specific segments of text identified as potentially hallucinatory. Finally, Embedding Similarity measures the semantic similarity between generated content and source materials, with low similarity scores indicating potential hallucinations. These methods differ in their scope, with some evaluating entire responses while others focus on specific spans or logical consistency.

Hallucination detection techniques employ diverse strategies for identifying inaccuracies in generated text. Some methods operate at the answer level, classifying entire responses as either factual or hallucinatory. Other approaches utilize span detection to pinpoint specific phrases within an answer that are unsupported by the source material. A third category assesses logical consistency, evaluating whether the generated answer can be logically inferred from the provided context, thereby identifying contradictions or unsupported claims. These varying methodologies allow for different granularities of analysis, from a holistic assessment of answer validity to the precise identification of problematic content.

FinBench-QA-Hallucination is a newly introduced benchmark designed to specifically evaluate the performance of hallucination detection methods within the context of knowledge-graph-augmented financial question answering. Evaluation using this benchmark, under clean conditions – meaning the knowledge graph contains no noisy or incorrect triplets – demonstrates that the Qwen model achieves a leading F1 score of 0.863. This score represents the highest performance attained on the benchmark in the absence of confounding data within the knowledge graph, establishing a performance baseline for future evaluations and method comparisons.

Evaluation of current hallucination detection methods demonstrates substantial performance decline when presented with noisy knowledge graph data. While these methods achieve acceptable results in clean environments, their efficacy is compromised by inaccuracies or inconsistencies within the supporting knowledge graph. This degradation is observed across various detection techniques, including those classifying entire answers and those focused on identifying specific hallucinated spans. The susceptibility to noisy signals highlights a critical limitation in the robustness of current hallucination detection systems and necessitates the development of more resilient approaches capable of filtering or mitigating the impact of inaccurate knowledge graph information.

The strong positive correlation between errors across all methods indicates a shared underlying systematic bias in their predictions.
The strong positive correlation between errors across all methods indicates a shared underlying systematic bias in their predictions.

The Persistent Challenge: Addressing Knowledge Graph Imperfections and Future Directions

The efficacy of hallucination detection in knowledge graph-augmented question answering systems is demonstrably vulnerable to inaccuracies within the knowledge graph itself. Known as ‘Triple Noise’, these erroneous or imprecise triplets-the fundamental building blocks representing relationships between entities-can mislead detection mechanisms, causing them to incorrectly validate fabricated responses. Studies reveal a significant performance decline when hallucination detection methods are applied to graphs containing such noise, effectively masking the very falsehoods they are designed to identify. This sensitivity underscores a critical limitation in current approaches, as real-world knowledge graphs are rarely pristine and often contain inherent errors accumulated during extraction and construction. Consequently, addressing Triple Noise is not merely a data cleaning exercise, but a fundamental requirement for building reliable and trustworthy question answering systems that leverage external knowledge.

Evaluations revealed a significant decline in the reliability of hallucination detection when knowledge graphs contained inaccuracies – a phenomenon termed ‘Triple Noise’. Across multiple Large Language Model (LLM) judges, the presence of these noisy triplets resulted in a Mean MCC (Matthews Correlation Coefficient) Degradation ranging from 50 to 68 percent. This substantial decrease indicates that even sophisticated LLMs struggle to discern truth from falsehood when relying on imperfect knowledge sources, severely impacting the trustworthiness of the augmented question-answering systems. The observed performance drop underscores the critical need to address data quality within knowledge graphs before deploying them in applications where factual accuracy is paramount.

The large language model Qwen demonstrated a particularly acute sensitivity to inaccuracies within knowledge graphs, experiencing a substantial 73% reduction in performance-as measured by the Matthews Correlation Coefficient-when evaluated with noisy data. This significant drop indicates that even a moderate level of error in the underlying knowledge triples severely compromises Qwen’s ability to provide reliable answers. The model’s heightened vulnerability underscores the critical need for data quality control in knowledge graph-augmented question answering systems, suggesting that some models are considerably less resilient to imperfections than others and require more robust error mitigation strategies.

Despite demonstrating a degree of stability compared to other evaluation techniques, embedding-based methods are not immune to the detrimental effects of inaccuracies within knowledge graphs. Analysis revealed a measurable, though comparatively smaller, decline in performance – specifically, a 9 to 13% degradation in Matthews Correlation Coefficient (MCC) – when these methods were subjected to noisy triplet data. This indicates that while embeddings offer some robustness against knowledge graph imperfections, they still struggle to maintain reliability in the presence of errors, suggesting a need for further refinement in how these methods interpret and utilize potentially flawed information from knowledge graphs.

The practical implementation of Knowledge Graph-enhanced Question Answering systems demands careful attention to the inherent imperfections within those knowledge graphs. Specifically, the presence of inaccurate or noisy triplets – factual errors embedded within the graph’s structure – significantly undermines the reliability of these systems. Evaluations consistently demonstrate substantial performance degradation when models encounter such ‘triple noise,’ indicating that a system’s ability to discern truth isn’t solely dependent on the language model itself, but also on the quality of the underlying knowledge source. Consequently, assessing a system’s robustness to these imperfections is not merely an academic exercise, but a critical step in ensuring dependable performance when deployed in real-world applications, where knowledge graphs are rarely pristine and error-free.

Continued development necessitates a focus on bolstering the reliability of knowledge graphs and the responses they enable. Current limitations highlight the need for automated error detection and correction within these graphs, moving beyond manual curation to address the scale of information involved. Simultaneously, research should prioritize techniques that enhance the ‘groundedness’ of large language model outputs – ensuring generated text is consistently and demonstrably supported by the knowledge graph’s verified facts. This includes exploring methods for LLMs to explicitly cite supporting evidence and to flag instances where information is absent or ambiguous within the knowledge source, ultimately fostering greater trust and accuracy in knowledge-augmented question answering systems.

The pursuit of veracity in financial question answering, as detailed in this work concerning FinBench-QA-Hallucination, demands a rigorous adherence to provable correctness. The vulnerability of current hallucination detection methods to noisy knowledge graph signals underscores a critical need for algorithmic purity. Vinton Cerf aptly observes, “The internet is not about technology; it’s about people.” Similarly, the efficacy of a financial QA system isn’t merely about its ability to answer questions, but about the demonstrable truthfulness of those answers, built upon a foundation of robust, verifiable knowledge. A system operating on flawed data, however sophisticated, yields results no better than conjecture, highlighting the necessity of provable algorithms within the realm of financial NLP.

What’s Next?

The exposure of current hallucination detection methods’ fragility in the face of imperfect knowledge graphs should not be interpreted as a call for merely ‘better’ noise reduction. It is, rather, a stark reminder that statistical heuristics, however convenient, offer no principled defense against logical inconsistency. The pursuit of ‘robustness’ must yield to the demand for correctness. Future work should prioritize the development of methods capable of verifying the logical entailment between a question, the knowledge graph, and the generated answer-a task that demands formal reasoning, not simply pattern recognition.

One avenue for exploration lies in integrating symbolic reasoning engines with neural architectures, though this will require a fundamental shift in evaluation metrics. Accuracy on benchmark datasets becomes less meaningful when the benchmarks themselves fail to rigorously assess logical validity. The field must embrace metrics that quantify the provability of answers, not merely their superficial plausibility.

Ultimately, the challenge transcends technical innovation. It requires a philosophical recalibration. The current emphasis on scaling models and generating fluent text has obscured the foundational problem: ensuring that these systems are not merely impressive mimics, but genuine reasoners. A truly intelligent system will not strive to appear correct; it will be correct, by necessity.


Original article: https://arxiv.org/pdf/2603.20252.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-24 11:24