Spotting AI Lies in Finance: A New Approach Cuts Errors by 92%

Author: Denis Avetisyan

A novel information-theoretic method dramatically improves the detection of inaccurate statements generated by artificial intelligence in financial contexts.

ECLIPSE demonstrates a substantial reduction in hallucination rates as coverage increases, achieving a 92% improvement at 30% coverage-reducing errors from 43.3% to 3.3% relative to entropy-only detection-and suggesting an inherent resilience to spurious outputs as system awareness expands.

Researchers introduce ECLIPSE, a framework leveraging perplexity decomposition to assess the quality of evidence and reduce AI hallucinations with API access only.

Despite the increasing fluency of large language models, their propensity for generating unsupported or ‘hallucinated’ content limits deployment in critical domains like finance. This paper introduces ECLIPSE, a novel information-theoretic framework-detailed in ‘Detecting AI Hallucinations in Finance: An Information-Theoretic Method Cuts Hallucination Rate by 92%’-that characterizes hallucinations as a mismatch between model uncertainty and evidence quality, achieving a 92% reduction in hallucination rate on a controlled financial dataset. By decomposing perplexity and leveraging token-level log probabilities, ECLIPSE effectively distinguishes between well-supported and fabricated responses using only API access. Could this approach offer a broadly applicable solution for mitigating hallucinations and building more trustworthy AI systems?

The Fragility of Surface-Level Truth

Even with the remarkable scaling of parameters and training data, large language models frequently exhibit a propensity for generating factually incorrect statements, commonly referred to as “hallucinations.” These aren’t simply random errors; the models confidently produce plausible-sounding information that lacks grounding in reality. This behavior isn’t necessarily tied to a lack of data, but rather a limitation in how the models process and synthesize information. They excel at identifying patterns and predicting the most likely sequence of tokens, but often fail to verify the truthfulness of the generated content. Consequently, a model can convincingly articulate a falsehood, making it challenging to distinguish between accurate knowledge and sophisticated fabrication, and highlighting a critical obstacle in deploying these systems for tasks demanding reliable information.

Current approaches to detecting hallucinations in large language models frequently assess the likelihood of each generated token, using log probabilities as a key metric for factual consistency. However, this reliance on surface-level token probabilities proves surprisingly fragile; when these models encounter less direct or informative textual cues – proxies that don’t strongly correlate with factual truth – the accuracy of hallucination detection plummets. This degradation isn’t simply a matter of noise; it reveals a fundamental limitation in these methods, demonstrating they often fail to discern why a statement is incorrect, instead focusing solely on how probable it appears given the training data. Consequently, models can confidently generate plausible falsehoods that bypass these token-level checks, highlighting the need for more robust techniques that evaluate the underlying reasoning process rather than merely assessing superficial statistical patterns.

The observed limitations in hallucination detection, where methods falter beyond simple token comparisons, highlight a crucial need to shift focus from what a language model says to how it arrives at that conclusion. Current techniques largely assess the plausibility of individual tokens, effectively gauging surface-level correctness without examining the underlying chain of reasoning. A more robust approach necessitates methods that probe the model’s internal logic – its ability to connect evidence, draw inferences, and maintain consistency – rather than simply evaluating the statistical likelihood of generated text. This demands the development of tools capable of dissecting the reasoning process itself, potentially through techniques like attention analysis, knowledge graph traversal, or the construction of explicit reasoning traces, ultimately aiming to build models that don’t just sound correct, but are correct because of sound reasoning.

Hallucinated answers exhibit increased semantic entropy and reduced effective capacity, suggesting the model disregards supporting evidence compared to clean samples.

Beyond Probability: Discerning Evidence-Based Reasoning

Effective reasoning in artificial intelligence necessitates more than simply producing a statistically probable response; it demands demonstrable justification through supporting evidence. Models capable of articulating why an answer is correct, by referencing relevant data, exhibit a higher degree of reasoning capability than those that merely predict an outcome. This requires evaluating not only the plausibility of an answer but also the extent to which the provided evidence increases the confidence in that answer. The ability to trace an answer back to its evidentiary basis is crucial for building trustworthy and reliable AI systems, particularly in applications where accountability and transparency are paramount. Without this connection to supporting data, a model’s output remains a prediction, not a reasoned conclusion.

Perplexity Decomposition is a technique used to analyze how language models leverage provided evidence when generating responses. It functions by calculating the overall perplexity of a model’s answer given both the question and the evidence, then decomposing this value into components representing the evidence’s contribution. Specifically, the technique assesses the extent to which the evidence reduces the uncertainty of the answer; a lower perplexity score indicates the evidence effectively constrained the possible answer distribution. This decomposition is achieved through calculating the log-likelihood of the answer given the question and evidence, $P(A|Q,E)$, and comparing it to the log-likelihood of the answer given only the question, $P(A|Q)$. The difference between these values quantifies the information gain from the evidence, enabling a granular understanding of how the model utilizes supporting information.

The quantification of evidence constraint within Perplexity Decomposition relies on assessing Log-Likelihood, a measure of how well a probability model predicts a given set of observations. Specifically, Log-Likelihood, denoted as $LL(a|e) = log P(a|e)$, calculates the natural logarithm of the probability of an answer ($a$) given the evidence ($e$). A higher Log-Likelihood indicates a stronger correlation between the evidence and the answer. By decomposing the overall perplexity into evidence-related and non-evidence-related components, the degree to which the evidence constrains the answer distribution – effectively reducing uncertainty – can be precisely quantified. This allows for a granular assessment of how much the evidence contributes to the model’s confidence in its response.

Analysis of learned coefficients reveals that evidence utilization, particularly through perplexity decomposition features, drives detection, though a single coefficient (pmax) unexpectedly exhibits a positive sign.

ECLIPSE: Charting the Terrain of Uncertainty and Evidence

ECLIPSE establishes a framework for evaluating language models by quantifying the relationship between semantic entropy and evidence capacity. Semantic entropy, representing model uncertainty, is inversely proportional to the confidence in its predictions; higher entropy indicates greater uncertainty. Simultaneously, evidence capacity measures the amount of supporting information a model utilizes from provided context. The framework explicitly models this trade-off, acknowledging that informative evidence can reduce uncertainty, but excessive reliance on evidence may not necessarily improve reasoning quality. By jointly considering these factors, ECLIPSE aims to provide a more comprehensive assessment of a language model’s ability to reason effectively and avoid both overconfidence and insufficient justification.

ECLIPSE enhances reasoning evaluation by integrating Semantic Entropy and Evidence Capacity. Traditional metrics often fail to distinguish between models that exhibit low uncertainty due to genuine knowledge and those with low uncertainty resulting from overconfidence or lack of information. Semantic Entropy, a measure of a model’s uncertainty in its predictions, is therefore combined with Evidence Capacity, which quantifies the amount of relevant information the model utilizes from supporting evidence. This combined metric allows for a more nuanced assessment; a high Evidence Capacity mitigates the negative impact of high Semantic Entropy, and vice versa, providing a more robust signal of reasoning quality than either metric alone. The framework effectively penalizes models that are both uncertain and lack supporting evidence, while rewarding those that demonstrate informed uncertainty.

The ECLIPSE framework utilizes Fact Extraction to pinpoint relevant supporting evidence from source documents, a process crucial for assessing the validity of model responses. Quantification of this evidence’s impact is achieved through Perplexity Decomposition, which measures how well the language model predicts the extracted facts; lower perplexity indicates stronger evidence support. Evaluations on financial Question Answering datasets demonstrate the framework’s effectiveness, achieving an Area Under the Curve (AUC) of 0.89, indicating a high degree of accuracy in distinguishing between well-supported and unsupported answers.

ECLIPSE significantly outperforms a simple entropy-based baseline on a financial question answering dataset, achieving an AUC of 0.89 compared to 0.50, as demonstrated by the receiver operating characteristic curves.

Towards Reliable Systems: Calibration and the Pursuit of Trustworthy AI

Model calibration is crucial for reliable AI systems, and ECLIPSE directly addresses this by enabling more accurate estimations of confidence in generated responses. Traditionally, large language models can be overconfident in incorrect answers or underconfident in correct ones; ECLIPSE aims to rectify this misalignment. By assessing how well a model’s predicted probabilities match its actual accuracy, the framework refines the confidence scores assigned to each generated response. This improved calibration is not merely a statistical refinement; it has practical implications, allowing downstream applications to better discern trustworthy outputs from potentially unreliable ones, and ultimately fostering greater user trust in AI-driven systems. A well-calibrated model provides a more honest signal regarding its own uncertainty, which is especially important in high-stakes scenarios where incorrect information can have significant consequences.

ECLIPSE builds upon established techniques for identifying inaccurate or fabricated information in language models, integrating methods like SelfCheckGPT and Semantic Entropy Probes into a cohesive framework. Rather than treating these approaches as isolated solutions, ECLIPSE provides a unified and principled way to leverage their strengths, allowing for more robust hallucination detection. This integration isn’t simply a matter of combining outputs; the framework allows for a systematic analysis of model confidence and evidence usage, enhancing the reliability of each individual method and enabling a more nuanced understanding of why a model might be generating untrustworthy content. The result is a versatile toolkit that adapts to various scenarios and offers a significant improvement over relying on any single detection technique in isolation.

ECLIPSE demonstrates particular efficacy when applied to Retrieval-Augmented Generation (RAG) systems, which rely heavily on external knowledge sources for accurate response generation. By prioritizing the evaluation of evidence usage – verifying whether a model’s claims are genuinely supported by retrieved documents – the framework significantly minimizes the occurrence of hallucinations. Studies reveal a remarkable 92% reduction in these inaccuracies at a coverage level of 30%, substantially outperforming methods that rely solely on semantic entropy for detection. This suggests that assessing how a model uses evidence, rather than simply measuring the unpredictability of its output, is a more robust approach to ensuring the reliability of RAG systems and, consequently, the trustworthiness of generated information.

ECLIPSE successfully recovers log-probability native coefficients for Claude-3-Haiku, demonstrating a 90-96% magnitude retention compared to GPT-3.5-turbo and confirming a sign flip indicative of accurate log-probability estimation.

The pursuit of reliable outputs from large language models, as detailed in this framework, inherently acknowledges the transient nature of any complex system. ECLIPSE, with its decomposition of perplexity and focus on evidence quality, attempts to chart the course of a system’s decline-to discern when its ‘improvements age faster than [it can] understand them.’ This methodology, by quantifying the relationship between model uncertainty and evidence, doesn’t seek to prevent decay, but to gracefully manage it. It’s a pragmatic acceptance that even the most advanced architectures are subject to entropy, and the key lies in building systems capable of self-assessment and adaptation. As Barbara Liskov aptly stated, “Programs must be correct, and they must be maintainable.”

What Lies Ahead?

The ECLIPSE framework, while demonstrably effective at reducing the incidence of hallucination, merely addresses a symptom. The underlying condition-the inherent tendency of these systems to construct plausible narratives detached from verifiable truth-remains. Each reduction in reported error carries the weight of past approximations, a historical debt accruing with every iteration. The pursuit of “truth” within a probabilistic model is, at best, a refined form of controlled confabulation.

Future work will inevitably focus on diminishing returns. Improvements to perplexity decomposition and entropy measurement will yield incremental gains, but the fundamental limitation remains: these models are not reasoning engines; they are sophisticated pattern completion machines. A more fruitful, though considerably more arduous, path lies in understanding how these systems degrade-identifying the specific architectural or training conditions that predispose them to fabrication. Only slow change, a careful mapping of failure modes, preserves resilience against the inevitable drift towards untethered generation.

The reliance on API access, while pragmatic, introduces an external dependency. The very definition of “hallucination” becomes subtly influenced by the provider’s data and operational constraints. A truly robust solution must move beyond external validation, towards an internal capacity for self-assessment-a metacognitive awareness of its own limitations. Whether such a capacity is even theoretically possible remains an open, and increasingly critical, question.

Original article: https://arxiv.org/pdf/2512.03107.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Surface-Level Truth

Beyond Probability: Discerning Evidence-Based Reasoning

ECLIPSE: Charting the Terrain of Uncertainty and Evidence

Towards Reliable Systems: Calibration and the Pursuit of Trustworthy AI

What Lies Ahead?

See also: