Beyond Black Boxes: Building Trustworthy AI for Finance

Author: Denis Avetisyan

A new framework combines the power of large language models with formal verification to create financial intelligence systems that are both accurate and demonstrably reliable.

The VERAFI system establishes an end-to-end pipeline for processing financial queries, leveraging $Qwen3$-based embeddings and the $Jina$-reranker to enhance information retrieval, followed by agentic reasoning grounded in formal $SMT$-lib specifications of $GAAP/SEC$ constraints-integrated as contextual guidelines-to ensure both mathematical accuracy and regulatory compliance in generated financial insights.

VERAFI leverages neurosymbolic AI and agentic processing to deliver verified financial reasoning policies using SMT-lib.

Despite advances in retrieval-augmented generation, financial AI systems remain vulnerable to calculation errors and regulatory breaches during complex reasoning. This paper introduces VERAFI: Verified Agentic Financial Intelligence through Neurosymbolic Policy Generation, a novel framework that integrates agentic processing with formally verified financial reasoning policies. Our results demonstrate that VERAFI achieves 94.7% factual correctness on FinanceBench, an 81% relative improvement over traditional dense retrieval methods, largely due to its neurosymbolic layer targeting persistent logical and mathematical fallacies. Could this approach pave the way for truly trustworthy financial AI capable of meeting the rigorous demands of compliance, investment, and risk management?

From Data Deluge to Decisive Insight: The Evolving Landscape of Financial Intelligence

Historically, discerning financial insights demanded painstaking manual review of documents – a process acutely vulnerable to human error and inherently limited by speed. Analysts would pore over contracts, reports, and statements, extracting key data points and assessing risk factors – a task that, even with skilled professionals, proved both time-consuming and expensive. This reliance on manual effort created significant bottlenecks in decision-making, hindering an organization’s ability to respond quickly to market changes or capitalize on emerging opportunities. The sheer volume of financial documentation, coupled with the subtle nuances within legal and regulatory language, consistently challenged the accuracy and efficiency of these traditional methods, making them increasingly unsustainable in the face of modern financial complexities.

The modern financial landscape generates data at an unprecedented rate, far exceeding the capacity of human analysts to effectively process it. Reports, transactions, news articles, and regulatory filings combine to create a deluge of information, while increasing globalization and intricate financial instruments add layers of complexity. This exponential growth isn’t merely a scaling problem; it demands solutions capable of nuanced understanding – systems that can discern subtle relationships, identify anomalies, and extract meaningful insights from unstructured data. Simply increasing processing power isn’t enough; automated solutions must move beyond keyword spotting and embrace semantic analysis to accurately interpret context, identify intent, and ultimately, make informed decisions within this increasingly complex financial ecosystem. The shift towards automation isn’t just about speed, but about retaining accuracy and insight in the face of overwhelming data volume.

Current natural language processing techniques, while powerful in many domains, frequently encounter limitations when applied to financial texts. The highly specialized vocabulary, complex sentence structures, and prevalence of legal and accounting jargon pose significant challenges for algorithms trained on general language corpora. Moreover, financial documents are subject to strict regulatory requirements – such as those concerning data privacy and reporting accuracy – which demand a level of precision and interpretability often absent in ‘black box’ NLP models. Consequently, simply applying standard NLP tools can lead to inaccurate extractions, misinterpretations of key financial indicators, and potential compliance issues, necessitating the development of bespoke approaches tailored to the unique characteristics of financial language and regulatory landscapes.

Combining dense retrieval with cross-encoder reranking significantly improves financial document retrieval performance across key metrics like Recall, NDCG, MRR, and Hit Rate at k=3.

VERAFI: Orchestrating Financial Reasoning with an Agentic, Neurosymbolic System

VERAFI’s information retrieval utilizes a two-stage process designed to maximize both recall and precision when processing financial documents. The initial stage employs broad-coverage retrieval techniques to identify a large set of potentially relevant passages, prioritizing comprehensive coverage of the document corpus. This is followed by a precise relevance ranking stage, which filters the initial results using a more refined model focused on identifying passages directly pertinent to the specific query. This two-stage approach mitigates the limitations of single-stage retrieval systems, which often struggle to balance breadth and accuracy, enabling VERAFI to efficiently locate critical financial data within complex documentation.

VERAFI employs an agentic framework to address complex financial queries by decomposing them into a series of sequential steps. This framework utilizes tools – specifically, specialized financial calculators and data retrieval functions – to perform intermediate computations and access relevant information. The agent plans a multi-step reasoning path, executing each step and updating its internal state based on the results. This allows VERAFI to handle queries requiring multiple calculations, cross-document information synthesis, and conditional logic, effectively simulating a human financial analyst’s problem-solving process.

VERAFI employs policy-guided generation to ensure outputs align with established financial regulations and accounting principles, thereby improving the reliability of its insights. Evaluation demonstrates VERAFI achieves 94.7% factual correctness when answering financial questions. This represents an 81% relative improvement compared to traditional dense retrieval methods, indicating a substantial gain in accuracy and adherence to financial standards through the implementation of policy-guided generation techniques.

The VERAFI system significantly improves factual correctness in generated content, increasing from 52.4% to 94.7% with agentic processing providing the most substantial gains and neurosymbolic validation adding a further 4.3 percentage points.

Deconstructing the Mechanics: A Deep Dive into Retrieval and Reasoning

Dense Passage Retrieval (DPR) is employed as a first-stage filtering mechanism to rapidly identify potentially relevant documents from a larger corpus. This process leverages neural models, such as Qwen3-Embedding-4B, to generate vector embeddings for both the query and each document passage. The similarity between the query embedding and passage embeddings is then calculated – typically using cosine similarity – to rank passages based on relevance. By converting textual data into a numerical vector space, DPR allows for efficient approximate nearest neighbor search, significantly reducing the number of passages that require more computationally expensive analysis in subsequent stages. This initial screening improves overall system performance and reduces latency compared to exhaustive search methods.

Cross-Encoder Reranking, implemented with the Jina-reranker-v3 model, operates as a secondary filtering stage following initial document retrieval. Unlike initial retrieval methods that assess relevance based on individual document or passage embeddings, cross-encoders process the query and each candidate passage jointly. This allows the model to consider the interaction between the query and the passage content, leading to a more nuanced understanding of relevance. Jina-reranker-v3 utilizes a transformer architecture trained to score the probability that a given passage answers the input query. The passages are then sorted by this score, effectively re-ranking the initial results and prioritizing those most likely to contain the answer. This process significantly improves precision at the expense of increased computational cost compared to simpler retrieval methods.

The Strands Agentic Framework builds upon a core agentic architecture to facilitate complex analytical workflows through dynamic planning and execution. This framework integrates a Python REPL, enabling the agent to perform computations and data manipulation directly within its processing loop. Furthermore, it incorporates Web Search capabilities, allowing the agent to access and incorporate external information into its analysis as needed. This combination of computational and information retrieval resources permits Strands to address tasks requiring both internal processing and real-time data acquisition, exceeding the limitations of static knowledge bases and pre-defined workflows.

Neurosymbolic autoformalization is a process that converts natural language descriptions of policies or rules into formal specifications written in SMT-lib, a standard input language for Satisfiability Modulo Theories (SMT) solvers. This translation enables automated reasoning and verification of the policy’s logical consistency and correctness. The system parses the natural language, identifies key logical components – such as conditions, actions, and constraints – and then represents these components using SMT-lib syntax. An SMT solver can then be used to determine if the formalized policy is satisfiable, consistent, and meets specified requirements. This allows for rigorous testing and validation of complex policies without manual translation to formal logic, reducing errors and increasing confidence in the system’s behavior.

Validating Insight and Charting a Course for Future Development

VERAFI’s capabilities undergo substantial scrutiny through evaluation on established financial benchmarks, notably FinanceBench, and conversational question-answering datasets such as ConvFinQA. These datasets provide a standardized and challenging environment to measure the system’s ability to not only retrieve relevant financial information, but also to synthesize it into coherent and accurate responses. Performance on these benchmarks allows for direct comparison with other state-of-the-art models and quantifies VERAFI’s progress in tackling complex financial reasoning tasks. Rigorous testing ensures the reliability and trustworthiness of the system’s analytical outputs, demonstrating its potential for practical application in financial analysis and decision-making.

To rigorously validate VERAFI’s outputs, the system employs a novel evaluation framework utilizing Large Language Models (LLMs) as automated judges. This approach moves beyond traditional metrics by assessing not only the presence of correct information, but also the completeness of responses to complex financial queries. Instead of relying solely on pre-defined ground truths, the LLM-as-a-Judge methodology allows for a more nuanced evaluation, determining if the generated analysis sufficiently addresses all relevant aspects of the question. The LLM effectively acts as an expert reviewer, comparing VERAFI’s responses against a broad understanding of financial principles and identifying any gaps or omissions in reasoning. This automated process dramatically increases the scalability and objectivity of the evaluation, ensuring a consistently high standard of factual accuracy and analytical thoroughness.

VERAFI distinguishes itself through the implementation of formalized, automated reasoning policies, establishing a novel framework for accountability in financial analysis. These policies aren’t simply guidelines; they are precisely defined rules embedded within the system, dictating how conclusions are reached from financial data. This structured approach moves beyond the ‘black box’ nature often associated with large language models, providing a clear audit trail for each analytical step. By explicitly outlining the reasoning process, VERAFI fosters trust in its outputs, enabling users to verify the logic behind financial recommendations and understand potential biases. The formalized policies are designed to ensure consistency and objectivity, reducing the risk of subjective interpretations and improving the reliability of financial insights derived from complex datasets.

Development of VERAFI is poised to extend beyond its current capabilities, with future iterations concentrating on a broader range of automated reasoning policies to enhance analytical rigor. This expansion will be coupled with integration into live financial data streams, enabling dynamic and up-to-the-minute assessments. Initial retrieval performance, measured by Recall@3 with Dense+Rerank, already demonstrates a strong foundation at 66.7%, suggesting a promising ability to efficiently access relevant information for complex financial inquiries and paving the way for a system capable of continuous, data-driven insights.

VERAFI’s architecture embodies a systemic approach to financial intelligence, mirroring the interconnectedness of living organisms. The framework doesn’t merely address isolated questions; it constructs a verified reasoning process, ensuring each component supports the overall integrity of the system. This holistic design philosophy aligns with Karl Popper’s assertion that “The method of ‘conjectures and refutations’ is not simply a procedure for discovering the truth, but an evolutionary method which trails along the path of trial and error.” VERAFI, through its neurosymbolic approach and formal verification, actively ‘refutes’ potential errors in financial reasoning, iteratively refining its policies toward a more robust and reliable system. The emphasis on formal verification isn’t just about correctness, but about building a resilient structure capable of adapting to the complexities of financial data.

The Road Ahead

The architecture presented in VERAFI suggests a path, though not a destination. One cannot simply graft formal verification onto a system built on statistical approximation and expect systemic resilience. The true challenge lies not in verifying individual policies, but in understanding how the agentic framework itself propagates uncertainty. The system’s intelligence, after all, resides not in any single reasoning step, but in the orchestration of those steps. A single faulty connection, a subtly biased retrieval, and the entire edifice of formal correctness becomes largely academic.

Future work must address the holistic behavior of such agents. Current methods largely treat retrieval-augmentation as a black box, ignoring the potential for spurious correlations to undermine the verified reasoning. A more nuanced understanding of information provenance – tracing the lineage of every fact – is essential. This necessitates a shift from verifying what the agent concludes, to verifying how it arrived at that conclusion, a problem that demands new tools for tracing causal dependencies within complex neurosymbolic networks.

Ultimately, the pursuit of ‘financial intelligence’ may reveal itself to be a fundamentally different endeavor than simply building better question-answering systems. The market, like any complex adaptive system, is defined by emergent behavior, by the unpredictable interplay of countless individual decisions. Attempting to impose formal correctness upon such a system may prove less fruitful than accepting a degree of inherent ambiguity, and focusing instead on building agents that are robust to, and capable of learning from, the inevitable noise.

Original article: https://arxiv.org/pdf/2512.14744.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

From Data Deluge to Decisive Insight: The Evolving Landscape of Financial Intelligence

VERAFI: Orchestrating Financial Reasoning with an Agentic, Neurosymbolic System

Deconstructing the Mechanics: A Deep Dive into Retrieval and Reasoning

Validating Insight and Charting a Course for Future Development

The Road Ahead

See also: