Smarter Finance AI: Bridging the Knowledge Gap for Accurate Answers

Author: Denis Avetisyan

A new approach combines diverse knowledge sources to significantly improve the ability of artificial intelligence to answer complex financial questions.

The FinQA dataset provides structured question-answer pairs designed to assess a system’s ability to reason about financial documents, facilitating research into information extraction and question answering within the financial domain.

This review details a multi-retriever RAG system leveraging both internal and external financial knowledge to enhance numerical reasoning in large language models for financial question answering.

Despite recent advances in large language models, complex financial question answering-particularly tasks demanding multi-step numerical reasoning-remains a significant challenge due to limitations in domain-specific knowledge. This research, ‘Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs’, addresses this by introducing a novel Retrieval-Augmented Generation (RAG) system that effectively integrates both external financial knowledge and internal question context. Through this approach, we achieve state-of-the-art performance, demonstrating a substantial improvement in numerical reasoning capabilities. However, the persistent gap between model performance and human expertise raises the question of how to best balance external knowledge integration with mitigating model hallucinations for truly reliable financial analysis.

The Rigorous Demands of Financial Reasoning

Successfully navigating financial question answering demands a sophisticated interplay between linguistic comprehension and quantitative skill, presenting a significant hurdle for contemporary artificial intelligence. Current models often struggle not simply with the calculations themselves – determining present value or calculating return on investment, for example – but with correctly interpreting the meaning embedded within financial terminology and applying it to the appropriate numerical operations. This necessitates a deep understanding of concepts like amortization, compounding interest, and various investment vehicles, coupled with the ability to translate those concepts into $mathematical$ formulations and execute them accurately. The challenge isn’t merely about processing numbers; it’s about decoding the language of finance and then performing the correct calculations within the given context, a task requiring a level of integrated reasoning that remains elusive for many AI systems.

Current approaches to Financial Numerical Reasoning QA frequently falter due to an inability to synthesize contextual information with precise calculation. These systems often treat numerical computation as a separate task from understanding the underlying financial principles and relevant background details. Consequently, they struggle with questions requiring the interpretation of nuanced financial scenarios-such as assessing risk tolerance or understanding the implications of specific economic indicators-leading to inaccurate or incomplete answers. The challenge isn’t simply performing $A + B = C$ , but rather discerning when and how to apply specific formulas or financial rules based on the question’s subtle cues and the broader financial context, a task demanding a level of integrated reasoning that surpasses the capabilities of many traditional question answering systems.

External Knowledge: A Necessary Augmentation

Retrieval-Augmented Generation (RAG) addresses limitations in Large Language Models (LLMs) by supplementing their pre-trained knowledge with information dynamically retrieved from external sources. LLMs, while proficient in language understanding and generation, possess a fixed knowledge base and can generate inaccurate or outdated responses when faced with questions requiring current or specialized information. RAG systems mitigate this by first retrieving relevant documents or passages from a knowledge source – such as a financial database like Investopedia – based on the user’s query. These retrieved passages are then provided as context to the LLM, allowing it to generate responses grounded in factual, up-to-date information. This approach improves the accuracy, reliability, and specificity of LLM outputs, particularly in domains where knowledge evolves rapidly or requires access to proprietary data.

DPR-FAISS is a system designed for efficiently retrieving relevant documents from a large corpus to provide external context for question answering. It combines Dense Passage Retrieval (DPR) with FAISS (Facebook AI Similarity Search). DPR utilizes a dual-encoder architecture to create dense vector embeddings of both queries and passages, allowing for semantic similarity searches. FAISS is a library that enables fast similarity search and clustering of dense vectors. By indexing the passage embeddings with FAISS, the system can quickly identify the most relevant passages given a query embedding, significantly improving the accuracy and reliability of responses generated by Large Language Models (LLMs) when factual recall is required.

The FinRAD Dataset is a resource specifically designed for the financial domain, comprising 97,733 question-answer pairs sourced from financial reports, news articles, and investor documents. It provides a benchmark for evaluating knowledge retrieval systems, focusing on the ability to accurately identify relevant financial context to answer complex questions. The dataset is split into training, validation, and test sets, and includes both multiple-choice and open-ended question formats. FinRAD’s structure allows for assessing both the retrieval component-measuring the precision of relevant document identification-and the generation component, which evaluates the quality of answers formulated using the retrieved context. Its size and financial focus make it a suitable dataset for developing and benchmarking retrieval-augmented generation (RAG) systems tailored for financial question answering applications.

Empirical Validation of Model Performance

Pre-trained language models, including architectures such as BERT, RoBERTa, SpanBERT, GPT-2, T5, GPT-4, and Gemini 1.5 Pro, serve as the core linguistic engine for these systems due to their capacity for robust language understanding. These models are initially trained on massive text corpora, enabling them to learn statistical relationships between words and phrases, and to develop contextualized word embeddings. This pre-training process provides a strong foundation for downstream tasks, reducing the need for extensive task-specific training data and improving performance on tasks requiring semantic understanding, reasoning, and generation. The models differ in their architectures-for example, BERT and RoBERTa utilize a masked language modeling approach, while GPT-2 and subsequent GPT versions employ a causal language modeling approach-but all leverage the principle of transfer learning to adapt pre-existing knowledge to new tasks.

Performance gains in financial question answering (QA) systems are achieved through fine-tuning pre-trained language models using datasets such as the FinQA Dataset. This process involves optimizing model parameters with techniques including the Adam Optimizer, which adjusts learning rates adaptively, and the CrossEntropyLoss function, which quantifies the difference between predicted and actual values. Further refinement is accomplished with the ReduceLROnPlateau scheduler, which dynamically reduces the learning rate when a performance metric plateaus, preventing overfitting and enabling more precise convergence. These optimization strategies contribute to improved execution accuracy on complex financial reasoning tasks.

Specialized language models such as SecBERT are pre-trained on large corpora of financial documents, enabling improved performance when processing domain-specific terminology and tasks. Current state-of-the-art results on Financial Numerical Reasoning QA tasks are achieved by utilizing Gemini 1.5 Pro in conjunction with a multi-retriever Retrieval-Augmented Generation (RAG) system, demonstrating an execution accuracy of 68.39%. This approach currently represents the highest reported accuracy on this benchmark, exceeding previous results and highlighting the effectiveness of combining advanced language models with information retrieval techniques.

Performance evaluations demonstrate incremental gains achievable through model and architectural adjustments. Specifically, transitioning from the `Gemini-1.0-pro` model to `Gemini 1.5 Pro` yielded a 2% improvement in execution accuracy. Independent of this change, a neural symbolic model achieved an execution accuracy of 61.24% on the `FinQA Dataset`, exceeding the originally published results of the `FinQA` paper which utilized 300 training epochs. Additionally, a system combining a `SecBERT Internal Retriever` with the `RoBERTa Large` language model improved execution accuracy by 3.46% relative to a baseline model.

Training loss decreased over 20 epochs when using a SEC-BERT encoder and LSTM decoder for the generator.

Beyond Simple Recall: Towards Cognitive Integration

Retrieval-Augmented Generation, while a considerable advancement in leveraging external knowledge, represents a stepping stone toward genuinely intelligent systems. Future investigations are increasingly focused on techniques that move beyond simply retrieving relevant documents and instead prioritize integrating that knowledge into a cohesive, reasoned response. Prompt-based generator approaches, in particular, offer a promising pathway by dynamically constructing prompts that guide large language models to synthesize information and draw inferences. These methods don’t just present facts; they enable the model to actively reason with the retrieved knowledge, potentially unlocking more nuanced, accurate, and contextually aware outputs, ultimately surpassing the limitations of simple information recall and approaching a higher form of cognitive processing.

Achieving truly scalable and reliable retrieval-augmented generation (RAG) systems hinges on efficient knowledge access, and advancements in dense passage retrieval (DPR) coupled with fast similarity search libraries like FAISS represent a crucial step forward. DPR moves beyond traditional keyword-based methods by learning semantic representations of both queries and knowledge passages, allowing for more nuanced and relevant matches. FAISS then accelerates the search process, enabling rapid identification of the most pertinent information within massive datasets-a capability essential for real-time applications and handling the ever-growing volumes of financial data. This combination significantly reduces latency and improves the accuracy of knowledge retrieval, forming a robust foundation for building sophisticated financial models that can effectively leverage external knowledge sources.

Emerging research focuses on constructing Neural Symbolic Generator architectures, a novel approach designed to meld the pattern-recognition capabilities of neural networks with the logical rigor of symbolic reasoning. These hybrid systems aim to address a critical limitation in current financial modeling – a lack of transparency and interpretability. By integrating symbolic knowledge representation and reasoning, these generators can not only predict market trends but also articulate the underlying rationale for those predictions in a human-understandable format. This combination promises to yield financial models that are demonstrably more trustworthy, allowing stakeholders to verify the logic driving critical decisions and fostering greater confidence in automated financial systems. The potential extends beyond simple prediction, enabling the generation of detailed financial reports and justifications grounded in both data analysis and established financial principles.

The pursuit of robust financial question answering, as detailed in this work, demands a precision mirroring mathematical law. This research elegantly demonstrates how integrating structured, domain-specific knowledge – financial dictionaries and symbolic reasoning – into a Retrieval-Augmented Generation system elevates Large Language Models beyond mere pattern recognition. As Marvin Minsky stated, “Common sense is the collection of things everyone knows, but no one can explain.” This paper strives to explain financial common sense to these models, moving beyond superficial understanding towards verifiable, reasoned answers. The multi-retriever approach isn’t simply about finding more information, but about curating knowledge that supports provable conclusions – a harmony of symmetry and necessity in algorithmic design.

Beyond the Retrieval Horizon

The presented work, while demonstrating improved performance in financial question answering, merely addresses a symptom, not the disease. The continued reliance on Large Language Models as fundamentally probabilistic pattern matchers remains problematic. Numerical reasoning, in its truest form, demands logical deduction, not statistical inference. The integration of financial dictionaries, while a step towards grounding, does not imbue the system with genuine understanding of underlying economic principles. Each added knowledge source is, in essence, another layer of empirical observation-a potentially leaky abstraction.

Future efforts must move beyond the superficial enrichment of retrieval mechanisms. A fruitful direction lies in exploring formal methods for representing financial knowledge-symbolic reasoning systems capable of verifying the logical consistency of answers, rather than simply generating plausible ones. The current paradigm prioritizes scaling model size; a more elegant solution would involve minimizing computational complexity through mathematically rigorous representation. The pursuit of ‘state-of-the-art’ benchmarks should not overshadow the fundamental need for provable correctness.

Ultimately, the goal is not to build systems that appear intelligent, but systems that are demonstrably so. The field requires a shift in focus-from statistical prowess to logical precision. The elegance of a solution, after all, resides not in its empirical performance, but in the purity of its underlying mathematics.

Original article: https://arxiv.org/pdf/2512.23848.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Rigorous Demands of Financial Reasoning

External Knowledge: A Necessary Augmentation

Empirical Validation of Model Performance

Beyond Simple Recall: Towards Cognitive Integration

Beyond the Retrieval Horizon

See also: