Teaching AI to Understand Finance: A New Approach to Data Creation

Author: Denis Avetisyan

This research details a method for building specialized training data to enhance the reasoning capabilities of artificial intelligence in complex financial scenarios.

A systematic pipeline constructs a dataset designed to facilitate research and development, ensuring a robust foundation for subsequent analysis and model training.

A pipeline for constructing synthetic instruction datasets with reasoning traces improves large language model performance in the Japanese financial domain.

Achieving robust reasoning capabilities in large language models remains a key challenge when adapting them to specialized domains. This is addressed in ‘Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain’, which proposes a method for automatically generating high-quality instruction datasets enriched with chain-of-thought reasoning traces. Evaluation on financial benchmarks demonstrates that training with this synthetic data significantly improves model performance in the targeted domain. Can this pipeline be generalized to facilitate rapid LLM adaptation across diverse specialized fields and unlock new levels of domain expertise?

The Challenge of Specialized Financial Reasoning

Despite the impressive general knowledge and language processing abilities of Large Language Models, applying them directly to the complexities of Japanese finance proves challenging. These models, typically trained on broad datasets, often lack the specialized understanding of unique financial terminology, intricate regulatory landscapes, and the specific reasoning patterns prevalent within the Japanese economic system. Consequently, a direct application frequently results in inaccurate interpretations, flawed analyses, and an inability to effectively address the nuanced questions common in this field. The gap between general linguistic competence and domain-specific expertise highlights the need for targeted adaptation strategies to unlock the full potential of LLMs in specialized financial contexts.

Pre-trained Large Language Models, while powerful in general language processing, often struggle when applied to the intricacies of financial analysis. These models lack inherent understanding of specialized financial terminology – terms like ‘yield curve’ or ‘beta’ hold no intrinsic meaning – and are unfamiliar with the complex regulatory landscapes governing financial institutions. Consequently, substantial adaptation is required; simply feeding a model financial reports won’t yield insightful analysis. This adaptation involves not just expanding the model’s vocabulary, but also equipping it with the capacity for specialized reasoning – for instance, discerning subtle risks within a balance sheet or interpreting the implications of a changing interest rate environment – skills that necessitate targeted training and fine-tuning beyond general language proficiency.

Successfully deploying Large Language Models in specialized fields like finance demands more than simply scaling model size; it requires deliberate knowledge transfer and performance enhancement strategies. Researchers are exploring techniques such as fine-tuning pre-trained models on domain-specific datasets, incorporating financial knowledge graphs to augment contextual understanding, and employing reinforcement learning from human feedback tailored to financial reasoning tasks. These approaches aim to bridge the gap between general language proficiency and the intricate requirements of financial analysis, risk assessment, and regulatory compliance. The goal is not to retrain models from scratch, but rather to efficiently adapt existing capabilities, allowing LLMs to navigate complex financial data, interpret nuanced language, and ultimately, provide reliable and insightful support for financial professionals.

Constructing a Robust Financial Instruction Dataset

The creation of a specialized instruction dataset is essential for effectively fine-tuning Large Language Models (LLMs) to accurately interpret and execute financial instructions. General-purpose LLMs often lack the nuanced understanding of financial terminology, regulatory constraints, and specific task requirements inherent in financial applications. A dedicated dataset, comprising diverse financial instructions paired with correct responses, allows the LLM to learn these specific patterns and relationships. This targeted training improves performance on tasks such as portfolio analysis, fraud detection, risk assessment, and customer service within the financial domain, ultimately increasing the reliability and usability of the LLM in real-world financial workflows.

Data cleaning within the financial instruction dataset utilized a combined approach of N-gram Filtering and MinHash & Locality Sensitive Hashing (LSH). N-gram Filtering identifies and removes repetitive or low-information content based on sequences of tokens, while MinHash & LSH efficiently detects and eliminates near-duplicate entries by creating hash signatures and grouping similar items. This dual-technique pipeline resulted in a 75.4% data filtering rate, significantly reducing redundancy and improving the overall quality of the dataset by focusing on unique and informative instructions. The filtering process prioritized retaining data points with high information density and minimizing the inclusion of near-identical examples.

Multi-turn conversation generation was implemented to construct a dataset simulating realistic financial interactions. This technique moves beyond single-turn question-and-answer pairs to create dialogues consisting of multiple sequential exchanges. The generated conversations are designed to reflect the complexity of real-world financial inquiries, including follow-up questions, clarifications, and multi-step instructions. This approach allows for the fine-tuning of Large Language Models (LLMs) to better understand and respond to nuanced financial requests, and to maintain context throughout an extended interaction, improving performance on tasks requiring conversational reasoning.

The financial instruction dataset underwent evaluation and refinement utilizing a Large Language Model (LLM) functioning as a judge. This LLM-as-a-Judge approach demonstrated an accuracy score of 81.7% in assessing the dataset’s quality and relevance, surpassing the performance of all other evaluation methodologies employed. This evaluation process involved assessing generated instructions for adherence to financial best practices, clarity, and logical consistency, allowing for iterative improvement of the dataset through targeted revisions and filtering of lower-quality examples.

Synthetic Data for Enhanced Reasoning Capabilities

Synthetic data generation was implemented to address limitations in available datasets for training large language models (LLMs) on complex financial reasoning. This process created a training corpus of 9.5 billion tokens specifically designed to encompass a wide range of financial scenarios. By programmatically generating data, we were able to significantly increase both the volume and diversity of training examples, focusing on instances requiring multi-step reasoning and analytical skills. This approach allows for targeted data creation, ensuring the LLM is exposed to a distribution of financial problems representative of real-world complexities, and mitigates biases present in existing, naturally sourced datasets.

The generation of synthetic training data leverages a Reasoning Large Language Model (LLM) to produce detailed reasoning traces. These traces are not simply answers to financial questions, but rather simulate the step-by-step thought process an analyst would undertake to arrive at a conclusion. This involves decomposing complex financial scenarios into a series of logical inferences, calculations, and justifications. The LLM is prompted to articulate each step in its reasoning, effectively creating a transcript of its analytical process. This approach aims to move beyond superficial pattern matching and enable the training of LLMs capable of transparent and explainable financial decision-making by explicitly modeling the reasoning pathway.

The length of generated reasoning traces significantly impacts model performance, but exhibits diminishing returns beyond a certain threshold. Empirical analysis of the synthetic data generation process revealed that increasing the reasoning trace length beyond 1024 tokens did not yield substantial improvements in the LLM’s ability to perform complex financial reasoning. While shorter traces may not provide sufficient context for accurate analysis, extending the trace length past this point resulted in minimal gains and increased computational cost, suggesting an optimal balance exists between trace length and model efficacy. This finding informed the data generation pipeline, prioritizing quality and diversity of reasoning steps over simply increasing the length of each trace.

Traditional Large Language Model (LLM) training often focuses on question-answering tasks, yielding outputs without discernible rationale. Our methodology shifts this paradigm by prioritizing the generation of detailed reasoning traces alongside answers. This compels the LLM to not only provide a conclusion but also to articulate the sequential steps and financial principles used to arrive at that conclusion. Consequently, the model’s outputs are inherently more transparent and explainable, allowing users to audit the logic and identify potential errors in the reasoning process. This capability is critical for high-stakes financial applications where trust and accountability are paramount, and moves beyond simple predictive accuracy towards verifiable and interpretable financial insights.

Task accuracy generally increases with reasoning trace length, suggesting that more extensive reasoning improves performance.

Validating Domain Adaptation and Performance Gains

To effectively tailor large language models (LLMs) for the complexities of the Japanese financial sector, a two-pronged adaptation strategy was implemented utilizing both Continued Pre-Training and Instruction Tuning. This process focused on refining the Qwen3 and gpt-oss models, initially exposing them to a substantial corpus of financial text to enhance their understanding of domain-specific vocabulary and context through Continued Pre-Training. Subsequently, Instruction Tuning was applied, leveraging a carefully curated dataset of instructions and corresponding financial responses to guide the models in performing specific tasks – such as analysis, summarization, and question answering – within the financial realm. This combined approach allows the LLMs to not only comprehend the nuances of Japanese financial language, but also to generate accurate and relevant outputs tailored to the unique demands of the industry.

The refinement of large language models for specialized fields, such as finance, hinges on their ability to interpret nuanced language and complex reasoning patterns. To achieve this, a meticulously curated Instruction Dataset is employed, comprising a diverse range of financial queries and corresponding solutions. This dataset isn’t simply provided; it undergoes a rigorous cleaning process to eliminate noise and inconsistencies, ensuring the LLM receives high-quality training data. By exposing the model to this refined dataset, it learns to not only recognize financial terminology but also to apply logical reasoning to solve problems and generate accurate, contextually relevant responses. This data-centric approach enables the LLM to move beyond general language understanding and develop a specialized proficiency in the complexities of financial language and reasoning.

Rigorous evaluation of the adapted language models relied on established financial benchmarks, providing a standardized measure of performance and facilitating direct comparison with existing models. A key metric employed was Pass@1, which assesses the model’s ability to generate a correct answer within a single attempt – a crucial indicator of reliability in financial applications. Results consistently demonstrated that the adapted models not only met but exceeded the performance of officially instruction-tuned models across all evaluated subtasks, highlighting the efficacy of the domain adaptation techniques and the quality of the instruction dataset used for refinement. This superior performance underscores the potential for these models to deliver accurate and dependable insights within the complex landscape of financial analysis and reasoning.

Rigorous evaluation reveals that the adapted large language models achieved substantial performance gains on key Japanese financial benchmarks. Specifically, the models demonstrated improvements ranging from 4.5 to 5.7 points on the japanese-lm-fin-harness, indicating a heightened ability to process and understand complex financial queries. Furthermore, a 0.4 point increase on the pfmt-bench-fin-ja benchmark highlights a strengthened capacity for financial reasoning and problem-solving. These results collectively validate the efficacy of a data-centric domain adaptation strategy, demonstrating that carefully curated and refined datasets are instrumental in tailoring large language models to specialized fields like finance and significantly boosting their performance.

Scaling and Generalizing the Data-Centric Approach

The current paradigm of adapting large language models (LLMs) often centers on model architecture and parameter tuning; however, recent work demonstrates the significant advantages of a data-centric approach, particularly when targeting highly specialized domains. This methodology prioritizes the quality, relevance, and representativeness of the training data itself, recognizing that even the most sophisticated model is limited by the information it receives. By carefully curating and refining datasets specific to complex fields – such as finance, legal analysis, or scientific research – researchers have shown substantial improvements in model performance, often exceeding those achieved through traditional model-centric techniques. This suggests a shift in focus towards systematically improving data, enabling LLMs to move beyond general language understanding and achieve genuine expertise within narrow, well-defined areas.

Continued development centers on broadening the applicability of this data-centric AI methodology beyond the financial sector. Researchers are investigating automated techniques for data generation, allowing for the creation of synthetic datasets tailored to specific domains where labeled data is scarce. Simultaneously, refinement strategies-including active learning and data augmentation-are being explored to improve the quality and efficiency of existing datasets. This scaling effort isn’t simply about applying the same process to new areas; it involves developing adaptable algorithms capable of identifying relevant data, mitigating bias, and ensuring the resulting language models maintain both accuracy and generalizability across diverse subject matter. The ultimate goal is to establish a flexible framework for rapidly customizing large language models to excel in any specialized field, minimizing the need for extensive manual data curation.

The scalability of this data-centric AI methodology hinges on effectively harnessing readily available web-scale data, and resources like Common Crawl represent a pivotal asset in this endeavor. Common Crawl’s massive archive of web pages offers a pre-existing, though often noisy, foundation for constructing domain-specific corpora without the prohibitive costs and time associated with manual data collection and annotation. Researchers are investigating automated filtering and refinement techniques to extract relevant financial information from this vast resource, transforming raw web content into structured datasets suitable for training and evaluating large language models. Successfully leveraging such publicly available data not only accelerates the development of specialized LLMs, but also democratizes access to advanced AI capabilities beyond organizations with substantial proprietary data holdings, ultimately fostering broader innovation in financial analysis and beyond.

The long-term objective extends beyond simply enhancing language processing capabilities; it envisions large language models functioning as sophisticated financial analysts. This requires imbuing these models with the capacity for nuanced interpretation of financial data, predictive modeling, and the generation of actionable insights – moving beyond pattern recognition to genuine understanding. Success in this area demands not merely fluency in financial terminology, but the ability to synthesize information from diverse sources, assess risk with accuracy, and ultimately, provide reliable support for critical decision-making processes within the financial sector. The potential outcome is a paradigm shift, where LLMs serve as invaluable partners to human analysts, augmenting their expertise and driving more informed outcomes.

The construction of synthetic datasets, as detailed in this study, echoes a fundamental principle of systemic design. Just as a complex system’s behavior stems from its structure, the performance of a large language model is intrinsically linked to the quality and coherence of its training data. The paper’s focus on reasoning traces – explicitly outlining the steps taken to arrive at an answer – exemplifies this. As Bertrand Russell observed, “The point of the mind is to organize experience.” This meticulous organization of data, mirroring a chain-of-thought process, isn’t simply about scaling data volume; it’s about providing the model with a clear, scalable framework for understanding and responding to complex financial inquiries. The ecosystem thrives on clarity, and a well-structured dataset is its foundation.

Future Directions

The construction of synthetic datasets, particularly those incorporating reasoning traces, offers a seductive path toward domain adaptation for large language models. This work, focused on the Japanese financial sector, demonstrates the potential, yet simultaneously highlights the inherent fragility of such systems. The apparent gains are not merely about data quantity, but the structure of that data – the explicit encoding of reasoning. However, the fidelity of this structure remains a critical, and largely unaddressed, concern. How well do these synthetic traces mirror the nuances of genuine expert reasoning, and what are the cascading effects of subtle misrepresentations?

Future research must move beyond simply demonstrating performance improvements on curated benchmarks. A deeper investigation into the generalizability of these models is required, particularly when confronted with novel, ambiguous, or adversarial inputs. The reliance on chain-of-thought prompting, while effective, introduces another layer of complexity, demanding a more robust understanding of its limitations and potential failure modes. Furthermore, the cost of creating and validating these datasets, even synthetic ones, should not be dismissed.

Ultimately, the pursuit of domain-specific intelligence through synthetic data is a worthwhile endeavor, but one fraught with hidden assumptions and potential pitfalls. Good architecture is invisible until it breaks, and only then is the true cost of decisions visible.

Original article: https://arxiv.org/pdf/2603.01353.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/