Teaching AI to Think Like a Finance Pro

Author: Denis Avetisyan

A new framework automatically refines large language models’ financial reasoning skills by learning from their mistakes-without retraining or new data.

The ASDA framework builds a skill library iteratively-beginning with an initial compilation of student errors, then refining it through phases that prioritize both comprehensive coverage of unresolved issues <span class="katex-eq" data-katex-display="false">Q^{\mathrm{gap}}</span> and the prevention of performance regressions <span class="katex-eq" data-katex-display="false">Q^{-}</span>, with each skill update rigorously validated against a defined correctness threshold before integration into the student’s learning prompt. — The ASDA framework builds a skill library iteratively-beginning with an initial compilation of student errors, then refining it through phases that prioritize both comprehensive coverage of unresolved issues $Q^{\mathrm{gap}}$ and the prevention of performance regressions $Q^{-}$ , with each skill update rigorously validated against a defined correctness threshold before integration into the student’s learning prompt.

ASDA leverages error analysis and skill distillation to adapt language models for domain-specific expertise in financial reasoning.

Adapting large language models to specialized domains like financial reasoning presents a paradox: expensive fine-tuning locks expertise into model weights, while training-free methods yield only marginal gains. To address this, we introduce ‘ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning’, a framework that automatically generates human-readable, version-controlled skill artifacts through iterative error analysis-effectively distilling knowledge without modifying model weights. Evaluated on a challenging financial benchmark, ASDA achieves substantial performance improvements-up to +17.33% on arithmetic reasoning-outperforming existing training-free baselines and offering a practical path to auditable domain adaptation. Could this approach unlock a new paradigm for rapidly deploying LLMs across diverse, data-rich industries without the need for costly retraining?

The Limits of Scale: Reasoning’s Foundational Challenge

While large language models demonstrate remarkable proficiency in tasks like text generation and pattern recognition, their capacity for complex reasoning remains surprisingly limited. These models, despite their scale, frequently falter when confronted with problems requiring multiple sequential steps or the application of abstract principles. This isn’t simply a matter of insufficient data; the underlying architecture itself presents a bottleneck. LLMs excel at identifying correlations within vast datasets, but lack an inherent ability to perform deductive or inductive reasoning-essentially, they struggle to understand the relationships between concepts, rather than merely recognizing their co-occurrence. Consequently, tasks demanding genuine problem-solving, planning, or causal inference often expose this fundamental limitation, revealing that scaling parameters alone isn’t a sufficient path toward artificial general intelligence.

The pursuit of enhanced reasoning in large language models has often focused on increasing model size – the number of parameters – yet this approach yields diminishing returns. Studies reveal that simply making these models larger doesn’t necessarily translate to improved problem-solving capabilities; instead, performance gains frequently stem from increased memorization of training data. Consequently, LLMs can excel at tasks seen during training, but struggle with novel situations requiring genuine inference or extrapolation. This reliance on pattern matching, rather than true understanding, highlights a critical limitation: while scale can improve statistical correlations, it doesn’t inherently instill the capacity for abstract thought or reliable generalization – a crucial distinction between mimicking intelligence and actually possessing it.

The core limitation in large language model reasoning isn’t computational power, but a structural inability to effectively handle specialized information. Current architectures treat all knowledge as equally accessible within a vast parameter space, hindering the focused application of expertise crucial for complex problem-solving. Unlike human cognition, which readily compartmentalizes and retrieves domain-specific knowledge, LLMs struggle to discern and prioritize relevant information within their generalized datasets. This results in inefficient processing, where models often attempt to solve problems using broadly applicable patterns rather than precise, nuanced understandings. Consequently, achieving competency in new fields demands extensive and repeated training, a process that is both resource-intensive and ultimately unsustainable as the demand for increasingly specialized AI capabilities grows.

The current paradigm for adapting large language models to new areas of expertise presents a significant practical hurdle: extensive retraining is typically required with each novel domain. This process isn’t merely computationally expensive, demanding substantial energy and resources; it also proves unsustainable in rapidly evolving fields where knowledge is constantly updated. Each retraining cycle essentially restarts the learning process, failing to leverage previously acquired knowledge and creating a costly, iterative loop. This limitation hinders the broader application of these models, as maintaining proficiency across multiple disciplines demands a continuous investment of time and resources that quickly becomes prohibitive, effectively locking expertise within narrowly defined boundaries.

Augmenting the baseline model with a domain-specific skill, detailed in Figure 2, corrected an error on a fixed income question and enabled seven further corrections across related questions.

Distilling Skills: A Framework for Adaptive Reasoning

Automated Skill Distillation and Adaptation (ASDA) is a framework for creating reusable agent skills through the analysis of Large Language Model (LLM) failures. Unlike traditional methods requiring model weight updates, ASDA operates by externalizing specific reasoning processes as independently executable skills. This approach involves identifying patterns in LLM errors, then constructing modular skills – comprising both procedural logic and code – to address these deficiencies. The resulting skills are applied via API access, allowing the LLM to dynamically leverage them during inference without necessitating retraining or altering the core model parameters. This enables continuous improvement and adaptation to new challenges without incurring the substantial computational costs associated with full model fine-tuning.

The initial ‘Warm-up Phase’ of ASDA involves a systematic error analysis process applied to the Large Language Model (LLM). This phase doesn’t involve modifying the LLM itself, but rather focuses on actively querying the model with a diverse set of inputs and meticulously documenting the resulting errors. These errors are categorized based on type – such as logical fallacies, factual inaccuracies, or procedural mistakes – and tagged with relevant metadata including input characteristics and observed failure patterns. The output of this phase is a comprehensive catalog of common LLM errors, serving as the foundation for subsequent skill distillation and adaptation processes. This catalog provides quantifiable data on the LLM’s weaknesses, enabling targeted skill development.

The Skill Library within ASDA functions as a repository of specialized reasoning modules generated from identified LLM error patterns. Each module, termed a ‘Skill File’, encapsulates a discrete domain-specific procedure – such as performing a specific calculation, retrieving information from a knowledge base, or applying a defined constraint – and includes associated code templates for execution. These Skill Files are not generalized algorithms but rather targeted solutions designed to address frequently observed error cases, allowing the system to bypass problematic areas in the LLM’s reasoning process. The library’s modular design facilitates both expansion with new skills and refinement of existing ones based on ongoing error analysis, contributing to continuous improvement in adaptive reasoning capabilities.

ASDA utilizes Application Programming Interfaces (APIs) to integrate newly distilled skills into a Large Language Model (LLM) without necessitating parameter updates or complete model retraining. This approach bypasses the computational expense and data requirements typically associated with fine-tuning. Specifically, when an LLM encounters a situation identified as solvable by a skill within the Skill Library, ASDA intercepts the process and executes the corresponding code template via the API. The output of this execution is then presented to the LLM as if it were the result of its own reasoning, effectively augmenting its capabilities on-demand. This API-based skill application allows for dynamic adaptation and avoids the limitations of static, pre-trained models.

Analysis of a Haiku 3.5 failure using Sonnet 4.5 generated a skill pattern-one of six found in the associated file-that highlights potential areas for improvement.

Dynamic Skill Injection and Performance Validation

The ‘Selector’ component functions as a dynamic prompt engineering module, analyzing incoming questions to identify relevant knowledge and reasoning strategies. This analysis triggers the injection of specific ‘Skill Files’ – pre-defined sets of instructions and examples – directly into the Large Language Model (LLM) prompt. These ‘Skill Files’ are not simply appended; the ‘Selector’ strategically positions them within the prompt to guide the LLM’s reasoning process, effectively providing contextual information and preferred solution pathways. This targeted injection aims to enhance the LLM’s ability to address complex queries by focusing its attention on pertinent skills and reducing the likelihood of errors stemming from insufficient or misapplied knowledge.

Evaluations utilizing both the Haiku 3.5 and Haiku 4.5 language models demonstrate that the ASDA framework yields substantial performance gains on complex financial reasoning challenges, as measured by the FAMMA benchmark. Specifically, arithmetic reasoning accuracy improved by up to 17.33 percentage points, while non-arithmetic reasoning accuracy saw an increase of 5.95 percentage points when employing ASDA. These results indicate a significant enhancement in the models’ ability to accurately process and solve financially-oriented problems across both arithmetic and non-arithmetic domains.

Performance validation using the FAMMA benchmark dataset demonstrates significant accuracy improvements with the ASDA framework. Initial testing with the Haiku 3.5 model yielded an 8.67 percentage point increase in arithmetic reasoning accuracy. Subsequent refinement epochs, involving two iterations of model training, further enhanced performance, resulting in a total arithmetic accuracy improvement of 17.33 percentage points. These results indicate a substantial and progressive enhancement of the model’s capabilities through the dynamic skill injection and refinement process on the FAMMA dataset.

The ASDA framework incorporates a ‘Self-Teaching’ mechanism where the language model iteratively refines its own skill library without external supervision. Evaluation on arithmetic reasoning tasks demonstrates this process achieves a 6.33 percentage point improvement in accuracy. Notably, this gain represents 73% of the total accuracy improvement observed when utilizing a stronger, externally defined teacher model, indicating a high degree of self-improvement capability within the framework and reducing reliance on curated external datasets for refinement.

Iterative Refinement: Sustaining Robust Reasoning

The architecture of ASDA features a dedicated ‘Iterative Refinement’ phase, functioning as a continuous loop for enhancing the skill library’s overall capabilities. This process isn’t a one-time adjustment, but rather a persistent effort to bolster both performance and the breadth of scenarios the system can effectively handle. Through consistent analysis and targeted improvements, the framework moves beyond simply achieving initial success to actively seeking out and rectifying limitations. This dedication to ongoing refinement allows the skill library to become increasingly resilient, adaptable, and capable of delivering consistently reliable reasoning – ultimately minimizing the need for costly and time-consuming complete retraining cycles.

The iterative refinement phase within the framework actively targets two critical areas for improvement: coverage and safety. Coverage refinement systematically addresses instances where the skill library fails to provide a correct response within the existing training data, essentially filling gaps in its knowledge base. Simultaneously, safety refinement focuses on preventing ‘regressions’ – unexpected declines in performance on previously mastered tasks. This dual approach ensures not only that the system learns to handle a wider range of inputs but also that its existing capabilities remain stable and reliable over time, contributing to a consistently robust and trustworthy performance.

The architecture proactively bolsters the skill library’s performance through continuous assessment and targeted improvement. This isn’t simply about achieving initial success, but about maintaining and enhancing capabilities over time. The system actively seeks out instances where the library falls short – identifying gaps in its knowledge or instances of unexpected behavior. Once identified, these weaknesses become the focus of refinement, with adjustments made to ensure consistently accurate and dependable responses. This dynamic process of self-correction isn’t a one-time fix, but an ongoing cycle, allowing the framework to adapt to evolving challenges and maintain a high degree of robustness.

The architecture culminates in a system designed to dramatically lessen the computational burden traditionally associated with maintaining large language model (LLM) performance. Rather than necessitate complete model retraining with each new challenge or identified weakness, this framework prioritizes targeted refinement of existing skills. By actively addressing failures and regressions within a dedicated iterative process, the system fosters a continuously improving skill library that adapts with minimal resource expenditure. This approach not only accelerates the development cycle but also unlocks the potential for more sustainable and cost-effective LLM deployment, paving the way for broadly accessible, robust reasoning capabilities.

The presented framework, ASDA, embodies a pursuit of essential functionality. It distills complex financial reasoning into executable skills through error analysis – a process aligning with the principle that unnecessary complexity obscures understanding. As Henri Poincaré stated, “It is through science that we arrive at truth, but it is through simplicity that we arrive at clarity.” ASDA’s training-free adaptation, focusing on model-specific expertise, demonstrates this simplicity in action. The system minimizes extraneous parameters and training data, offering a focused approach to LLM adaptation – a direct reduction toward the essential, and thus, a form of intellectual economy. The method prioritizes distilling knowledge into usable agent skills rather than relying on brute-force model adjustments, showcasing a preference for elegant solutions.

The Road Ahead

The proliferation of large language models necessitates a reckoning: scale alone does not confer understanding. This work, by automating skill distillation and adaptation, skirts the issue of true intelligence, instead focusing on a more pragmatic remediation of error. If a model stumbles, ASDA offers a bandage, not a cure. The elegance lies in its training-free approach, a tacit admission that further pre-training is often a diminishing return, a desperate piling of data onto a fundamentally flawed structure. Yet, this framework merely addresses symptoms. The underlying causes of these errors – the conceptual gaps, the logical fallacies – remain largely unexamined.

Future iterations should not settle for simply correcting mistakes, but attempt to diagnose why those mistakes occur. A true advance would involve ASDA not just generating executable skills, but also constructing an internal representation of the model’s deficiencies. This is not a matter of adding more parameters, but of distilling existing knowledge into a more coherent and accessible form. The current focus on adaptation is useful, but ultimately superficial.

The ultimate test will be generalization. Can these distilled skills transfer to genuinely novel problems, or are they merely clever patches for a limited set of scenarios? The temptation to demonstrate incremental improvement on benchmark datasets must be resisted. The goal should not be to achieve higher scores, but to build a system that can reason, not simply mimic reasoning. If that proves impossible, then perhaps the entire endeavor is, at its core, a beautiful distraction.

Original article: https://arxiv.org/pdf/2603.16112.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Limits of Scale: Reasoning’s Foundational Challenge

Distilling Skills: A Framework for Adaptive Reasoning

Dynamic Skill Injection and Performance Validation

Iterative Refinement: Sustaining Robust Reasoning

The Road Ahead

See also: