Can AI Trade? New Benchmark Reveals Limits of Large Language Models in Finance

Author: Denis Avetisyan

A new study challenges the hype around artificial intelligence in financial markets, demonstrating that current large language models struggle with fundamental quantitative trading tasks.

Market-Bench, a novel benchmark, assesses large language models’ ability to implement and backtest basic trading strategies, revealing significant deficiencies in accuracy and risk management.

Despite advances in code generation, large language models (LLMs) still struggle with tasks requiring robust quantitative reasoning. This limitation is explored in ‘Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics’, which introduces a novel benchmark for assessing LLMs’ ability to construct and backtest basic trading strategies from natural language descriptions. Our findings reveal that while current models can often produce executable code, accurately implementing even simple strategies-such as scheduled trading, pairs trading, or delta hedging-remains a significant challenge. Can LLMs overcome these hurdles to become reliable tools for quantitative finance, or will deeper advancements in reasoning and numerical precision be required?

The Allure and Peril of LLMs in Quantitative Finance

Quantitative trading, at its core, demands the swift and accurate processing of vast datasets to identify fleeting market opportunities. Historically, this has been the domain of meticulously crafted algorithms – lines of specialized code designed by quantitative analysts and financial engineers – coupled with the judgment of experienced traders. These systems perform complex calculations, from statistical arbitrage to portfolio optimization, all within milliseconds to capitalize on price discrepancies or predict market movements. The process isn’t simply about number crunching; it requires translating abstract financial theories into concrete, executable strategies, and then continuously monitoring and adapting those strategies to changing market conditions. This traditionally involves a significant investment in both technical infrastructure and highly skilled personnel, making it a challenging and resource-intensive endeavor.

The integration of Large Language Models (LLMs) into quantitative trading presents a compelling, yet cautious, frontier. While these models demonstrate an ability to process and interpret vast datasets – a core requirement for identifying market patterns and executing trades – their application in high-stakes financial environments demands rigorous scrutiny. Current LLMs excel at tasks like sentiment analysis and news summarization, potentially streamlining aspects of financial research. However, translating this capability into consistently profitable trading strategies is a significant hurdle. The inherent unpredictability of financial markets, coupled with the potential for LLMs to generate statistically plausible but economically unsound decisions, raises concerns about their reliability. Thorough backtesting, robust risk management protocols, and continuous monitoring are crucial to mitigate these risks before widespread adoption can occur, ensuring that the promise of automation doesn’t overshadow the potential for substantial financial losses.

The apparent proficiency of Large Language Models in financial contexts can be misleading, as standard evaluation benchmarks often fail to reflect the complexities of actual market dynamics. A recent study utilizing Market-Bench demonstrates this limitation, revealing that even seemingly straightforward trading strategies prove challenging for current LLMs to implement correctly. These models frequently stumble on tasks requiring nuanced understanding of order types, transaction costs, and real-time market data – factors routinely handled by experienced financial professionals. Consequently, performance metrics derived from simplified benchmarks may offer an overly optimistic assessment of an LLM’s true capabilities, potentially leading to flawed investment decisions and highlighting the critical need for more robust and realistic evaluation methodologies.

Beyond Simple Accuracy: Evaluating Financial LLMs

Current financial Large Language Model (LLM) evaluations utilizing benchmarks such as FinanceQA and BizFinBench consistently reveal performance limitations when applied to realistic financial tasks. FinanceQA focuses on question answering requiring reasoning over financial documents, while BizFinBench assesses the ability to extract structured data from financial reports. Results from both benchmarks indicate that LLMs struggle with tasks demanding complex financial reasoning, accurate data extraction, and the synthesis of information from multiple sources. Specifically, LLMs exhibit difficulties in accurately interpreting nuanced financial language, correctly identifying key figures within reports, and drawing logical conclusions based on the presented data, highlighting deficiencies in their ability to effectively process and understand financial information.

FinEval-KR and CFinBench are specifically designed to measure an LLM’s understanding of financial concepts and terminology, employing question-answering formats to test declarative knowledge. Complementing these knowledge-based benchmarks, HumanEval and DS-1000 evaluate the code generation proficiency of LLMs, assessing their ability to produce functional code in response to prompts. This is relevant to financial applications as many tasks, such as quantitative analysis or algorithmic trading, require code implementation. HumanEval focuses on general programming skills, while DS-1000 emphasizes data science competencies, both of which are crucial for developing and deploying LLM-powered financial tools.

Current financial LLM benchmarks frequently assess individual skills, such as question answering or code generation, in isolation from one another. This fragmented evaluation fails to capture the complete performance profile of an LLM when applied to real-world financial tasks that require the integration of multiple capabilities. Market-Bench addresses this limitation by evaluating LLMs on the complete implementation of trading strategies, encompassing tasks from data retrieval and analysis to order execution and portfolio management. This end-to-end approach provides a more comprehensive and realistic assessment of an LLM’s ability to function effectively within a complex financial ecosystem, moving beyond isolated skill measurements to evaluate practical, system-level performance.

Translating Strategy to Backtest: The Role of Market-Bench

Market-Bench is a benchmarking platform designed to assess the capacity of Large Language Models (LLMs) to convert textual descriptions of trading strategies into functional backtests. This process addresses the challenge of translating qualitative, natural language instructions into quantitative, executable code for financial analysis. The platform facilitates automated evaluation by providing a standardized environment and metrics for comparing LLM performance on strategy implementation. By enabling this translation, Market-Bench aims to bridge the gap between the accessibility of natural language and the precision required for algorithmic trading and quantitative research, allowing users to assess an LLM’s ability to understand and operationalize complex trading logic.

LLMs utilize both real-world Market Data and artificially generated datasets to create simulated trading environments for strategy evaluation. Synthetic Book, a key component of these simulations, is derived from Level 10 (L10) Order Book Data, providing a granular representation of order flow and market depth. This allows LLMs to test and refine trading strategies – such as those focused on pairs trading, scheduled execution, or options delta hedging – without the risks and costs associated with live market deployment. The use of synthetic data, particularly Synthetic Book, enables scalable and controlled experimentation, facilitating iterative improvements in strategy performance and robustness prior to live implementation.

Market-Bench evaluates Large Language Models (LLMs) using a suite of trading strategies with varying complexity. Pairs Trading (Strategy 1) currently achieves a Pass@3 rate of 0.80, indicating LLMs successfully generate executable backtests 80% of the time when given three attempts. Scheduled Trading (Strategy 2) is more challenging, with a Pass@3 rate of 0.67, and Options Delta Hedging (Strategy 3) presents the highest difficulty, achieving a Pass@3 rate of 0.65. These differing success rates highlight the range of reasoning and data processing capabilities required for each strategy, and serve as benchmarks for assessing LLM performance in financial applications.

Beyond Automation: Towards Robust and Insightful Financial AI

The rapid advancement of Large Language Models (LLMs) in finance is being fueled by the emergence of dedicated frameworks and open-source pipelines. Systems like PIXIU offer a structured environment for building and rigorously testing financial LLMs, while collaborative initiatives such as FinGPT and Open-FinLLMs are democratizing access to tools and data. These platforms streamline the development process, allowing researchers and practitioners to move beyond isolated experiments and towards reproducible, scalable solutions. By providing pre-built components, standardized benchmarks, and shared resources, these tools are not only accelerating innovation but also fostering a more collaborative and transparent landscape for financial AI, ultimately enabling more robust and reliable applications in areas like algorithmic trading, risk management, and financial forecasting.

The confluence of large language models and established quantitative finance techniques is poised to redefine financial practices. By synthesizing the analytical rigor of traditional methods – such as time series analysis and statistical arbitrage – with the pattern recognition and predictive capabilities of LLMs, a new era of automated and insightful decision-making is emerging. This integration isn’t simply about replacing existing systems; it’s about augmenting them, allowing for the automation of previously complex tasks like sentiment analysis of news sources, risk assessment based on unstructured data, and the identification of subtle market anomalies. The result is a potential for uncovering previously hidden opportunities, optimizing portfolio construction, and ultimately, improving financial outcomes through a more holistic and data-driven approach. This synergy promises to move beyond reactive strategies towards proactive, predictive models capable of navigating the complexities of modern financial markets.

Evaluating the performance of financial language models extends beyond simple accuracy metrics like Mean Absolute Error (MAE). While current models, such as Gemini 3 Pro – which achieves MAE values of 14.83, 52.22, and 1245.48 for Strategies 1, 2, and 3 respectively – demonstrate predictive capabilities, a truly comprehensive assessment demands consideration of robustness, interpretability, and fairness. Robustness ensures consistent performance across diverse market conditions and unforeseen events, while interpretability allows stakeholders to understand the reasoning behind predictions, fostering trust and enabling informed decision-making. Critically, fairness considerations are essential to mitigate potential biases embedded within the data or algorithms, preventing discriminatory outcomes and promoting equitable access to financial opportunities. These multifaceted evaluations are paramount for responsible deployment of financial AI and building confidence in its long-term viability.

Future Directions: Enhancing Reliability and Expanding Scope

The efficacy of large language models in finance hinges on robust evaluation, and future benchmarks must move beyond simplified datasets to mirror the intricacies of real-world financial systems. Current assessments often fail to adequately test an LLM’s performance under duress, necessitating the inclusion of comprehensive stress tests that simulate market crashes or economic downturns. Crucially, benchmarks should also verify adherence to complex regulatory compliance standards, such as those governing anti-money laundering or securities trading. Beyond individual firm risk, evaluations need to assess the potential for LLM-driven decisions to contribute to systemic risk – the propagation of instability across the entire financial network – a factor largely absent from current testing protocols. By incorporating these multifaceted challenges, future benchmarks will provide a more accurate and reliable measure of an LLM’s true capabilities and limitations within the financial domain.

Current large language models, while demonstrating proficiency in many tasks, often struggle with the inherent uncertainty of financial markets. Further investigation centers on enhancing their capacity for probabilistic reasoning, moving beyond simple predictions to quantify the range of possible outcomes and associated risks. This involves developing methods for LLMs to not only adapt to shifting economic landscapes and novel data, but also to articulate the rationale behind their analyses and forecasts in a manner accessible to financial professionals and regulators. Such transparency is crucial for building trust and ensuring responsible implementation, as it allows for effective validation of model outputs and identification of potential biases or limitations. Ultimately, improving these capabilities will be paramount for leveraging LLMs as reliable tools for complex financial decision-making and risk assessment.

The convergence of large language models with other artificial intelligence methodologies promises a significant leap forward in financial applications. Integrating LLMs with reinforcement learning allows for the development of adaptive trading strategies and portfolio optimization techniques, where models learn through trial and error in simulated market environments. Furthermore, combining LLMs with causal inference methods moves beyond mere correlation to establish genuine understanding of financial relationships, enabling more robust risk assessments and predictive modeling. This synergy facilitates the identification of true causal drivers of market behavior, rather than spurious correlations, which is crucial for preventing systemic failures and fostering innovation in areas like fraud detection and algorithmic compliance. The resulting systems will not only process information but also reason about it, offering a more nuanced and reliable approach to financial decision-making.

The pursuit of increasingly complex models, as observed in large language models evaluated by Market-Bench, often obscures fundamental deficiencies. The benchmark reveals a struggle with basic quantitative trading-a seeming paradox given the models’ capacity for code generation. This echoes a core tenet of elegant design: simplicity isn’t merely aesthetic, it’s functional. As Marvin Minsky once stated, “The more of a system you can see, the more of it you can understand.” Market-Bench highlights precisely this-the inability of these models to demonstrate a clear, comprehensible grasp of even introductory financial dynamics, despite producing syntactically correct code. The lack of transparency in their reasoning processes ultimately hinders reliable backtesting and risk management, proving that apparent complexity doesn’t equate to genuine intelligence or utility.

What Remains?

The exercise reveals not a deficit of code generation, but a fundamental disconnect. Current large language models can produce instructions for a trading strategy, yet consistently fail to reconcile those instructions with the reality of market simulation. The issue isn’t whether a model can write Python; it’s that it doesn’t, fundamentally, understand price. Stripping away the superficial ability to assemble syntax leaves a void where economic intuition should reside.

Future work must resist the urge to layer complexity atop this fragility. Elaborate risk models or high-frequency strategies are irrelevant while basic backtesting yields nonsensical results. The focus should be on embedding, or perhaps sculpting, a rudimentary understanding of market dynamics – supply, demand, impact – directly into the model’s core. This requires less data, less scaling, and more principled constraint.

The benchmark, in essence, has exposed a limitation not of capability, but of direction. The path forward isn’t to build larger language models, but to build models that speak a smaller, more truthful language about how markets function. What remains, after all, is what truly matters.

Original article: https://arxiv.org/pdf/2512.12264.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/