Can AI Pick Stocks? A New Benchmark for Financial Reasoning

Author: Denis Avetisyan


Researchers have developed a rigorous testing ground to assess how well large language models perform at complex portfolio optimization tasks, revealing significant differences in their quantitative abilities.

This study introduces a benchmark framework for evaluating large language models’ performance on mean-variance optimization problems in financial decision-making, comparing GPT-4, Gemini 1.5 Pro, and Llama 3.1-70B.

While large language models (LLMs) demonstrate promise across diverse domains, rigorous evaluation of their quantitative reasoning abilities-particularly in financially relevant contexts-remains a challenge. This is addressed in ‘Constructing a Portfolio Optimization Benchmark Framework for Evaluating Large Language Models’, which introduces a novel benchmark for assessing LLM performance on portfolio optimization problems with mathematically defined solutions. Comparative analysis of GPT-4, Gemini 1.5 Pro, and Llama 3.1-70B reveals significant performance variations-with GPT-4 excelling in risk-based scenarios and Gemini showing strength in return maximization-highlighting both the potential and limitations of current LLMs in applying quantitative skills to finance. Can this framework accelerate the development of reliable and scalable LLM-driven solutions for portfolio management and investment strategy?


The Inherent Fragility of Conventional Financial Models

Conventional financial models, while historically useful, often stumble when confronted with the messy reality of market behavior. These models frequently depend on assumptions of normality, linearity, and rational actors – simplifications that rarely hold true in dynamic economic systems. Real-world data is characterized by outliers, non-linear relationships, and behavioral biases – factors that can significantly distort the predictions generated by these rigid frameworks. Consequently, models built on these foundations may underestimate risk, misprice assets, and fail to accurately forecast market movements. This disconnect between theoretical assumptions and empirical observation necessitates a shift towards more flexible and data-driven approaches capable of capturing the inherent complexity and unpredictability of financial landscapes.

Contemporary financial markets are characterized by interconnectedness, volatility, and a sheer volume of data that surpasses the capabilities of traditional analytical methods. This escalating complexity isn’t merely a quantitative issue; it fundamentally alters the nature of financial problems, demanding systems capable of learning, adapting, and making inferences from incomplete or noisy information. Static models, built on historical correlations, increasingly falter when confronted with novel events or rapidly shifting conditions. Consequently, there’s a growing need for decision-making systems that leverage techniques like machine learning and artificial intelligence – not to replace human judgment, but to augment it by identifying patterns, assessing risks, and generating insights previously hidden within the deluge of market data. The future of financial reasoning hinges on embracing these intelligent systems to navigate an environment where predictability is diminishing and adaptability is paramount.

A Rigorous Framework for Evaluating Financial Acumen

A standardized benchmark framework is crucial for evaluating the financial decision-making competency of Large Language Models (LLMs) due to the high-stakes nature of financial applications and the potential for significant error costs. This framework must provide a consistent and quantifiable method for assessing model performance across various financial tasks, enabling objective comparison between different LLM architectures and training methodologies. Without such a framework, evaluating LLM suitability for financial roles relies on subjective assessments and lacks the rigor required for deployment in regulated industries. Key characteristics of a robust framework include a clearly defined evaluation methodology, a diverse and representative dataset, and metrics that accurately reflect real-world financial performance criteria, such as accuracy, risk assessment, and return on investment.

The benchmark framework employs multiple-choice questions specifically designed to evaluate reasoning capabilities beyond simple factual recall in Large Language Models. These questions present scenarios and require models to apply financial principles to select the most appropriate answer, necessitating a deeper comprehension of the underlying concepts rather than just identifying memorized information. Question construction focuses on presenting information requiring analysis, calculation, and comparative judgement, effectively differentiating between models capable of true understanding and those reliant on pattern matching or keyword recognition. This approach allows for quantitative assessment of a model’s ability to synthesize information and draw logical conclusions within a financial context.

FinQA and ConvFinQA represent key datasets for evaluating the contextual financial understanding of Large Language Models (LLMs). FinQA focuses on question answering related to financial documents, requiring models to extract relevant information and synthesize answers. ConvFinQA extends this evaluation to multi-turn conversational settings, simulating realistic financial dialogues where models must maintain context across multiple exchanges to accurately address user queries. Both benchmarks utilize datasets comprised of real-world financial questions and contexts, enabling a more comprehensive assessment of an LLM’s ability to not only identify financial concepts, but also to apply them appropriately within dynamic, conversational scenarios.

Constructing Plausible Financial Distractors

Distance-Based Distractor Generation constructs plausible but incorrect portfolio options by systematically altering asset allocations from a calculated optimal solution. This method operates by defining a multi-dimensional distance metric – often Euclidean or Manhattan distance – within the asset space. Incorrect options are then generated by sampling portfolios within a defined radius of the optimal solution, ensuring they remain relatively close in allocation but deviate sufficiently to produce suboptimal performance. The magnitude of acceptable deviation, and thus the difficulty of the distractor, is controlled by adjusting the radius and the weighting of individual asset variances. This approach avoids creating entirely unrealistic options, focusing instead on subtle variations that require nuanced financial reasoning to identify as incorrect.

Threshold-Based Distractor Generation constructs plausible but incorrect answer options by introducing portfolio allocations that fall within a predefined performance threshold of the optimal solution. This method identifies alternatives that achieve a level of return reasonably close to the maximum, but at the cost of increased risk or deviation from specified investment constraints. The threshold is determined empirically to ensure distractors appear viable to a user performing a cursory analysis, yet are demonstrably suboptimal upon closer examination of key performance indicators like Sharpe Ratio or Sortino Ratio. This approach differs from random generation, as the alternatives are specifically engineered to test an LLM’s ability to discern nuanced differences in financial performance and justify the selection of the truly optimal portfolio.

The implementation of distance-based and threshold-based distractor generation techniques is intended to move beyond superficial pattern matching in Large Language Models (LLMs) and assess true financial reasoning capability. By creating plausible but incorrect answer options that require nuanced evaluation of portfolio performance and allocation strategies, the benchmark increases in complexity. This forces LLMs to demonstrate an understanding of the underlying financial principles, rather than simply identifying statistically common responses or keywords. Successful performance necessitates a model’s ability to accurately calculate returns, assess risk, and justify optimal investment decisions, thereby providing a more robust measure of financial intelligence.

Quantifying Model Performance and Cognitive Capabilities

A comprehensive benchmark framework was employed to rigorously evaluate the capabilities of several large language models (LLMs), including GPT-4, Gemini 1.5 Pro, and Llama 3.1-70B. This evaluation moved beyond simple question-answering to assess complex reasoning skills crucial for financial applications and general cognitive ability. The framework facilitated a standardized comparison of these models, allowing for quantifiable insights into their strengths and weaknesses across a range of tasks. By employing consistent metrics and challenging problem sets, researchers were able to establish a clear understanding of each model’s potential and limitations, ultimately paving the way for more informed development and deployment of LLMs in various domains.

To understand the scope of these large language models beyond financial expertise, evaluations extended to established reasoning benchmarks like MMLU and HellaSwag. MMLU, a massive multi-task language understanding benchmark, tests knowledge across 57 diverse subjects, while HellaSwag focuses on commonsense reasoning through fill-in-the-blank scenarios. Performance on these tasks provides crucial insight into a model’s general cognitive abilities – its capacity for knowledge retention, logical deduction, and understanding nuanced contexts – independent of specialized financial data. This broader assessment helps determine whether observed success within FinQA stems from genuine reasoning prowess or simply from pattern recognition within a limited domain, ultimately revealing the true potential and limitations of each model’s intelligence.

Model efficacy was rigorously quantified through established portfolio optimization metrics, notably the Sharpe Ratio – a measure of risk-adjusted return – and Conditional Value-at-Risk, which assesses potential losses under adverse market conditions. Results indicate that GPT consistently demonstrated superior performance in risk-based objectives, achieving the highest accuracy in minimizing both portfolio variance and Maximum Drawdown (MDD). This suggests GPT possesses a stronger capacity to construct portfolios that prioritize capital preservation and downside protection compared to Gemini 1.5 Pro and Llama 3.1-70B, which exhibited comparatively lower accuracy in these crucial risk management tasks. The consistent outperformance of GPT in these areas highlights its potential for practical application in financial modeling and investment strategies focused on mitigating potential losses.

Current large language models exhibit a notable limitation in the complex task of maximizing the Sharpe Ratio, a crucial metric for evaluating risk-adjusted investment returns. Evaluations across diverse constraint types consistently revealed an accuracy rate below 10%, indicating a significant challenge in effectively balancing potential gains against inherent financial risk. This suggests that while these models can process financial data and understand portfolio concepts, they struggle to identify optimal investment strategies that consistently deliver high returns relative to the level of risk assumed. Further research and development are therefore needed to enhance the models’ capacity for nuanced financial reasoning and optimization, particularly in scenarios demanding sophisticated risk management.

Comparative analysis of the large language models revealed distinct performance profiles across portfolio optimization tasks. Gemini demonstrated a relative strength in achieving return-based objectives, indicating an aptitude for maximizing gains, though its overall accuracy consistently trailed that of GPT, especially when subjected to portfolio constraints. Conversely, Llama exhibited the lowest performance levels, registering significantly reduced accuracy compared to both Gemini and GPT across all evaluated metrics. This suggests a substantial gap in Llama’s capacity for complex financial reasoning and optimization, while Gemini, though proficient in certain areas, still requires refinement to match GPT’s consistent accuracy, particularly in navigating the complexities of constrained investment strategies.

The pursuit of a robust benchmark, as detailed in this work concerning large language models and portfolio optimization, echoes a fundamental principle of mathematical rigor. The varying performance observed among models like GPT-4, Gemini, and Llama 3, when faced with quantitative reasoning tasks, highlights the necessity of provable correctness over mere functional output. As Albert Camus stated, “The struggle itself… is enough to fill a man’s heart. One must imagine Sisyphus happy.” This resonates with the iterative process of benchmark creation; refining the evaluation framework, while demanding, is a worthwhile endeavor, akin to perpetually pushing the stone – a pursuit of verifiable truth in the face of complex financial modeling.

Beyond the Numbers

The demonstrated variability in performance across large language models when applied to portfolio optimization highlights a fundamental issue: competence on synthetic benchmarks does not guarantee robust quantitative reasoning. While these models can manipulate the mathematics of mean-variance optimization, the true test lies in their ability to generalize to unseen market conditions, a capacity not readily assessed through isolated problem sets. The deterministic nature of financial outcomes demands reproducibility, yet the inherent stochasticity of markets introduces an unavoidable element of chance. The challenge, therefore, isn’t merely achieving numerical accuracy, but modeling uncertainty with mathematical rigor.

Future work must move beyond evaluating models on their ability to find optimal portfolios and focus instead on their capacity to describe the limitations of such optimizations. A truly intelligent system would not simply offer a solution, but quantify the degree of confidence in that solution, acknowledging the inherent unknowability of future market behavior. The current framework serves as a starting point, but lacks the capacity to assess a model’s understanding of its own fallibility – a critical attribute for any system entrusted with financial decision-making.

Ultimately, the pursuit of artificial intelligence in finance demands a shift in perspective. The goal is not to replicate human intuition, but to construct systems grounded in mathematical principles – systems whose behavior is predictable, provable, and, crucially, auditable. Any deviation from this standard risks introducing a new form of opacity, obscuring the very logic it purports to illuminate.


Original article: https://arxiv.org/pdf/2603.09301.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-11 15:23