Can AI Beat the Market?

Author: Denis Avetisyan

New research explores whether large language models can reliably predict stock performance, and reveals the critical role of human oversight.

This review finds that while large language models show promise in financial forecasting, they are susceptible to reasoning errors and require access to high-quality data like regulatory filings to achieve dependable results.

Despite advances in automated financial analysis, consistently outperforming market benchmarks remains a significant challenge. This paper, ‘Large Language Models and Stock Investing: Is the Human Factor Required?’, investigates the capacity of state-of-the-art large language models (LLMs) to generate reliable stock predictions across various prompting strategies. Our findings reveal that while LLMs demonstrate potential, their performance is hampered by recurring reasoning failures and reliance on potentially inaccurate information, necessitating robust human oversight and access to verified data sources like regulatory filings. Can strategically implemented LLMs, guided by human expertise, ultimately unlock new avenues for consistent, data-driven investment success?

The Elusive Signal: Beyond Traditional Market Prediction

Despite decades of refinement in financial modeling and the deployment of increasingly sophisticated data analysis techniques, consistently accurate stock market prediction continues to be a significant challenge. The market’s inherent complexity, driven by a multitude of interacting factors – from macroeconomic indicators and geopolitical events to investor sentiment and unforeseen ‘black swan’ occurrences – introduces substantial noise and uncertainty. While models can identify correlations and predict probabilities, they frequently struggle to account for the non-linear, often irrational, behaviors that characterize financial markets. Consequently, even the most advanced algorithms are prone to errors, highlighting the limitations of relying solely on quantitative analysis and the persistent difficulty in forecasting future market movements with absolute certainty.

Financial markets are notoriously complex adaptive systems, and analyses dependent solely on past performance frequently struggle to anticipate future trends. Static models, while useful for understanding certain baseline conditions, inherently lack the flexibility to account for the myriad of evolving factors – geopolitical events, shifts in investor sentiment, and unforeseen economic disruptions – that constantly reshape market dynamics. The reliance on historical data assumes a degree of predictability that simply doesn’t exist in a system driven by human behavior and subject to random shocks. Consequently, these traditional approaches often produce inaccurate forecasts, failing to capture the subtle, nonlinear relationships that govern price movements and leaving investors vulnerable to unexpected volatility.

The modern financial landscape is characterized by an unprecedented deluge of data, arriving at speeds that overwhelm conventional analytical techniques. This isn’t simply a matter of ‘more data’; the velocity – the rate at which information streams in – fundamentally alters the challenge. Traditional models, built on batch processing of historical data, struggle to keep pace with real-time fluctuations and interconnected events. Consequently, innovation focuses on techniques like high-frequency trading algorithms, machine learning applied to streaming data, and natural language processing of news and social media. These novel approaches aim to not only process the sheer volume of information but also to extract meaningful signals from the noise, identifying patterns and anticipating shifts before they fully materialize – a crucial capability in today’s hyper-competitive markets.

The Emergence of LLMs: A New Paradigm for Market Insight

Large Language Models (LLMs) present a novel approach to stock market prediction due to their capacity for advanced textual analysis. Traditional quantitative methods often struggle with unstructured data sources, whereas LLMs can process and interpret information from diverse text-based financial sources, including news articles, company reports (10-K, 10-Q), earnings call transcripts, and social media feeds. This capability allows LLMs to identify subtle correlations between textual content and subsequent stock price movements that may be missed by conventional algorithms. The models achieve this by employing techniques like natural language processing (NLP) and machine learning to extract relevant entities, assess sentiment, and identify predictive patterns within the textual data. Consequently, LLMs offer the potential to incorporate a broader range of information into predictive models, potentially improving forecast accuracy and offering new insights into market dynamics.

Large Language Models (LLMs) demonstrate potential in stock market prediction through training on extensive datasets comprising financial news articles, Securities and Exchange Commission (SEC) regulatory filings-including 10-K and 10-Q reports-and reports generated by financial analysts. This training allows the LLM to identify statistically significant correlations between textual data and subsequent stock performance. Specifically, LLMs can parse text to quantify sentiment – positive, negative, or neutral – associated with specific companies or market sectors. Furthermore, the models can detect patterns in reported financial data, such as revenue growth, earnings per share, and debt levels, that historically correlate with stock price movements. The capacity to process unstructured text data, combined with quantitative analysis, allows LLMs to potentially predict future stock behavior based on identified patterns and sentiment shifts.

Successful deployment of Large Language Models (LLMs) for financial analysis is heavily dependent on prompt engineering. LLMs do not inherently understand financial contexts; therefore, precisely formulated prompts are crucial to direct the model’s attention to relevant data and desired outputs. Strategies include specifying the type of analysis – such as sentiment analysis, trend identification, or risk assessment – and defining the scope of the input data, like focusing on specific companies or sectors. Furthermore, prompts should clearly articulate the desired output format, whether it’s a numerical prediction, a textual summary, or a categorized risk level. Iterative refinement of prompts, coupled with rigorous validation of generated insights against historical data, is essential to minimize inaccuracies and maximize predictive performance.

From Naive Queries to Sophisticated Reasoning: The Evolution of Prompting

Initial experimentation with unrefined, or “naive,” query prompts yielded limited financial performance, registering an average monthly excess return of only 0.35%. This result indicated a significant deficiency in the model’s ability to interpret unstructured requests and generate actionable insights without specific direction. The low return rate underscored the necessity of implementing more formalized prompting techniques, providing the LLM with clearly defined parameters and analytical frameworks to improve the quality and utility of its responses. This initial benchmark served as a critical baseline for evaluating the effectiveness of subsequent, more structured prompting strategies.

The implementation of structured prompts, characterized by the provision of defined analytical frameworks and explicit instructions, resulted in a demonstrable improvement in Large Language Model (LLM) performance. Unlike naive prompting approaches, which yielded an average monthly excess return of 0.35%, structured prompts facilitated more consistent and accurate outputs. This improvement stems from the reduction of ambiguity and the channeling of the LLM’s processing towards specific, pre-defined tasks. The framework directs the model to follow a logical sequence, increasing the reliability of generated insights and enabling more effective downstream analysis.

Implementation of chain-of-thought reasoning, a prompting technique that encourages the LLM to articulate its reasoning process step-by-step, yielded a substantial improvement in financial forecasting accuracy. This method was coupled with iterative review loops and human validation of the LLM’s outputs to correct errors and refine the model’s approach. Quantitative results demonstrate an average monthly excess return of 3.04% when utilizing this combined prompting and validation strategy, representing a significant increase over initial naive prompting methods and highlighting the benefits of incorporating structured reasoning and human oversight.

Evaluation of several Large Language Models – ChatGPT, Gemini, DeepSeek, and Perplexity – was conducted using the established prompting frameworks. Performance metrics indicated that Perplexity achieved a 17% improvement in Information Ratio when prompted to incorporate data sourced from regulatory filings. This suggests that Perplexity, relative to the other tested LLMs, is particularly effective at leveraging structured data from official sources to refine its analytical output and improve the risk-adjusted returns identified within the investment strategy.

Classification accuracy, when utilizing structured prompts, achieved a score of 0.579. This metric indicates the model’s capability to differentiate between various assets based on the provided input and instructions. A score of 0.579 suggests the model correctly classifies assets with approximately 57.9% accuracy, demonstrating a statistically significant ability to distinguish between asset classes beyond random chance. This level of discrimination is crucial for applications such as portfolio construction, risk assessment, and automated trading strategies, where accurate asset categorization is paramount.

The Necessary Human Element: Validating Insight and Mitigating Risk

Large language models, despite their capacity for identifying patterns and generating seemingly insightful predictions, are not immune to fundamental errors in reasoning. These models operate by statistically predicting the most likely continuation of a given input, a process that can lead to confidently stated, yet entirely inaccurate, conclusions – a phenomenon often described as “hallucination.” This susceptibility arises from the models’ lack of genuine understanding; they manipulate symbols without possessing the contextual awareness or common sense reasoning capabilities of a human analyst. Consequently, LLM-derived insights must be treated with caution, as the models may struggle with nuanced financial data, unforeseen market conditions, or even simple logical fallacies, potentially leading to flawed investment strategies if relied upon without verification.

Large language models, despite their increasing sophistication, are not infallible and can produce flawed reasoning or inaccurate predictions when applied to complex financial analysis. Consequently, human oversight remains a critical component of any investment strategy utilizing these technologies. Experts are needed to validate the outputs generated by LLMs, meticulously identifying potential errors or biases that might otherwise lead to unsound investment recommendations. This process involves not simply accepting the model’s conclusions, but rather scrutinizing the underlying logic and data used to arrive at those conclusions, ensuring that recommendations align with established financial principles and market realities. Integrating human judgment safeguards against the risks inherent in relying solely on algorithmic outputs, ultimately bolstering the reliability and trustworthiness of the entire investment process.

The predictive power of Large Language Models in financial analysis is notably strengthened when integrated with established quantitative methods. Rather than replacing traditional techniques, these models function optimally as a complement to financial ratios and multi-factor models, offering a more holistic evaluation of investment opportunities. These corroborative tools allow for verification of LLM-generated insights, identifying potential discrepancies or biases in the model’s reasoning. By cross-referencing LLM predictions with metrics such as price-to-earnings ratios, debt-to-equity ratios, and factors like value, momentum, and quality, analysts gain a more comprehensive perspective, reducing the risk of relying solely on potentially flawed algorithmic outputs and ultimately improving the robustness of investment strategies.

The integration of human validation into large language model (LLM) stock market predictions demonstrably reduces investment risk and amplifies potential returns. This “human-in-the-loop” methodology doesn’t replace algorithmic power, but rather tempers it with critical oversight, identifying and correcting potential reasoning errors within LLM outputs. Studies reveal this combined approach-specifically when paired with Chain-of-Thought prompting-achieves an Information Ratio of 0.68, a key performance indicator signifying risk-adjusted returns. This ratio indicates that for every unit of risk taken, the strategy generates 0.68 units of excess return, suggesting a robust and effective system for leveraging LLMs in financial forecasting while maintaining a prudent risk profile.

The pursuit of automated stock prediction, as explored in this paper, often yields systems that appear functional yet lack genuine understanding. If the system survives on duct tape – patching together correlations without causal reasoning – it’s probably overengineered. This echoes Karl Popper’s sentiment: “The more we learn about the world, the more we realize how little we know.” The study highlights how LLMs, despite their capabilities, stumble on reasoning tasks without contextual data like regulatory filings. Modularity, in the form of LLM components, doesn’t guarantee control; it requires a holistic view, acknowledging the inherent limitations of data-driven prediction and the necessity of human oversight to mitigate reasoning failures.

Beyond the Algorithm

The pursuit of automated stock prediction, as this work demonstrates, quickly reveals a fundamental truth: systems are not collections of isolated components, but interwoven architectures. While large language models exhibit a surface aptitude for parsing financial data, their vulnerability to reasoning failures suggests the problem isn’t simply one of information access, but of understanding. Modifying one part of the predictive engine – adding more data, refining the model – triggers a cascade of consequences elsewhere, often unforeseen. The apparent promise of LLMs, therefore, resides not in replacing human analysis, but in augmenting it – providing a more comprehensive, yet still fallible, view of a complex landscape.

Future research must address the structural underpinnings of these failures. A focus on why an LLM misinterprets information, rather than merely that it does, is crucial. Access to high-quality data, like regulatory filings, is a necessary condition, but insufficient on its own. The architecture of the model itself – its capacity for causal reasoning, its ability to handle ambiguity – dictates its performance.

Ultimately, the question isn’t whether an algorithm can predict the market, but whether it can understand it. And understanding, it appears, demands a level of systemic awareness that remains, for the present, beyond the reach of purely computational systems. The challenge, then, lies not in building a better predictor, but in designing a more integrated system – one that acknowledges the inherent limitations of any single component.

Original article: https://arxiv.org/pdf/2603.19944.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Elusive Signal: Beyond Traditional Market Prediction

The Emergence of LLMs: A New Paradigm for Market Insight

From Naive Queries to Sophisticated Reasoning: The Evolution of Prompting

The Necessary Human Element: Validating Insight and Mitigating Risk

Beyond the Algorithm

See also: