Trading on Thin Air: When AI-Powered Signals Lose Their Edge

Author: Denis Avetisyan

New research reveals that while large language models can generate promising indicators for financial trading, their performance is fragile and prone to failure during periods of economic volatility.

During periods of heightened market volatility-as measured by the VIX-an agent augmented by large language models exhibits diminished performance relative to a standard baseline, though this disparity lessens or inverts when volatility subsides, suggesting a sensitivity to systemic risk.

This paper investigates the regime-dependent behavior of features extracted from large language models used in reinforcement learning-based trading policies, highlighting the impact of distribution shift and the need for robust evaluation metrics.

Despite advances in machine learning for financial forecasting, translating predictive signals into robust trading strategies remains a persistent challenge. This paper, ‘When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies’, investigates the use of large language models (LLMs) to generate features for reinforcement learning-based trading agents, finding that while LLMs can identify predictive patterns in news and filings, their effectiveness is highly sensitive to changing market conditions. Specifically, LLM-derived features degrade performance during macroeconomic shifts, highlighting a disconnect between feature-level validity and policy-level robustness. Can we develop more adaptive methods for incorporating LLM signals that account for distributional changes and ensure consistent performance across diverse market regimes?

The Illusion of Prediction: Beyond Handcrafted Features

For decades, financial forecasting has depended on analysts meticulously constructing features – quantifiable variables – from raw data. These handcrafted features, while seemingly logical, often prove inflexible and unable to adapt to the ever-changing complexities of financial markets. The inherent limitations stem from the difficulty of anticipating every relevant signal and translating subjective insights into objective, measurable inputs. Consequently, these models struggle to capture the subtle nuances and contextual information embedded within unstructured data, such as news articles or company reports, leading to brittle performance when market conditions shift. This reliance on pre-defined indicators frequently results in missed opportunities and an inability to effectively respond to unforeseen events, highlighting the need for more dynamic and adaptable approaches to feature engineering.

Financial analysis has historically depended on meticulously engineered features – quantifiable data points selected by experts. However, a significant portion of potentially valuable market intelligence resides in unstructured data, such as news articles, analyst reports, and regulatory filings like those from the Securities and Exchange Commission. Large Language Models (LLMs) present a novel approach to unlock these insights, automatically processing textual information to identify contextual features previously inaccessible to traditional methods. By understanding the nuances of language, LLMs can discern sentiment, extract key entities, and identify relationships within text, effectively transforming qualitative data into quantitative signals. This capability moves beyond simple keyword searches, allowing for a more sophisticated understanding of how information disseminated through unstructured sources impacts financial markets and potentially improves predictive modeling.

Successfully integrating Large Language Models into financial modeling isn’t simply a matter of applying them to raw text; it demands sophisticated feature engineering. Recent work has focused on automating this process through a prompt-optimization loop, systematically refining the questions posed to the LLM to elicit the most predictive signals. This iterative approach yielded a substantial improvement in the Information Coefficient (IC), a key metric for evaluating predictive power – shifting from a negative correlation of -0.024, indicating a detrimental signal, to a positive correlation of +0.104. This demonstrates that a carefully tuned LLM, guided by automated optimization, can move beyond noise and uncover meaningful insights previously inaccessible through traditional, handcrafted features, highlighting the potential for LLMs to generate genuinely alpha-driving signals.

While Large Language Models unlock valuable signals from previously inaccessible data, their predictive power isn’t automatically guaranteed in financial forecasting. Studies reveal that LLM-derived features, though rich in contextual information, require careful integration with established macroeconomic indicators to consistently improve performance. Data quality presents a further challenge; biases or inaccuracies within the unstructured data fed to the LLM can significantly distort extracted features and lead to flawed predictions. Consequently, a robust analytical framework must prioritize both the cleanliness of input data and the conditioning of LLM insights with reliable macroeconomic variables, ensuring that the models reflect genuine market dynamics rather than spurious correlations or data artifacts.

Across 2025, the cumulative portfolio value demonstrates that while the <span class="katex-eq" data-katex-display="false">\pm 1</span> standard deviation around the 5-seed mean (solid lines) highlights performance variability, a clear regime split between H1 (red, tariff-driven shock) and H2 (yellow, calmer) reveals when LLM-derived features enhance or hinder portfolio performance. — Across 2025, the cumulative portfolio value demonstrates that while the $\pm 1$ standard deviation around the 5-seed mean (solid lines) highlights performance variability, a clear regime split between H1 (red, tariff-driven shock) and H2 (yellow, calmer) reveals when LLM-derived features enhance or hinder portfolio performance.

The Allure of Autonomy: Reinforcement Learning as a System for Adaptation

Reinforcement Learning (RL) applies a computational approach to algorithmic trading where an agent learns to maximize cumulative rewards through interaction with a simulated or live market environment. Unlike traditional rule-based systems or supervised learning methods, RL does not require pre-labeled data; instead, the agent discovers optimal trading strategies via trial and error. The agent observes market states, takes actions (buy, sell, hold), and receives rewards or penalties based on the outcome of those actions. This iterative process allows the agent to adapt to changing market conditions and learn complex, non-linear relationships that might be difficult to capture with static strategies. The learning process typically involves balancing exploration – trying new actions to discover potentially better strategies – and exploitation – leveraging known successful strategies to maximize immediate rewards.

The Proximal Policy Optimization (PPO) agent is a policy gradient method commonly employed in reinforcement learning due to its stability and sample efficiency. Unlike methods that perform large policy updates which can lead to instability, PPO constrains the policy update to a trust region, ensuring the new policy remains close to the previous one. This is achieved through a clipped surrogate objective function that penalizes deviations beyond a specified ratio. In the context of algorithmic trading, this characteristic is crucial for navigating the non-stationary and complex dynamics of financial markets, where drastic strategy shifts based on limited data can be detrimental. PPO’s ability to learn incrementally and adapt reliably to changing market conditions, combined with its relative ease of implementation and hyperparameter tuning, makes it a preferred choice for developing robust trading agents.

The incorporation of Large Language Model (LLM)-derived features into a Reinforcement Learning (RL) framework enhances an agent’s ability to interpret and respond to dynamic market conditions. LLMs process textual data – including news articles, social media sentiment, and financial reports – to generate quantifiable features representing market events and their potential impact. These features, when integrated as part of the RL agent’s state space, provide context beyond historical price data. Testing demonstrated that, while LLM features alone yielded a Sharpe Ratio of -0.267 in H1 2025, combining them with macroeconomic features improved performance, suggesting that LLM-derived insights are most effective when considered alongside broader economic indicators.

The inclusion of macroeconomic features significantly improves the performance of reinforcement learning agents used in algorithmic trading. Analysis of trading results during the first half of 2025 demonstrates a substantial difference in Sharpe Ratio when macroeconomic conditioning is applied; a Sharpe Ratio of -0.007 was achieved when utilizing macro features alone, compared to -0.267 when utilizing large language model (LLM)-derived features without macroeconomic context. This indicates that while LLM features can provide valuable short-term signals, they require broader economic context provided by macroeconomic indicators to effectively navigate market trends and mitigate risk, leading to a considerable improvement in risk-adjusted returns.

The Ghosts in the Machine: Addressing the Realities of Deployment

Reinforcement learning (RL) agents deployed in dynamic environments, such as financial markets, are susceptible to performance degradation due to distribution shift and regime change. Distribution shift refers to alterations in the input data distribution, meaning the statistical properties of the market data change over time. Regime change indicates a more fundamental shift in market behavior, such as a transition from a period of high volatility to low volatility, or a change in overall market trend. Both phenomena invalidate the assumptions upon which the RL agent was trained, leading to suboptimal decision-making and reduced profitability. The agent’s learned policy, optimized for a specific historical distribution, becomes less effective when applied to new, unseen data that deviates significantly from the training distribution. Consequently, continuous monitoring and adaptation strategies, such as retraining or policy adaptation, are necessary to maintain consistent performance in non-stationary environments.

The Feature-Policy Gap describes a discrepancy between the features used to train a reinforcement learning (RL) agent and the actual policies that maximize reward in a given environment. This gap arises when the provided features are insufficient to accurately represent the state of the environment or lack predictive power regarding future rewards. Consequently, the agent may learn suboptimal policies, even with a powerful learning algorithm. Minimizing this gap necessitates careful feature engineering, focusing on identifying and incorporating variables that are strongly correlated with expected returns and provide a comprehensive representation of the relevant state space. Addressing the Feature-Policy Gap is critical for successful RL deployment, as it directly impacts the agent’s ability to generalize and perform effectively in complex, real-world scenarios.

The presence of conflicting signals within news articles introduces substantial noise and uncertainty for reinforcement learning (RL) agents tasked with making trading decisions. These conflicts arise when different parts of an article present opposing viewpoints on a particular asset or economic indicator, or when reported data contradicts prevailing market sentiment. This inconsistency complicates the agent’s ability to accurately assess information, leading to suboptimal or incorrect trading strategies. The agent must then discern the validity and relevance of each signal, increasing the computational burden and potentially diminishing predictive accuracy. Consequently, managing conflicting signals is critical for robust performance in real-world financial applications.

Optimizing prompts used to generate features from Large Language Models (LLMs) is critical for maximizing the predictive power of those features in Reinforcement Learning (RL) applications. Our research demonstrates that automated prompt optimization significantly improves feature quality, as quantified by the Information Coefficient (IC). Specifically, we observed an IC improvement from -0.024 to +0.104 through this process. This indicates a transition from features that were negatively correlated with future rewards to those that are positively correlated, thereby enhancing the RL agent’s ability to make informed decisions based on LLM-derived inputs.

Both agents achieved stable performance, plateauing in validation Sharpe ratio around 400-500k training steps, as indicated by the multi-seed cutoff.

The Illusion of Progress: Validation and the Limits of LLM-Enhanced RL

Rigorous benchmarking serves as the cornerstone of progress in reinforcement learning (RL) for financial applications. Establishing standardized evaluation protocols allows for meaningful comparison of diverse RL strategies, moving beyond isolated successes to identify genuinely superior approaches. This comparative analysis isn’t merely about ranking algorithms; it’s about pinpointing specific weaknesses and areas ripe for improvement. By consistently assessing performance across a defined set of market conditions and metrics – such as Sharpe Ratio, maximum drawdown, and transaction costs – researchers and practitioners can systematically refine algorithms, address limitations, and ultimately build more robust and reliable trading systems. Without such standardized evaluation, claims of performance gains remain anecdotal, hindering the field’s advancement and potentially leading to flawed implementations in live trading environments.

The complexities of financial market simulation and reinforcement learning necessitate robust and reproducible research tools. FinRL addresses this need by offering a standardized library and environment specifically designed for training and evaluating RL agents in finance. This open-source framework streamlines the development process, providing pre-built environments that mimic real-world market dynamics and a suite of commonly used algorithms. Researchers and practitioners can leverage FinRL to benchmark different strategies, experiment with various reward functions, and rigorously test the performance of their agents on historical data, fostering greater transparency and accelerating innovation in algorithmic trading and portfolio management. By providing a common platform, FinRL facilitates collaboration and ensures that results are readily comparable, ultimately contributing to the advancement of reliable and effective financial applications of reinforcement learning.

The convergence of large language models (LLM) and reinforcement learning (RL) presents a compelling opportunity to elevate algorithmic trading strategies. By incorporating LLM-derived features – insights extracted from news articles, social media, and financial reports – into RL frameworks, agents can potentially discern subtle market signals and nuanced sentiment previously inaccessible through traditional quantitative methods. This integration allows RL agents to move beyond purely numerical data, factoring in qualitative information that influences investor behavior and asset pricing. The resulting strategies demonstrate the capacity to adapt to evolving market dynamics and potentially outperform those relying solely on historical price data, although careful consideration must be given to mitigating the risks associated with noisy or biased LLM outputs and ensuring robust generalization across different market conditions.

Despite promising initial results, realizing the full potential of large language model (LLM)-enhanced reinforcement learning for algorithmic trading necessitates further investigation into key challenges. Current systems are susceptible to distribution shift, where changing market conditions degrade performance due to reliance on historical data; this is further complicated by the feature-policy gap, a misalignment between the information LLMs provide and the actions the RL agent takes. Conflicting signals derived from LLM analysis also present a hurdle, requiring sophisticated methods to reconcile disparate interpretations of market sentiment. Recent experiments illustrate this complexity; while the incorporation of LLM-derived features did not demonstrably improve performance – yielding a Sharpe Ratio of 1.038 in H2 2025 compared to 1.099 achieved with macro features alone – the study highlights the need for innovative approaches to feature selection, signal filtering, and adaptive learning strategies to build truly robust and reliable trading systems.

The pursuit of predictive signals from large language models, as explored within this study, resembles tending a garden in shifting sands. One anticipates blooms, crafting prompts as careful cultivation, yet macroeconomic shocks arrive as unforeseen frosts. The system, initially promising, reveals its limitations not through inherent flaws, but through the simple act of growing up within a changing environment. Ada Lovelace observed that “The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.” This rings true; the model’s efficacy isn’t about independent insight, but about the boundaries of the knowledge encoded within its training and the prompt engineering that guides it. The paper rightly emphasizes the importance of recognizing these regime boundaries, lest one mistake transient success for enduring wisdom.

The Shifting Sands

The pursuit of predictive features from large language models, as demonstrated by this work, feels less like engineering and more like divination. It reveals not a pathway to scalable intelligence, but a deepening awareness of just how contingent all signal is. The paper rightly identifies the fragility of these relationships under distributional shift; yet, the core problem persists. Scalability is simply the word used to justify complexity, and each added layer of abstraction is a prophecy of future failure, a pre-ordained point where the map no longer matches the territory.

The focus now must move beyond merely detecting regime change to anticipating its character. Robustness isn’t about surviving shocks; it’s about building systems that gracefully degrade, that yield information even when fractured. The information coefficient offers a useful diagnostic, but it treats the symptom, not the disease. The real challenge lies in acknowledging that the perfect architecture is a myth to keep us sane, and embracing instead the messy, emergent properties of systems designed for continual adaptation.

Ultimately, this research reinforces a humbling truth: everything optimized will someday lose flexibility. The future isn’t about finding the signal; it’s about building the capacity to listen for any signal, however faint, however transient, in the ever-shifting noise.

Original article: https://arxiv.org/pdf/2604.10996.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Prediction: Beyond Handcrafted Features

The Allure of Autonomy: Reinforcement Learning as a System for Adaptation

The Ghosts in the Machine: Addressing the Realities of Deployment

The Illusion of Progress: Validation and the Limits of LLM-Enhanced RL

The Shifting Sands

See also: