Trading on Feelings: Can AI Learn From Our Mistakes?

Author: Denis Avetisyan

Researchers are exploring whether incorporating common human cognitive biases into artificial intelligence trading agents can improve financial decision-making.

Action space dimensionality-specifically, the removal of a “hold” action-profoundly alters learning dynamics, as evidenced by fluctuating Sharpe Ratios and epoch-over-epoch returns, suggesting that an effective action configuration is intrinsically linked to the complexity of the state representation.

This review examines the challenges of integrating behavioral finance principles-specifically loss aversion and overconfidence-into reinforcement learning algorithms for algorithmic trading.

While rational agent models dominate algorithmic finance, human financial decisions are demonstrably shaped by psychological biases. This is the central challenge addressed in ‘Incorporating Cognitive Biases into Reinforcement Learning for Financial Decision-Making’, which investigates integrating cognitive factors-specifically loss aversion and overconfidence-into reinforcement learning frameworks for trading. The study reveals that simply incorporating these biases doesn’t consistently yield improved risk-adjusted returns, highlighting the complexities of modeling human behavior in artificial intelligence. Can a more nuanced understanding of behavioral finance unlock the potential for truly human-like, and ultimately more robust, financial AI systems?

Decoding the Limits of Prediction

A significant limitation of many contemporary algorithmic trading strategies lies in their dependence on historical data analysis. These systems are frequently built upon the assumption that past market patterns will reliably predict future behavior, a premise increasingly challenged by the accelerating pace of change in financial landscapes. While backtesting on historical datasets can demonstrate potential profitability under specific conditions, such models often struggle to adapt when confronted with novel events, shifts in market sentiment, or unforeseen economic disruptions. This rigidity stems from an inability to incorporate real-time information effectively or to learn from evolving dynamics, leading to diminished performance and potentially substantial losses when market conditions deviate from those previously observed. Consequently, strategies anchored solely in historical analysis risk becoming obsolete, underscoring the necessity for adaptive systems capable of continuous learning and dynamic recalibration.

Conventional financial metrics, such as the Sharpe Ratio, frequently present a skewed picture of investment performance by assuming perfectly rational actors. These calculations typically assess risk-adjusted returns based on statistical volatility, yet fail to incorporate the pervasive influence of cognitive biases-systematic patterns of deviation from norm or rationality in judgment. Real-world traders are susceptible to loss aversion, overconfidence, and herd behavior, which can dramatically impact trading decisions and ultimately distort reported returns. Consequently, relying solely on these traditional metrics can lead to an overestimation of true profitability and an underestimation of the actual risks involved, as they do not fully capture the unpredictable element of human psychology in financial markets.

Comprehensive backtesting revealed a consistent inability of traditional algorithmic strategies to generate positive returns across all configurations tested. This outcome underscores a critical limitation of approaches reliant on purely historical data and static models. The consistently negative cumulative returns observed suggest that prevailing strategies often fail to adequately address the inherent unpredictability of financial markets, as well as the subtle but powerful influence of cognitive biases in trading decisions. Consequently, a shift towards adaptive strategies-those capable of learning from new information and accounting for both market volatility and the imperfections of human-like decision-making-is essential for improved performance and sustainable gains.

Hyperparameter tuning across state space discretizations <span class="katex-eq" data-katex-display="false">n \in \{5, 10, 15, 20\}</span> revealed consistently negative and highly variable cumulative returns, suggesting an inability to learn profitable trading strategies. — Hyperparameter tuning across state space discretizations $n \in \{5, 10, 15, 20\}$ revealed consistently negative and highly variable cumulative returns, suggesting an inability to learn profitable trading strategies.

Re-Engineering Adaptation: Reinforcement Learning

Reinforcement Learning (RL) provides a computational approach to developing trading agents capable of making sequential decisions without explicit programming for every market scenario. Unlike traditional algorithmic trading which relies on pre-defined rules, RL agents learn through trial and error, receiving rewards or penalties for each trade executed. This learning process allows the agent to adapt to changing market conditions and optimize its trading strategy over time. The framework centers on an agent interacting with an environment – in this case, a financial market – and iteratively refining a policy to maximize cumulative rewards, typically measured as profit. The agent’s actions influence the market state, which in turn dictates the received reward and shapes future learning.

Q-Learning is a model-free, off-policy temporal difference learning algorithm used to train the trading agent. The agent learns a Q-function, $Q(s,a)$ , which estimates the expected cumulative reward for taking action a in state s. This function is iteratively updated based on the Bellman equation, allowing the agent to discover an optimal policy for action selection. The training process involves the agent interacting with the simulated market, receiving rewards based on trade outcomes, and updating the Q-function to maximize these rewards over time. Through repeated interaction and Q-function refinement, the agent aims to learn a trading strategy that consistently generates positive cumulative returns.

Synthetic financial data, utilized for training and evaluating the reinforcement learning agent, is generated via a Random Walk Model. This model simulates asset price movements as a stochastic process where each subsequent price is determined by adding a random variable to the current price. Specifically, price changes are modeled as independent and identically distributed random variables, typically drawn from a normal distribution with a mean of zero and a defined standard deviation representing volatility. The Random Walk Model offers a computationally efficient method for generating large datasets of price histories, allowing for extensive training and robust evaluation of the trading agent’s performance under controlled conditions, and enabling consistent benchmarking of different algorithmic strategies.

Despite testing zero, random, and small positive initializations, the Q-table demonstrated similarly volatile training performance and returns, indicating that initialization has a minimal effect on the agent's learning in this environment. — Despite testing zero, random, and small positive initializations, the Q-table demonstrated similarly volatile training performance and returns, indicating that initialization has a minimal effect on the agent’s learning in this environment.

Mirroring Reality: Modeling Cognitive Biases

To simulate human-like decision-making, the trading agent’s reward function was altered to incorporate Loss Aversion, a cognitive bias where negative outcomes are felt more strongly than positive ones. This was achieved by multiplying losses by a factor λ greater than one, effectively increasing the perceived negative impact of unfavorable trades. The standard reward function, typically summing gains and losses, was therefore modified to $R' = gains - λ * losses$ . Values of λ were tested to determine the impact of varying degrees of loss aversion on agent performance, with the hypothesis that a moderate level of loss aversion would better reflect real-world trading behavior.

The exploration strategy of the trading agent was modified to simulate overconfidence, a common cognitive bias. This was achieved through an Exploration Rate Adjustment where the probability of taking a random action (exploring) decreases more rapidly than in a standard ε-greedy approach. Specifically, the exploration rate was decayed exponentially after each trading period, with a decay factor determined by a confidence parameter. Higher confidence values resulted in a faster decrease in exploration, meaning the agent more quickly committed to its initial perceived optimal strategy, even in the face of new information. This reflects the human tendency to overestimate the accuracy of one’s own beliefs and underestimate the value of seeking further evidence.

Testing of Loss Aversion was conducted by applying multipliers ranging from 1 to 3 to the loss component of the reward function. Results indicated a negative correlation between the Loss Aversion multiplier and agent performance; configurations with $λ \geq 2.5$ consistently yielded suboptimal outcomes compared to lower values or no Loss Aversion ( $λ = 1$ ). This suggests that while moderate loss aversion may reflect realistic behavioral traits, excessive weighting of losses actively impedes the learning process and reduces overall profitability within the simulated trading environment.

Analysis of the training runs demonstrated that the Sharpe Ratio, a measure of risk-adjusted return, exhibited substantial variance across all tested configurations of cognitive biases. Specifically, standard deviations of the Sharpe Ratio remained consistently high, irrespective of the Loss Aversion multiplier or Exploration Rate Adjustment applied to the trading agent. This indicates a lack of consistent convergence during the training process and suggests that the introduction of these behavioral biases, while aiming for realism, did not improve the stability of learning. The high variability observed implies that results obtained with any single set of bias parameters may not be reliably reproducible and that extended training periods or alternative optimization techniques would be necessary to achieve robust performance.

Increasing loss aversion beyond a multiplier of 2.5 degrades learning dynamics, as demonstrated by decreasing Sharpe Ratio and cumulative returns.

Refining the System: Stability and Impact

To effectively operate within the intricacies of a financial market, the agent leverages state space discretization – a technique that transforms the continuous range of possible market conditions into a finite set of discrete states. This simplification is crucial because a fully continuous state space would demand an impractical amount of computational resources and data to explore thoroughly. By categorizing market variables – such as price levels, trading volume, and technical indicators – into manageable groups, the agent can efficiently learn optimal trading strategies for each defined state. This allows for quicker adaptation to changing conditions and facilitates the development of a robust policy, even within the high-dimensional and noisy environment characteristic of real-world financial markets. The process effectively reduces the complexity of the decision-making process, enabling the agent to generalize learned behaviors across similar market states.

The implementation of temporal continuity addresses a critical challenge in reinforcement learning: the inherent noisiness of financial reward signals. By smoothing the reward function over time, this technique effectively reduces the impact of short-term market fluctuations and prevents the agent from overreacting to transient events. This approach fosters a more stable learning process, discouraging erratic trading behaviors driven by immediate, potentially misleading, gains or losses. Consequently, the agent develops a more robust strategy, less susceptible to whipsaws and better equipped to navigate the complexities of the financial market with a more consistent and measured response to changing conditions.

Despite employing state space discretization and temporal continuity to foster stable learning within a complex market simulation, the incorporation of loss aversion into the reinforcement learning agent did not yield consistent performance improvements. Analysis revealed that while these techniques aimed to mitigate erratic behavior, the agent’s Sharpe Ratio-a measure of risk-adjusted return-remained notably volatile across various configurations. This suggests that, even with refined learning dynamics, the inherent psychological bias of prioritizing avoiding losses did not translate into a consistently advantageous trading strategy, and no tested configuration ultimately achieved positive cumulative returns. The study highlights the challenges of directly applying behavioral economics principles to automated trading systems without careful consideration of the market environment and algorithm design.

Despite employing state space discretization and temporal continuity to enhance learning stability within the simulated market, the study ultimately revealed a consistent inability to achieve positive cumulative returns across all tested configurations. This outcome suggests that, while these refinements addressed certain aspects of agent behavior, fundamental challenges remain in applying reinforcement learning to financial markets. The consistent lack of profitability indicates that the complexities of market dynamics, even within a simplified simulation, present a significant hurdle to consistently outperforming baseline strategies. Further research is therefore needed to identify alternative approaches or more nuanced reward structures capable of driving sustained positive returns, as the current methodology, despite its improvements, did not yield the desired outcome of consistent profitability.

Intermediate state space complexities (<span class="katex-eq" data-katex-display="false">n=15</span>) yield relatively stable performance, as indicated by Sharpe Ratios and Cumulative Returns, though all configurations exhibit significant reward volatility over time. — Intermediate state space complexities ( $n=15$ ) yield relatively stable performance, as indicated by Sharpe Ratios and Cumulative Returns, though all configurations exhibit significant reward volatility over time.

The exploration of behavioral finance within reinforcement learning, as detailed in this study, reveals a fascinating tension. It isn’t enough to simply include established cognitive biases; the system doesn’t automatically benefit. This echoes a sentiment expressed by Paul Erdős: “A mathematician knows a lot of things, but a physicist knows a lot more.” The paper demonstrates that understanding the rules of behavioral finance – the established biases – is only the first step. Truly modeling human decision-making, and thus improving algorithmic trading, requires a deeper understanding of how those rules interact within complex systems – a form of reverse-engineering reality. The study implies that a purely theoretical application of biases, without accounting for market dynamics, yields limited results, mirroring the need for a physicist’s intuition beyond mathematical formulas.

What Lies Ahead?

The attempt to graft behavioral quirks onto reinforcement learning, as this work demonstrates, isn’t a simple matter of code injection. Reality, after all, is open source – the underlying logic is there, but merely acknowledging loss aversion or overconfidence isn’t enough to unlock its predictive power. The results suggest that these biases, when treated as static parameters, may even introduce noise. The challenge isn’t simply including these effects, but modeling how they emerge, evolve, and interact within a complex system. A truly adaptive agent needs to not just feel fear of loss, but to learn when that feeling is statistically justified.

Future iterations must move beyond simply encoding pre-defined biases. Perhaps the focus should shift towards modeling the cognitive architecture itself – the mechanisms that generate these biases in the first place. Could reinforcement learning be used to simulate the evolution of cognitive shortcuts, and then test their efficacy in financial markets? Or might a more fruitful approach involve identifying the fundamental information asymmetries that drive these biases, rather than treating them as inherent flaws in decision-making?

Ultimately, this work serves as a useful, if humbling, reminder. Bridging the gap between behavioral finance and machine learning isn’t about finding the right algorithm; it’s about reverse-engineering the very process of intelligent behavior. The code is there, waiting to be deciphered. The question is whether current methods possess the necessary tools to do so, or if a fundamentally different approach is required.

Original article: https://arxiv.org/pdf/2601.08247.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Decoding the Limits of Prediction

Re-Engineering Adaptation: Reinforcement Learning

Mirroring Reality: Modeling Cognitive Biases

Refining the System: Stability and Impact

What Lies Ahead?

See also: