When RL Gets Stuck: Solving Challenges in Dynamic Environments

Author: Denis Avetisyan


New research reveals why standard reinforcement learning agents falter in constantly changing situations and proposes a surprisingly simple fix.

An agent navigating a multiplicative dynamic optimizes wealth by selecting between safe and risky actions, where repetitions in action selection-returning updated wealth to the state-manage non-ergodic contexts, and differing optimization strategies-based on expected values versus growth rates-predict distinct indifference points between the two actions, <span class="katex-eq" data-katex-display="false">p_E</span> and <span class="katex-eq" data-katex-display="false">p_T</span>, respectively.
An agent navigating a multiplicative dynamic optimizes wealth by selecting between safe and risky actions, where repetitions in action selection-returning updated wealth to the state-manage non-ergodic contexts, and differing optimization strategies-based on expected values versus growth rates-predict distinct indifference points between the two actions, p_E and p_T, respectively.

This review details model-agnostic methods for improving deep reinforcement learning performance in non-ergodic environments by addressing limitations in time-averaged reward estimation.

While reinforcement learning excels at finding optimal policies, its reliance on expected values can falter when applied to non-ergodic environments-systems where long-term outcomes depend heavily on initial conditions. This limitation is addressed in ‘Model-Agnostic Solutions for Deep Reinforcement Learning in Non-Ergodic Contexts’, which demonstrates that standard deep reinforcement learning agents systematically underperform in such settings due to their inability to account for time-averaged growth. The authors show that simply allowing the agent to experience repeated trajectories-implicitly embedding temporal information-can correct this, improving performance without altering environmental feedback or objective functions. Could this approach offer a broadly applicable solution for deploying reinforcement learning in real-world systems characterized by non-stationary dynamics?


The Fragility of Expectation: Ergodicity in Reinforcement Learning

Traditional Reinforcement Learning algorithms are fundamentally built upon the principle of ergodicity, a concept asserting that the average performance of a single agent over time will mirror the average performance of a large ensemble of agents at any given moment. This seemingly abstract mathematical condition drastically simplifies the learning process, allowing algorithms to reliably estimate long-term rewards and converge on optimal policies by effectively assuming the system’s future behavior is representative of its past. Essentially, ergodicity provides the statistical foundation for RL by guaranteeing that sufficient exploration will eventually reveal a representative sample of the environment’s dynamics, enabling accurate value function estimation and policy improvement. Without this assumption, the learning process becomes significantly more complex, as algorithms struggle to generalize from limited experience and may converge on suboptimal solutions due to biased estimations.

Many real-world systems deviate from the ergodic principle, a cornerstone of traditional Reinforcement Learning (RL). These non-ergodic systems, often characterized by multiplicative dynamics – where changes are proportional to the current state rather than being additive – present a significant challenge to standard RL approaches. Consider financial markets or resource depletion scenarios; a single, exceptionally positive or negative event can dramatically alter the system’s trajectory, rendering past data an unreliable predictor of future outcomes. This violates the fundamental assumption underlying most RL algorithms, which rely on averaging experiences over time to estimate optimal policies. Consequently, applying standard RL to these non-ergodic environments can lead to inaccurate estimations of value functions and, ultimately, unstable or suboptimal policies, as the agent struggles to learn in a world where the rules are constantly shifting based on accumulated, rather than independent, events.

The practical limitations of conventional deep reinforcement learning become apparent when addressing non-ergodic environments, where historical data poorly predicts future system behavior. Standard algorithms, built on the premise of statistically stable averages, struggle to accurately estimate value functions and, consequently, generate unstable or suboptimal policies. Research demonstrates this deficiency through the inability of these agents to achieve genuinely time-optimal control; instead, they converge to solutions that prioritize stability over speed, even when faster, albeit riskier, paths are available. This inability to leverage potentially advantageous, yet statistically rare, events highlights a fundamental mismatch between the assumptions of traditional RL and the complexities of many real-world systems, demanding novel approaches capable of navigating non-stationary and unpredictable dynamics.

Increasing the number of repetitions <span class="katex-eq" data-katex-display="false">M</span> in the toy model shifts the policy's preference from maximizing expected value (blue) towards minimizing the probability of the worst-case outcome (green), revealing a sensitivity to path dependence not observed in single-step training.
Increasing the number of repetitions M in the toy model shifts the policy’s preference from maximizing expected value (blue) towards minimizing the probability of the worst-case outcome (green), revealing a sensitivity to path dependence not observed in single-step training.

Beyond Simple Averages: The Pursuit of Time-Average Growth

In scenarios termed non-ergodic, where future outcomes are not solely determined by past performance and statistical averages are unreliable predictors of long-term results, simply maximizing average reward proves inadequate for sustained capital growth. This is because average reward prioritizes immediate gains without accounting for the compounding effects of both positive and negative returns over extended periods. Instead, the focus should shift to optimizing time-average growth – specifically, the asymptotic growth rate of capital over many trials. This metric, often expressed as \lim_{n \to \in fty} (Capital_n)^{1/n} , represents the long-run geometric mean return and provides a more accurate assessment of a strategy’s ability to preserve and increase wealth in unpredictable environments. Strategies that maximize average reward may exhibit high variance and a significant risk of ruin, whereas optimizing for time-average growth inherently minimizes this risk by prioritizing consistent, albeit potentially smaller, positive returns.

The Kelly Criterion is a mathematical formula for determining the optimal size of a series of bets or investments to maximize long-run growth rate. It achieves this by allocating capital proportionally to the edge – the probability-weighted expected return of an investment – while explicitly minimizing the risk of ruin. Rather than maximizing average reward, the criterion maximizes log(W_t), where W_t represents wealth at time t. This logarithmic utility function prioritizes consistent, sustainable growth over large, infrequent gains, effectively balancing risk and reward. The resulting fractional Kelly bet size is calculated as f = p – (1 – p) / b, where p is the probability of success and b* is the win-to-loss ratio; full Kelly would be 100% of wealth allocated to the investment, but fractional Kelly bets are often employed to reduce volatility.

Implementing the Kelly Criterion for resource allocation, such as in the Portfolio Assignment Problem, necessitates defining a mapping between system states, available actions, and the resulting expected growth rates. Each state represents the current distribution of resources, and each action represents a reallocation strategy. The Kelly Criterion then determines the optimal fraction of capital to allocate to each action based on the probability of that action yielding a positive return and the magnitude of that return; specifically, the optimal fraction is proportional to the edge – the expected return above zero – divided by the total available edges. This mapping requires quantifying the probability and expected value of each outcome for each action within a given state to calculate the resulting growth rate g = \log(1 + \text{net return}), and iteratively applying the criterion to maximize long-term growth.

The optimal investment strategy, determined by the Kelly criterion (dashed green curve), maximizes expected value compared to a standard policy (solid blue curve) when allocating wealth to a portfolio, as illustrated by the portfolio assignment problem.
The optimal investment strategy, determined by the Kelly criterion (dashed green curve), maximizes expected value compared to a standard policy (solid blue curve) when allocating wealth to a portfolio, as illustrated by the portfolio assignment problem.

Learning to Repeat: Navigating Non-Ergodicity Through Repetitions Training

Repetitions Training is an approach to reinforcement learning (RL) specifically developed to mitigate the difficulties arising from non-ergodic environments. Traditional RL algorithms rely on the assumption of stationarity and i.i.d. data, which is often violated in non-ergodic scenarios where the system’s future states are not independent of its past. This method addresses this limitation by intentionally repeating specific time steps during training. This repetition allows the agent to experience a more controlled and consistent learning signal, improving policy stability and enabling effective learning of long-term dependencies that are critical for navigating non-ergodic systems where simple averaging of rewards can lead to inaccurate estimations and suboptimal performance.

Repetitions Training enhances learning by deliberately increasing the agent’s exposure to specific temporal states. This is achieved through repeated presentations of the same time steps within an episode, effectively amplifying the signal associated with those states and their immediate transitions. By experiencing these states multiple times, the agent receives more frequent updates to its value estimations, leading to a more accurate assessment of long-term rewards. This emphasis on repeated exposure facilitates learning of time-average growth – the cumulative reward an agent can expect over extended interactions – as the agent develops a refined understanding of how actions influence future states and subsequent rewards within those repeated time windows.

Repetitions Training demonstrably improves agent performance in non-ergodic environments where standard Reinforcement Learning (RL) methods struggle with estimation errors and policy instability. Empirical results indicate a significant shift in the indifference point with increased repetitions, suggesting improved decision-making under non-ergodic conditions (Figure 3b). Furthermore, analysis of the Mean Squared Error (MSE) reveals a decreasing trend as the number of repetitions increases (Figure 5), quantitatively supporting the method’s ability to refine estimations and stabilize policies in scenarios where the environment’s statistical properties change over time, leading to more reliable long-term performance.

An actor-critic DRL model trained with path-dependent learning converges to the optimal portfolio assignment strategy defined by the Kelly Objective, as demonstrated by the consistent mean μ and median policies across agents.
An actor-critic DRL model trained with path-dependent learning converges to the optimal portfolio assignment strategy defined by the Kelly Objective, as demonstrated by the consistent mean μ and median policies across agents.

Deep Learning as the Engine: Enabling Robust Decision-Making in Complex Systems

Deep Reinforcement Learning represents a significant advancement over traditional reinforcement learning approaches by integrating the power of deep neural networks. Where classic methods struggle with the ‘curse of dimensionality’ – the exponential increase in computational complexity as the state space grows – DRL employs these networks as function approximators. This allows the agent to estimate complex value functions – predicting the long-term reward for being in a given state – and ultimately learn optimal policies even in environments with vast and intricate state spaces. Instead of explicitly storing value estimates for every possible state, the neural network generalizes from observed states, enabling learning in scenarios previously considered intractable. This capability unlocks the potential for applying reinforcement learning to real-world problems involving high-dimensional inputs, such as image processing, robotics, and game playing.

Deep Q-Networks (DQN) and Actor-Critic models represent pivotal advancements in applying deep learning to reinforcement learning challenges within complex environments. Traditional reinforcement learning struggles with high-dimensional state spaces, requiring extensive computation to estimate optimal actions; however, these deep learning architectures circumvent this limitation by utilizing neural networks to approximate the Q-function – a mapping of state-action pairs to expected future rewards. DQN employs a deep neural network to directly estimate the Q-function, enabling it to generalize across similar states and actions, while Actor-Critic methods utilize two networks: an ‘actor’ that learns the optimal policy, and a ‘critic’ that evaluates the policy’s effectiveness. This synergistic approach allows for more efficient learning and adaptation in environments where exhaustive state-action enumeration is impractical, ultimately paving the way for robust decision-making in previously intractable scenarios.

Combining deep learning architectures with a technique called Repetitions Training creates a remarkably powerful system for making decisions in unpredictable settings. Traditional reinforcement learning struggles when environments aren’t ‘ergodic’ – meaning past experiences aren’t necessarily indicative of future outcomes – but this combined approach actively addresses that challenge. By repeatedly exposing the learning agent to similar, yet varied, scenarios, Repetitions Training stabilizes the learning process and allows the deep neural network to generalize more effectively. The resulting policies don’t just find a good solution, but converge towards the theoretically optimal ‘Kelly Objective’ policy, maximizing long-term gains even under uncertainty – as visually demonstrated in Figure 6 – and demonstrating a significant advancement in robust decision-making capabilities.

Despite random parameter selection, agents trained with an Actor-Critic model successfully learned complete and reasonably precise policies for portfolio assignment.
Despite random parameter selection, agents trained with an Actor-Critic model successfully learned complete and reasonably precise policies for portfolio assignment.

The pursuit of robust solutions in reinforcement learning often overlooks the fundamental assumptions about the environment. This work highlights the critical issue of non-ergodicity, revealing how reliance on time-averaged expected values can lead to suboptimal policies. The proposed method, repeating training episodes, implicitly captures the temporal dynamics absent in standard approaches, demonstrating a surprising elegance in its simplicity. It’s reminiscent of Paul Erdős’s observation: “A mathematician knows a lot of formulas, but a physicist knows a lot of tricks.” The ‘trick’ here – repeated exposure – allows the agent to learn the underlying structure of a non-ergodic world, revealing that good architecture is invisible until it breaks, and only then is the true cost of decisions visible.

Beyond the Average

The observed fragility of standard reinforcement learning in non-ergodic settings suggests a fundamental miscalibration. The pursuit of expected values, while elegant in its simplicity, proves insufficient when the underlying dynamics actively resist averaging. This work offers a pragmatic, if somewhat brute-force, solution – repetition – but it merely highlights a deeper need. The field must move beyond seeking universally optimal policies and embrace methods that explicitly model, or at least implicitly encode, the temporal structure of the environment. A clever algorithm is not the goal; a robust one is.

Future work should explore whether this implicit encoding of dynamics, achieved through repeated episodes, can be formalized and accelerated. Could curriculum learning, or techniques borrowed from time-series analysis, offer more efficient pathways to robust policies? The current reliance on model-agnostic approaches, while appealing in their generality, may ultimately be a limitation. Perhaps a degree of inductive bias, carefully chosen to reflect the likely structure of non-ergodic environments, will prove more fruitful.

The persistent allure of the ‘general’ agent should be tempered with a healthy skepticism. Complexity rarely confers resilience. Instead, a focus on minimal sufficient structure – a system that embodies only what is necessary to function reliably – seems a more promising direction. The challenge, then, is not to build a machine that can learn anything, but one that learns effectively within the constraints of a non-stationary world.


Original article: https://arxiv.org/pdf/2601.08726.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-14 15:59