Learning to Decide: Reinforcement Learning’s Rise in Economics

Author: Denis Avetisyan

This review explores how reinforcement learning, a powerful technique for decision-making, is being applied to solve complex problems in economic modeling and analysis.

A neural network reinforcement learning from human feedback (RLHF) model, comprising approximately 4,800 parameters, demonstrates the fastest convergence toward optimal performance, while a structurally simpler model-defined by only four parameters-exhibits high initial variance but ultimately achieves complete recovery at <span class="katex-eq" data-katex-display="false"> K=5,000 </span>. — A neural network reinforcement learning from human feedback (RLHF) model, comprising approximately 4,800 parameters, demonstrates the fastest convergence toward optimal performance, while a structurally simpler model-defined by only four parameters-exhibits high initial variance but ultimately achieves complete recovery at $K=5,000$ .

A comprehensive survey connecting reinforcement learning algorithms to classical dynamic programming and their applications in game theory, optimal control, and causal inference.

While classical dynamic programming faces limitations in high-dimensional economic models, demanding simplifying reductions, ‘A Survey of Reinforcement Learning For Economics’ reviews how reinforcement learning offers a natural, sample-based extension capable of addressing these complexities. This survey connects traditional planning methods to modern learning algorithms-including temporal difference learning and policy gradients-and demonstrates their application to areas like pricing, game theory, and preference elicitation. Despite its promise, the success of reinforcement learning remains bounded by challenges such as sample inefficiency and a reliance on accurate simulators; however, when guided by economic structure, can it become a robust tool for computational economists?

Laying the Foundation: Optimization in Complex Systems

Numerous practical challenges, from robotics and financial trading to resource management and healthcare, necessitate making a series of interconnected decisions where the outcomes are not fully predictable. These scenarios, characterized by sequential dependencies and inherent uncertainty, demand optimization strategies capable of adapting to evolving conditions and mitigating potential risks. Simply maximizing immediate rewards often proves insufficient; instead, robust optimization seeks policies that consistently deliver favorable results across a range of possible future states. This approach acknowledges that perfect information is rarely available and prioritizes solutions that perform reliably even when faced with unforeseen circumstances or noisy data, ensuring resilience and long-term success in complex, dynamic environments.

Dynamic programming, built upon the foundation of the Bellman Equation, historically provides a rigorous method for solving sequential decision problems by breaking them down into smaller, more manageable subproblems. However, this approach encounters a significant obstacle known as the “curse of dimensionality.” As the number of state variables or possible actions increases – a common characteristic of real-world complexities – the computational resources required to solve these subproblems grow exponentially. This makes it impractical, if not impossible, to apply classical dynamic programming to even moderately sized problems. The state space expands so rapidly that storing and processing the necessary information becomes prohibitive, limiting the scalability and applicability of the technique despite its theoretical elegance. Consequently, researchers have actively sought alternative methods, such as reinforcement learning, to circumvent this fundamental limitation and tackle complex decision-making scenarios.

The inherent complexities of dynamic programming, particularly its susceptibility to the ‘curse of dimensionality’, have propelled the development of reinforcement learning as a viable alternative for solving sequential decision-making problems. Unlike methods requiring complete environmental models, reinforcement learning algorithms refine optimal policies through direct interaction and experience. Recent studies demonstrate the efficiency of these algorithms; Value Iteration, a core reinforcement learning technique, converges to an optimal solution in a remarkably concise 9 iterations, a timeframe directly correlated with the Markov Decision Process (MDP) diameter. Further optimizing this process, Policy Iteration exhibits even faster convergence, achieving optimal policies in just 11 iterations-significantly outpacing Value Iteration and highlighting the potential for rapid learning in complex environments. This accelerated convergence suggests reinforcement learning offers a computationally advantageous approach to problems previously intractable for classical dynamic programming techniques.

In a Brock-Mirman economy with parameters <span class="katex-eq" data-katex-display="false">\alpha=0.36</span> and <span class="katex-eq" data-katex-display="false">\beta=0.96</span>, value iteration converges linearly to the optimal value function after 567 iterations, while policy iteration, functioning as Newton’s method, achieves convergence in just 11 iterations by solving for the fixed point of the policy operator. — In a Brock-Mirman economy with parameters $\alpha=0.36$ and $\beta=0.96$ , value iteration converges linearly to the optimal value function after 567 iterations, while policy iteration, functioning as Newton’s method, achieves convergence in just 11 iterations by solving for the fixed point of the policy operator.

Bridging the Gap: Temporal Difference Learning in Action

Temporal Difference (TD) learning addresses the challenge of learning value estimates in situations where complete sequences of experience are unavailable. Unlike Monte Carlo methods which require an entire episode to complete before updating value estimates, TD learning updates these estimates based on the difference between the predicted reward and the actual reward received at each step. This “bootstrapping” process allows the agent to learn from incomplete sequences, as the value of a state is updated even before the final outcome is known. Consequently, TD learning is more efficient and can learn online, adapting to changing environments without requiring complete episodes of interaction, making it suitable for continuous learning tasks.

Q-learning and SARSA are both temporal difference learning algorithms that iteratively update value functions based on observed rewards and estimated future rewards. However, they differ in their update rules regarding exploration and exploitation. Q-learning is an off-policy algorithm; it learns the optimal policy by assuming the agent always takes the best possible action, regardless of the action it actually takes. This is achieved by updating the Q-value using the maximum possible future reward. Conversely, SARSA is an on-policy algorithm, meaning it learns the value function for the policy the agent is currently following. It updates the Q-value based on the actual action taken, incorporating the exploration strategy directly into the learning process. This distinction leads to different convergence behaviors and can affect the algorithm’s performance in environments with stochastic rewards or actions.

The application of temporal difference learning with function approximation, notably through deep neural networks, was prominently demonstrated by TD-Gammon, an AI backgammon player. This system achieved a level of play approaching that of world champion human players and, crucially, developed strategies previously unknown to experts. Quantitative results further support the efficacy of these algorithms; in a 5×5 gridworld environment, both Q-learning and SARSA algorithms consistently converged with a Root Mean Squared Error (RMSE) of less than 0.01 across all 25 states, indicating a high degree of accuracy and stability in value estimation.

Off-policy algorithms (Q-learning, Q(λ), DQN) consistently converge to the optimal value function <span class="katex-eq" data-katex-display="false">V^{\\<i>}</span>, while on-policy algorithms (SARSA, REINFORCE, NPG, PPO) exhibit persistent discrepancies between the learned value function <span class="katex-eq" data-katex-display="false">V(s)</span> and the optimal one, particularly in states not visited during training, as determined by convergence criteria of <span class="katex-eq" data-katex-display="false"> \\max_{s}|V(s)-V^{\\</i>}(s)|<0.1</span>. — Off-policy algorithms (Q-learning, Q(λ), DQN) consistently converge to the optimal value function $V^{\\<i>}$ , while on-policy algorithms (SARSA, REINFORCE, NPG, PPO) exhibit persistent discrepancies between the learned value function $V(s)$ and the optimal one, particularly in states not visited during training, as determined by convergence criteria of $\\max_{s}|V(s)-V^{\\</i>}(s)|<0.1$ .

Direct Policy Optimization: Steering Towards Optimal Behavior

Policy Gradient methods represent a class of reinforcement learning algorithms that directly parameterize and optimize the policy function, rather than estimating a value function. This is achieved by estimating the gradient of the expected cumulative reward with respect to the policy parameters. The core principle involves adjusting the policy parameters in the direction that increases the probability of actions leading to higher rewards. Algorithms like REINFORCE accomplish this through Monte Carlo sampling, collecting trajectories and using the observed returns to update the policy. The gradient calculation relies on the likelihood of the observed actions given the policy and the associated returns, effectively guiding the policy towards more favorable behavior.

Initial implementations of policy gradient methods, while theoretically sound, frequently exhibited instability during training. This instability stemmed from large policy updates that could drastically alter the agent’s behavior, leading to divergence from optimal policies. To address this, algorithms such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) were developed. These methods constrain the magnitude of policy updates within a specified trust region or through clipping mechanisms, preventing overly aggressive changes and promoting more stable learning. By limiting the step size, TRPO and PPO ensure that the new policy remains relatively close to the previous policy, improving the likelihood of consistent improvement and preventing catastrophic performance drops.

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) address the instability of naive policy gradient methods by limiting the magnitude of policy updates during each iteration. This constraint, often implemented via a trust region or clipping mechanism, prevents drastic changes to the policy that could lead to performance degradation. Empirical results demonstrate the efficacy of Temporal Difference (TD) learning when applied to Cost-to-Go Prediction (CCP) estimation within these algorithms; specifically, utilizing TD learning for CCP estimation yielded a reduction in Mean Squared Error (MSE) of between 4 and 11-fold when contrasted with traditional state-space discretization techniques.

Off-policy methods consistently converge to the optimal policy <span class="katex-eq" data-katex-display="false">\pi^{\\*}</span> across all states, while on-policy methods exhibit persistent errors in states distant from the optimal trajectory, as demonstrated by retained incorrect actions. — Off-policy methods consistently converge to the optimal policy $\pi^{\\*}$ across all states, while on-policy methods exhibit persistent errors in states distant from the optimal trajectory, as demonstrated by retained incorrect actions.

Beyond Human Performance: The Rise of Deep Reinforcement Learning

Deep Q-Networks (DQNs) represented a pivotal advancement in artificial intelligence by showcasing the potential for an agent to learn directly from raw, high-dimensional inputs – in this case, the pixel data of Atari game screens. Prior to DQNs, reinforcement learning algorithms typically relied on hand-engineered features to represent the game state. However, DQNs utilized a deep neural network to process these pixels, effectively learning to identify relevant patterns and features automatically. This breakthrough enabled the agent to achieve human-level performance on a range of Atari 2600 games, demonstrating that complex skills could be acquired solely through trial and error and visual observation – a feat previously considered unattainable for artificial intelligence systems. The success of DQNs not only validated the combination of deep learning and reinforcement learning but also opened new avenues for tackling problems where defining informative features is challenging or impractical.

AlphaGoZero represented a significant leap forward in artificial intelligence, achieving mastery of the complex board game Go not through supervised learning from human games, but entirely through self-play. This innovative approach utilized a single deep neural network to simultaneously predict both moves and the eventual outcome of games, iteratively improving its strategy by playing millions of matches against itself. The result was a system that not only equaled but demonstrably surpassed the performance of all previous Go programs, and crucially, consistently defeated the world’s top human players. This achievement underscored the potential of reinforcement learning to unlock superhuman capabilities in domains previously considered the exclusive province of human intellect, showcasing a capacity for strategic thinking and complex pattern recognition that redefined the boundaries of AI expertise.

The triumph of algorithms like AlphaGoZero underscores a pivotal advancement in artificial intelligence: the synergistic combination of deep learning and reinforcement learning. These techniques, when united, empower agents to navigate intricate challenges demanding sequential decision-making – problems where optimal actions depend on a history of prior choices and observations. Deep learning provides the capacity to discern patterns and representations from raw, high-dimensional data, while reinforcement learning furnishes a framework for learning through trial and error, maximizing cumulative rewards over time. This pairing isn’t merely additive; it unlocks the potential for agents to not only master established domains like game-playing, but also to address real-world complexities in robotics, resource management, and autonomous systems, pushing the boundaries of what machines can learn and achieve independently.

Charting Future Directions: Beyond Current Limitations

A significant hurdle in the widespread adoption of reinforcement learning lies in its notorious data appetite. Contemporary algorithms frequently demand immense datasets – often exceeding those readily available in real-world applications – to achieve proficient performance. This reliance on extensive training severely restricts their utility in resource-constrained environments, such as robotics operating with limited sensor data, or healthcare applications where patient interactions are precious and experimentation costly. The need for millions of interactions to master even moderately complex tasks poses a practical barrier, hindering the deployment of these powerful techniques in scenarios where data acquisition is difficult, expensive, or time-consuming. Consequently, a central focus of ongoing research centers on developing methods that can learn effectively from fewer examples, paving the way for reinforcement learning’s integration into a broader spectrum of practical problems.

Contemporary research in reinforcement learning prioritizes strategies to overcome limitations in real-world application, specifically concerning the substantial data requirements of many algorithms. Current efforts center on enhancing sample efficiency – enabling agents to learn effectively from fewer interactions with their environment – and building algorithms that are more robust to noisy or unpredictable conditions. A significant hurdle lies in balancing exploration – discovering new and potentially rewarding actions – with exploitation – capitalizing on known successful strategies. Furthermore, improving an agent’s ability to generalize – applying learned knowledge to novel situations beyond the training data – remains a core focus, with techniques like domain randomization and transfer learning showing particular promise in bridging the gap between simulated environments and complex real-world scenarios.

The convergence of model-based reinforcement learning, hierarchical reinforcement learning, and meta-learning represents a compelling trajectory for advancing artificial intelligence. Model-based approaches, which involve learning a representation of the environment, offer the potential to significantly reduce the need for extensive trial-and-error, as agents can predict the consequences of their actions. This predictive capability is further enhanced when combined with hierarchical reinforcement learning, allowing complex tasks to be broken down into manageable sub-goals and learned more efficiently. Crucially, the addition of meta-learning-or ‘learning to learn’-enables agents to rapidly adapt to new, unseen environments by leveraging prior experience across a distribution of tasks. This synergistic combination promises to overcome current limitations in sample efficiency and generalization, ultimately paving the way for AI systems capable of robust, adaptable, and truly intelligent behavior in complex, real-world scenarios.

The survey meticulously details the progression of reinforcement learning algorithms, from the foundational principles of dynamic programming to the complexities of deep reinforcement learning. This echoes Albert Camus’ observation: “The struggle itself… is enough to fill a man’s heart. One must imagine Sisyphus happy.” The iterative nature of these algorithms – constantly refining policies through trial and error, much like Sisyphus endlessly pushing his boulder – highlights the inherent process of seeking optimal control. The value lies not merely in achieving a final solution, but in the continuous learning and adaptation embedded within the system’s architecture, a concept central to the article’s exploration of temporal difference learning and policy gradients.

Beyond the Algorithm

This survey reveals a field rapidly accumulating tools-algorithms branching like city streets. Yet, the infrastructure itself warrants continued scrutiny. Reinforcement learning, while potent, often treats problems as isolated entities, optimizing for immediate reward without fully accounting for systemic consequences. A truly robust system-be it an economic model or a robotic controller-must exhibit resilience, adapting not just to changes within its defined parameters, but also to unanticipated shifts in the broader environment. The current emphasis on deep learning as a universal function approximator risks obscuring the importance of structural understanding; one does not fix a traffic jam by adding more lanes without redesigning the intersections.

The connection to causal inference, though promising, remains largely superficial. Establishing genuine counterfactual reasoning-understanding not just what happened, but why and what if-requires moving beyond correlation and embracing mechanisms. Algorithms must be designed to interrogate the underlying generative processes, not merely predict outcomes. This necessitates a return to first principles, incorporating domain knowledge and theoretical constraints rather than relying solely on data-driven approaches.

Future progress hinges not on inventing entirely new algorithms, but on evolving the existing framework. Like a well-planned city, the infrastructure should evolve without rebuilding the entire block. The challenge lies in creating systems that are not merely intelligent, but understandable, adaptable, and ultimately, sustainable in the face of inevitable complexity.

Original article: https://arxiv.org/pdf/2603.08956.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/