AI Takes the Wheel: Mastering F1 Race Strategy with Reinforcement Learning

Author: Denis Avetisyan

Researchers have developed an AI framework capable of learning optimal Formula 1 race strategies through self-play and real-time adaptation.

A training scheme iteratively refines an agent’s policy through self-play, initially pitting it against a singular opponent before strategically incorporating past, high-performing iterations-those with the highest Elo scores-into a continually evolving opponent pool, ensuring progressive challenge and refinement.

This work presents a reinforcement learning approach for multi-agent race strategy optimization, incorporating competitor modeling, energy management, and pit stop scheduling in Formula 1.

Optimizing race strategy in Formula 1 demands real-time adaptation to dynamic conditions and competitor actions, a challenge for conventional approaches. This is addressed in ‘Learning-based Multi-agent Race Strategies in Formula 1’, which proposes a reinforcement learning framework for training multi-agent systems to collaboratively optimize pit-stop timing, tire selection, and energy management. Through self-play training and an interaction module accounting for opponent behavior, the framework generates robust and adaptive strategies capable of consistently improving race performance. Could such learning-based systems ultimately provide a competitive edge for race strategists both on and off the track?

The Inevitable Optimization: Charting a Course Through Chaos

For decades, Formula 1 race strategy has been largely shaped by the experience of engineers and the real-time judgments of pit wall strategists, supplemented by pre-race simulations that, while sophisticated, possess inherent limitations. These simulations often struggle to accurately model the chaotic and ever-changing conditions of a Grand Prix – factors like unpredictable weather, safety car deployments, and the evolving performance of tires all introduce significant variables. Consequently, strategies developed beforehand, or even calculated during the race based on incomplete data, frequently prove suboptimal as the race unfolds. The reliance on human intuition, though valuable, is susceptible to cognitive biases and the sheer speed at which events transpire, leading to missed opportunities for maximizing performance and highlighting a clear need for more robust and adaptive strategic tools.

Achieving the fastest possible race time in Formula 1 is far more than simply driving quickly; it’s a deeply intricate optimization problem spanning numerous interacting variables. Teams must simultaneously manage the vehicle’s energy systems – deploying and recovering electrical power for performance gains while avoiding depletion – and meticulously track tire degradation, as grip diminishes with use impacting lap times. Critically, this occurs not in isolation, but alongside the unpredictable actions of competitors. Each overtake, defensive maneuver, or pit stop alters the strategic landscape, demanding constant recalculation. This confluence of factors creates a ‘high-dimensional’ problem, meaning the number of variables and their potential interactions quickly becomes overwhelming for traditional analytical methods, pushing teams to explore advanced computational techniques like reinforcement learning to navigate the complexities and identify optimal strategies.

Current Formula 1 race strategy tools often fall short due to their limited ability to model the intricate aerodynamic wake interactions between vehicles. A leading car disrupts airflow, creating turbulent conditions that significantly impact the performance of following cars – a phenomenon known as ‘dirty air’. Existing computational methods typically simplify these interactions, failing to capture the subtle but crucial shifts in downforce and drag experienced by cars operating within another’s wake. This simplification introduces considerable uncertainty into performance predictions, as the actual aerodynamic effects can vary dramatically based on relative positioning, speed differentials, and even minor track variations. Consequently, strategies developed using these tools may miscalculate optimal overtaking opportunities, pit stop timings, and ultimately, overall race outcomes, highlighting the need for more sophisticated modeling techniques that accurately represent these complex aerodynamic dependencies.

The race car model receives the agent’s action <span class="katex-eq" data-katex-display="false">\mathbf{a}</span>, gap time <span class="katex-eq" data-katex-display="false">t_{gap}</span>, and interaction-induced lap time <span class="katex-eq" data-katex-display="false">\Delta T_{int}</span> to produce observations for both the ego car <span class="katex-eq" data-katex-display="false">\mathbf{o}</span> and its opponent <span class="katex-eq" data-katex-display="false">\mathbf{\tilde{o}}</span>, as detailed in [10]. — The race car model receives the agent’s action $\mathbf{a}$ , gap time $t_{gap}$ , and interaction-induced lap time $\Delta T_{int}$ to produce observations for both the ego car $\mathbf{o}$ and its opponent $\mathbf{\tilde{o}}$ , as detailed in [10].

Adaptive Intelligence: Forging a Strategy Through Experience

Reinforcement Learning (RL) was implemented to develop autonomous agents capable of optimizing race performance through strategic decision-making. This approach frames the racing problem as a Markov Decision Process, where the agent learns an optimal policy by interacting with a simulated environment and receiving rewards based on its actions. The agent observes the race state – including its own position, velocity, and the positions of competitors – and selects actions such as acceleration, braking, and steering. Through iterative training, the agent adjusts its policy to maximize cumulative rewards, effectively learning to navigate the track and outmaneuver opponents to achieve the fastest possible race time. This differs from traditional rule-based or path-planning algorithms by enabling adaptive behavior and learning from experience.

The Soft Actor-Critic (SAC) algorithm was selected for its capacity to efficiently learn optimal policies in continuous action spaces, crucial for nuanced racing maneuvers such as steering and acceleration. SAC is an off-policy algorithm, allowing it to learn from a replay buffer of past experiences, improving data efficiency. This is achieved through a maximum entropy reinforcement learning approach, encouraging exploration and preventing premature convergence to suboptimal strategies. The algorithm balances policy exploitation with entropy maximization, resulting in more robust and adaptable agents capable of quickly adjusting to varied track conditions, opponent behaviors, and race dynamics. Specifically, SAC utilizes two neural networks: an actor network to determine the optimal action, and a critic network to evaluate the quality of those actions, iteratively refining the policy through gradient descent.

The Agent Interaction Module is a core component of the reinforcement learning framework, designed to address the multi-agent nature of racing scenarios. This module enables the RL agent to predict the likely actions of competing agents based on observed behaviors and race context, such as position, speed, and track location. These predictions are then integrated into the agent’s decision-making process, allowing it to proactively adjust its strategy – for example, by blocking overtaking maneuvers or capitalizing on competitor errors. The module utilizes a recurrent neural network to maintain an internal state representing the observed history of each opponent’s actions, improving the accuracy of future action predictions and facilitating more sophisticated, reactive racing behavior.

Self-play training involves the reinforcement learning agent competing against multiple instances of itself, iteratively improving its racing strategy through experience gained in countless simulated races. This process generates a diverse training dataset without the need for human-labeled data or predefined scenarios. The agent’s policy is updated after each race based on the outcome, effectively creating a continuously evolving opponent and accelerating the learning process. By repeatedly engaging in self-competition, the agent develops robustness against a wide range of racing styles and learns to exploit subtle strategic advantages, ultimately leading to a highly competitive and adaptable racing AI.

The agent utilizes a frozen base policy [10] combined with a trainable interaction module to determine actions <span class="katex-eq" data-katex-display="false">\mathbf{a}</span> based on its own observation <span class="katex-eq" data-katex-display="false">\mathbf{o}</span> and that of the opponent <span class="katex-eq" data-katex-display="false">\mathbf{\\tilde{o}}</span>, allowing for adaptive behavior. — The agent utilizes a frozen base policy [10] combined with a trainable interaction module to determine actions $\mathbf{a}$ based on its own observation $\mathbf{o}$ and that of the opponent $\mathbf{\\tilde{o}}$ , allowing for adaptive behavior.

The Engine of Strategy: Modeling Hybrid Power and Energy Flow

The Hybrid-Electric Power Unit (HEPU) is modeled with component-level fidelity, including the internal combustion engine, electric motor-generator, battery pack, and associated power electronics. This detailed representation allows for the simulation of energy flow between these components, accounting for efficiencies and limitations at each stage. Specifically, the model captures the engine’s torque curve as a function of speed and throttle input, the electric motor’s torque and power limits, and the battery’s state of charge and discharge rates. Furthermore, energy losses due to component inefficiencies – such as engine friction, motor losses, and battery internal resistance – are incorporated to provide a realistic assessment of available power and energy. This granularity is essential for accurately predicting the impact of control actions on energy deployment and, consequently, vehicle performance.

The model maintains discrete, time-varying representations of both fuel and battery energy content, quantified in joules. Fuel energy is depleted based on engine load and efficiency, while battery energy is affected by charging via regenerative braking and discharge to the electric motor. The rate of fuel consumption is determined by engine operating parameters such as throttle position and RPM, and is calculated using a polynomial fuel map. Battery state of charge (SoC) is tracked as a percentage, influencing both available power output and regenerative braking capacity. This explicit modeling of energy reserves allows the reinforcement learning agent to evaluate the trade-offs between utilizing internal combustion engine power and electric motor assistance, optimizing for performance and energy conservation across different driving conditions.

Effective energy management is a primary determinant of vehicle performance in racing scenarios, and consequently, is the core driver of the reinforcement learning agent’s policy. The agent’s decision-making process prioritizes maximizing cumulative reward, which is directly linked to minimizing lap times and optimizing overall race completion time. This is achieved by strategically allocating power between the internal combustion engine and the electric motor, balancing performance gains with energy expenditure. The agent learns to anticipate track conditions and energy demands, optimizing energy usage for both short-term speed and long-term race viability. Specifically, the agent’s actions-such as engine torque allocation and regenerative braking intensity-are continuously adjusted to maintain an optimal state of charge for the battery while delivering the necessary power for each driving phase.

The integration of a detailed hybrid powertrain model with a Reinforcement Learning (RL) framework facilitates the development of optimal control strategies for diverse racing scenarios. The model accurately simulates the interaction between the internal combustion engine, electric motor, and battery system, providing the RL agent with a realistic environment for learning. This allows the agent to explore various energy management policies and identify strategies that maximize performance in both short-duration, high-power qualifying laps and longer, strategically-focused race conditions. The RL algorithm optimizes control parameters – such as engine torque distribution and battery discharge rates – based on accumulated rewards, ultimately leading to learned policies tailored to the specific demands of each race phase.

During the duel, agent AA (blue) strategically allocated fuel and battery energy, and timed pit stops to overcome a <span class="katex-eq" data-katex-display="false">0.5</span> unit starting delay and ultimately gain an advantage over agent BB (red), as evidenced by the negative gap time. — During the duel, agent AA (blue) strategically allocated fuel and battery energy, and timed pit stops to overcome a $0.5$ unit starting delay and ultimately gain an advantage over agent BB (red), as evidenced by the negative gap time.

The Dance of Aerodynamics: Unveiling Performance Through Interaction

The performance of a racing car is critically influenced by aerodynamic interactions with those nearby, most notably through the wake effect. As a leading vehicle cuts through the air, it generates turbulent air in its wake – a region of reduced pressure and disrupted airflow. This turbulence dramatically diminishes the downforce available to a following car, reducing grip and increasing drag, thereby impacting its cornering speed and overall pace. The model explicitly accounts for this complex phenomenon, simulating how a car’s aerodynamic efficiency is dynamically altered by its proximity to competitors, and realistically captures the challenges of maintaining speed while trailing another vehicle through corners and along straights. This nuanced representation of aerodynamic dependencies is central to the agent’s ability to optimize overtaking maneuvers and overall race strategy.

The agent’s enhanced racing strategy stems from its ability to model and react to aerodynamic turbulence, specifically the wake created by leading vehicles. Rather than treating following cars as static obstacles, the reinforcement learning agent learns to predict how this disturbed airflow will affect its own performance, allowing for proactive adjustments to steering and speed. This predictive capability is crucial for successfully navigating overtaking maneuvers; the agent doesn’t simply attempt a pass, but anticipates the resulting turbulence and compensates accordingly, maintaining control and minimizing speed loss. Consequently, the agent demonstrates a marked improvement in overtaking success rate, exploiting opportunities that a simpler, non-turbulent model would miss, and achieving a more efficient and competitive race pace.

The simulation model demonstrates a nuanced understanding of how a vehicle’s lap time is dynamically affected by both prevailing aerodynamic conditions and the immediate presence of competing cars. It doesn’t simply calculate an average drag coefficient; instead, the model predicts variations in lap time, accounting for turbulent wake effects and downforce reductions experienced when following closely behind another vehicle. This predictive capability allows the reinforcement learning agent to proactively adjust its racing line and speed, optimizing its pace not just for a single lap, but for the entire race duration. The accuracy of these lap time predictions is critical, as it directly informs the agent’s decision-making process, enabling it to capitalize on opportunities for overtaking and effectively defend its position against rivals – ultimately leading to significant performance gains.

Extensive computational simulations reveal a substantial performance advantage for the reinforcement learning agent when accounting for multi-car interactions. The agent consistently achieved faster lap times compared to a single-agent policy, culminating in an approximate 12.52-second reduction in overall race time. This improvement isn’t merely statistical; an Elo rating system further quantifies the agent’s dominance, assigning it a score of 1000 – a clear indication of superior racing capability when factoring in the complexities of aerodynamic drafting and turbulence experienced in a multi-vehicle environment. These findings demonstrate the critical role of incorporating realistic physical interactions into the training process for achieving optimized race performance.

The agent learns to navigate an environment with a fixed competitor car, where aerodynamic interactions and observations from both vehicles influence its control.

Toward Adaptive Strategy: A Vision for the Future of Racing Intelligence

To address the inherent unpredictability of Formula 1 racing, future research will integrate Monte Carlo Simulation with the reinforcement learning framework. This involves running numerous simulated race scenarios, each with slightly varied conditions – such as weather fluctuations or competitor actions – to evaluate the potential outcomes of different strategic choices. By repeatedly sampling these possibilities, the agent can move beyond deterministic planning and develop a more robust strategy that accounts for uncertainty. This allows for a probabilistic assessment of risk and reward, enabling the agent to identify optimal decisions even when faced with incomplete information or unexpected events. The result is a racing strategy less susceptible to disruption and better equipped to capitalize on opportunities as they arise, ultimately leading to improved performance and more consistent results.

Investigations are underway to integrate tire compound selection directly into the reinforcement learning framework, treating it as a fluid, adaptable parameter rather than a fixed decision. This approach allows the agent to dynamically assess track conditions, weather forecasts, and competitor strategies to optimize tire choices throughout a race in real-time. By modeling tire degradation and performance characteristics, the system aims to identify optimal pit stop timings and compound combinations, maximizing lap times and overall race performance. The intention is to move beyond pre-defined strategies and allow the agent to react intelligently to the evolving demands of each unique race scenario, potentially uncovering novel strategies previously unseen in Formula 1.

To rigorously assess and refine the developed reinforcement learning agents, an Elo Rating System – a method originally designed to rank chess players – will be implemented. This system facilitates a comparative analysis of agent performance by pitting them against each other in simulated races, dynamically adjusting their ratings based on win-loss outcomes. The resulting Elo scores provide a quantifiable measure of strategic skill, enabling researchers to identify the most effective algorithms and track improvements over time. By continually challenging agents against increasingly skilled opponents, the system fosters a cycle of continuous learning and optimization, ultimately pushing the boundaries of automated race strategy and providing a robust framework for benchmarking future advancements in the field.

The convergence of reinforcement learning, Monte Carlo simulation, and dynamic parameter optimization holds the potential to fundamentally reshape Formula 1 strategy. Current strategic decisions, reliant on complex modeling and human intuition, may soon be surpassed by agents capable of continuously learning and adapting to the unpredictable nature of a race. By precisely evaluating countless scenarios and optimizing choices like tire compound selection in real-time, these systems aren’t simply improving existing strategies – they are unlocking performance gains previously considered unattainable. The implementation of an Elo Rating System allows for rigorous benchmarking, fostering a cycle of continuous improvement that could redefine the competitive landscape of motorsport, moving beyond incremental gains to genuine strategic innovation.

Duels between agents reveal that pit stop and tire compound strategies-using medium (yellow) or soft (red) tires as indicated by the tire icons-significantly impact race time differences, as measured relative to a baseline race between agent AA and agent CC, with the upper agent starting <span class="katex-eq" data-katex-display="false">0.5</span> laps behind the lower agent. — Duels between agents reveal that pit stop and tire compound strategies-using medium (yellow) or soft (red) tires as indicated by the tire icons-significantly impact race time differences, as measured relative to a baseline race between agent AA and agent CC, with the upper agent starting $0.5$ laps behind the lower agent.

The pursuit of optimal race strategies, as detailed in this work, echoes a fundamental principle of all systems: adaptation. This paper’s reinforcement learning framework, allowing agents to refine strategies through self-play, isn’t merely about winning a race; it’s about gracefully responding to the inevitable entropy of competition. As Alan Turing observed, “Sometimes people who are unhappy tend to look for happiness in the wrong places.” Similarly, suboptimal initial strategies are not failures, but signals-opportunities to learn and recalibrate within the complex environment of Formula 1. The energy management and pit stop optimization detailed within demonstrate a dialogue with past iterations, refining approaches to delay, if not avoid, eventual decay.

The Long Run

The pursuit of optimal race strategy, as demonstrated by this work, is less about achieving a final, perfect solution and more about building systems resilient to inevitable entropy. Each iteration of self-play isn’t simply refinement; it’s a versioning of accumulated experience, a form of memory against the ceaseless degradation of predictive accuracy. The very notion of ‘optimization’ is a snapshot, a momentarily stable state within a dynamic system always tending toward disorder. The arrow of time always points toward refactoring, toward adaptation to competitors whose strategies themselves are not static points, but evolving trajectories.

A key limitation remains the fidelity of the simulation. Real-world Formula 1 is a chaotic confluence of factors – unpredictable weather, mechanical failures, driver error – all absent from the controlled environment. Extending the framework to incorporate these stochastic elements isn’t merely a technical challenge; it’s an acknowledgement that perfect prediction is impossible, and robust adaptability is paramount. The true test will lie in the system’s capacity to gracefully degrade, to maintain competitive performance even as its underlying assumptions are violated.

Future work might explore the integration of anticipatory mechanisms, systems capable of not just reacting to opponent moves, but modeling their intentions. However, even such foresight is ultimately bounded by uncertainty. The most interesting question isn’t how to build the best strategy, but how to build a strategy that can survive long enough to become legend, a testament to its capacity for continual renewal.

Original article: https://arxiv.org/pdf/2602.23056.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/