Learning to Route: AI Surpasses Heuristics for the Traveling Salesman Problem

Author: Denis Avetisyan


A new approach using offline reinforcement learning trains an AI to find better routes than traditional methods, even without live experimentation.

The system calculates node embeddings from graph coordinates, then utilizes a causal transformer decoder-informed by both node transitions and predicted relational trajectory graphs <span class="katex-eq" data-katex-display="false">\hat{R}_{t}</span>-to forecast subsequent relational trajectory graphs <span class="katex-eq" data-katex-display="false">\tilde{R}_{t}</span>, demonstrating a method for relational reasoning within dynamic graph structures.
The system calculates node embeddings from graph coordinates, then utilizes a causal transformer decoder-informed by both node transitions and predicted relational trajectory graphs \hat{R}_{t}-to forecast subsequent relational trajectory graphs \tilde{R}_{t}, demonstrating a method for relational reasoning within dynamic graph structures.

Decision Transformers, trained on existing solutions and leveraging goal-conditioned learning, achieve state-of-the-art performance on the Traveling Salesman Problem.

Despite decades of algorithmic development, solving computationally intractable problems like the Traveling Salesman Problem remains a significant challenge. This is addressed in ‘Offline Decision Transformers for Neural Combinatorial Optimization: Surpassing Heuristics on the Traveling Salesman Problem’, which introduces a novel approach leveraging offline reinforcement learning to learn superior strategies directly from existing heuristic solutions. By integrating a Pointer Network with the Decision Transformer architecture and employing expectile regression for optimistic reward conditioning, the authors demonstrate consistent performance gains over classical heuristics. Could this framework unlock the potential of offline RL to surpass embedded domain knowledge in a wider range of combinatorial optimization tasks?


The Inherent Limits of Expedient Routes

The Traveling Salesman Problem (TSP) serves as a crucial test case in the field of optimization, yet its inherent computational complexity poses a substantial challenge to conventional algorithmic solutions. The problem asks for the shortest possible route that visits each of a given set of cities exactly once and returns to the origin city; however, the number of possible routes grows factorially with the number of cities – meaning even a modest increase in cities rapidly renders exhaustive search impractical. Specifically, with n cities, there are (n-1)! possible routes, quickly exceeding the capabilities of even the most powerful computers. This exponential growth in complexity highlights why traditional methods, while effective for smaller instances, struggle to scale and provide timely solutions for real-world applications involving numerous locations, driving the need for innovative approaches to tackle this fundamental problem.

Traditional heuristic methods for combinatorial optimization, though capable of delivering reasonably good solutions, frequently encounter limitations when confronted with real-world complexity and increasing problem size. Algorithms like simulated annealing or genetic algorithms, while versatile, often require substantial computational resources as the number of variables and constraints grows – a phenomenon known as poor scalability. Furthermore, these methods typically rely on hand-crafted rules and parameters, demanding significant expert knowledge for effective tuning in diverse scenarios. Consequently, adapting these heuristics to novel or intricate problems can be a laborious process, often yielding suboptimal results or failing to converge within practical timeframes. This inflexibility highlights a critical need for more robust and adaptable approaches capable of handling the inherent challenges of large-scale, complex combinatorial landscapes.

The inherent difficulties faced by traditional algorithms when addressing combinatorial optimization problems, such as the Traveling Salesman Problem, are driving a surge in research utilizing machine learning techniques. These methods offer the potential to overcome the scalability issues and limitations in adapting to intricate scenarios that often plague conventional heuristics. Recent work demonstrates a significant advancement in this area; through the implementation of a novel machine learning framework, researchers have successfully achieved solutions with an optimality gap of less than 10%. This indicates a substantial improvement in both the quality of solutions generated and the computational efficiency with which they are found, suggesting machine learning is a promising path toward resolving previously intractable optimization challenges.

The Transformer as a Cartographer of Possibility

The Decision Transformer is an offline reinforcement learning framework designed to learn policies directly from pre-collected datasets of demonstrated solutions. This approach deviates from traditional reinforcement learning by eliminating the need for active environment interaction during the learning process. Instead, the framework treats reinforcement learning as a sequence modeling problem, leveraging the Transformer architecture to predict future actions based on past states and rewards contained within the static dataset. The input to the Decision Transformer consists of sequences of (state, action, reward) tuples, allowing it to learn from data generated by any behavioral policy, including heuristic methods, and subsequently generate its own policies for decision-making.

The Decision Transformer differentiates itself from conventional reinforcement learning by eliminating the need for active environment interaction during the training phase. Traditional RL algorithms require an agent to explore an environment and learn through trial and error, a process which can be time-consuming and resource-intensive. In contrast, the Decision Transformer operates solely on pre-collected static datasets of past experiences or solutions. This capability allows it to directly utilize existing knowledge – such as datasets of human-provided solutions or previously recorded agent behavior – without requiring further environmental sampling, making it particularly suitable for scenarios where real-time interaction is impractical or costly.

The Decision Transformer demonstrates improved performance in combinatorial optimization by reframing the problem as a sequential modeling task within an offline reinforcement learning framework. Empirical results indicate consistent outperformance against four established heuristics – Nearest Neighbor (NN), Nearest Insertion (NI), Farthest Insertion (FI), and Simulated Annealing (SA) – across varying problem instances. Specifically, the Decision Transformer achieved superior results on problems with sizes N=20, N=50, and N=100, suggesting scalability and robustness in modeling complex sequential decision-making processes inherent in combinatorial optimization tasks.

A Chorus of Heuristics: Constructing the Learning Dataset

The training of our Decision Transformer relies on a dataset generated using four established heuristics for solving the Traveling Salesman Problem: Nearest Neighbor, Nearest Insertion, Farthest Insertion, and Simulated Annealing. Nearest Neighbor constructs a tour by iteratively visiting the closest unvisited city. Nearest Insertion adds cities to the tour based on the minimum increase in total distance, while Farthest Insertion prioritizes adding the city furthest from the existing tour. Simulated Annealing employs a metaheuristic approach, probabilistically accepting worse solutions to escape local optima. Combining the outputs of these diverse heuristics provides a robust and varied training set, exposing the Decision Transformer to a range of solution strategies and improving its generalization capability.

The integration of Nearest Neighbor, Nearest Insertion, Farthest Insertion, and Simulated Annealing heuristics into the training data generation process yields a robust dataset due to the complementary characteristics of each algorithm. Nearest and Farthest Insertion methods efficiently construct relatively short tours, while Nearest Neighbor provides a baseline for comparison. Simulated Annealing, as a stochastic optimization technique, explores a wider solution space, generating more diverse, though potentially suboptimal, routes. This diversity is critical; by exposing the Decision Transformer to solutions derived from varied algorithmic approaches, the model avoids overfitting to a single solution style and develops a generalized policy capable of effectively addressing a broader range of Traveling Salesman Problem instances.

The training dataset is constructed by executing four established Traveling Salesman Problem (TSP) heuristics – Nearest Neighbor, Nearest Insertion, Farthest Insertion, and Simulated Annealing – on a range of problem instances. Each execution of a heuristic generates a complete tour, represented as an ordered sequence of city visitations, which constitutes a single training example. This sequence of actions, detailing the path taken to solve the TSP, serves as the input for the Decision Transformer. The resulting dataset, comprised of numerous such action sequences, provides the necessary data for the model to learn a policy for solving the TSP, effectively mapping problem states to optimal or near-optimal actions.

The Allure of Optimism: Guiding Exploration with Anticipated Rewards

The Decision Transformer’s capabilities are significantly augmented through the implementation of Expectile Regression, a technique used to forecast optimistic Return-to-Go (RTG) values. Rather than predicting the most likely cumulative reward, this approach focuses on estimating potential upper bounds, effectively guiding the model to prioritize exploration of trajectories that could yield substantially higher rewards. By consistently aiming for these potentially superior outcomes, the system is encouraged to venture beyond conventionally promising paths, ultimately discovering solutions that might otherwise remain hidden within the vast solution space. This targeted optimism doesn’t simply introduce randomness; it provides a learned bias towards exploring areas with genuine potential for improvement, leading to a more efficient and effective search process.

The model’s capacity for effective exploration is substantially broadened by prioritizing paths anticipated to yield greater rewards. Conventional reinforcement learning algorithms often become fixated on locally optimal solutions, failing to investigate potentially more fruitful, yet initially less promising, trajectories. By strategically predicting and emphasizing optimistic Return-to-Go values, the system is incentivized to venture beyond these familiar routes, actively seeking out solutions that would otherwise remain hidden. This directed exploration doesn’t rely on random chance; instead, the model intelligently focuses its search on areas with a higher likelihood of containing superior rewards, ultimately leading to the discovery of more effective strategies and a significantly expanded solution space.

The implementation of Optimistic Return-to-Go prediction demonstrably elevates the quality of solutions generated by the Decision Transformer while simultaneously broadening its exploratory capacity. This approach, which prioritizes paths anticipated to yield higher rewards, achieves approximately twofold performance gains when contrasted with the established Simulated Annealing heuristic. Rigorous testing reveals consistent outperformance compared to models utilizing squared error loss – specifically with an α=0.5 parameter – indicating a substantial improvement in both efficiency and the ability to discover optimal strategies. By encouraging the model to consider a wider range of possibilities, this mechanism facilitates the identification of solutions previously inaccessible through conventional methods, ultimately leading to more robust and effective decision-making.

The Pointer Network: Navigating Discrete Action Spaces

Addressing the complexities of the Traveling Salesman Problem requires innovative approaches to action selection, and this work introduces an integration of a Pointer Network within the Decision Transformer architecture. This combination enables the model to effectively navigate the discrete space of possible cities to visit, a significant challenge for sequence-to-sequence models. The Pointer Network functions as an attention mechanism that dynamically focuses on the most relevant city at each step of the tour, allowing the Decision Transformer to learn a policy for constructing optimal or near-optimal routes. By directly pointing to the next city in the sequence, the model avoids generating invalid or infeasible solutions, ensuring the creation of complete and valid tours, and ultimately improving performance on this complex combinatorial optimization problem.

The core innovation lies in the Pointer Network’s ability to address the Traveling Salesman Problem’s discrete action space; instead of predicting a complete tour directly, the model iteratively selects each subsequent city to visit. This is achieved through an attention mechanism that dynamically focuses on the remaining unvisited cities, effectively ‘pointing’ to the next destination in the optimal route. Crucially, this approach guarantees the generation of valid and feasible tours, as the model is constrained to select only from available cities and avoids repeating visits. By decoupling the tour construction from a fixed-size output, the Pointer Network enables the Decision Transformer to learn a policy for sequential decision-making within the combinatorial landscape, leading to significantly improved solution quality and a demonstrable reduction in the optimality gap.

The synergy between the Decision Transformer and Pointer Network architectures demonstrates considerable potential beyond the Traveling Salesman Problem, offering a novel approach to diverse combinatorial optimization tasks. By leveraging the Decision Transformer’s ability to learn from sequential data and the Pointer Network’s dynamic action selection, the combined system effectively navigates complex, discrete solution spaces. Evaluations across multiple benchmarks reveal an overall optimality gap of less than 10%, indicating a high degree of solution quality and suggesting that this framework could be adapted to address challenges in areas such as resource allocation, scheduling, and routing – potentially redefining approaches to these traditionally difficult problems.

The pursuit of optimization, as demonstrated by this work on the Traveling Salesman Problem, inherently acknowledges the imperfect nature of systems. Each heuristic, each attempted solution, represents a step in the system’s evolution, a refinement born from previous iterations. This mirrors the inevitable decay all systems experience. As Bertrand Russell observed, “The difficulty lies not so much in developing new ideas as in escaping from old ones.” The Decision Transformer, by learning from existing heuristics – even flawed ones – doesn’t seek to erase the past, but to build upon it. This approach, leveraging goal-conditioned learning and optimistic reward prediction, recognizes that progress isn’t about achieving perfect solutions, but about continually refining the process itself, accepting that incidents – or suboptimal solutions – are integral to maturity.

What Lies Ahead?

The demonstration that a Decision Transformer can distill, and ultimately surpass, heuristic solutions to the Traveling Salesman Problem is less a triumph of algorithmic ingenuity than an acknowledgement of entropy. Heuristics, after all, are merely localized minima in a vast search space-temporary bulwarks against the inevitable decay toward suboptimal solutions. This work doesn’t solve the TSP; it extends the lifespan of effective strategies, logging the system’s chronicle in the weights of a neural network. The true challenge isn’t achieving better performance on known instances, but maintaining competence as the distribution of problems shifts-a test of generalization, not optimization.

The reliance on offline datasets, while pragmatic, introduces a temporal constraint. The learned policy is, by definition, anchored to a past era of solution-finding. Future work must address the problem of continual adaptation, allowing the system to incorporate new data without catastrophic forgetting. Furthermore, the choice of ‘return-to-go’ as a conditioning signal, though effective, feels somewhat arbitrary. Exploring alternative goal representations-signals that capture not just where to go, but how to explore-may unlock more robust and transferable policies.

Ultimately, this research highlights a broader trend: the displacement of explicit algorithms by implicit ones. The Decision Transformer isn’t programmed to solve the TSP; it’s trained to behave as if it has. Deployment is merely a moment on the timeline, and the true measure of success won’t be peak performance, but the rate at which performance degrades-the elegance with which the system ages.


Original article: https://arxiv.org/pdf/2603.25241.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-27 17:46