Learning to Optimize with Limited Data

Author: Denis Avetisyan

A novel framework leverages inexpensive labels and self-supervision to enhance the robustness and efficiency of surrogate-based optimization for complex problems.

A three-stage amortized optimization framework-comprising approximate label collection, supervised pretraining, and self-supervised training-achieves up to a 59× reduction in offline computational cost while demonstrably enhancing accuracy, optimality, and feasibility when contrasted with established baseline methods.

This work introduces a three-stage amortized optimization approach that improves the training of neural surrogates used in constrained optimization and related fields like physics-informed learning.

Scaling optimization and simulation often demands expensive, high-quality data, creating a fundamental tension between accuracy and cost. The work presented in ‘Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels’ addresses this challenge with a novel framework leveraging imperfect, low-cost labels to train neural surrogates. By combining supervised pretraining with self-supervision, the authors demonstrate that models require only modest amounts of inexact data to converge rapidly and achieve improved performance across challenging domains like constrained optimization and power-grid operation. Could this approach unlock efficient solutions for a broader range of complex, real-world problems currently limited by data acquisition costs?

Beyond Iteration: A Paradigm Shift in Optimization

Many optimization challenges, from logistical planning to engineering design, are traditionally addressed using iterative algorithms. These methods begin with an initial guess and progressively refine it through repeated calculations, aiming to approach an optimal solution. However, this process can be remarkably demanding in terms of computational resources, particularly as the complexity of the problem increases. Each iteration requires evaluating the objective function and updating the solution, and the number of iterations needed to achieve a satisfactory result can grow exponentially with the dimensionality of the problem. This computational expense becomes a significant bottleneck in real-time applications or when dealing with large-scale systems, hindering the ability to efficiently find optimal or even near-optimal solutions within a reasonable timeframe. Consequently, researchers are actively exploring alternative approaches that circumvent the limitations of iterative methods and offer faster, more efficient pathways to solving complex optimization problems.

Conventional optimization techniques frequently encounter significant challenges when applied to real-world problems characterized by a large number of variables and intricate relationships. Power systems, for instance, involve coordinating countless generators, transmission lines, and loads, creating a high-dimensional space where finding the optimal operating point becomes computationally prohibitive. Similarly, dynamic simulations-modeling everything from fluid flow to complex mechanical systems-demand solving equations across numerous time steps and spatial dimensions. The computational burden escalates rapidly with increasing dimensionality, often rendering iterative methods impractically slow or even incapable of converging to a solution within a reasonable timeframe. This limitation highlights the need for alternative approaches capable of efficiently tackling the complexities inherent in these high-dimensional, real-world scenarios.

Rather than repeatedly refining an approximate solution, a paradigm shift is occurring where machine learning models directly learn to solve optimization problems. This approach treats the solution process itself as the target of learning, enabling models to bypass iterative refinement and predict optimal or near-optimal solutions directly from problem specifications. By training on vast datasets of problem instances, these learned solvers can generalize to unseen scenarios, potentially achieving significant speedups and improved performance compared to traditional methods – especially in complex domains where each iteration of a conventional algorithm is computationally demanding. This offers a pathway toward real-time optimization and control in areas like power grid management, robotics, and dynamic system simulations, promising a future where solutions aren’t calculated, but recognized.

Warm-starting semi-supervised learning with inexpensive labels significantly reduces both aggregate solution error (<span class="katex-eq" data-katex-display="false">MSE</span> and <span class="katex-eq" data-katex-display="false">MAE</span>) and stabilizes the temporal evolution of error for key state variables, demonstrating effective physics-informed learning of stiff dynamical equations. — Warm-starting semi-supervised learning with inexpensive labels significantly reduces both aggregate solution error ( $MSE$ and $MAE$ ) and stabilizes the temporal evolution of error for key state variables, demonstrating effective physics-informed learning of stiff dynamical equations.

Accelerated Convergence: Establishing Robust Initial Conditions

Warm-Start Pretraining significantly improves the training efficiency of learned solvers by establishing a robust initial parameter configuration. Rather than initiating the learning process from random values, this technique leverages prior knowledge or data to generate a starting point close to an optimal solution. This pre-training phase effectively reduces the search space for the solver, allowing it to quickly refine the initial parameters and converge on a high-quality solution with fewer iterations. The resultant improvement in training speed is substantial, as demonstrated by a reported 59x reduction in offline computational time when compared to training from a randomly initialized state.

Cheap Label Generation is employed as a pretraining strategy to minimize the computational expense associated with creating training datasets. This process involves generating approximate labels for data instances using methods that are significantly faster and less resource-intensive than traditional, manual labeling or fully supervised approaches. While these generated labels may not be perfectly accurate, they provide a sufficient signal for the learned solver to establish a reasonable initial state. The resulting dataset, though approximate, allows for pretraining at a reduced cost, ultimately accelerating the convergence of the model during subsequent refinement stages.

Initializing a learned solver with a reasonably accurate starting point, rather than random values, significantly accelerates the convergence process and improves solution reliability. This approach, termed Warm-Start Pretraining, demonstrably reduces offline computational time; testing has shown a 59x reduction in required computation compared to training the same model with fully supervised learning techniques. This efficiency gain is a direct result of the model requiring fewer iterative steps to reach an optimal solution when it begins with a pre-trained, approximate solution rather than attempting to learn from a completely random state.

Warm-starting the learning process with physics-informed data optimizes performance, achieving a validation MSE of approximately <span class="katex-eq" data-katex-display="false">1.69 \times 10^{-6}</span>, while a computationally efficient labeling strategy delivers a <span class="katex-eq" data-katex-display="false">100 \times</span> speedup compared to generating ground-truth labels. — Warm-starting the learning process with physics-informed data optimizes performance, achieving a validation MSE of approximately $1.69 \times 10^{-6}$ , while a computationally efficient labeling strategy delivers a $100 \times$ speedup compared to generating ground-truth labels.

Direct Solution Mapping: A Learned Function of Problem Instance

Amortized optimization fundamentally shifts the problem-solving paradigm by training a machine learning model to directly predict solutions given a specific problem instance as input. Traditional optimization methods iteratively search for a solution, requiring substantial computational effort for each new instance. In contrast, an amortized approach learns a mapping from the problem instance space to the solution space, effectively ‘learning to solve’ rather than solving each instance from scratch. This is achieved by representing the solution as a function parameterized by a neural network, which is then trained on a dataset of problem instances and their corresponding optimal solutions, allowing for rapid solution generation for unseen instances.

Supervised Learning forms the primary training methodology for Direct Solution Mapping. This involves constructing a dataset comprising numerous problem instances paired with their corresponding optimal, or highly accurate, solutions-typically obtained through established numerical solvers or experimental data. The machine learning model is then trained on this dataset to learn the functional relationship between problem input and solution output. This allows the model to generalize and predict solutions for new, unseen problem instances. The quality of the training data-both in terms of size and accuracy of the provided solutions-directly impacts the performance and generalization capability of the resulting model.

Evaluations demonstrate that models produced through direct solution mapping achieve a mean squared error (MSE) in physics residuals that is statistically comparable to fully supervised learning approaches. Importantly, these models exhibit a significantly faster convergence rate during training when benchmarked against traditional Physics-Informed Neural Network (PINN) training methodologies. This accelerated convergence is observed across a range of tested problem instances, indicating improved computational efficiency without sacrificing solution accuracy, as quantified by the physics residual $MSE$ .

Utilizing inexpensive DCOPF labels for initialization consistently improves the optimality and constraint satisfaction of SSL-based power grid optimization, particularly for methods prioritizing hard constraints, while maintaining competitive performance even in challenging ACOPF scenarios.

Taming Dynamics: Integrating Physics into the Neural Network

Solving dynamic systems – those that evolve over time – often presents significant computational hurdles. Recent advancements utilize Physics-Informed Neural Networks (PINNs) to overcome these challenges by fundamentally altering the modeling approach. Instead of treating the system as a ‘black box’, PINNs integrate known physical laws – expressed as differential equations – directly into the neural network’s learning process. This isn’t simply adding constraints after training; the network is trained to satisfy these laws. The physical equations become part of the loss function, guiding the network towards solutions that are not only accurate but also physically plausible. Consequently, PINNs require less data to achieve reliable results, demonstrate improved generalization capabilities, and offer a pathway to modeling complex systems where traditional numerical methods struggle.

Effective training of neural networks for dynamic systems relies on strategically managing computational resources and the complexity of the learning process. Techniques such as Time Curriculum address this by initially training the network on short-term predictions, gradually extending the prediction horizon as learning progresses-analogous to a student mastering basic concepts before tackling more complex problems. Complementing this is Residual-Based Adaptive Refinement, a method that dynamically concentrates computational effort on areas of the solution exhibiting rapid changes or high error. This targeted approach-focusing on the ‘interesting’ parts of the dynamic system-not only accelerates training but also enhances the accuracy of predictions, particularly in scenarios where the system’s behavior is highly sensitive to initial conditions or external forces. By intelligently allocating resources, these techniques enable the network to efficiently learn and accurately represent the complexities inherent in dynamic systems.

The developed framework exhibits a notable increase in robustness when predicting dynamic system behavior. This improvement is quantified by demonstrating expanded basins of attraction – the range of initial conditions that lead to a stable solution – across numerous independent simulations initiated with different random seeds. Critically, the incidence of ‘degenerate runs’, where the model fails to converge to a physically plausible solution, is significantly reduced. This enhanced reliability suggests the physics-informed neural network is less sensitive to minor perturbations or uncertainties in initial conditions, offering more consistent and trustworthy predictions for complex dynamic systems, even when faced with noisy or incomplete data.

Supervised pretraining in the synthetic task reveals a U-shaped merit trajectory alongside monotonically decreasing loss, highlighting a discrepancy between data-driven training and the benefit of a merit-based criterion.

Enhancing Generalization & Robustness: Towards Autonomous Scientific Discovery

Batch Normalization represents a significant advancement in training deep neural networks, especially when dealing with the intricacies of complex systems. This technique normalizes the activations of each layer, reducing internal covariate shift – the change in the distribution of network activations during training. By stabilizing the learning process, Batch Normalization allows for the use of higher learning rates and mitigates the vanishing/exploding gradient problem, ultimately leading to faster convergence and improved generalization performance. The effect is particularly pronounced in deep networks where the compounding effect of small changes in weights can easily destabilize training; Batch Normalization effectively smooths this process, enabling the network to learn more robust and reliable representations of the underlying data, and preventing overfitting to the training set.

The techniques developed demonstrate broad applicability, extending beyond standard classification and regression tasks to encompass the nuanced analysis of time series data through ‘Sliding Window’ methodologies. This approach allows the system to process sequential data by focusing on localized segments, enabling the identification of patterns and anomalies within dynamic systems – for example, predicting equipment failure from sensor readings or forecasting financial market trends. By applying these methods to sliding windows, the network effectively learns temporal dependencies and generalizes its understanding across different time scales, proving particularly valuable in scenarios where data exhibits non-stationarity or evolving characteristics. The adaptability of this framework suggests its potential integration into diverse fields, including signal processing, climate modeling, and even natural language processing where sequential data is paramount.

Investigations are now directed towards extending the capabilities of these techniques to significantly more intricate systems, moving beyond current limitations and embracing challenges presented by high-dimensional data and non-linear dynamics. A central aim is the development of algorithms capable of not merely modeling observed phenomena, but of autonomously discerning the fundamental physical laws governing those systems – essentially, enabling machines to perform scientific discovery. This involves exploring novel approaches to symbolic regression and causal inference, combined with the power of neural networks to handle complex relationships, potentially unlocking a new era of automated scientific understanding and predictive modeling across diverse fields like climate science, materials discovery, and drug development.

Warm-starting with cheap labels results in a smoother training landscape with a more easily optimized descent toward a better optimum, unlike vanilla training which encounters a rugged and difficult-to-navigate merit landscape.

The pursuit of efficient optimization, as detailed in this work, fundamentally relies on establishing a rigorous framework for evaluating surrogate models. Robert Tarjan once stated, “The best algorithm is the one that solves the problem.” This sentiment echoes the paper’s core approach: rather than relying on complex, computationally expensive methods, the authors prioritize a three-stage amortized optimization utilizing inexpensive labels. This allows for a demonstrably stable and accurate learning process, particularly crucial when navigating challenging constrained optimization landscapes. The method’s success stems from a commitment to provable efficacy, mirroring Tarjan’s emphasis on solving the problem correctly, not simply achieving a result that appears to work.

What’s Next?

The pursuit of inexpensive optimization, as demonstrated, merely shifts the burden-from computationally expensive evaluations to the acquisition of ‘cheap’ labels. The elegance of this transfer remains to be proven. While supervised pretraining and self-supervision offer pragmatic gains in navigating complex landscapes, they do not address the fundamental issue: the inherent ill-conditioning often present in constrained optimization. The basin of attraction, that siren song of convergence, remains frustratingly sensitive to initialization and problem structure.

Future work must move beyond empirical demonstration and toward provable guarantees. The current framework, while demonstrably effective, lacks the mathematical rigor to definitively state why it works, or under what conditions its performance will degrade. A deeper exploration of the interplay between surrogate model fidelity and optimization algorithm robustness is crucial. Specifically, investigations into the spectral properties of the Hessian, and the development of preconditioners tailored to the learned surrogate, are warranted.

Ultimately, the field requires a shift in perspective. It is not enough to simply find solutions; one must understand the geometry of the optimization space itself. Until then, even the most sophisticated amortized framework remains a clever heuristic, a temporary reprieve from the inescapable complexities of true mathematical elegance.

Original article: https://arxiv.org/pdf/2603.05495.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/