Smarter Data Sampling for Faster, More Accurate Machine Learning

Author: Denis Avetisyan

A new strategy combines Gaussian processes and graph neural networks to dramatically improve the efficiency of building predictive models for complex field data.

A coupled framework adaptively samples scalar fields by leveraging a Gaussian process surrogate-which estimates means <span class="katex-eq" data-katex-display="false">\mu_{GP}</span> and variances <span class="katex-eq" data-katex-display="false">\sigma_{GP}</span> from inputs <span class="katex-eq" data-katex-display="false">\bm{\xi}</span>-and a field model that integrates scalar quantities to approximate means <span class="katex-eq" data-katex-display="false">\mu_{GNN}</span> and variances <span class="katex-eq" data-katex-display="false">\sigma_{GNN}</span>, with the subsequent misfit and epistemic uncertainties driving an iterative infill criterion to refine sampling points and update both surrogates. — A coupled framework adaptively samples scalar fields by leveraging a Gaussian process surrogate-which estimates means $\mu_{GP}$ and variances $\sigma_{GP}$ from inputs $\bm{\xi}$ -and a field model that integrates scalar quantities to approximate means $\mu_{GNN}$ and variances $\sigma_{GNN}$ , with the subsequent misfit and epistemic uncertainties driving an iterative infill criterion to refine sampling points and update both surrogates.

This review details goal-driven adaptive sampling techniques for surrogate modeling, uncertainty quantification, and aerodynamic prediction using machine learning.

Despite the increasing reliance on expensive black-box simulations in fields like computational fluid dynamics, achieving desired accuracy with minimal computational cost remains a significant challenge. This is addressed in ‘Goal-Driven Adaptive Sampling Strategies for Machine Learning Models Predicting Fields’, which introduces a novel active learning strategy combining Gaussian process regression with graph neural networks to efficiently build surrogate models for field predictions. By simultaneously reducing both epistemic uncertainty and the discrepancy between scalar and field predictions, this approach demonstrably improves accuracy and reduces computational expense-as shown with uncertainty propagation using the NASA common research model. Could this adaptive sampling framework unlock more robust and cost-effective solutions for complex scientific modeling and prediction tasks?

The Challenge of Accurate Flow Prediction

The pursuit of efficient and high-performing aircraft fundamentally relies on the accurate prediction of how air flows around and through their designs. Aerodynamic forces dictate lift, drag, and stability, meaning even subtle inaccuracies in flow prediction can translate to significant performance deficits or even catastrophic failure. Particularly challenging is the modeling of turbulent regimes, where chaotic, swirling motions dominate, requiring immense computational power to resolve. Optimizing aircraft for fuel efficiency, reducing noise pollution, and ensuring structural integrity all demand a deep understanding of these complex flows, necessitating predictive tools that move beyond simplified assumptions and capture the full breadth of aerodynamic behavior. Consequently, advancements in flow prediction aren’t merely academic exercises, but rather critical drivers of innovation within the aerospace industry, directly impacting both economic viability and flight safety.

Conventional computational fluid dynamics (CFD) relies heavily on solving the Reynolds-Averaged Navier-Stokes (RANS) equations, a process demanding significant computational resources. This expense stems from the need to discretize complex geometries and simulate time-dependent flow fields, often requiring high-resolution meshes and iterative solution schemes. Even with powerful supercomputers, a single high-fidelity simulation can take days or weeks to complete, severely limiting the number of design iterations possible during the aircraft development cycle. The time-consuming nature of these simulations also presents challenges for real-time applications, such as flight control systems or aerodynamic shape optimization. Consequently, researchers continually seek more efficient methods to achieve accurate flow predictions without sacrificing computational speed, including exploring turbulence modeling improvements and leveraging advanced computing architectures.

Current computational fluid dynamics approaches, while powerful, frequently encounter limitations when modeling the intricate details of turbulent flows around aircraft. The inherent chaotic nature of turbulence introduces uncertainties that are difficult to resolve with traditional methods, leading to inaccuracies in predicting aerodynamic behavior. This difficulty in capturing complex phenomena – such as flow separation, vortex shedding, and shockwave interactions – not only compromises the fidelity of simulations but also severely restricts a designer’s ability to confidently explore a wide range of design options. Without reliable uncertainty quantification, engineers are often forced to rely on conservative estimates, potentially leading to over-designed, less efficient aircraft, or, conversely, designs that operate too close to performance limits and risk failure.

Computational fluid dynamics simulations, validated by experimental data, demonstrate that surface skin friction coefficients vary with angle of attack, with uncertainty quantified using the SBUQ approach and a GP-GNN infill criterion.

Surrogate Modeling: A Path to Computational Efficiency

Surrogate modeling presents a computationally efficient alternative to traditional Computational Fluid Dynamics (CFD) simulations by leveraging data-driven approximations. Rather than solving the governing equations of fluid flow for each design iteration or operating condition, a surrogate model is trained on data generated from a limited number of high-fidelity CFD runs. This trained model, typically a machine learning algorithm, then predicts aerodynamic quantities – such as lift, drag, and pressure distribution – with significantly reduced computational cost. The accuracy of the surrogate model is dependent on the quantity and quality of the training data, as well as the suitability of the chosen modeling technique for the specific flow physics and design space.

Gaussian Process Regression (GPR) and Graph Neural Networks (GNNs) provide methods for efficiently predicting aerodynamic quantities by learning from existing Computational Fluid Dynamics (CFD) data. GPR is a probabilistic, non-parametric technique well-suited for modeling complex, non-linear relationships between input parameters and aerodynamic coefficients such as lift, drag, and pitching moment. It provides not only a prediction but also a measure of uncertainty. GNNs, conversely, excel at processing data with inherent graph structures, allowing them to directly model the relationships between discrete points in a flow field or the connectivity of an aerodynamic shape. This allows for the prediction of pressure distributions, velocity fields, or other flow features with reduced computational cost compared to running full CFD simulations for each new input condition. Both techniques require a training phase using CFD data but subsequently offer significantly faster prediction times, enabling rapid exploration of the design space.

Proper Orthogonal Decomposition (POD) is a dimensionality reduction technique applied to computational fluid dynamics (CFD) data to create a lower-dimensional representation of the flow field. This is achieved by identifying the dominant modes – the patterns that capture the most energy – within a set of high-fidelity CFD snapshots. By projecting the original, high-dimensional flow data onto these POD modes, the dimensionality of the problem is significantly reduced while retaining a specified level of accuracy, typically measured by the amount of energy preserved in the reduced-order model. This simplification allows for faster computation and reduced computational cost in subsequent analyses, such as surrogate model training or uncertainty quantification, without sacrificing critical flow physics.

Predictions of surface pressure coefficients from surrogate models demonstrate higher accuracy within the high-probability region of the input space (points a and b) compared to near the distribution boundary (point c), with models trained on 60 design of experiments (DoE) samples exhibiting noticeable differences from computational fluid dynamics (CFD) results where inaccuracies are present.

Adaptive Sampling: Refining Accuracy Through Intelligent Exploration

Adaptive sampling is an iterative process used to improve the accuracy of machine learning models by intelligently expanding the training dataset. Rather than random sampling, this technique prioritizes the selection of new data points in regions where the model exhibits the greatest uncertainty. This is typically quantified using metrics derived from the model’s predictive distribution, such as predictive variance. By focusing on areas of high uncertainty, adaptive sampling efficiently explores the input space and reduces epistemic uncertainty – the uncertainty stemming from a lack of training data – with fewer samples than traditional methods. The process continues until a pre-defined convergence criterion is met, resulting in a refined dataset that improves model performance and generalization capabilities.

Infill criteria are central to adaptive sampling, directing the selection of new data points to maximize model improvement. Surrogate Error with Misfit (SEwMisfit) is one such criterion, combining predictive variance – representing model uncertainty – with model misfit, which quantifies the difference between the model’s predictions and the observed data. By prioritizing regions exhibiting both high variance and significant misfit, SEwMisfit focuses sampling efforts on areas where the model is both uncertain and inaccurate. This combined approach contrasts with strategies relying solely on predictive variance, as it addresses systematic errors alongside aleatoric and epistemic uncertainty, leading to more efficient exploration of the design space and faster convergence to an accurate surrogate model.

Combining Jensen-Shannon Divergence (JSD) with Gaussian Process Regression (GP) and Graph Neural Networks (GNN) facilitates efficient exploration of the design space during adaptive sampling. Specifically, the coupled infill strategies – Surrogate Error with Misfit (SEwMisfit) and JSD – demonstrate a significant reduction in epistemic uncertainty. Comparative analysis indicates these coupled approaches achieve up to a 75% decrease in uncertainty relative to utilizing either GP or GNN independently. This improvement stems from JSD’s ability to quantify distributional divergence, complementing the predictive capabilities of GP and GNN to more effectively identify regions of high model uncertainty for targeted sampling.

Four adaptive sampling criteria-surrogate error applied to the GP and GNN individually, surrogate error with misfit between the GP and GNN, and Jensen-Shannon divergence coupling the surrogates-guide infill sampling by prioritizing locations <span class="katex-eq" data-katex-display="false">x_{i}^{\*}</span> where prediction uncertainty is highest, as indicated by the shaded bands around the drag coefficient prediction <span class="katex-eq" data-katex-display="false">C_{D}(x)</span> and the joint input probability density. — Four adaptive sampling criteria-surrogate error applied to the GP and GNN individually, surrogate error with misfit between the GP and GNN, and Jensen-Shannon divergence coupling the surrogates-guide infill sampling by prioritizing locations $x_{i}^{\*}$ where prediction uncertainty is highest, as indicated by the shaded bands around the drag coefficient prediction $C_{D}(x)$ and the joint input probability density.

Validation and Impact on the Future of Aerodynamic Design

The proposed surrogate modeling framework underwent rigorous validation through its application to the NASA Common Research Model, a widely-used configuration for high-speed aerodynamics research. This implementation showcased the framework’s capacity to accurately replicate complex flow phenomena around a realistic aircraft geometry. By efficiently approximating computationally expensive high-fidelity simulations, the model enabled a substantial reduction in analysis time without sacrificing predictive accuracy. The success with the NASA Common Research Model establishes this surrogate modeling approach as a viable and effective tool for aerodynamic design, paving the way for faster and more comprehensive exploration of design spaces and ultimately, improved aircraft performance.

The proposed methodology demonstrably achieves high-fidelity predictions of critical aerodynamic characteristics – specifically, the Surface Pressure Coefficient and Skin Friction Coefficient – while significantly reducing computational demands. Traditional high-resolution simulations, essential for detailed aerodynamic analysis, are notoriously resource-intensive; however, this framework leverages surrogate modeling to approximate complex flow behavior with minimal computational cost. By accurately capturing these key quantities-which directly influence drag, lift, and overall performance-designers can explore a wider range of design iterations and optimize aerodynamic surfaces more efficiently. This reduction in computational burden doesn’t compromise accuracy, enabling rapid prototyping and a more thorough understanding of fluid-structure interactions without the limitations of expensive and time-consuming simulations.

The developed surrogate modeling framework facilitates a comprehensive assessment of design uncertainty, empowering engineers to move beyond single-point optimizations. By systematically exploring variations in operational parameters – such as turbulent freestream intensity – the framework quantifies the impact of these uncertainties on aerodynamic performance. This capability allows for the robust optimization of designs, ensuring consistent performance even under fluctuating conditions. Validation using adaptive sampling strategies yielded highly accurate predictions, demonstrated by an r² score of 0.99 and a root mean squared error (RMSE) of less than 3% for critical lift and drag coefficients, confirming the framework’s reliability and potential for enhancing aerodynamic designs.

Using the SBUQ approach with a field surrogate and GP-GNN Infill (SEwMisfit criterion), the predicted surface pressure coefficient along the wingspan at a <span class="katex-eq" data-katex-display="false">1.5^{\circ}</span> angle of attack is bounded by <span class="katex-eq" data-katex-display="false">\pm 2</span> standard deviations, representing uncertainty propagation. — Using the SBUQ approach with a field surrogate and GP-GNN Infill (SEwMisfit criterion), the predicted surface pressure coefficient along the wingspan at a $1.5^{\circ}$ angle of attack is bounded by $\pm 2$ standard deviations, representing uncertainty propagation.

The pursuit of accurate field prediction, as detailed in the article, often leads to increasingly complex models. However, this work champions a different path-one focused on intelligent data selection to achieve robustness with fewer resources. This aligns perfectly with Grace Hopper’s sentiment: “It’s easier to ask forgiveness than it is to get permission.” The adaptive sampling strategy detailed isn’t about exhaustive data gathering, but rather a calculated approach – daring to refine the model with only the most informative samples. By prioritizing data that actively reduces uncertainty, the methodology embodies a pragmatic elegance-a willingness to streamline and focus on what truly matters for reliable aerodynamic prediction.

Where to Next?

The presented work, while demonstrating a pragmatic advance in adaptive sampling, merely scratches the surface of a deeper issue: the relentless pursuit of prediction without a corresponding reckoning with inherent model limitations. The coupling of Gaussian processes with graph neural networks offers efficiency, certainly, but efficiency in propagating error is still error. The true challenge lies not in predicting more fields, but in predicting when a prediction is unreliable-and acting accordingly. A focus on robust failure modes, rather than asymptotic accuracy, would be a welcome recalibration.

Further refinement should eschew complexity for clarity. The current architecture, while functional, invites unnecessary abstraction. Code should be as self-evident as gravity; each component’s purpose immediately apparent. Exploration of alternative surrogate models – those predicated on physical constraints, for example – may yield more interpretable and, ultimately, more trustworthy results. Intuition is the best compiler; a relentless drive toward elegance, even at the cost of immediate performance gains, will prove more fruitful in the long run.

Finally, a critical examination of the ‘oracle’ inherent in active learning strategies is required. The assumption of a perfect, costless evaluation function is a convenient fiction. Real-world deployment necessitates grappling with noisy, delayed, and potentially adversarial feedback. Addressing this reality, rather than smoothing it over with idealized simulations, is the necessary, if unglamorous, path forward.

Original article: https://arxiv.org/pdf/2601.21832.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Challenge of Accurate Flow Prediction

Surrogate Modeling: A Path to Computational Efficiency

Adaptive Sampling: Refining Accuracy Through Intelligent Exploration

Validation and Impact on the Future of Aerodynamic Design

Where to Next?

See also: